このサイトのコンテンツは個人の意見で会社を代表しません

Warning: this block entry is constantly being updated as I think of things to add. If you have questions or things you think should be covered, please leave them in the comments below.

概要

The Arty A7 comes with 256MB of DDR3L, but actually using it isn’t always the simplest thing. Unlike the Zync series, there is no hard IP for memory controllers, and so you have to roll your own or use Xilinx’s soft IP. There is no one definitive guide (that I could find), and there is quite a bit of wrong information out there. Much of the needed info either doesn’t exist, is spread out all over the internet, or assumes a good deal of previous FPGA experience to understand. So therefore, I am writing this blog post as a place to hold all my notes, because I will most certainly forget all of it in the near future.

Requirements for following along are: an Arty A7 board, Vivado (I am using 2021.1), Windows 10, infinite patience for dealing with Vivado’s bugs, and way too much free time to read a super long and dry blog post that explicitly spells out every step of the process.

Controller Levels

Vivado uses IP called MIG for generating DDR controllers. MIG stands for Massively Insufferable Garbage, and is basically a GUI where you specify the parameters of your memory, and it crashes… err, I mean creates a controller for you. There are four levels of controller this blog post mentions, but sadly the only two that MIG can create are AXI and UI. However, the UI controller can be used as a base for creating your own native controller.

AXI

The MIG IP gives you the option to generate an AXI interface. AXI is, to directly steal text from wikipedia, part of the ARM Advanced Microcontroller Bus Architecture 3 (AXI3) and 4 (AXI4) specifications, and is a parallel high-performance, synchronous, high-frequency, multimaster, multi-slave communication interface, mainly designed for on-chip communication. If that sounds like your jam, then AXI might be for you.

UI

UI is the other interface that MIG can generate. It’s a bit high level and allows you to add read and write commands fairly easily, while handling things like data reordering and user address mapping for you.

Native

UI is a convenience wrapper on top of the native layer.  If you don’t need some of the UI helper functionality, such as reordering requests, you can get better latency by going with native directly. The MIG manual has some very sparse details on how native is supposed to work, but sadly MIG has no option to generate a native-level controller. The best option I’ve come up with so far is generating a UI controller, and then modifying the generated IP to make it do what you want

PHY

This is the real live boy. At the PHY level, you are worrying about temperature, refreshing cells, and directly driving memory signals. Not recommended for beginners, but I purposely chose a board with no DDR controller hard IP in the hopes that someday working out how to write a controller at this level. I am nowhere near that yet. Give me a few years.

DDR3 Refresher

This  is meant to serve as a quick refresher of things that would be useful to know when generating and using the controller. It is in no way meant to be a exhaustive explanation of the inner workings of DDR3 or SDRAM

Structure

DDR3 consists of ranks, banks, rows, and columns. The A7 is a single rank configuration, and so the rank number will always be 0. Each rank consists of 8 banks, each bank is made up of 2^14=16,384 rows, and each row is made up of 2^10=1024 columns. Therefore the total number of columns would be 8 x 16,384 x 1024 = 134,217,728.  If we have 256MB of memory, that’s 256MB / 134,217,728 = 2 bytes, or 16 bits per column which is what we’d expect.

Address Encoding

All this row and column business is important for performance reasons. To use a row, it must first be opened which can take some of time.  Each bank can have one row open, but not multiple rows per bank. Once a row is opened, subsequent accesses to it’s columns are less expensive, but as a general rule you don’t want to be unnecessarily opening a new row every access. Vivado’s memory controller generator gives you two options for address encoding: {row, bank, column} and {bank, row, column}.  In {row, bank, column}, you go through all 1024 columns and then increment the bank. Because there are 8 banks, this gives you one giant contiguous working area of 8 banks x 1024 columns x 2 bytes/column, or 16,384 bytes that can be randomly accessed without having to reopen a row.

Address encoding {bank, row, column} is useful in a different situation. Instead of each bank contributing a row to form one big contiguous area, here each bank has its own 32MB area, of which only one 2048 byte row can be open at a time. This can be useful in situations where you have different memory fabric clients that need to access their own far apart PA regions, and once you open a new row you’re unlikely to need the previous one. So as an example, maybe I have a texture at address 0 that I (mainly) linearly go through and bring into the cache, so that I’m unlikely to need previous data. And at the same time I have a render target located 32MB after the texture that I write to semi randomly, and I don’t want it to interfere with rows that the texture reads are using.

Banks vs Bank Machines

Just to be clear, the number of banks != the number of bank machines in the controller. For an example of how this can affect performance, I’m going to directly quote Xilinx:

Increasing the number of Bank Machines can improve the efficiency of the design. For example, if a traffic pattern sends writes across a single row in all banks before moving to the next row (writes to B0R0, B1R0, B2R0, B3R0 followed by writes to B0R1, B1R1, B2R1, B3R1), five bank machines will provide higher efficiency than 2, 3 or 4. Requests for each bank in Row0 can be assigned to the first four Bank Machines. When the request comes in for Row1, requiring a Precharge/Activate, the 5th Bank Machine can be assigned this request while the other Bank Machines complete any pending requests

In my controller, I left the default number of bank machines at 4, but that’s just because I haven’t done any profiling yet to see if my memory access patterns could benefit from additional bank machines.

Generating The Controller

Digilent is kind enough to provide MIG-loadable project files containing the DDR parameters, however MIG import is horribly broken. It will do things like ignore the clock period defined in the project and just set something else. Therefore it’s probably best to just manually enter everything yourself. You might even learn something along the way. You will still need the UCF file containing the pin constraints for the board. It can be downloaded here.

With that downloaded, go ahead and open the Memory Interface Generator in the IP Catalog. On this first screen you’ll want to select Create Design (because we’re not importing a project), set a component name for your controller, set the number of controllers to 1, and make sure that AXI4 is unchecked. Hit next, and enter in the FPGA part number, which should be xc7a100t-csg324. Hit next again and select DDR3 SDRAM.

There is quite a bit to enter on the next page. The clock period should be set to 3000ps, or 333.33MHz. Set the PHY to controller ratio to 4:1, the memory part to MT41K128M16XX-15E, the memory voltage to 1.35v, and the data width to 16. Ordering should be normal, number of bank machines is 4, and enable the data mask if you think you’re going to need it.

On the next page, set the input clock period to 6000ps (166.666MHz), the read burst type to sequential, and output driver impedance control and RTT to RZQ/6. It’s up to you if you want to use a chip select pin and what address mapping you want to use.

some of the more cryptic DDR3 settings live here

The above page is where stuff starts getting interesting. First up, we need to choose configurations for the system clock and reference clock, with options like single-ended, differential, and no buffer. Differential is for differential pairs, single ended is useful for when we directly connect to an external clock pin, and no buffer is probably what you want if you are deriving your clocks from a clocking wizard (like me!). This naming always threw me off, because “no buffer” sounds like it can’t be used with clock buffers. Rather it means there is no clock buffer, so feel free to use your own. So let’s just say we set these both to no buffer and move on. It will save you having to track down obscure clock backbone errors in implementation.

System reset polarity can be anything you want, as long as you remember what you chose later when we hook up the controller. Maybe its just best leave it as its default, active low. Debug signals should be off, unless you somehow really need it. Then turn on internal vref and IO power reduction. Finally, XADC instantiation is turned off in the suggested settings, but what do they know?! Go ahead and enable it unless you are using the XADC block somewhere else in the design. Click next to go to the next page and set the impedance to 50Ω.

We’re almost done now. On the next screen, choose Fixed Pin Out, and hit next. Then click Read XDC/UCF and browse to the UCF file you previously downloaded. Click on the validate button to validate, and you should be greeted with a popup saying the current pinout is valid. Here’s a funny story. If you had imported the Arty project instead of entering all the above settings manually, this step would fail with alot of very scary sounding warnings. Aren’t you glad you did it all manually?

The final config screen asks you to select pins for sys_rst, init_calib_complete, and tg_compare_error. Honestly, this screen is for fancy people who want to hook things up to actual pins for some reason. Leave all of these unconnected, and we’ll hook them up to logic later. After this, its a bunch of summary screens, disclaimers clearing Xilinx of any liability if your device explodes, and accept prompts. Blow through all these and you will have yourself a shiny new memory controller.

Using The Controller

So you have a controller generated, but what now? Hooking it up is easier than you might think. Note that this is just for synthesis, as the process for simulation is quite a bit more complicated, and will be explained later in this post.

Hooking It Up

Go to your top module, and add the following DDR3 signals to your module ports.

parameter DQ_WIDTH = 16;
parameter DQS_WIDTH = 2;
parameter CS_WIDTH = 1;
parameter ROW_WIDTH = 14;
parameter DM_WIDTH = 2;
parameter ODT_WIDTH = 1;

module jpu2_top_module(
    input wire logic CLK100MHZ,
    input wire logic [3:0] sw,
    output wire logic [3:0] led,
    input wire logic [3:0] btn,
    // HDMI
    output wire logic [3:0] hdmi_out_p,
    output wire logic [3:0] hdmi_out_n,
    // DDR3 signals
    inout wire logic [DQ_WIDTH-1:0] ddr3_dq,
    inout wire logic [DQS_WIDTH-1:0] ddr3_dqs_n,
    inout wire logic [DQS_WIDTH-1:0] ddr3_dqs_p,
    output wire logic [ROW_WIDTH-1:0] ddr3_addr,
    output wire logic [3-1:0] ddr3_ba,
    output wire logic ddr3_ras_n,
    output wire logic ddr3_cas_n,
    output wire logic ddr3_we_n,
    output wire logic ddr3_reset_n,
    output wire logic [1-1:0] ddr3_ck_p,
    output wire logic [1-1:0] ddr3_ck_n,
    output wire logic [1-1:0] ddr3_cke,
    output wire logic [(CS_WIDTH*1)-1:0] ddr3_cs_n,
    output wire logic [DM_WIDTH-1:0] ddr3_dm,
    output wire logic [ODT_WIDTH-1:0] ddr3_odt);

The clever among you will notice that normally the names of signals in your top level module come from your constraints file, but none of these DDR3 signals are in Arty-A7-100-Master.xdc. This is because MIG generates a second secret constraints file that isn’t added to the project. I named my controller ddr3_native_controller, and so the constraints file would be something like ddr3_native_controller.xdc, and its stored around  jpu2.gen\sources_1\ip\ddr3_native_controller\ddr3_native_controller\user_design\constraints\.

The other thing to note is those params at the top. Where they come from will make a bit more sense once we get to the section on simulation. But for now, its fairly safe to use them as-is if you stick to the controller settings outlined above.

To actually instantiate the controller, just do this

localparam DATA_WIDTH = 16;
localparam PAYLOAD_WIDTH = DATA_WIDTH;
localparam nCK_PER_CLK = 4;
localparam APP_DATA_WIDTH = 2 * nCK_PER_CLK * PAYLOAD_WIDTH;
localparam APP_MASK_WIDTH = APP_DATA_WIDTH / 8;
localparam ADDR_WIDTH = 28;

// DDR3 controller app signals
logic [(2*nCK_PER_CLK)-1:0] app_ecc_multiple_err;
logic [(2*nCK_PER_CLK)-1:0] app_ecc_single_err;
logic [ADDR_WIDTH-1:0] app_addr = 0;
logic [2:0] app_cmd = 0;
logic app_en = 0;
logic app_rdy;
logic [APP_DATA_WIDTH-1:0] app_rd_data;
logic app_rd_data_end;
logic app_rd_data_valid;
logic [APP_DATA_WIDTH-1:0] app_wdf_data = 0;
logic app_wdf_end;
logic [APP_MASK_WIDTH-1:0] app_wdf_mask = 0;
logic app_wdf_rdy;
logic app_sr_active;
logic app_ref_ack;
logic app_zq_ack;
logic app_wdf_wren = 0;

logic sys_rst = 0;

// UI clock is returned by the UI, and is 1/4 the memory interface clock
// 325MHz / 4 = 81.25MHz
logic ui_clk; 
logic [11:0] device_temp;

ddr3_native_controller u_ddr3_native_controller
(
    // Memory interface ports
    .ddr3_addr(ddr3_addr),                          // output
    .ddr3_ba(ddr3_ba),                              // output
    .ddr3_cas_n(ddr3_cas_n),                        // output
    .ddr3_ck_n(ddr3_ck_n),                          // output
    .ddr3_ck_p(ddr3_ck_p),                          // output
    .ddr3_cke(ddr3_cke),                            // output
    .ddr3_ras_n(ddr3_ras_n),                        // output
    .ddr3_we_n(ddr3_we_n),                          // output
    .ddr3_dq(ddr3_dq),                              // inout
    .ddr3_dqs_n(ddr3_dqs_n),                        // inout
    .ddr3_dqs_p(ddr3_dqs_p),                        // inout
    .ddr3_reset_n(ddr3_reset_n),                    // output
    .init_calib_complete(init_calib_complete),      // output
    .ddr3_cs_n(ddr3_cs_n),                          // output
    .ddr3_dm(ddr3_dm),                             // output
    .ddr3_odt(ddr3_odt),                            // output
    // Application interface ports
    .app_addr(app_addr),                            // input
    .app_cmd(app_cmd),                              // input
    .app_en(app_en),                                // input
    .app_wdf_data(app_wdf_data),                    // input
    .app_wdf_end(app_wdf_end),                      // input
    .app_wdf_wren(app_wdf_wren),                    // input
    .app_rd_data(app_rd_data),                      // output
    .app_rd_data_end(app_rd_data_end),              // output
    .app_rd_data_valid(app_rd_data_valid),          // output
    .app_rdy(app_rdy),                              // output
    .app_wdf_rdy(app_wdf_rdy),                      // output
    .app_sr_req(1'b0),                              // input
    .app_ref_req(1'b0),                             // input
    .app_zq_req(1'b0),                              // input
    .app_sr_active(app_sr_active),                  // output
    .app_ref_ack(app_ref_ack),
    .app_zq_ack(app_zq_ack),                        // output
    .ui_clk(ui_clk),                                // output
    .ui_clk_sync_rst(rst_from_ui),                  // output
    .app_wdf_mask(app_wdf_mask),                    // input
    // System Clock Ports
    .sys_clk_i(sysclock_166MHz),                    // input
    // Reference Clock Ports
    .clk_ref_i(refclock_200MHz),                    // input
    //.device_temp_i(device_temp_i),                // input
    .device_temp(device_temp),                      // output
    .sys_rst(sys_rst)                               // input
);

Clocking

The controller takes in two clocks and outputs one. The input clocks are a 200MHz reference clock, and a 166.666MHz system clock, both of which can be generated by a Vivado clocking wizard. These are refclock_200MHz and sysclock_166MHz in the above code. The output clock is ui_clk, and is the clock that the user interface part of the controller runs on. So any logic you have that drives controller signals should be clocked on the positive edge of this depressingly slow 81.25MHz clock.

Assuming you chose for the reset signal to be active low, you have to hold sys_rst low for a minimum of 200 nano, before bring it high again. You can do this any way you want, but I usually do something silly like this

// reset delay is something like 200000 pico (or 200 nano)
// so 100MHz clock has a period of 10 nano
// and so we can just hold it low for 20+ clocks
logic [4:0] reset_cnt = 1;
always_ff @(posedge CLK100MHZ) begin : DDR_RESET
    // count up until we wrap around to zero, and then stay there
    reset_cnt <= reset_cnt + |reset_cnt;
    sys_rst_n <= ~|reset_cnt;
end : DDR_RESET

assign sys_rst = RST_ACT_LOW ? sys_rst_n : ~sys_rst_n;

Paths

There are two different paths to the controller: one for writing commands and another for adding the data you want to write to memory. These paths are (semi) independent in that you can add write data before or at the same time as you add the corresponding command. You can also add write data after you add the command, but it must be within two clocks of adding the command.

Command Path

The command path deals with writing commands, and any signals that are common to reads and writes. To issue commands, set the command number, the address, and assert app_en. This path can randomly become unavailable as the controller performs various internal functions, so you have to always check app_rdy the clock after setting your data to know if you need to continue holding or if the command was accepted. To make this a bit clearer, here is a sample waveform to illustrate.

In the above example, app_rdy transitions from 1 to 0 on the exact same clock as a write command is being added. If we were to sample app_rdy on that clock, we’d get the old value (1) and mistakenly think the command would be accepted. Therefore we have to start checking app_rdy the clock *after* we set command data.

Signals

app_rdy: output, indicated whether a command could have been accepted the previous clock
app_in: input, indicates to the controller that we're adding a command
app_cmd: input, 1 for reads and 0 for writes
app_addr: input, the address of the read or write command we are adding
ui_clk_sync_rst: output, active low signal indicating the UI is ready to use
init_calib_complete: output, active high signal indication DDR3 calibration is complete

Write Data Path

The write data path is basically a FIFO holding 128 bit data to be written to memory. Like the command path, it also has a signal to indicate when data has been successfully added, but unlike the command path’s app_rdy which can ho high and low at random times, app_wdf_rdy is basically there to tell you if the write data FIFO is full, and won’t suddenly go low unless you add data. This makes it a bit easier to work with.

Signals

app_wdf_rdy: output, indicates the write data FIFO isn't full, and new data can be added
app_wdf_wren: input, write data FIFO write enable
app_wdf_data: input, the 128 bit write data to be added to the FIFO
app_wdf_mask: input, per-byte write enable mask. Active low
app_wdf_end: input, must always be tied to app_wdf_wren, despite what the incorrect docs say

This explains the timing requirements. Because write data is just added to a FIFO, it can be added before or together with a write command. When a write command is encountered, a few clocks later the write path will get the data to be written from the write data FIFO. Like I said before, its possible to add the write data up to two clocks after the write command, but I’ve not tried that before and it feels like you’re asking for trouble.

Performance

Simulation

This was initially a bit confusing for a dumb reason. With most other IP, you set it up, generate it, instantiate it, and it all just works both for synthesis and for simulation. The memory controller is a bit different because controller != memory. You need both the controller IP *and* the DDR model for it to work, or else the controller will sit around forever waiting for calibration. Sadly the info on how to make this work isn’t easy to find (or I am garbage at searching)

opening an example design from the sources view

The first step is to right click on the controller IP in the sources pane, and go to Open IP Example Design. Save it somewhere, and go to that directory in explorer.  You’ll see a subdirectory called imports, which you’ll want to copy and paste to the root directory of your project.

copy the imports directory from the sample design to your design

Some tutorials I have seen say it’s ok to only copy some of the RTL files, but it’s probably safer to just copy everything and then remove what you don’t need. Next, add wiredly.v, ddr3_model.sv, and sim_tb_top.sv to your project’s simulation sources. You may need to add some of the other sources as well, if you get errors about undefined modules. Then, set sim_tb_top to be the new top module for simulation.

My new simulation hierarchy. sim_tb_top from the example DDR design is my new top module, and sim_top is the old one containing my testbenches

We’re almost done. Essentially what we’re going to do is to let that new top module from the sample design handle all the DDR stuff for us, but inside of it we instantiate our old top module and pass the relevant DDR signals. Search sim_tb_top.sv for the section that says FPGA Memory Controller , and just below you’ll see a module called example_top being instantiated. Replace example_top with the name of your previous simulation top (sim_top in the above image), and modify the module to take the DDR signals passed in from sim_tb_top. So my sim_tb_top.sv now looks like this

// this is sim_tb_top.sv. Its the file that came from the generated reference design.
// it used to instantiate a module called example_top, but I replaced that with my sim_top

//===========================================================================
//                         FPGA Memory Controller
//===========================================================================
sim_top u_ip_top(
    // DDR signals here
and my sim_top.sv looks like this
// this is in sim_top.sv
// it's my old top module before I made sim_tb_top the new top module.
// I also modified it to add all the DDR signals
module sim_top(
    // DDR3 signals from sim_tb_top
    inout wire logic [DQ_WIDTH-1:0] ddr3_dq,
    inout wire logic [DQS_WIDTH-1:0] ddr3_dqs_n,
    inout wire logic [DQS_WIDTH-1:0] ddr3_dqs_p,
    output wire logic [ROW_WIDTH-1:0] ddr3_addr,
    output wire logic [3-1:0] ddr3_ba,
    output wire logic ddr3_ras_n,
    output wire logic ddr3_cas_n,
    output wire logic ddr3_we_n,
    output wire logic ddr3_reset_n,
    output wire logic [1-1:0] ddr3_ck_p,
    output wire logic [1-1:0] ddr3_ck_n,
    output wire logic [1-1:0] ddr3_cke,
    output wire logic [(CS_WIDTH*1)-1:0] ddr3_cs_n,
    output wire logic [DM_WIDTH-1:0] ddr3_dm,
    output wire logic [ODT_WIDTH-1:0] ddr3_odt,
    // clocks. todo: these need to be generated 
    input wire logic sys_clk_i,
    input wire logic clk_ref_i,
    // misc
    input wire logic [11:0] device_temp_i,
    output wire logic init_calib_complete,
    wire logic tg_compare_error,
    input wire logic sys_rst);

    // your code here!

endmodule

And that’s it! Those signals can be hooked up to your DDR3 controller, and you should be in business. There are three things to note, though. First of all, simulation can take a Very Long Time. Like minutes or more of wall clock time can pass before DDR calibration finishes, so be prepared to wait unless you disable calibration. Second, you’ll notice I have input ports for sys_clk_i, clk_ref_i, and sys_rst. The top level simulation generates these signals for you as a convenience, so you’re free to use them in simulation, but it’s probably a better idea to just use what’s described in the clocking section above. And finally, you’ll notice I don’t have any of those module parameters such as DQ_WIDTH, ROW_WIDTH, and CS_WIDTH that you’ll see in your version of the files. So that I can use them all over, I pulled all the parameters into a header file (ddr3_params.vh) that is included anywhere it’s needed. I suggest you do the same.

Interfacing With Native Directly

Signals and Flow

It’s probably useful to begin with looking at the general flow of native mode

demonstration of the native interface protocol
native interface flow
native interface processing for read and write data. The wr_data timing shown is incorrect

The timing diagrams are fairly self explanatory, but I’ll go over the general usage anyway. To add a command, on clock N set the command data. Command data is the command itself, priority bit, data_buf_addr, and the address in {rank, bank, row, column} form. This is the first difference between native and UI. When creating the controller, you chose a mapping from user address to bank, row, and column, and the address encoding was handled by the UI layer. But with native you have to manually specify these yourself.  Next, on clock N+1, assert the use_addr signal to indicate a request wants to be added. If accept was asserted, then use_addr can be safely deasserted on clock N+2.

図1-65 shows the flow for adding multiple simultaneous commands.  Assuming the command data is set on clock N, you still assert use_addr on clock N+1. If accept is high at that time, then it’s OK to move on to the next command data. You can see this when the command data transitions from 1 to 2 on clock N+1. Things get more interesting on clock N+3. use_addr is asserted and the command data wants to transition from 3 to 4, but it can’t because accept is low. Therefore the command data must remain stable until accept goes high.

The biggest difference between UI and native has to do with data_buf_addr. Because the native layer can reorder requests for efficiency, it needs a way to signal to the user design which request is being referenced. For writes, data_buf_addr might be an index into a user buffer that stores the data to be written. For reads, it might be an index into a user buffer where we will store data returned from memory for reordering.

This is illustrated in 図1-66. For writes, wr_data_en is a signal from the native layer that indicates it is pulling the write data for a previously added write command. wr_data_addr will be the same as the data_buf_addr we added with the command data. You’ll notice that while the write data is 128 bits, write data D0~D3 (the first 16×4=64 bits) is read from the user buffer on the first clock, and D4~D7 (the last 64 bits) on the next clock. To make this easier to handle, the native layer also provides wr_data_offset, which will be 0 during the first 64 bits and 1 during the last 64 bits. When wr_data_en is high, it’s your job to take wr_data_addr and wr_data_offset and use them to look up the write data for the command. This write data is used to drive wr_data.

Reads function similarly. rd_data_en indicates that the native layer is returning read data to you. rd_data_addr is an echo of the data_buf_addr you added with the read command, rd_data_offset will be 0 for the lower 64 bits and 1 for the upper 64 bits, and rd_data is the 64 bits of read data being returned from memory. Once you receive the data, its up to you to reorder it if that’s something you care about.

Corrections

First of all, a confession: I probably should have said this earlier, but while the above descriptions do match the timing diagrams shown in the docs, they don’t match the actual Arty A7 DDR3 configuration. The diagrams assume a burst length of 4, meaning data is returned 16×4=64 bits at a time. But we have BURST_LENGTH set to 8, or 16×8=128 bits, meaning we only need one read/write per request, and both rd_data_offset and wr_data_offset will always be zero.  To be fair to Xilinx, this is called out in the docs, as the description of wr_data_offset specifically states

このバスは、バースト長の処理に1サイクル以上が必要なときに使用します

You can see all this in the waveform below.

I should also mention there a few confusing errors in 図1-66. First of all, rd_data_addr is listed twice. The second one obviously should be rd_data_offset. Next, the way the write timing is shown is incorrect. It seems to imply that you are supposed to provide the write data on the same clock that the system tells you the buffer address to pull it from. Surely that can’t be right. 図1-64 seems to get it right, since wr_data_en and wr_data_addr are provided on clock N, and you’re expected to provide the corresponding write data on clock N+1. This can be seen in the following waveform

waveform from the lower level DDR3 controller. Notice how the write data changes one clock *after* wr_data_en goes high

Here you’ll see wr_data_en go high the same clock that wr_data_addr comes (red circle). And it one clock after this that the user buffer provides the data to be written to memory (blue circle).

How The UI Layer Does It

The UI provides a (possibly overcomplicated?) sample implementation for the read and write buffers. These are located at user_design\rtl\ui\mig_7series_v4_2_ui_rd_data.v and user_design\rtl\ui\mig_7series_v4_2_ui_wr_data.v respectively. I recommend taking a look through these to see what they are doing, or at least reading the extensive comments for a pretty good description of how they work. Actually, while you’re there, check out all the source for the IP. Everything is in there, even PHY.

As a bit of starting advice, before you go modifying the UI, try setting the controller data ordering to STRICT. Much of the UI latency comes from read reordering, which is turned off in by the following line

if (ORDERING == "STRICT") begin : strict_mode
    assign app_rd_data_valid_ns = 1'b0;
    assign single_data = 1'b0;
    assign rd_buf_full = 1'b0;
    reg [DATA_BUF_ADDR_WIDTH-1:0] rd_data_buf_addr_r_lcl;
    wire [DATA_BUF_ADDR_WIDTH-1:0] rd_data_buf_addr_ns = rst? 0 : rd_data_buf_addr_r_lcl + rd_accepted;
    always @(posedge clk) rd_data_buf_addr_r_lcl <= #TCQ rd_data_buf_addr_ns;
    assign rd_data_buf_addr_r = rd_data_buf_addr_ns;

    always @(rd_data) app_rd_data = rd_data;
    always @(rd_data_en) app_rd_data_valid = rd_data_en;
    always @(rd_data_end) app_rd_data_end = rd_data_end;
The obvious caveat is that you need to understand your memory access pattern *and* profile, to see if you can get 0) away without reordering and 1) can actually benefit from a reduction in latency. But if you’re not concerned with latency, then why are you still reading this?
Every console has that one iconic thing that appears in tech demos. PlayStation has the duck in a bath, Microsoft has the red ring of death, and I have The Cort In The Hat

このサイトのコンテンツは個人の意見で会社を代表しません

Cache Details

概要

This post represents my first attempt to implement the simplest possible texture cache that still works with my current GPU, in preparation for getting texturing working. Originally, it wasn’t meant to be a releasable post, but rather was written before coding the cache as a way to work out the high level design ideas and to find potential issues. Turns out pretending to explain my ideas to a fictional reader works really well for me as a way of designing. As a bonus, it’s also a way to help me remember how my own stuff works, should work get busy and I need to come back to the project after N months. The cache is currently working both in sim and in hardware, but with fake values stuffed into memory instead of actual textures

Cache Organisation

The texture cache is a read-only cache with LRU policy and single-clock invalidation. The cacheline size is 256 bits, which is 16 pixels in 565 format, representing a 4×4 tiled block. The data store for the cache is built out of pairs of 36 kbit simple dual-port BRAMs. This appears to be a requirement stemming from the cache having asymmetric read and write sizes (one pixel reads and half cacheline writes). This gives a total memory size of 72 kbits, or 256 lines, if ECC isn’t repurposed as usable data. And with a 2-way cache, that is 128 sets. Therefore read addresses presented to the cache will have 4 bits indicating the pixel within a line, 7 bits to determine the set, and the MSB 16 bits will be the tag

cacheline block
Figure 1: a cacheline is a 4×4 block of pixels. Hex numbers represent the offset inside a cacheline

The tags are kept separate from the cache datastore. Each line has a 16 bit tag, and a valid bit, and each set has shares one bit to indicate which way is LRU. This data doesn’t represent the actual current state of the cache, but rather the state the cache will be in once the next request is processed.

High Level Cache View
Figure 2: high level cache view. T is cache tags and LRU bits, H is a bit indicating hit or miss, and DATA is the cachelines

Client Arbiter

Each cache serves four clients. To give me the simplest test, the clients are currently just the rasterizer tiles, although eventually I may try to add texture reads to my pixel shaders. Each client can have multiple requests in flight, and data is returned in the order that requests are accepted.

The arbiter chooses a client request with a simple four bit rotating onehot mask, where the set bit determines the first client to consider. So if the mask was 4’b0100, the arbiter would start with client 2, and then try 3, 0, and 1 in order, looking for a valid request. A mask of 4’b0010 would check clients 1, 2, 3, and then 0. The priority mask is rotated one bit left when a request is accepted. Early on I tried to be more “clever” with arbitration, favouring client requests that would be hits before ones that would be misses, but this really complicated the design and made it harder to meet timing. It also starved the cache of DDR3 requests that could have been in flight.

The arbiter itself is a two state machine. In the first state, it looks for a valid client request. If one is found, it saves off the request address and client number, presents the request to the cache, rotates the accept mask, and notifies the client that it’s request was accepted, before transitioning to the second state. The second state simply waits for the ack from the cache, which usually comes within two clocks unless the cache input FIFO is full or it’s not in the request accept state.

Requests consist of only an address and client ID. Because we have 256MB of RAM, the full address width is 28 bits. However, cache requests are for 16 bit pixels, meaning the LSB of the address isn’t needed in the request, and so request addresses are 27 bits wide. Client ID is just a 2 bit number indicating client. This ID is returned by the cache, along with the pixel data, and is used to signal a client that there is valid data for them on the bus.

Cache Input Processing

The cache itself is made of four state machines: input, DDR request, DDR return, and output. The input machine is a three state FSM where the first state accepts and saves off an incoming request from the arbiter, the second state gets the tags and LRU bit for the relevant cache set, and the third state updates the cache tags, adds misses to the DDR3 request FIFO, and adds the request to the output request FIFO.

The output request FIFO entry contains a client number, one bit to indicate whether a request is a hit or miss, and some address information. Cache reads are pixel aligned, and so the read address is constructed as {set_bits (7b), way (1b), pixel bits (4b)}. Writes are half cacheline granularity, and so the addresses are {set_bits (7b), way (1b), halfline (1b)}, but the halfline isn’t needed until a miss’s data is returned from the memory fabric, and therefore isn’t stored in the output request FIFO entry.

The cache tags and LRU bits are stored separately from the cacheline data, and don’t reflect the current state of the cache. For example, a miss will set the valid bit so that the next request for the address will be a hit, even though the data is not yet in the cache. The cache is updated with the following pseudocode:

cache_set_flags[input_req_addr.set].lru <= |input_is_hit ? input_is_hit[0] : ~input_set_flags.lru;
cache_set_flags[input_req_addr.set].tag_entry[input_way_to_use] <= {1'b1, input_req_addr.tag};

input_is_hit is a two bit vector (one for each way), that is guaranteed to be either zero in the case of a miss, or $onehot for a hit. The bitwise-or reduction of this, |input_is_hit, indicates whether the request hit in any way. In the case of a hit, the set’s LRU bit is set to input_is_hit[0], so that LRU is now set to whatever way we didn’t hit in. Otherwise for a miss, the current LRU bit is just inverted. The tag_entry is always updated, either with the existing tag for a hit, or the new tag for a miss.

DDR3 Request Processing

In the case of a miss, the input FSM adds memory requests to a CDC FIFO for the DDR3 request FSM to consume, with the write side clocked at the system clock frequency, and the read side clocked at the much lower DDR3 controller user interface frequency. The DDR3 request addresses are just the upper 23 bits of the byte address. This is because byte addresses are 28 bit and cache request addresses are 27 bits (since pixels are 2 bytes), but I want DDR3 request addresses to be cache line aligned. With a cache line holding 16 pixels, this means DDR3 request addresses only need 28 – $clog2(2) – $clog2(32/2) = 23 bits. It is this 23 bit address and the number of 128 bit reads (two for a 256 bit cacheline) that is sent over the fabric.

The DDR3 request FSM is a simple two state machine. The first gets an entry from the DDR3 request FIFO, if there are any. The second state presents the request to the memory fabric, and waits for the ack to come beck before getting the next FIFO entry.

DDR3 Return Processing

Data returned from the memory fabric is 128 bits wide, but cachelines are 256 bits, so I have some logic to gather the returned data into cachelines before adding to the return data FIFO. This is actually really simple, and just requires a single bit to indicate whether incoming data is the high 128 or low 128, and a 128 bit vector to store the previously returned data. And so the return data FIFO input becomes {new_data_from_ddr, prev_data_from_ddr}, and the FIFO write enable is (valid_data & is_high_128).

Output Block

Here’s where I admit I have been lying. The output FSM is actually two FSMs: one for hit processing and another for misses.

Hit FSM

The hit machine has four states. The first state waits for a valid entry in the output request FIFO. Once output_fifo_empty goes low, the request is saved off, the cache address is sent to the BRAM, and it transitions to the second state. All the second state does is transition to the third state, to give the BRAM time to return the requested data. In the third state, the output pixel is read from the BRAM, and it transitions to the final sync state.

output block
Figure 3: output block hit and miss FSMs

Miss FSM

The write machine only has three states. Like the read machine, the write machine initial state also waits until output_fifo_empty goes low, but it also waits for (was_hit | ~ddr3_ret_fifo_empty). This is because in the case of a hit, the write machine has nothing to do and can safely continue. But in the case of a miss, it also has to wait until the required data is returned from the memory fabric. Once the condition is met, the cacheline data is read from the DDR3 return FIFO. The lower 128 bits of that data, the cache write address, and the write enable bit (high if the request was a miss) are sent to the BRAM, and it transitions to the second state. In the second state, the BRAM address is incremented and the upper 128 bits of the cacheline are sent to the BRAM. Like the read machine, the final state is the sync state, which waits for both FSMs to enter into sync, before transitioning back to the first state.

SVA Verification

The bulk of architectural asserts in the cache are for verifying FIFOs, checking for overflow, underflow, and validity, and making sure that the FIFO element sizes match what was set up in the IP. The rest are sanity checks verifying the expected flow of the various FSMs. For example, if an output request FIFO entry is a miss, once the data from memory is available, the cache write enable should rise, stay asserted for two clocks, and then fall. There are also quite a few internal consistency asserts, such as making sure a request never hits in more than one way, that valid bits go low one clock after an invalidation, DDR return FIFO must be empty if the output request FIFO is empty, and that data from the cache is valid exactly two clocks after a read request.

There are some asserts in the arbiter as well. Most of these are sanity checks like the accepted client must actually have a valid request, the client request bit goes low one clock after the arbiter accepts a request, and that the arbiter FSM must be in the kArbStateWaitCacheAck state when the cache asserts arb2cache_if.req_accepted. I bring these three up as an example because all three of the were failing due to a stupid typo that I feel really should have been an error, or at least a warning. Thanks, Vivado. The lesson here is to assert everything, no matter how stupid or small it may seem.

Finally, testbench asserts are used to verify correct high level operation. Each test has three modes: linear, all misses, and alternating ways. Linear uses linearly increasing addresses, and verifies both that the data returned is correct and that I get 1 miss followed by 15 hits. The all misses mode requests the first pixel of every cacheline, and so asserts check that every request is a miss. Alternating ways is a bit harder to explain. Basically I want to test

  1. [tag A, set 0, way 0: miss]
  2. [tag B, set 0, way 1: miss]
  3. [tag A, set 0, way 0: hit]
  4. [tag B, set 0, way 1: hit]
  5. [tag C, set 2, way 0: miss]
  6. [tag D, set 2, way 1: miss]
  7. [tag C, set 2, way 0: hit]
  8. [tag D, set 2, way 1: hit]
  9. [tag E, set 1, way 0: miss]
  10. [tag F, set 1, way 1: miss]
  11. [tag E, set 1, way 0: hit]
  12. [tag F, set 1, way 1: hit]
and so on. That pattern is used to go through all of memory, cycling through all sets and ways, and then wrapping around. I then assert that req_was_hit == test_counter[1], since this pattern should give two misses followed by two hits. One caveat is that the tests above only apply to testbenches where the arbiter has one client. Things get harder in the four client case, since the client request order is randomised. I can still verify the returned data is correct, but since I can no longer rely on any client being the first to touch a cacheline, it becomes harder to verify hits and misses. But this is definitely on my list of things to do

Future Optimisations

Like I said, this represents a first attempt at getting something simple up and running, and in doing so I took quite a few shortcuts. Currently, data is returned in the order requests are accepted, and a miss can block subsequent hits. I’m considering having per-client queues in the cache, where requests from a particular client are always processed in order, but there is some freedom to return a hit from client A while client B’s miss is pending. I actually looked into this pretty early on, but it really complicated the design. I may have to come back to this in the future if I am feeling brave

Fake Scanout Demo

So, why fake a demo? Well, lighting up a green LED on test success is fine, but this is a “GPU” so it would be nice to see something visual on screen. The current rasterizer is a remnant from my previous tile racing design, and rewriting it to support texturing and framebuffers is going to take a Very Long Time. So in order to get some visual confirmation that the texture caches and memory fabric are working, I tried to come up with a minimal work demo that would get some textures displaying.

Demo Setup

I tried to keep the demo as close as possible to the final GPU configuration. So there are twenty “rasterizers”, each covering a 32×32 pixel tile, or a total area of 640×32. Four rasterizers share a texture cache, and so there are five texture caches. These five caches talk to DDR3 through a memory fabric arbiter, which is also shared with a single write client that fills RAM with tiled texture data stored in a BRAM. The addressing used is identical to the final addressing calculation, except that the texture width is locked to 256 pixels. The fake rasterizers themselves don’t do any rasterization. Rather they only contain two state machines: one for requesting the needed texels and the other for receiving them from the cache.

High level demo config. 20 rasterizers connect to 4 caches

DDR3 Fabric and Arbiter

To save time, the memory fabric arbiter is based on the cache client arbiter, but with some modifications. The biggest of these is that there are now both read and write clients, the number of clients is increased from four to six, and the priority is now fixed based on client number. In the demo, client 0 is the write client. It fills RAM with texture data and then goes idle forever, and so it’s given the highest priority to allow it to finish before any reads occur. For read clients, client 1 has the highest priority and client 5 has the lowest, since client 1 handles pixels on the left side of the screen which are needed before client 5’s right side of the screen pixels.

fabric arbiter
fabric arbiter. The red boxes show the write client’s last four writes finishing before read clients can start. The blue boxes show the cache 0 read clients locking out the other lower priority caches. Click to enlarge image.

The other major modification is that data coming back from memory can can consist of a variable number of 128 bit transactions, rather than always being one like for the cache arbiter, and so the transaction count (minus one) needs to be sent to the DDR3 state machine. For writes, I’m only writing 128 bits at a time, so the number of transactions is always one. For reads, this is currently always going to be two since the only read client is the caches, and caches want to fill 128×2=256 bit cachelines.

When a read request comes along, the requesting client number is put in a FIFO. When data comes back from memory, a return transaction counter is incremented, and when the counter has all bits set, the client number FIFO has its read signal asserted to move on to the next entry. This allows me to add the client number to the FIFO once, even though I’m expecting two transactions back.

DDR3 State Machine

The memory controller has two separate and (semi) independent paths: a command path that takes a command (read or write) and address, and a write data path for storing data to write. The write data can be added before the command, on the same clock as the command, or up to two clocks after the command is added. But because the command path can become unavailable (app_rdy goes low) for long periods of time, it was safer to always add the write data first, if its a write command. And so the state for adding write data looks like

app_en <= 0;
calibration_success <= 1;
read_transaction_count <= from_ddr3_arb_if.read_transaction_count;
app_addr <= (from_ddr3_arb_if.req_addr << kDdr3ReqAddrToFullAddrShift);
app_cmd <= from_ddr3_arb_if.req_is_read;

// if a valid read comes in, or a write comes in and there is space in the write data FIFO
if (from_ddr3_arb_if.req_valid & (from_ddr3_arb_if.req_is_read | app_wdf_rdy)) begin
     app_wdf_wren <= ~from_ddr3_arb_if.req_is_read;
     from_ddr3_arb_if.req_accepted <= 1;
     app_en <= 1;
     current_state_ui <= kStateAddCommand;
end

And the state for adding commands (one per transaction) just becomes the following simple loop

// but requests consist of multiple contiguous transactions, so count down
read_transaction_count <= read_transaction_count - app_rdy;
app_addr <= app_addr + (app_rdy << kTransactionIncShift);
app_en <= ~app_rdy || |read_transaction_count;
current_state_ui <= app_rdy && (read_transaction_count == 0) ? kStateWaitRequest : kStateAddCommand;

The return data logic simply waits for app_rd_data_valid to go high and then assigns from_ddr3_arb_if.ret_data <= app_rd_data and then from_ddr3_arb_if.ret_valid <= app_rd_data_valid;

Fake Rasterizers

These aren’t really rasterizers, as they don’t do any rasterizing, but rather are just stand-ins for the 20 rasterizer tiles I will eventually have to update to work with texturing. For now, think of them as just FSMs that request and receive texels from the texture caches.

Cache Request

The request FSM begins when receiving a signal from HDMI, and requests 32 contiguous (in X) texels from the texture cache. Texture data is tiled as shown in figure 1, and therefore the texel address is computed as

// shouldn't really require any muls and adds. Everything is just a vector insert
function TexCacheAddr PixelXyToCacheAddr(input FakeTexCoord x_in, input FakeTexCoord y_in);
    return  (x_in >> 2) * 16                       // start of 4x4 tile in a row
        + (y_in >> 2) * (kFakeTextureWidth * 4)    // start of 4x4 tile in a col
        + {y_in[1], x_in[1], y_in[0], x_in[0]};    // offset inside the tile
endfunction

Once all 32 requests are accepted by the cache, the FSM transitions to idle and waits for HDMI to signal the next scanline’s data is needed.

Cache Return

The return FSM waits for data to come back from the cache, and builds up a per-rasterizer vector of the last 32 texels returned.

logic [kNumRastTiles-1:0] ret_from_cache_valid;
generate
    // sadly Vivado won't let you index into interfaces with for loop variables
    for (tg = 0; tg < kNumRastTiles; ++tg) begin
        assign ret_from_cache_valid[tg] = cache_client_ifs[tg>>2][tg&3].ret_valid;
    end
endgenerate

// per rasterizer cache read
always_ff @(posedge clock_in_200MHz_rasterizer) begin : PER_RAST_READ
    for (int t = 0; t < kNumRastTiles; ++t) begin
        if (ret_from_cache_valid[t]) begin
            // {old_pix, new_pix} = (old_pix << 16) | new_pix
            per_rast_row[t] <= (per_rast_row[t] << $bits(CachePixel)) | ret_data_from_cache[t >> $clog2(kNumClientsPerTextureCache)];
        end
    end     
end : PER_RAST_READ

Here per_rast_row[t] is the per-rasterizer vector of 32 texels that is eventually passed to scanout to be used as the colour data. ret_from_cache_valid[t] is asserted if the cache is returning valid texels for client t, and ret_data_from_cache is the per-cache returned texel data.

HDMI Scanout

HDMI processing is largely unchanged from previous GPUs. There are three clock domains: a 25MHz pixel clock, a 250MHz TMDS clock, and a clock that runs at the GPU system clock. That final one is responsible for interfacing between scanout and the rest of the GPU.

The interface block is the most “interesting” bit. It signals the fake rasterizers to start requesting texels when

(pixelX == gHdmiWidth - 7) && (pixelY >= gHdmiHeight-2 || pixelY < gHdmiHeightVisible-2)

This because to request texels for line N, I want to get them from the cache during line N-1, which means kicking everything at the end of line N-2. Here, gHdmiHeight is 525 lines, and so I want to start when the line == 523 or when line < 480-2.

The interface block also signals the fake rasterizers to latch or shift out the fetched texels, shift register style. The latch happens when

(pixelX == gHdmiWidth - 7) && (pixelY == gHdmiHeight-1 || pixelY <= gHdmiHeightVisible-2)
To process texels for line N, I want to latch them during the previous line (N-1).  So for 525 lines, latch scanline 0’s texels at the end of line 524, wrap around, and then end with latching scanline 470’s texels at the end of line 478. Once the row of pixels is latched, the interface block can shift out one texel per clock, and add it to a very small CDC FIFO.  This data is then consumed by the 250MHz block which pulls the final colour to be scanned out from the shared FIFO.

Results

Alternate demo mascot: rirakkuma
My desk is a chaotic mess

A fun historical note

This post was published on 2021/06/27.  Last night I was banging my head against a wall (figuratively) trying to figure out the final RTL bug in the texturing demo. I woke up, having apparently dreamt the solution, ran to my computer and confirmed the fix. Finally being able to relax, I checked facebook where I was greeted with this amazingly timed memory from exactly one year ago today!

Exactly one year ago today I finished my last faked demo

So yeah, see you on 2022/06/27 for a faked depth buffer demo, I guess?

Special Thanks

As always, thanks to @domipheus, @CarterBen, @Laxer3A, and @sparsevoxel for keeping me inspired with their absolutely mental projects.Thanks to @blackjero for instilling in me the importance of debugging graphics by actually putting things on screen instead of stuffing numbers into buffers.

And super mega ultra thanks to @mikeev, @tom_forsyth, @rygorous, and everyone else participating in the twitter FPGA threads. Its an honour being able to ask such accomplished professionals so many stupid questions and not get laughed out of the room

このサイトのコンテンツは個人の意見で会社を代表しません

 

概要

Dokukon (独身コンピュター), also known as Hitokon (一人コンピューター)Bocchikon, and any other name that implies the opposite of ファミコン, is a 18 bit retro console mainly made in a weekend as part of a 48 hour console jam. It tries to stick to clock speeds that would have been plausible in the 16-bit era, with a few notable exceptions described in later sections.

The general design philosophy can be summarised as: I hate CPUs, and want to do everything I can to minimise CPU involvement in everything. The audio system uses command buffers that can play complex songs just by writing a single MMIO register. The scatter/gather pattern DMA is designed to transfer data to and from different source and destination patterns to avoid CPU loops. The start address of palette index and colour writes encodes a pattern so that every CPU write updates the destination address to the next location in the pattern.

Parts Used

  • Arty A7 100T board
  • HDMI PMOD
  • AMP3 Audio PMOD
  • Famicom Controller

The Arty can be purchased from Digilent here. The AMP3 PMOD is also from Digilent, and is sold here. Finally, the Famicom controller is just a regular Famicom controller with the connector cut off, and wired to PMOD-pluggable pins. See below for more info.

Scanout

This is the most major departure from the 16-bit era clock speed goal. The screen is fixed at 640×480 @ 60Hz, for which the HDMI spec requires a 25.175MHz pixel clock and a 251.75MHz TMDS clock. The HDMI block is responsible for signalling the GPU both when new scanlines start, and N clocks before a scanline starts, where N is the latency of HDMI => GPU messages. The GPU then adds colours to a pixel colour FIFO which scanout then reads from and displays.

Audio

Audio output is fixed at 16 bits per channel, two channel stereo, and a 24,000Hz sample rate. This is driven by a 768KHz bit clock, and an MCLK of 6.144MHz that drives the audio logic. Technically it could be clocked much lower, but good luck getting a PLL/MMCM to generate a clock in the kilohertz range

Audio is programmed via GPU-like command buffers stored in audio RAM. There are per-channel CPU-accessible MMIO registers responsible for setting the command buffer address, pausing and unpausing the audio, and setting volume, with more registers currently being added. The currently available commands are

  1. square wave: Play a note from C3 to B5, for a length of time between a whole note and sixteenth note, including dotted
  2. set BPM: set the channel BPM. Supported range is 50 to 177
  3. rest: insert a rest. Supports all note lengths supported by square waves
  4. set label: write an initialisation value to one of the four labels
  5. conditional jump: takes a label number, condition, decrement flag, and jump offset. The command looks at label [0..3] and tests it for a condition, optionally decrementing the label. This can be used for conditional jumps, unconditional jumps, and repeating a section of music a number of times

Future support is planed for setting note decay and fadeout, PCM data, triangle waves, and noise

Synchronisation and loops/repeats are achieved through labels. To synchronise two channels, you can have one wait on a label that the other one writes. To implement loops and repeats, use a label write to set the repeat count, and then use a conditional jump with label decrement.

GPU

The GPU is designed around sprites and tile graphics, with a max of 128 sprites, of which 64 are displayable per scanline. The tile size is 32×32, and so a 640×480 screen becomes 20×15=300 background tiles.

State Machines

The GPU is divided into three state machines: background tile processing (BG), sprite search, and foreground sprite processing (FG).

GPU processing starts when HDMI signals the GPU to start a new scanline, which runs the BG processing and sprite search state machines in parallel. Sprite search is pretty simple. It looks at all 128 sprites for the first 64 that intersect the current scanline, and adds them to a list. BG processing is a bit more involved. It looks at each pixel in a scanline and works out what BG tile it lives in. It then looks up that tile’s index in GPU RAM, and uses that index to fetch graphics data for the pixel from graphics tile ROM. The graphics tile data is then used as an index into a colour palette to get the background pixel colour. Lastly, it determines the final colour by pulling from the FG sprite FIFO to see if there is a sprite overlapping this pixel and whether its transparent. This final colour is added to an HDMI FIFO for scanout to consume.

BG and FG both need to share a read port for graphics tile ROM, and so they can’t run at the same time. So when BG processing reads the final pixel’s data from the ROM, it then signals FG sprite processing to begin. FG processing does a parallel reduction of the 64 intersecting sprites found by the sprite search pass, and finds the first one to intersect the current pixel. It then looks up that sprite’s pixel in graphics tile ROM to get it’s palette index, and to see if it’s transparent. Note that FG must always run one scanline ahead of BG so that BG processing has the expected data in its FG FIFO.

The GPU is also responsible for signalling the CPU when a new scanline is started, when BG processing finishes, and when FG processing finishes. The CPU can then poll on an MMIO status register to wait for a specific part of a scanline to finish. This is useful for when you only want to change tile indices, and need to wait for BG processing but not FG processing to finish for the current scanline. Or maybe you want to change sprites for the next scanline, and therefore need to wait for FG processing to finish. There is also logic in there to signal the CPU as early as possible, to make up for the round trip latency of the CDC FIFO used to signal the CPU, and the time it takes for a CPU register write to be seen by the GPU.

Palettes

Currently there are 8 background palettes with 8 colours each. The 32×32 background tiles are grouped into 64×64 supertiles, where each tile in a supertile uses the same palette. This means a screen becomes 10×8=80 supertiles, which is alot of data for the CPU to update. To make it a bit more convenient to update sparse indices without the CPU constantly changing the destination address, I encode the address in a similar way to what I use for DMA.

address = dest index (7b), dest X inc (3b), dest Y inc (2b), writes per row (4b)

Just like DMA, every time a new palette number is written, the destination address is automatically changed according to the pattern. And just like the DMA encoding, this allows you to update a linear array of indices, a sparse block of indices, a row of indices, or a column.

There is a similar trick when updating the colours in the palettes. The start address encodes start palette, start colour withing the palette, num colours per palette to update, and colour increment. So you could start at palette 2, and update colours 1, 4, and 7 in all subsequent palettes without ever the CPU changing address.

CPU

The CPU runs at 10MHz, and has a three stage pipeline with no hazards. It reads 16-bit instructions from a 4096 deep instruction memory ROM bank.

ISA

The CPU utilises a RISK ISA. This is not a typo or an acronym, I just think you’re taking a big RISK if you expect it to correctly execute instructions. It currently has the following instructions

  • add/sub
  • shift
  • bitwise: contains a 4 bit function field that specifies AND, OR, XOR, NOT
  • popcnt
  • mov 8-bit literal low
  • mov reg
  • insert 8-bit literal high
  • load
  • pop
  • pack
  • unpack
  • jump
  • branch if zero
  • branch if not zero
  • branch if carry
  • branch if negative
  • call
  • return
  • store
  • push
  • compare: compare register with register, or register with literal
  • nop

Fun facts:

All branch and jump instructions have two forms: one where the instruction encodes a literal offset, and another where it encodes register number that holds the offset. The instruction after any PC-modifying instruction is considered a branch delay slot.

Mov 8-bit literal low is usually used in conjunction with insert 8-bit literal high to build 16-bit literals. The low mov instruction will write 8 bits to the destination’s LSB8, and zero out the MSB8. Insert leaves the LSB8 as it is, and inserts 8 bits to the MSB8.

And then there is pack/unpack. I wanted a fast way to extract some bits from a field in a register, update the value, and insert the new value back in, but there weren’t enough bits available for the pack. The initial solution was to have unpack set an internal insert mask that pack then uses when reinserting the bits. This mask persists as long as you don’t do another unpack, and so it can be used for multiple packs in a row. I am not 100% in love with this, and am currently considering other solutions such as implicit pairs of aligned registers to save bits for this instruction.

Memory System

All ROMs and RAMs are grouped into the appropriate sounding but vaguely named Memory System. It is also home to the CPU-accessible MMIO registers used to program other bits of the hardware. The CPU accesses these via load and store instructions to four aperatures.

CPU accessible apertures

  • MMIO: RW, aperture for MMIO registers
  • RAM: RW, aperture for CPU accessible RAM
  • ROM: RO, aperture for reading from ROM
  • tile indices: RW, currently unused, but eventually shares a GPU RAM write port with DMA

So for example, a write to MMIO might look like this

mov r0, kMmioRegGpuSetSpriteNum
mov r1, #0
store r1, r0, kApertureMmio, #0

The 16 bits of address and 2 bits of aperature form a 18-bit address. The memory map then becomes:

aper offs CR CW GR GW DR DW 
0x00_0000  1  1  0  0  0  0 <= start of RAM
0x00_FFFF  1  1  0  0  0  0 <= end of RAM
0x01_0000  1  0  0  0  1  0 <= start of general ROM data
0x01_FFFF  1  0  0  0  1  0 <= end of general ROM data
0x02_0000  1  1  1  1  0  0 <= start of MMIO regs
0x02_FFFF  1  1  1  1  0  0 <= end of MMIO regs
0x03_0000  0  0  1  0  0  0 <= start of gfx tile data ROM
0x03_FFFF  0  0  1  0  0  0 <= end of gfx tile data ROM (131,072 bytes, 341 tiles, 1.14 screens)
(memory map doesn't include internal GPU RAM)

RAMs and ROMs

The memory system encompasses the following RAM and ROM banks:

  • CPU RAM: 16-bit RAM accessible to the CPU
  • CPU ROM: 2048 deep ROM of 18 bit halfwords. CPU can only access the LSB16 of each halfword
  • GPU RAM: 1024 deep RAM of 18 bit halfwords. Each halfword stores two 9-bit indices into tile ROM
  • GPU graphics tile ROM: 65,536 deep ROM of 18-bit halfwords

The sizes need a bit of clarification. A graphics tile is a 32×32 pixel block of image data, where each pixel is a 3-bit palette index. So a single 18-bit halfword stores palette indices for six pixels. So to store one row of a 32×32 tile requires six halfwords (with four bits wasted at the end), and a whole tile is 32×6 = 192 halfwords per tile. A 640×480 screen is covered by 20×15=300 tiles, and so one screen worth of data would require 57,600 halfwords. Graphics tile ROM is 65,536 deep, so we can store about 1.14 screens worth of tiles. This isn’t great, and I plan to improve it, but for the time being do plan on reusing graphics tiles alot.

GPU RAM needs to store 9-bit indices for the 300 background tiles. This works out to 150 halfwords, and yet the RAM is 1024 deep. This is because the smallest BRAM primitive on the board is 18kbit. I should probably look into finding other uses for the unused memory.

Finally, CPU ROM has a bit of a dual nature. It’s 18-bits wide, but the only client that can access the whole 18 bits is DMA. Only the LSB16 of each halfword is visible to the CPU. I wish I had a great excuse for why this is, but the jam was only 48 hours long and I jumped into making the CPU before realising everything else in the system should be 18 bits.

Gather/Scatter Pattern DMA

DMA is how background graphics tile indices are transferred from ROM to the GPU, and is clocked with the GPU clock. There are three MMIO registers used to program and kick off DMA transfers

  • DMA config reg 0
    • src start addr: 12 bits, source start, [0..4095] tiles
    • num Y tiles: 4 bits, number of tiles to transfer in the Y direction, [1..15]
  • DMA config reg 1
    • dest start addr: 9 bits, destination start, [0..299] tiles
    • num X tiles: 5 bits, number of tiles to transfer in the X direction, [1..20]
    • src X inc MSB2: 2 bits, the MSB2 of the source X increment
  • DMA config reg 2
    • src X inc LSB3: 3 bits, source X direction increment, [0..20] tiles
    • src Y inc: 4 bits, source Y direction increment, [1..15] tiles
    • dest X inc: 5 bits, destination X direction increment, [1..20] tiles
    • dest Y inc: 4 bits, destination Y direction increment, [1..15] tiles

Writing to that final register is what initiates the DMA transfer. This can be used to do a few useful things. First, you can copy data linearly, or copy a packed block of indices to a packed area in GPU RAM. You can also copy a packed block of indices to a sparse area in GPU RAM, and vice versa. Finally you can splat a single source index into either a packed or sparse area in GPU RAM. Here is an example from the SDK katakana sample

Don’t feel bad, even I can’t remember how my own DMA engine works

Finally, there is a 16-deep DMA request queue. This was supposed to be a quick hack get some of the functionality of DMA chains. However the DMA status MMIO registers aren’t currently hooked up, so be careful not to overflow the FIFO. I will have to fix this eventually.

Controller

The controller was “designed” to feel “familiar” to retro gamers. It has a d-pad, two buttons, plus start and select.

This controller somehow feels familiar to retro gamers

The latest controller data is sampled at roughly 625KHz. It’s probably safe to clock it a bit higher, but the japanese datasheet for the shift register is not very clear on the maximum clock speed when voltage is 3.3v, so I kept it under 1MHz just to be safe. The CPU can query the latest controller data at any time via MMIO register. The Vivado constraints for PMOD port JD are set up as the following

Dev Environment and IDE

Current system requirements are

  • high tolerance for crap
  • low expectations
  • Windows10 64 bit

Because you can’t make games without a proper IDE, I decided to write one using internet snippets of C#, a garbage language with no redeeming value. The theme I am going for is making people long for the days of PS2 CodeWarrior, if you know what I mean. Currently you can build code and data, verify and connect to devkits, and upload executables and data to them. Eventually I’ll add sound and graphics editors

IDE tab for editing and assembling code
IDE tab for devkit management

Host Target Comms

Host target communication is done over UART, with 8 data bits, 2 stop bits, and no parity. Parody, on the other hand, is enabled for this entire project. Host sends the target update files, consisting of

  • global update header
  • section header (18 bit one-hot)
  • section checksum (18 bit XOR)
  • num 18 bit halfwords in the data payload
  • data payload (18 bits x num_halfwords)
  • additional sections (section header + checksum + payload)
  • global update end (18’b0)

The first four section headers (0001, 0010, 0100, 1000) correspond to the four flashable ROMs: code, data, graphics tile, and audio. A write to any of these ROMs will trigger a devkit reset, and light up the corresponding colour LED on the Arty. Yellow means a particular ROM will be updated, green means that ROM’s update succeeded, and red means the ROM data failed the checksum test.

Other section headers include confirm (send a message to the host when the IDE 確認 button is pressed), LED on (light up some LED pattern), and eventually debugger messages should I ever decide to try and make one.

Special Thanks

Extra special thanks to Rupert Renard, Mike Evans, and the rest of hardware twitter for being so generous with your advice and patient with my stupid questions. Thanks to Colin Riley for the expertly crafted HDMI PMOD that made HDMI possible. Also shout out to Ben Carter (yes, that Ben Carter. The Super Famicom ray tracing one) and Laxer3A for working on what is arguably two of the coolest projects out there, and inspiring me to keep pushing.

FAQ

why 18 bits?  Two reasons. First, with ECC disabled, a block RAM byte is 9 bits. This seemed pretty useful at the time, so I just designed everything around that and the numbers worked out really well. Second and most important, Jake Kazdal thinks he’s so cool for calling his company 17bit, and I just felt like I needed to one-up him.

Console X does it this way. Why don’t you? Because its more fun to try and make up your own weird ways to do things without copying others. Sometimes you come up with the same solution others did, and sometimes you make something worse. All that matters is you have fun in the process!

How do I become a registered developer?  I dunno, sign an NDA and pay a developer fee to The City Of Elderly Love animal shelter?

Is everything rigorously verified/validated, and confirmed working?  LOLZ no. I do have lots of SVA asserts, though, if that makes you feel better.

このサイトのコンテンツは個人の意見で会社を代表しません

table of contents

terms and acronyms

rasterizer overview

architecture overview

SDN

AKB and NMBs

SKE

ELT


terms and acronyms

This post discusses the blocks used in building up, launching, and executing shader work. Because shaders can’t (yet) access memory in a general purpose way, pixel shaders are focused on rather than compute. The rasterizers will be covered in-depth in another post. Before getting started, the following table summarises some of the more commonly used terms and acronyms used throughout this post. Tooltips, shown in bold, are also added in the beginning of each section to make looking up acronyms easier

Term Meaning Notes
Nami vector of 8 pixels/lanes This is also the ALU width
SDN Setup Data for Nami No relation to SDN48
AKB ALU Kernel Block No relation to AKB48
NMB Nami Management Block No relation to NMB48
SKE Shader Kernel Executor No relation to SKE48
STU Shader Texture Unit No relation to STU48
ELT Export [Lit | Luminous] Texels No relation to Every Little Thing
Figure 1: Terms and acronyms used in this post

rasterizer overview

Figure 2: Twenty rasterizers cover a total of 640×32 pixels

Each rasterizer covers a 32 by 32 pixel screen tile. With a fixed screen resolution of 640×480, the twenty rasterizers together operate on a 640×32 row of screen tiles. Before pixel shaders were added, each rasterizer had its own bank in the render cache, and would output the new blended colour (using the previous colour, the new per-vertex barycentric interpolated colour, and a programmable blending function) and GPU metadata to it. These banks are double buffered so that the rasterizers render to bank N while HDMI scans out the data in bank N^1. Think of it like racing the beam, but instead of a single scanline, you have the time HDMI takes to scan out a tile row. This works out to be

row = 640 + 16 (front porch) + 48 (back porch) + 96 (sync) = 800
total pixels = 800 * 32 = 25,600
25,600 pixels @ 25MHz HDMI clk = 204,800 ticks at 200MHz gfx clk

So if a rasterizer takes 3072 clocks to process a triangle for a 32×32 tile, this works out to 67 triangles per rasterizer per tile row per frame. And since hardware manufacturers like to maximise stats using unlikely scenarios, if every screen tile row has its own set of triangles, then the rasterizers process a maximum of 60,300 triangles per second, and output pixels at a rate of 1.24 Gpix/sec.

Things to eventually fix: The rasterizers initially worked on pixel pairs, and processed all three triangle edges in a single clock. However this used up too many DSP slices and made it hard to meet timing. I need to re-investigate whether I can use what I learned about meeting timing this last year to hit the initial target of one triangle taking 512 clocks rather than 3072.

Due to a current limitation with the ISA, the rasterizers are still outputting colour rather than barycentric weights. The pixel cache address and new per-vertex barycentric interpolated colour are output to the SDN for use in shaders. Unfortunately the previous colour isn’t currently available due to losing the ordering guarantee that came from the rasterizers directly writing the cache.

In the previous design, all twenty rasterizers had to stay in sync, meaning you could cull triangles based on vertical position but not horizontal. With the new shader system, this limitation has been removed, and so horizontal culling should eventually be added.


architecture overview

As shown in Figure 2, the twenty rasterizers are grouped into ten pairs of two. Each rasterizer pair outputs pixels to an SDN which packs the pixels into a eight lane vector called a Nami. Once a full Nami is built up, or a hang-preventing timer expires, the Nami is then sent to the AKB and assigned to one of four NMBs. Each clock, the AKB selects the next NMB, receives a Nami from it, and sends that Nami to the SKE for execution. Finally, export instructions will send the final colour and cache address to the ELT for export to the shader cache.


SDN

The SDN takes in pixels from a pair of rasterizers, and packs them into an eight lane Nami. There was quite a bit of experimentation done wrt the number of rasterizers that share an SDN. Larger numbers (4+) can build up a full Nami faster, but can introduce quite a bit of logic complexity in the number of possible input cases and overflow cases. Two rasterizers was settled on because the number of overflow cases is only one.

Figure 3 below shows a few of the possible cases the SDN has to handle. In the first case (upper left) we start with a Nami that has seven out of eight slots full. We get a valid pixel from rasterizer A and no pixel from rasterizer B. Pixel A is placed into the final slot in the Nami, and the completed Nami is sent to the AKB. The SDN then starts over with an empty Nami.

Figure 3: a few possible SDN input and overflow cases. Red X indicates an already filled slot

The second case (upper right) shows the opposite. A valid pixel is received from rasterizer B but not from A. B’s pixel completes the Nami, it’s sent to the AKB, and the SDN starts over again with an empty Nami.

In the third case (lower left), we get valid pixels from both rasterizers A and B, but this time the Nami has two unfilled slots. Both pixels A and B are placed in the Nami, it’s sent to the AKB, and the SDN starts over again with an empty Nami.

The final case shows the one possible overflow scenario. Two valid pixels are received but the Nami only has one available slot. Rasterizer A’s pixel completes the Nami, which is sent to the AKB, but then the SDN starts over with a one pixel Nami, containing pixel B. Note in all of the above cases, a full Nami is sent. It’s possible for there not to be enough pixels to form a full Nami. To prevent hangs, a 32 clock timer is reset on every new pixel. If the timer goes down to zero, a partial Nami will be sent.

Notes on rearchitecting: because the rasterizers currently can process one triangle edge per clock, or one triangle every three clocks, the SDN relies on this to reduce logic. If the rasterizers are rewritten to support one triangle per clock, the SDN needs to be redone as well.

Things to eventually fix: to simplify the cache design each rasterizer works on screen pixel R + 20N. So rasterizer 0 would process pixels 0, 20, 40, 60, 80, etc. Rasterizer 1 would process pixels 1, 21, 41, 61, 81, etc. Because of this more partial Nami end up getting sent than would if a quad pattern was used. This was a simplification made when the project started to reduce muxes and help meet timing, and should definitely get fixed before texturing is added.


AKB and NMBs

The AKB has four functions:

  1. take in Nami from an SDN, and assign them to one of the four NMBs
  2. each clock, select the next NMB via round robin, receive a Nami from it, and pass the Nami on to the SKE for execution
  3. receive a Nami’s updated dynamic data (new active lane mask, status flags, program counter, retire status, etc) from the SKE, and route it back to the proper NMB
  4. when all NMBs are full, send a backpressure signal to the SDN, which clock gates the SDN and rasterizer until Nami slots are freed up
Figure 4: an AKB connects to an SDN, and contains the interesting shader bits

The AKB takes as input Nami from an SDN, and must distribute them to the next non-full NMB. Because of the way NMBs are selected for execution via round robin, it’s preferable to also add new Nami via round robin as well, to help saturate the ALU. The AKB keeps track of the last NMB to receive a new Nami, looks at the full flags of each NMB, and determines which NMB the next new Nami will be sent to.

Because pipeline latency requires at least four clocks between consecutive executions of the same Nami, the AKB round robins the four NMBs with no ability to skip empty NMBs. In the case where an empty NMB is selected, a null nami is passed to the AKB to avoid side effects. Once the next NMB is selected, and it’s Nami is received, the AKB appends the NMB number and Nami ID, and passes the data to the SKE for execution. The NMB number and Nami ID are used to route the updated dynamic data returned from the SKE back to the right place.

The NMBs are fairly simple. They store the execution state, consisting of constant and dynamic data, for up to four Nami. They also choose the next Nami to be presented to the AKB for execution, maintain status flags like is_full, and handle the add/retire management logic. Dynamic data consists of things that can change during execution, like the valid lane mask, status flags, program counter, and retire status. The const data is the previous colour for a pixel, the new barycentric interpolated colour, and the address in the shader cache. Because the Nami in the NMB can be sparse, the NMB remembers the last Nami sent for execution, and tries to fairly select a different one to run next time. This is mainly to help hide some of the latency once texturing is implemented.

Figure 5: simplified view of NMB functionality

In the above image, the NMB has two valid Nami (slots 0 and 2) indicated by the valid mask V. The generated add mask, A, is 4’b0010 meaning the incoming new Nami will be added to slot 1. R is the retire mask coming from the AKB/SKE, and in this case means the Nami in slot 2 is retiring. Each clock, the new valid mask is calculated as V = V | A & R which both adds a new Nami and retires any Nami that had their export request granted. The full signal is calculated as the bitwise AND reduction of the valid mask.

Notes on rearchitecting: The AKB and NMBs take a few clocks to update state, and therefore rely on an SDN not being able to provide more than one Nami every four clocks. If the SDN bottleneck is removed, it will require a significant updating of the AKB

Things to eventually fix: right now, every assigned Nami is always ready to run. There is no ability to skip a Nami that is waiting on a texture fetch or an i-cache miss. This is definitely going to need fixing as texturing is added, and if compute shaders requiring general DDR3 memory access are added.


SKE

SKE, shown as a red box in Figure 4, stands for Shader Kernel Executor. It contains the execution pipeline, register file, i-cache, and ALU used in executing Nami.

Each clock, the AKB gets a Nami from the next NMB, and passes it to the SKE. In the case where an NMB sends the same Nami back to back, four clocks isn’t enough time for the updated dynamic data from the first execution to reach the NMB, and so a forwarding path is needed. The ALU sends the updated dynamic data and Nami ID both to the originating NMB, and also over the forwarding path to the SKE input. If the current Nami ID is equal to the forwarded Nami ID, then the Nami is being executed twice in a row, and forwarded data must be used to avoid out of date dynamic data. This forwarding path is shown both in Figure 4 and the in following waveform.

Figure 6: single Nami data forwarding case

The above waveform shows the case where only one Nami exists and its assigned to NMB0. The waveform starts at (A) when fetch_nami goes high, meaning the AKB has selected NMB0 to send a Nami. One clock later (B), valid_nami_from_akb goes high when the SKE receives the Nami and it enters the execution pipeline. Four clocks after fetch_nami originally went high, it goes high again (C), and NMB0 sends the same Nami as last time. At this point, NMB0 still hasn’t received the updated dynamic data from the first execution of the Nami, and so it sends stale data. The SKE detects this case, and asserts use_forwarded_data (D). Finally, at (E) the updated dynamic data from the first execution arrives at NMB0.

Handling of the export instruction is different than that of other instructions. It’s currently the only instruction that can fail and require a retry. The ELT has a small Nami FIFO for holding export requests, and if this FIFO is full an export request can’t be granted. Retry is implemented simply by not updating the program counter if an export request is not not granted.

A successful export grant relies on the data forwarding mechanism outlined above. Export functions as an end of shader signal, and will retire the Nami in the NMB, but as shown in Figure 6 it’s possible for the NMB to issue the Nami for execution one more time before seeing the retire signal from the SKE. It’s for this reason that the dynamic data contains a successful export grant signal, which will insert a NOP if forwarded data is used.

The register file consists of two banks of eight registers each. Instructions that take two input operands must take one from bank A and the other from bank B. This is to make it easier to achieve two register reads per clock, since reading asynchronous distributed RAM on posedge and negedge clk makes it difficult to meet timing. Instruction output operand encoding can specify bank A, bank B, or both. The registers themselves are eight lanes wide, the same width as a Nami, and each lane is 16 bits. The total register file is 16 (registers/Nami) x 4 (Nami/NMB) x 4 (NMBs/AKB) x 8 (lanes) x 16 (bits/lane) = 32,768 bits per AKB. Despite the size, using BRAM would be wasteful as native BRAM primitives don’t support read widths of 128 bits, and can’t provide the byte write enable bits needed to implement lane masking, meaning a 18k BRAM per two lanes would have to be used, which leaves each BRAM only used at 25% capacity. The current implementation uses distributed RAM, but this has drawbacks as well as it requires high LUT usage for multiplexing reads.

notes on rearchitercting: because Nami do not currently have the ability to wait, the way the i-cache is being used is more like a separate program memory. It will go back to being used as an actual cache once the new DDR3 controller is done, and work implementing texturing begins


ELT

ELT stands for Export (Lit | Luminous) Texels, and is the block responsible for exporting to the cache. It places incoming Nami data from the SKE into a Nami FIFO, and then for each pixel in the Nami, adds it’s colour/metadata/cache address to one of two pixel FIFOs for consumption by the shader cache. The reason there are two pixel FIFOs is that two rasterizers connect to a single SDN / AKB, and therefore a Nami can contain a mix of pixels to be written to either cache bank in the pair. The image below shows an example of this. Letters A through H represent Nami data, and [0,1] specifies which cache in the pair to write the colour to.

Figure 7: various FIFOs inside the export block

When the Nami FIFO is not empty, the ELT takes the next entry and loops over each pixel, distributing it to one of the two pixel FIFOs, depending on the cache pair number. In the above example, pixel data A, D, and G are routed to pixel FIFO 0, to be consumed by cache bank 2A+0. Confusingly, the A in the bank number is unrelated to pixel data A, but rather stands for AKB number, such that AKB0 connects to cache banks 0 and 1, AKB1 connects to cache banks 2 and 3, and so on.

Because a cache bank can have multiple clients, it’s possible to add pixels to the ELT pixel FIFO faster than the cache consumes them. This case is shown above, where pixel H should be added to pixel FIFO 1, but there is no room. In this case, a backpressure signal is asserted, and the ELT will retry to add the pixel next clock

The image below shows the first ever pixel shader running on hardware, displaying in a small PIP window. It’s not pretty or fancy or advanced, and the shader was only around 7 instructions, but admittedly it felt pretty amazing actually seeing the GPU saturate and having it not mess up.

special thanks

Ben Carter (@CarterBen) for reading the whole thing, even though it was a mess and made little sense

Colin Riley (@domipheus) for both inspiring me with his CPU project, and for sending me the HDMI PMOD that he made