table of contents
terms and acronyms
This post discusses the blocks used in building up, launching, and executing shader work. Because shaders can’t (yet) access memory in a general purpose way, pixel shaders are focused on rather than compute. The rasterizers will be covered in-depth in another post. Before getting started, the following table summarises some of the more commonly used terms and acronyms used throughout this post. Tooltips, shown in, are also added in the beginning of each section to make looking up acronyms easier
|vector of 8 pixels/lanes
|This is also the ALU width
|Setup Data for Nami
|No relation to SDN48
|ALU Kernel Block
|No relation to AKB48
|Nami Management Block
|No relation to NMB48
|Shader Kernel Executor
|No relation to SKE48
|Shader Texture Unit
|No relation to STU48
|Export [Lit | Luminous] Texels
|No relation to Every Little Thing
Each rasterizer covers a 32 by 32 pixel screen tile. With a fixed screen resolution of 640×480, the twenty rasterizers together operate on a 640×32 row of screen tiles. Before pixel shaders were added, each rasterizer had its own bank in the render cache, and would output the new blended colour (using the previous colour, the new per-vertex barycentric interpolated colour, and a programmable blending function) and GPU metadata to it. These banks are double buffered so that the rasterizers render to bank N while HDMI scans out the data in bank N^1. Think of it like racing the beam, but instead of a single scanline, you have the time HDMI takes to scan out a tile row. This works out to be
row = 640 + 16 (front porch) + 48 (back porch) + 96 (sync) = 800 total pixels = 800 * 32 = 25,600 25,600 pixels @ 25MHz HDMI clk = 204,800 ticks at 200MHz gfx clk
So if a rasterizer takes 3072 clocks to process a triangle for a 32×32 tile, this works out to 67 triangles per rasterizer per tile row per frame. And since hardware manufacturers like to maximise stats using unlikely scenarios, if every screen tile row has its own set of triangles, then the rasterizers process a maximum of 60,300 triangles per second, and output pixels at a rate of 1.24 Gpix/sec.
Things to eventually fix: The rasterizers initially worked on pixel pairs, and processed all three triangle edges in a single clock. However this used up too many DSP slices and made it hard to meet timing. I need to re-investigate whether I can use what I learned about meeting timing this last year to hit the initial target of one triangle taking 512 clocks rather than 3072.
Due to a current limitation with the ISA, the rasterizers are still outputting colour rather than barycentric weights. The pixel cache address and new per-vertex barycentric interpolated colour are output to the SDN for use in shaders. Unfortunately the previous colour isn’t currently available due to losing the ordering guarantee that came from the rasterizers directly writing the cache.
In the previous design, all twenty rasterizers had to stay in sync, meaning you could cull triangles based on vertical position but not horizontal. With the new shader system, this limitation has been removed, and so horizontal culling should eventually be added.
As shown in Figure 2, the twenty rasterizers are grouped into ten pairs of two. Each rasterizer pair outputs pixels to an which packs the pixels into a eight lane vector called a . Once a full Nami is built up, or a hang-preventing timer expires, the Nami is then sent to the and assigned to one of four . Each clock, the AKB selects the next NMB, receives a Nami from it, and sends that Nami to the for execution. Finally, export instructions will send the final colour and cache address to the for export to the shader cache.
The SDN takes in pixels from a pair of rasterizers, and packs them into an eight lane. There was quite a bit of experimentation done wrt the number of rasterizers that share an SDN. Larger numbers (4+) can build up a full Nami faster, but can introduce quite a bit of logic complexity in the number of possible input cases and overflow cases. Two rasterizers was settled on because the number of overflow cases is only one.
Figure 3 below shows a few of the possible cases the SDN has to handle. In the first case (upper left) we start with a Nami that has seven out of eight slots full. We get a valid pixel from rasterizer A and no pixel from rasterizer B. Pixel A is placed into the final slot in the Nami, and the completed Nami is sent to the. The SDN then starts over with an empty Nami.
The second case (upper right) shows the opposite. A valid pixel is received from rasterizer B but not from A. B’s pixel completes the Nami, it’s sent to the AKB, and the SDN starts over again with an empty Nami.
In the third case (lower left), we get valid pixels from both rasterizers A and B, but this time the Nami has two unfilled slots. Both pixels A and B are placed in the Nami, it’s sent to the AKB, and the SDN starts over again with an empty Nami.
The final case shows the one possible overflow scenario. Two valid pixels are received but the Nami only has one available slot. Rasterizer A’s pixel completes the Nami, which is sent to the AKB, but then the SDN starts over with a one pixel Nami, containing pixel B. Note in all of the above cases, a full Nami is sent. It’s possible for there not to be enough pixels to form a full Nami. To prevent hangs, a 32 clock timer is reset on every new pixel. If the timer goes down to zero, a partial Nami will be sent.
Notes on rearchitecting: because the rasterizers currently can process one triangle edge per clock, or one triangle every three clocks, the SDN relies on this to reduce logic. If the rasterizers are rewritten to support one triangle per clock, the SDN needs to be redone as well.
Things to eventually fix: to simplify the cache design each rasterizer works on screen pixel R + 20N. So rasterizer 0 would process pixels 0, 20, 40, 60, 80, etc. Rasterizer 1 would process pixels 1, 21, 41, 61, 81, etc. Because of this more partial Nami end up getting sent than would if a quad pattern was used. This was a simplification made when the project started to reduce muxes and help meet timing, and should definitely get fixed before texturing is added.
AKB and NMBs
The AKB has four functions:
- take in from an , and assign them to one of the four
- each clock, select the next NMB via round robin, receive a Nami from it, and pass the Nami on to the for execution
- receive a Nami’s updated dynamic data (new active lane mask, status flags, program counter, retire status, etc) from the SKE, and route it back to the proper NMB
- when all NMBs are full, send a backpressure signal to the SDN, which clock gates the SDN and rasterizer until Nami slots are freed up
The AKB takes as input Nami from an SDN, and must distribute them to the next non-full NMB. Because of the way NMBs are selected for execution via round robin, it’s preferable to also add new Nami via round robin as well, to help saturate the ALU. The AKB keeps track of the last NMB to receive a new Nami, looks at the full flags of each NMB, and determines which NMB the next new Nami will be sent to.
Because pipeline latency requires at least four clocks between consecutive executions of the same Nami, the AKB round robins the four NMBs with no ability to skip empty NMBs. In the case where an empty NMB is selected, a null nami is passed to the AKB to avoid side effects. Once the next NMB is selected, and it’s Nami is received, the AKB appends the NMB number and Nami ID, and passes the data to the SKE for execution. The NMB number and Nami ID are used to route the updated dynamic data returned from the SKE back to the right place.
The NMBs are fairly simple. They store the execution state, consisting of constant and dynamic data, for up to four Nami. They also choose the next Nami to be presented to the AKB for execution, maintain status flags like is_full, and handle the add/retire management logic. Dynamic data consists of things that can change during execution, like the valid lane mask, status flags, program counter, and retire status. The const data is the previous colour for a pixel, the new barycentric interpolated colour, and the address in the shader cache. Because the Nami in the NMB can be sparse, the NMB remembers the last Nami sent for execution, and tries to fairly select a different one to run next time. This is mainly to help hide some of the latency once texturing is implemented.
In the above image, the NMB has two valid Nami (slots 0 and 2) indicated by the valid mask V. The generated add mask, A, is 4’b0010 meaning the incoming new Nami will be added to slot 1. R is the retire mask coming from the AKB/SKE, and in this case means the Nami in slot 2 is retiring. Each clock, the new valid mask is calculated as V = V | A & R which both adds a new Nami and retires any Nami that had their export request granted. The full signal is calculated as the bitwise AND reduction of the valid mask.
Notes on rearchitecting: The AKB and NMBs take a few clocks to update state, and therefore rely on an SDN not being able to provide more than one Nami every four clocks. If the SDN bottleneck is removed, it will require a significant updating of the AKB
Things to eventually fix: right now, every assigned Nami is always ready to run. There is no ability to skip a Nami that is waiting on a texture fetch or an i-cache miss. This is definitely going to need fixing as texturing is added, and if compute shaders requiring general DDR3 memory access are added.
SKE, shown as a red box in Figure 4, stands for Shader Kernel Executor. It contains the execution pipeline, register file, i-cache, and ALU used in executing .
Each clock, the Figure 4 and the in following waveform.gets a Nami from the next , and passes it to the SKE. In the case where an NMB sends the same Nami back to back, four clocks isn’t enough time for the updated dynamic data from the first execution to reach the NMB, and so a forwarding path is needed. The ALU sends the updated dynamic data and Nami ID both to the originating NMB, and also over the forwarding path to the SKE input. If the current Nami ID is equal to the forwarded Nami ID, then the Nami is being executed twice in a row, and forwarded data must be used to avoid out of date dynamic data. This forwarding path is shown both in
The above waveform shows the case where only one Nami exists and its assigned to NMB0. The waveform starts at (A) when fetch_nami goes high, meaning the AKB has selected NMB0 to send a Nami. One clock later (B), valid_nami_from_akb goes high when the SKE receives the Nami and it enters the execution pipeline. Four clocks after fetch_nami originally went high, it goes high again (C), and NMB0 sends the same Nami as last time. At this point, NMB0 still hasn’t received the updated dynamic data from the first execution of the Nami, and so it sends stale data. The SKE detects this case, and asserts use_forwarded_data (D). Finally, at (E) the updated dynamic data from the first execution arrives at NMB0.
Handling of the export instruction is different than that of other instructions. It’s currently the only instruction that can fail and require a retry. Thehas a small Nami FIFO for holding export requests, and if this FIFO is full an export request can’t be granted. Retry is implemented simply by not updating the program counter if an export request is not not granted.
A successful export grant relies on the data forwarding mechanism outlined above. Export functions as an end of shader signal, and will retire the Nami in the NMB, but as shown in Figure 6 it’s possible for the NMB to issue the Nami for execution one more time before seeing the retire signal from the SKE. It’s for this reason that the dynamic data contains a successful export grant signal, which will insert a NOP if forwarded data is used.
The register file consists of two banks of eight registers each. Instructions that take two input operands must take one from bank A and the other from bank B. This is to make it easier to achieve two register reads per clock, since reading asynchronous distributed RAM on posedge and negedge clk makes it difficult to meet timing. Instruction output operand encoding can specify bank A, bank B, or both. The registers themselves are eight lanes wide, the same width as a Nami, and each lane is 16 bits. The total register file is 16 (registers/Nami) x 4 (Nami/NMB) x 4 (NMBs/AKB) x 8 (lanes) x 16 (bits/lane) = 32,768 bits per AKB. Despite the size, using BRAM would be wasteful as native BRAM primitives don’t support read widths of 128 bits, and can’t provide the byte write enable bits needed to implement lane masking, meaning a 18k BRAM per two lanes would have to be used, which leaves each BRAM only used at 25% capacity. The current implementation uses distributed RAM, but this has drawbacks as well as it requires high LUT usage for multiplexing reads.
notes on rearchitercting: because Nami do not currently have the ability to wait, the way the i-cache is being used is more like a separate program memory. It will go back to being used as an actual cache once the new DDR3 controller is done, and work implementing texturing begins
ELT stands for Export (Lit | Luminous) Texels, and is the block responsible for exporting to the cache. It places incomingdata from the into a Nami FIFO, and then for each pixel in the Nami, adds it’s colour/metadata/cache address to one of two pixel FIFOs for consumption by the shader cache. The reason there are two pixel FIFOs is that two rasterizers connect to a single / , and therefore a Nami can contain a mix of pixels to be written to either cache bank in the pair. The image below shows an example of this. Letters A through H represent Nami data, and [0,1] specifies which cache in the pair to write the colour to.
When the Nami FIFO is not empty, the ELT takes the next entry and loops over each pixel, distributing it to one of the two pixel FIFOs, depending on the cache pair number. In the above example, pixel data A, D, and G are routed to pixel FIFO 0, to be consumed by cache bank 2A+0. Confusingly, the A in the bank number is unrelated to pixel data A, but rather stands for AKB number, such that AKB0 connects to cache banks 0 and 1, AKB1 connects to cache banks 2 and 3, and so on.
Because a cache bank can have multiple clients, it’s possible to add pixels to the ELT pixel FIFO faster than the cache consumes them. This case is shown above, where pixel H should be added to pixel FIFO 1, but there is no room. In this case, a backpressure signal is asserted, and the ELT will retry to add the pixel next clock
The image below shows the first ever pixel shader running on hardware, displaying in a small PIP window. It’s not pretty or fancy or advanced, and the shader was only around 7 instructions, but admittedly it felt pretty amazing actually seeing the GPU saturate and having it not mess up.
Ben Carter (@CarterBen) for reading the whole thing, even though it was a mess and made little sense
Colin Riley (@domipheus) for both inspiring me with his CPU project, and for sending me the HDMI PMOD that he made