このサイトのコンテンツは個人の意見で会社を代表しません

 

概要

Dokukon (独身コンピュター), also known as Hitokon (一人コンピューター)Bocchikon, and any other name that implies the opposite of ファミコン, is a 18 bit retro console mainly made in a weekend as part of a 48 hour console jam. It tries to stick to clock speeds that would have been plausible in the 16-bit era, with a few notable exceptions described in later sections.

The general design philosophy can be summarised as: I hate CPUs, and want to do everything I can to minimise CPU involvement in everything. The audio system uses command buffers that can play complex songs just by writing a single MMIO register. The scatter/gather pattern DMA is designed to transfer data to and from different source and destination patterns to avoid CPU loops. The start address of palette index and colour writes encodes a pattern so that every CPU write updates the destination address to the next location in the pattern.

Parts Used

  • Arty A7 100T board
  • HDMI PMOD
  • AMP3 Audio PMOD
  • Famicom Controller

The Arty can be purchased from Digilent here. The AMP3 PMOD is also from Digilent, and is sold here. Finally, the Famicom controller is just a regular Famicom controller with the connector cut off, and wired to PMOD-pluggable pins. See below for more info.

Scanout

This is the most major departure from the 16-bit era clock speed goal. The screen is fixed at 640×480 @ 60Hz, for which the HDMI spec requires a 25.175MHz pixel clock and a 251.75MHz TMDS clock. The HDMI block is responsible for signalling the GPU both when new scanlines start, and N clocks before a scanline starts, where N is the latency of HDMI => GPU messages. The GPU then adds colours to a pixel colour FIFO which scanout then reads from and displays.

Audio

Audio output is fixed at 16 bits per channel, two channel stereo, and a 24,000Hz sample rate. This is driven by a 768KHz bit clock, and an MCLK of 6.144MHz that drives the audio logic. Technically it could be clocked much lower, but good luck getting a PLL/MMCM to generate a clock in the kilohertz range

Audio is programmed via GPU-like command buffers stored in audio RAM. There are per-channel CPU-accessible MMIO registers responsible for setting the command buffer address, pausing and unpausing the audio, and setting volume, with more registers currently being added. The currently available commands are

  1. square wave: Play a note from C3 to B5, for a length of time between a whole note and sixteenth note, including dotted
  2. set BPM: set the channel BPM. Supported range is 50 to 177
  3. rest: insert a rest. Supports all note lengths supported by square waves
  4. set label: write an initialisation value to one of the four labels
  5. conditional jump: takes a label number, condition, decrement flag, and jump offset. The command looks at label [0..3] and tests it for a condition, optionally decrementing the label. This can be used for conditional jumps, unconditional jumps, and repeating a section of music a number of times

Future support is planed for setting note decay and fadeout, PCM data, triangle waves, and noise

Synchronisation and loops/repeats are achieved through labels. To synchronise two channels, you can have one wait on a label that the other one writes. To implement loops and repeats, use a label write to set the repeat count, and then use a conditional jump with label decrement.

GPU

The GPU is designed around sprites and tile graphics, with a max of 128 sprites, of which 64 are displayable per scanline. The tile size is 32×32, and so a 640×480 screen becomes 20×15=300 background tiles.

State Machines

The GPU is divided into three state machines: background tile processing (BG), sprite search, and foreground sprite processing (FG).

GPU processing starts when HDMI signals the GPU to start a new scanline, which runs the BG processing and sprite search state machines in parallel. Sprite search is pretty simple. It looks at all 128 sprites for the first 64 that intersect the current scanline, and adds them to a list. BG processing is a bit more involved. It looks at each pixel in a scanline and works out what BG tile it lives in. It then looks up that tile’s index in GPU RAM, and uses that index to fetch graphics data for the pixel from graphics tile ROM. The graphics tile data is then used as an index into a colour palette to get the background pixel colour. Lastly, it determines the final colour by pulling from the FG sprite FIFO to see if there is a sprite overlapping this pixel and whether its transparent. This final colour is added to an HDMI FIFO for scanout to consume.

BG and FG both need to share a read port for graphics tile ROM, and so they can’t run at the same time. So when BG processing reads the final pixel’s data from the ROM, it then signals FG sprite processing to begin. FG processing does a parallel reduction of the 64 intersecting sprites found by the sprite search pass, and finds the first one to intersect the current pixel. It then looks up that sprite’s pixel in graphics tile ROM to get it’s palette index, and to see if it’s transparent. Note that FG must always run one scanline ahead of BG so that BG processing has the expected data in its FG FIFO.

The GPU is also responsible for signalling the CPU when a new scanline is started, when BG processing finishes, and when FG processing finishes. The CPU can then poll on an MMIO status register to wait for a specific part of a scanline to finish. This is useful for when you only want to change tile indices, and need to wait for BG processing but not FG processing to finish for the current scanline. Or maybe you want to change sprites for the next scanline, and therefore need to wait for FG processing to finish. There is also logic in there to signal the CPU as early as possible, to make up for the round trip latency of the CDC FIFO used to signal the CPU, and the time it takes for a CPU register write to be seen by the GPU.

Palettes

Currently there are 8 background palettes with 8 colours each. The 32×32 background tiles are grouped into 64×64 supertiles, where each tile in a supertile uses the same palette. This means a screen becomes 10×8=80 supertiles, which is alot of data for the CPU to update. To make it a bit more convenient to update sparse indices without the CPU constantly changing the destination address, I encode the address in a similar way to what I use for DMA.

address = dest index (7b), dest X inc (3b), dest Y inc (2b), writes per row (4b)

Just like DMA, every time a new palette number is written, the destination address is automatically changed according to the pattern. And just like the DMA encoding, this allows you to update a linear array of indices, a sparse block of indices, a row of indices, or a column.

There is a similar trick when updating the colours in the palettes. The start address encodes start palette, start colour withing the palette, num colours per palette to update, and colour increment. So you could start at palette 2, and update colours 1, 4, and 7 in all subsequent palettes without ever the CPU changing address.

CPU

The CPU runs at 10MHz, and has a three stage pipeline with no hazards. It reads 16-bit instructions from a 4096 deep instruction memory ROM bank.

ISA

The CPU utilises a RISK ISA. This is not a typo or an acronym, I just think you’re taking a big RISK if you expect it to correctly execute instructions. It currently has the following instructions

  • add/sub
  • shift
  • bitwise: contains a 4 bit function field that specifies AND, OR, XOR, NOT
  • popcnt
  • mov 8-bit literal low
  • mov reg
  • insert 8-bit literal high
  • load
  • pop
  • pack
  • unpack
  • jump
  • branch if zero
  • branch if not zero
  • branch if carry
  • branch if negative
  • call
  • return
  • store
  • push
  • compare: compare register with register, or register with literal
  • nop

Fun facts:

All branch and jump instructions have two forms: one where the instruction encodes a literal offset, and another where it encodes register number that holds the offset. The instruction after any PC-modifying instruction is considered a branch delay slot.

Mov 8-bit literal low is usually used in conjunction with insert 8-bit literal high to build 16-bit literals. The low mov instruction will write 8 bits to the destination’s LSB8, and zero out the MSB8. Insert leaves the LSB8 as it is, and inserts 8 bits to the MSB8.

And then there is pack/unpack. I wanted a fast way to extract some bits from a field in a register, update the value, and insert the new value back in, but there weren’t enough bits available for the pack. The initial solution was to have unpack set an internal insert mask that pack then uses when reinserting the bits. This mask persists as long as you don’t do another unpack, and so it can be used for multiple packs in a row. I am not 100% in love with this, and am currently considering other solutions such as implicit pairs of aligned registers to save bits for this instruction.

Memory System

All ROMs and RAMs are grouped into the appropriate sounding but vaguely named Memory System. It is also home to the CPU-accessible MMIO registers used to program other bits of the hardware. The CPU accesses these via load and store instructions to four aperatures.

CPU accessible apertures

  • MMIO: RW, aperture for MMIO registers
  • RAM: RW, aperture for CPU accessible RAM
  • ROM: RO, aperture for reading from ROM
  • tile indices: RW, currently unused, but eventually shares a GPU RAM write port with DMA

So for example, a write to MMIO might look like this

mov r0, kMmioRegGpuSetSpriteNum
mov r1, #0
store r1, r0, kApertureMmio, #0

The 16 bits of address and 2 bits of aperature form a 18-bit address. The memory map then becomes:

aper offs CR CW GR GW DR DW 
0x00_0000  1  1  0  0  0  0 <= start of RAM
0x00_FFFF  1  1  0  0  0  0 <= end of RAM
0x01_0000  1  0  0  0  1  0 <= start of general ROM data
0x01_FFFF  1  0  0  0  1  0 <= end of general ROM data
0x02_0000  1  1  1  1  0  0 <= start of MMIO regs
0x02_FFFF  1  1  1  1  0  0 <= end of MMIO regs
0x03_0000  0  0  1  0  0  0 <= start of gfx tile data ROM
0x03_FFFF  0  0  1  0  0  0 <= end of gfx tile data ROM (131,072 bytes, 341 tiles, 1.14 screens)
(memory map doesn't include internal GPU RAM)

RAMs and ROMs

The memory system encompasses the following RAM and ROM banks:

  • CPU RAM: 16-bit RAM accessible to the CPU
  • CPU ROM: 2048 deep ROM of 18 bit halfwords. CPU can only access the LSB16 of each halfword
  • GPU RAM: 1024 deep RAM of 18 bit halfwords. Each halfword stores two 9-bit indices into tile ROM
  • GPU graphics tile ROM: 65,536 deep ROM of 18-bit halfwords

The sizes need a bit of clarification. A graphics tile is a 32×32 pixel block of image data, where each pixel is a 3-bit palette index. So a single 18-bit halfword stores palette indices for six pixels. So to store one row of a 32×32 tile requires six halfwords (with four bits wasted at the end), and a whole tile is 32×6 = 192 halfwords per tile. A 640×480 screen is covered by 20×15=300 tiles, and so one screen worth of data would require 57,600 halfwords. Graphics tile ROM is 65,536 deep, so we can store about 1.14 screens worth of tiles. This isn’t great, and I plan to improve it, but for the time being do plan on reusing graphics tiles alot.

GPU RAM needs to store 9-bit indices for the 300 background tiles. This works out to 150 halfwords, and yet the RAM is 1024 deep. This is because the smallest BRAM primitive on the board is 18kbit. I should probably look into finding other uses for the unused memory.

Finally, CPU ROM has a bit of a dual nature. It’s 18-bits wide, but the only client that can access the whole 18 bits is DMA. Only the LSB16 of each halfword is visible to the CPU. I wish I had a great excuse for why this is, but the jam was only 48 hours long and I jumped into making the CPU before realising everything else in the system should be 18 bits.

Gather/Scatter Pattern DMA

DMA is how background graphics tile indices are transferred from ROM to the GPU, and is clocked with the GPU clock. There are three MMIO registers used to program and kick off DMA transfers

  • DMA config reg 0
    • src start addr: 12 bits, source start, [0..4095] tiles
    • num Y tiles: 4 bits, number of tiles to transfer in the Y direction, [1..15]
  • DMA config reg 1
    • dest start addr: 9 bits, destination start, [0..299] tiles
    • num X tiles: 5 bits, number of tiles to transfer in the X direction, [1..20]
    • src X inc MSB2: 2 bits, the MSB2 of the source X increment
  • DMA config reg 2
    • src X inc LSB3: 3 bits, source X direction increment, [0..20] tiles
    • src Y inc: 4 bits, source Y direction increment, [1..15] tiles
    • dest X inc: 5 bits, destination X direction increment, [1..20] tiles
    • dest Y inc: 4 bits, destination Y direction increment, [1..15] tiles

Writing to that final register is what initiates the DMA transfer. This can be used to do a few useful things. First, you can copy data linearly, or copy a packed block of indices to a packed area in GPU RAM. You can also copy a packed block of indices to a sparse area in GPU RAM, and vice versa. Finally you can splat a single source index into either a packed or sparse area in GPU RAM. Here is an example from the SDK katakana sample

Don’t feel bad, even I can’t remember how my own DMA engine works

Finally, there is a 16-deep DMA request queue. This was supposed to be a quick hack get some of the functionality of DMA chains. However the DMA status MMIO registers aren’t currently hooked up, so be careful not to overflow the FIFO. I will have to fix this eventually.

Controller

The controller was “designed” to feel “familiar” to retro gamers. It has a d-pad, two buttons, plus start and select.

This controller somehow feels familiar to retro gamers

The latest controller data is sampled at roughly 625KHz. It’s probably safe to clock it a bit higher, but the japanese datasheet for the shift register is not very clear on the maximum clock speed when voltage is 3.3v, so I kept it under 1MHz just to be safe. The CPU can query the latest controller data at any time via MMIO register. The Vivado constraints for PMOD port JD are set up as the following

Dev Environment and IDE

Current system requirements are

  • high tolerance for crap
  • low expectations
  • Windows10 64 bit

Because you can’t make games without a proper IDE, I decided to write one using internet snippets of C#, a garbage language with no redeeming value. The theme I am going for is making people long for the days of PS2 CodeWarrior, if you know what I mean. Currently you can build code and data, verify and connect to devkits, and upload executables and data to them. Eventually I’ll add sound and graphics editors

IDE tab for editing and assembling code
IDE tab for devkit management

Host Target Comms

Host target communication is done over UART, with 8 data bits, 2 stop bits, and no parity. Parody, on the other hand, is enabled for this entire project. Host sends the target update files, consisting of

  • global update header
  • section header (18 bit one-hot)
  • section checksum (18 bit XOR)
  • num 18 bit halfwords in the data payload
  • data payload (18 bits x num_halfwords)
  • additional sections (section header + checksum + payload)
  • global update end (18’b0)

The first four section headers (0001, 0010, 0100, 1000) correspond to the four flashable ROMs: code, data, graphics tile, and audio. A write to any of these ROMs will trigger a devkit reset, and light up the corresponding colour LED on the Arty. Yellow means a particular ROM will be updated, green means that ROM’s update succeeded, and red means the ROM data failed the checksum test.

Other section headers include confirm (send a message to the host when the IDE 確認 button is pressed), LED on (light up some LED pattern), and eventually debugger messages should I ever decide to try and make one.

Special Thanks

Extra special thanks to Rupert Renard, Mike Evans, and the rest of hardware twitter for being so generous with your advice and patient with my stupid questions. Thanks to Colin Riley for the expertly crafted HDMI PMOD that made HDMI possible. Also shout out to Ben Carter (yes, that Ben Carter. The Super Famicom ray tracing one) and Laxer3A for working on what is arguably two of the coolest projects out there, and inspiring me to keep pushing.

FAQ

why 18 bits?  Two reasons. First, with ECC disabled, a block RAM byte is 9 bits. This seemed pretty useful at the time, so I just designed everything around that and the numbers worked out really well. Second and most important, Jake Kazdal thinks he’s so cool for calling his company 17bit, and I just felt like I needed to one-up him.

Console X does it this way. Why don’t you? Because its more fun to try and make up your own weird ways to do things without copying others. Sometimes you come up with the same solution others did, and sometimes you make something worse. All that matters is you have fun in the process!

How do I become a registered developer?  I dunno, sign an NDA and pay a developer fee to The City Of Elderly Love animal shelter?

Is everything rigorously verified/validated, and confirmed working?  LOLZ no. I do have lots of SVA asserts, though, if that makes you feel better.

このサイトのコンテンツは個人の意見で会社を代表しません

table of contents

terms and acronyms

rasterizer overview

architecture overview

SDN

AKB and NMBs

SKE

ELT


terms and acronyms

This post discusses the blocks used in building up, launching, and executing shader work. Because shaders can’t (yet) access memory in a general purpose way, pixel shaders are focused on rather than compute. The rasterizers will be covered in-depth in another post. Before getting started, the following table summarises some of the more commonly used terms and acronyms used throughout this post. Tooltips, shown in bold, are also added in the beginning of each section to make looking up acronyms easier

Term Meaning Notes
Nami vector of 8 pixels/lanes This is also the ALU width
SDN Setup Data for Nami No relation to SDN48
AKB ALU Kernel Block No relation to AKB48
NMB Nami Management Block No relation to NMB48
SKE Shader Kernel Executor No relation to SKE48
STU Shader Texture Unit No relation to STU48
ELT Export [Lit | Luminous] Texels No relation to Every Little Thing
Figure 1: Terms and acronyms used in this post

rasterizer overview

Figure 2: Twenty rasterizers cover a total of 640×32 pixels

Each rasterizer covers a 32 by 32 pixel screen tile. With a fixed screen resolution of 640×480, the twenty rasterizers together operate on a 640×32 row of screen tiles. Before pixel shaders were added, each rasterizer had its own bank in the render cache, and would output the new blended colour (using the previous colour, the new per-vertex barycentric interpolated colour, and a programmable blending function) and GPU metadata to it. These banks are double buffered so that the rasterizers render to bank N while HDMI scans out the data in bank N^1. Think of it like racing the beam, but instead of a single scanline, you have the time HDMI takes to scan out a tile row. This works out to be

row = 640 + 16 (front porch) + 48 (back porch) + 96 (sync) = 800
total pixels = 800 * 32 = 25,600
25,600 pixels @ 25MHz HDMI clk = 204,800 ticks at 200MHz gfx clk

So if a rasterizer takes 3072 clocks to process a triangle for a 32×32 tile, this works out to 67 triangles per rasterizer per tile row per frame. And since hardware manufacturers like to maximise stats using unlikely scenarios, if every screen tile row has its own set of triangles, then the rasterizers process a maximum of 60,300 triangles per second, and output pixels at a rate of 1.24 Gpix/sec.

Things to eventually fix: The rasterizers initially worked on pixel pairs, and processed all three triangle edges in a single clock. However this used up too many DSP slices and made it hard to meet timing. I need to re-investigate whether I can use what I learned about meeting timing this last year to hit the initial target of one triangle taking 512 clocks rather than 3072.

Due to a current limitation with the ISA, the rasterizers are still outputting colour rather than barycentric weights. The pixel cache address and new per-vertex barycentric interpolated colour are output to the SDN for use in shaders. Unfortunately the previous colour isn’t currently available due to losing the ordering guarantee that came from the rasterizers directly writing the cache.

In the previous design, all twenty rasterizers had to stay in sync, meaning you could cull triangles based on vertical position but not horizontal. With the new shader system, this limitation has been removed, and so horizontal culling should eventually be added.


architecture overview

As shown in Figure 2, the twenty rasterizers are grouped into ten pairs of two. Each rasterizer pair outputs pixels to an SDN which packs the pixels into a eight lane vector called a Nami. Once a full Nami is built up, or a hang-preventing timer expires, the Nami is then sent to the AKB and assigned to one of four NMBs. Each clock, the AKB selects the next NMB, receives a Nami from it, and sends that Nami to the SKE for execution. Finally, export instructions will send the final colour and cache address to the ELT for export to the shader cache.


SDN

The SDN takes in pixels from a pair of rasterizers, and packs them into an eight lane Nami. There was quite a bit of experimentation done wrt the number of rasterizers that share an SDN. Larger numbers (4+) can build up a full Nami faster, but can introduce quite a bit of logic complexity in the number of possible input cases and overflow cases. Two rasterizers was settled on because the number of overflow cases is only one.

Figure 3 below shows a few of the possible cases the SDN has to handle. In the first case (upper left) we start with a Nami that has seven out of eight slots full. We get a valid pixel from rasterizer A and no pixel from rasterizer B. Pixel A is placed into the final slot in the Nami, and the completed Nami is sent to the AKB. The SDN then starts over with an empty Nami.

Figure 3: a few possible SDN input and overflow cases. Red X indicates an already filled slot

The second case (upper right) shows the opposite. A valid pixel is received from rasterizer B but not from A. B’s pixel completes the Nami, it’s sent to the AKB, and the SDN starts over again with an empty Nami.

In the third case (lower left), we get valid pixels from both rasterizers A and B, but this time the Nami has two unfilled slots. Both pixels A and B are placed in the Nami, it’s sent to the AKB, and the SDN starts over again with an empty Nami.

The final case shows the one possible overflow scenario. Two valid pixels are received but the Nami only has one available slot. Rasterizer A’s pixel completes the Nami, which is sent to the AKB, but then the SDN starts over with a one pixel Nami, containing pixel B. Note in all of the above cases, a full Nami is sent. It’s possible for there not to be enough pixels to form a full Nami. To prevent hangs, a 32 clock timer is reset on every new pixel. If the timer goes down to zero, a partial Nami will be sent.

Notes on rearchitecting: because the rasterizers currently can process one triangle edge per clock, or one triangle every three clocks, the SDN relies on this to reduce logic. If the rasterizers are rewritten to support one triangle per clock, the SDN needs to be redone as well.

Things to eventually fix: to simplify the cache design each rasterizer works on screen pixel R + 20N. So rasterizer 0 would process pixels 0, 20, 40, 60, 80, etc. Rasterizer 1 would process pixels 1, 21, 41, 61, 81, etc. Because of this more partial Nami end up getting sent than would if a quad pattern was used. This was a simplification made when the project started to reduce muxes and help meet timing, and should definitely get fixed before texturing is added.


AKB and NMBs

The AKB has four functions:

  1. take in Nami from an SDN, and assign them to one of the four NMBs
  2. each clock, select the next NMB via round robin, receive a Nami from it, and pass the Nami on to the SKE for execution
  3. receive a Nami’s updated dynamic data (new active lane mask, status flags, program counter, retire status, etc) from the SKE, and route it back to the proper NMB
  4. when all NMBs are full, send a backpressure signal to the SDN, which clock gates the SDN and rasterizer until Nami slots are freed up
Figure 4: an AKB connects to an SDN, and contains the interesting shader bits

The AKB takes as input Nami from an SDN, and must distribute them to the next non-full NMB. Because of the way NMBs are selected for execution via round robin, it’s preferable to also add new Nami via round robin as well, to help saturate the ALU. The AKB keeps track of the last NMB to receive a new Nami, looks at the full flags of each NMB, and determines which NMB the next new Nami will be sent to.

Because pipeline latency requires at least four clocks between consecutive executions of the same Nami, the AKB round robins the four NMBs with no ability to skip empty NMBs. In the case where an empty NMB is selected, a null nami is passed to the AKB to avoid side effects. Once the next NMB is selected, and it’s Nami is received, the AKB appends the NMB number and Nami ID, and passes the data to the SKE for execution. The NMB number and Nami ID are used to route the updated dynamic data returned from the SKE back to the right place.

The NMBs are fairly simple. They store the execution state, consisting of constant and dynamic data, for up to four Nami. They also choose the next Nami to be presented to the AKB for execution, maintain status flags like is_full, and handle the add/retire management logic. Dynamic data consists of things that can change during execution, like the valid lane mask, status flags, program counter, and retire status. The const data is the previous colour for a pixel, the new barycentric interpolated colour, and the address in the shader cache. Because the Nami in the NMB can be sparse, the NMB remembers the last Nami sent for execution, and tries to fairly select a different one to run next time. This is mainly to help hide some of the latency once texturing is implemented.

Figure 5: simplified view of NMB functionality

In the above image, the NMB has two valid Nami (slots 0 and 2) indicated by the valid mask V. The generated add mask, A, is 4’b0010 meaning the incoming new Nami will be added to slot 1. R is the retire mask coming from the AKB/SKE, and in this case means the Nami in slot 2 is retiring. Each clock, the new valid mask is calculated as V = V | A & R which both adds a new Nami and retires any Nami that had their export request granted. The full signal is calculated as the bitwise AND reduction of the valid mask.

Notes on rearchitecting: The AKB and NMBs take a few clocks to update state, and therefore rely on an SDN not being able to provide more than one Nami every four clocks. If the SDN bottleneck is removed, it will require a significant updating of the AKB

Things to eventually fix: right now, every assigned Nami is always ready to run. There is no ability to skip a Nami that is waiting on a texture fetch or an i-cache miss. This is definitely going to need fixing as texturing is added, and if compute shaders requiring general DDR3 memory access are added.


SKE

SKE, shown as a red box in Figure 4, stands for Shader Kernel Executor. It contains the execution pipeline, register file, i-cache, and ALU used in executing Nami.

Each clock, the AKB gets a Nami from the next NMB, and passes it to the SKE. In the case where an NMB sends the same Nami back to back, four clocks isn’t enough time for the updated dynamic data from the first execution to reach the NMB, and so a forwarding path is needed. The ALU sends the updated dynamic data and Nami ID both to the originating NMB, and also over the forwarding path to the SKE input. If the current Nami ID is equal to the forwarded Nami ID, then the Nami is being executed twice in a row, and forwarded data must be used to avoid out of date dynamic data. This forwarding path is shown both in Figure 4 and the in following waveform.

Figure 6: single Nami data forwarding case

The above waveform shows the case where only one Nami exists and its assigned to NMB0. The waveform starts at (A) when fetch_nami goes high, meaning the AKB has selected NMB0 to send a Nami. One clock later (B), valid_nami_from_akb goes high when the SKE receives the Nami and it enters the execution pipeline. Four clocks after fetch_nami originally went high, it goes high again (C), and NMB0 sends the same Nami as last time. At this point, NMB0 still hasn’t received the updated dynamic data from the first execution of the Nami, and so it sends stale data. The SKE detects this case, and asserts use_forwarded_data (D). Finally, at (E) the updated dynamic data from the first execution arrives at NMB0.

Handling of the export instruction is different than that of other instructions. It’s currently the only instruction that can fail and require a retry. The ELT has a small Nami FIFO for holding export requests, and if this FIFO is full an export request can’t be granted. Retry is implemented simply by not updating the program counter if an export request is not not granted.

A successful export grant relies on the data forwarding mechanism outlined above. Export functions as an end of shader signal, and will retire the Nami in the NMB, but as shown in Figure 6 it’s possible for the NMB to issue the Nami for execution one more time before seeing the retire signal from the SKE. It’s for this reason that the dynamic data contains a successful export grant signal, which will insert a NOP if forwarded data is used.

The register file consists of two banks of eight registers each. Instructions that take two input operands must take one from bank A and the other from bank B. This is to make it easier to achieve two register reads per clock, since reading asynchronous distributed RAM on posedge and negedge clk makes it difficult to meet timing. Instruction output operand encoding can specify bank A, bank B, or both. The registers themselves are eight lanes wide, the same width as a Nami, and each lane is 16 bits. The total register file is 16 (registers/Nami) x 4 (Nami/NMB) x 4 (NMBs/AKB) x 8 (lanes) x 16 (bits/lane) = 32,768 bits per AKB. Despite the size, using BRAM would be wasteful as native BRAM primitives don’t support read widths of 128 bits, and can’t provide the byte write enable bits needed to implement lane masking, meaning a 18k BRAM per two lanes would have to be used, which leaves each BRAM only used at 25% capacity. The current implementation uses distributed RAM, but this has drawbacks as well as it requires high LUT usage for multiplexing reads.

notes on rearchitercting: because Nami do not currently have the ability to wait, the way the i-cache is being used is more like a separate program memory. It will go back to being used as an actual cache once the new DDR3 controller is done, and work implementing texturing begins


ELT

ELT stands for Export (Lit | Luminous) Texels, and is the block responsible for exporting to the cache. It places incoming Nami data from the SKE into a Nami FIFO, and then for each pixel in the Nami, adds it’s colour/metadata/cache address to one of two pixel FIFOs for consumption by the shader cache. The reason there are two pixel FIFOs is that two rasterizers connect to a single SDN / AKB, and therefore a Nami can contain a mix of pixels to be written to either cache bank in the pair. The image below shows an example of this. Letters A through H represent Nami data, and [0,1] specifies which cache in the pair to write the colour to.

Figure 7: various FIFOs inside the export block

When the Nami FIFO is not empty, the ELT takes the next entry and loops over each pixel, distributing it to one of the two pixel FIFOs, depending on the cache pair number. In the above example, pixel data A, D, and G are routed to pixel FIFO 0, to be consumed by cache bank 2A+0. Confusingly, the A in the bank number is unrelated to pixel data A, but rather stands for AKB number, such that AKB0 connects to cache banks 0 and 1, AKB1 connects to cache banks 2 and 3, and so on.

Because a cache bank can have multiple clients, it’s possible to add pixels to the ELT pixel FIFO faster than the cache consumes them. This case is shown above, where pixel H should be added to pixel FIFO 1, but there is no room. In this case, a backpressure signal is asserted, and the ELT will retry to add the pixel next clock

The image below shows the first ever pixel shader running on hardware, displaying in a small PIP window. It’s not pretty or fancy or advanced, and the shader was only around 7 instructions, but admittedly it felt pretty amazing actually seeing the GPU saturate and having it not mess up.

special thanks

Ben Carter (@CarterBen) for reading the whole thing, even though it was a mess and made little sense

Colin Riley (@domipheus) for both inspiring me with his CPU project, and for sending me the HDMI PMOD that he made