This project (and therefore also this blog post) are very much in-progress, and are being actively updated as I add new features.


I am not a professional RTL engineer, so please don’t look to me as an example of how to make things properly. If you ask me why I did something a particular way, 9/10 times the answer will be because I don’t know any better. To help you better understand, I am including a visualisation below

So if the design is neither good nor novel, why make a blog post? Well, please allow me to answer the question I just pretended you asked. You don’t have to be abnormally smart or professionally experienced to have fun designing your own game hardware. It’s my hope that someone on twitter will see an average normal unremarkable human making game hardware, and decide they want to try too.  Believe me, if I can do this then so can you, and it’s incredibly fun!

This Year’s Retro Console Jam Theme

Once a year, I take a short break from working on my 3D GPUs to try making a retro console in a weekend. It’s a nice 気分転換 from the stress of the more hardcore hobby projects, and allows me to just quickly make a thing without spending months unhealthily obsessing over the minimum number of bits needed for every single net

And like any fun jam, each year there is a theme. This year’s theme is: The year is 1993. You work for a maker of game consoles, and you’re almost ready to release your latest sprite-based system. But all is not well. You hear rumours of a new console coming out, a console rumoured to have some pretty incredible 3D capabilities. With the deadline for tapeout being Monday morning, there is no time to completely rearchitect, but you do have one weekend to try and augment the sprite system to fake 3D as much as possible. Can you do it? Can you save the company and win the generation? Can you become the Greatest Console Hero? (hint: no. No, you can’t)


In previous years, I did the normal game jam thing where you don’t sleep/eat for 48 straight hours to get as much done as possible. Last year, I did a CPU, GPU, sound chip, controls, and even a custom IDE for development and running.

This year, I decided to choose life, and limited myself to GPU only. To avoid having to do a custom CPU, I broke out my old friend the Zynq so that I could just use the hard ARM cores for the CPU. While I tried to stay period appropriate by limiting myself to what was possible in 1993, the CPU is the notable exception. Whatever, my game, my rules.

This project was “finished” in a single weekend (or two (or three (or seven (actually I had alot of fun and may still be actively be working on it right now)))). Again, my game, my rules.

GPU Overview

I wanted to do a traditional scanline-based sprite system, but with “full 3D” transformation of the sprites. Each sprite is tagged with a 3 bit field that serves as an index into an eight entry table of inverse matrices. Each scanline pixel is then tested against the sprites by reverse projecting into sprite space [-4.0 .. 4.0) where it is trivial to test if a pixel is inside the sprite, and to calculate the texture UVs. The system supports sprite scaling, X, Y, and Z rotation, as well as anything you can do with a 3×3 matrix (of which only 2×2 is ever used). Translation is a separate field and stored per sprite.

Why inverse matrices? Well, I retroactively came up with a bunch of great excuses like ease of implementation in hardware, and how easy it made texturing and texture coordinates. But really, in my heart, I am a contrarian. I made an entire GPU around ray marching just because the internet wouldn’t STFU about hardware ray tracing. Normal humans take vertices and project with matrices, so I figured I’d be weird for the sake of being weird, and do the opposite by using inverse matrices. Luckily for me, it accidentally turned out to be a great idea!

The primary downside is that it’s up to the user to calculate matrix inverses, to which I respond:

  1. Most transformations will be pure rotations, where the inverse is the transpose, or pure scale where diagonal scale factors are just the reciprocals
  2. If you don’t constantly swap matrices during rendering, there are only 8 matrices, so I’m not worried about this not scaling
  3. Most importantly, sounds like your matrix issues are your own, and not my problem 🙂

I also wanted to support multiple texture sizes to make scaling less garbage looking. All textures are square with power of two dimensions, and originally sizes 8×8 to 1024×1024 were supported. However, larger sizes are impractical, so anything larger than 64×64 is disabled via ifdef. This limits the size of the texture dimension field in the sprite data to two bits per sprite.

Hardware Blocks

Sprite Pipelines (SP)

The visible portion of a scanline is 640 pixels, and so that’s the total number of pixels that need to be tested against all sprites. Each sprite pipe is a 16 pixel wide SIMD, that loops five times per sprite to cover an 80 pixel area. Therefore it takes eight sprite pipes to cover the full scanline. Sixteen was chosen as SIMD width to allow me to hit the max sprites-per-scanline target with minimal overhead. Lower numbered sprites are higher priority, so any lane that already intersected with a sprite avoids all further hits. Currently sprites look like this

I make Very Bad Decisions when rushed

Each sprite pipe independently accepts an 80 pixel wide range of the current scanline from the scheduler. All pipes share access to the sprite registers, with the further left pipes having higher arbiter priority due to those being needed by scanout first. Each sprite is loaded, its parameters are fetched, it’s inverse matrix is looked up in the matrix table by ID, and all scanline pixels are then transformed by the inverse matrix.

There are a few optimisations probably worth mentioning. First and most obvious is that any multiplies between a matrix element and the Y coordinate can be shared between all pixels, since all pixels in a scanline have the same Y. Second, because X coordinates increment by 1, instead of having sixteen multipliers, I can only do the multiply for lane 0, and then lanes 1..15 can just do a simple addition with the matrix element adjusted as below

Finally, the sprite pipe then calculates valid flags based on whether or not the result is inside [-4.0 .. 4.0), offsets the coordinates to the [0.0 .. 8.0) range, and passes the result off to address generation.

UV And Address Generation

Texel data is either R5G6B5 with no alpha, or R5G5B5A1 with 1 bit to indicate transparency. Because the texel BRAM and cache is 32 bits wide, texels are stored as 16×2=32 bit pairs. Data is assumed to be stored as 8×8 tiles, with texture sizes larger than 8×8 having their tiles stored in Morton order. Internally the texels in an 8×8 tile can be linear or tiled. Finally a texture start offset can be specified in number of 8×8 tiles. All this allows a fun subset of tricks you can do by having textures of different sizes alias each other.

Generating addresses is trivial from the sprite pipe inputs, and can be done with multiplexing alone. Some of the address calculation for linear mode is shown below

Data comes in as s.12.11 fixed point coordinates. This can be smaller, but I’m currently just reusing the same type I am using elsewhere. The final address will be a start offset (in number of 8×8 tiles), a tile number that the pixel falls into, and the X and Y offset inside that tile. Because I am using inverse matrices, the calculation itself only depends on the coordinates within the sprite and the texture size.

For example, all transformed sprites are 8×8, but if the texture is 32×32, then four pixels should map to the same texel. And so the UV would be be fractional bits 1/4, 1/8, 1/6, etc. This is then turned into a tile number and inner tile offsets

Example: 16×16 texture with tilemode=linear

I’m also doing the obvious fetch minimisation optimisation, by marking consecutive scanline pixels that share the same texel as valid but no-fetch. Furthermore since texel data is fetched in pairs, if pixel N wants texel 2M, and pixel N+1 wants texel 2M+1, no fetch is required for the second pixel.

Finally, address generation will output a per-lane state, with valid and shared flags, a fetch flag, and a texel address to load from. Some of those flags are redundant, and can be eventually be removed.

Texel Fetch (TF)

There are eight TF blocks, one for each SP. This block is notified when it’s sprite pipe finishes outputting the lane state for a segment of the current scanline, and begins to fetch texels to one of four shared FIFOs.

Technically there are two paths: fetch from DRAM and fetch from BRAM, but in the course of chasing a synthesis bug, the DRAM path wasn’t exactly maintained. So everything in this section applies to BRAM, but is also designed in a way that will eventually make it easier to re-add the DDR path.

All eight TFs share a single texel BRAM, again with the leftmost TFs having the highest priority since their texels are needed first. Pairs of TFs share a single dual ported destination BRAM with TFs 2N using port A to write addresses 0..79, and TFs 2N+1 using port B to write addresses 80..159. This minimises the number of BRAMs while still allowing all TFs to operate somewhat independently.

Each TF looks at the lane state for pixel 80N, gets the fetch address, and requests the corresponding texel pair from the BRAM. It then shifts the lane state array right to begin fetching the data for the next pixel. If the needed texel is a member of the last fetched pair, no fetch is issue and the previously fetched texel pair is used.

TODO: background tile work isn’t done yet, so I don’t want to commit to a concrete plan, but this part of TF is also where I’d select between texels for valid sprites and some background image. Currently any pixel not covered by a sprite defaults to a background colour.

Scanout FIFO Build (SF)

Finally, a section of the blog post I can phone in with minimal effort. There is only one SF, and it reads texels from the four TF FIFOs (left to right), and adds them to a CDC FIFO read by HDMI scanout. Moving right along…

Synchronisation Between Blocks

I’ll describe current synchronisation in this section, but be aware this is temporary and has to change if I want to add my per-scanline craziness.

Scanout → SP Sync

In general, the SP is allowed to run as far ahead of scanout as it can. However, in practice, it can only get a few lines ahead due to pipeline limitations and FIFO depths. The only real sync dependency is that the SP must wait for a signal from scanout in order to start processing pixel row 0. This is to make sure the SP starts at the same time every frame.

SP → TF Sync

SP also has a dependency on TF. Because SP produces lane state that is copied to TF local state, the SP can’t produce state for line N+1 unless TF will be done consuming it’s current line N local state by the time SP finishes. Conversely, the TF can’t run until it gets its input from the SP. In practice this means that at any given time, SP is probably working on line N+1 while TF is processing line N.

All eight SPs are independent of each other, and SPn can run as long as its dependencies are met by TFn.

TF → SF Sync

SF loops over the four texel fetch BRAMs, where each BRAM is written by a pair of TFs. So if TF0 and TF1 are finished writing to BRAM 0, then SF is free to start processing that BRAM. BRAMs must be processed in order, so if BRAM 0 isn’t ready, SF can’t jump to BRAM 1. In practice, this shouldn’t happen often since all hardware favours leftmost blocks.

SF → Scanout Sync

SF adds pixels to the CDC FIFO at a faster rate than scanout consumes them, so the FIFO frequently fills up. However, SF is free to add pixels to the CDC FIFO whenever space becomes available by scanout consuming them.


Texel DMA

Screenshot of an older (but still relevant) block design

I’m going to come right out and admit this took me longer to work out than I would have liked. There were precisely zero examples anywhere, and I was thrown off by AXI DMA transfers specifying a source data address, but providing no way to set a destination address, which seems pretty essential for writing to a BRAM. I’m not sure if what I came up with is canon in the AXI extended universe, but I’ll explain what I ended up with.

First up, create a AXI Direct Memory Access in the block design editor, and connect the following interfaces, all illustrated in the above image:

  1. Connect Zynq’s M_AXI_GPU (general purpose AXI master 0) to the slave port of an AXI interconnect, and connect the interconnect’s master port to the AXI DMA’s S_AXI_LITE. This interface will be used for the CPU sending configuration commands to the DMA unit
  2. Create a second AXI interconnect. Connect it’s master port to the Zynq’s S_AXI_HP0 (high performance AXI slave 0, used for talking to DDR). Connect the new AXI interconnect’s slave ports to the DMA’s M_AXI_SG (used for fetching the buffer descriptors used for scatter gather), as well as M_AXI_MM2S and M_AXI_S2MM for DDR data
  3. Create a AXI4 Stream Data FIFO and connect it’s slave port to DMA’s M_AXIS_MM2S. This is the streaming interface that DMA uses to send the DDR-fetched data over the AXI bus to the FIFO
  4. Connect all the clocks and reset lines for the above

Or for a human-understandable visualisation of the interfaces, please see this image from fpgadeveloper.

Cool, but now what? We have a way to DMA texel data from DRAM to a FIFO, but still have no way to do BRAM writes or provide BRAM addresses. I couldn’t find any relevant/useful Xilinx IP for this, so I had to make a custom thing. cleverly called axi_dma_fifo_to_bram. One end talks to the FIFO using it’s expected AXI protocol, and the other end controls a BRAM port. It has two states, the first state looking for a 32 bit header (16 bit texel pair address and 16 bit number of texel pairs), and the second state accepting data and writing it to the BRAM.  Even if you are AXI averse, its not as bad as it sounds as the FIFO only has ready, valid, and data signals


Sprite data is written to the sprite BRAM through PS-side AXI transfers. Each transfer is 32 bits, and writes the parameters for a single sprite using XBram_WriteReg(). Despite my original BRAM being word-addressable, the BD-generated BRAM uses 32 bit AXI addressing and has a 4 bit we mask, and therefore all addresses must be 4-byte aligned byte addresses

Currently In Progress

  1. I need to do something about backgrounds. Right now, any pixel not covered by a sprite defaults to some solid background colour
  2. Current sync makes it difficult, but I need some clever way of changing things per scanline. X and Y scroll, sprite index offset, matrix table entries, texel data, or any number of things you might want to do per scanline. Whether this is interrupts, command buffers, or 480 copies of register state has yet to be determined, but I am 100% dedicated to making whatever I decide on weird
  3. I need to readd the DDR path back in for texel fetches, and make BRAM use optional


I have yet to work out how to embed videos, so instead, here is a link to a rotating scaling sprites video demonstrating scaling and 3D rotation using the SDK I made. It features the unique and non copyright-infringing Fabrizio the Sicilian Electrician, and therefore I am not expecting any large corporations to legally object. Either way, he’ll eventually be replaced with the Cort Stratton texture I use for all my other consoles.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>