概要

このサイトのコンテンツは個人の意見で会社を代表しません

This post represents my first attempt to implement the simplest possible texture cache that still works with my current GPU, in preparation for getting texturing working. Originally, it wasn’t meant to be a releasable post, but rather was written before coding the cache as a way to work out the high level design ideas and to find potential issues. Turns out pretending to explain my ideas to a fictional reader works really well for me as a way of designing. As a bonus, it’s also a way to help me remember how my own stuff works, should work get busy and I need to come back to the project after N months. The cache is currently working both in sim and in hardware, but with fake values stuffed into memory instead of actual textures

Cache Organisation

The texture cache is a read-only cache with LRU policy and single-clock invalidation. The cacheline size is 256 bits, which is 16 pixels in 565 format, representing a 4×4 tiled block. The data store for the cache is built out of pairs of 36 kbit simple dual-port BRAMs. This appears to be a requirement stemming from the cache having asymmetric read and write sizes (one pixel reads and half cacheline writes). This gives a total memory size of 72 kbits, or 256 lines, if ECC isn’t repurposed as usable data. And with a 2-way cache, that is 128 sets. Therefore read addresses presented to the cache will have 4 bits indicating the pixel within a line, 7 bits to determine the set, and the MSB 16 bits will be the tag

cacheline block
A cacheline is a 4×4 block of pixels. Hex numbers represent the offset inside a cacheline

The tags are kept separate from the cache datastore. Each line has a 16 bit tag, and a valid bit, and each set has shares one bit to indicate which way is LRU. This data doesn’t represent the actual current state of the cache, but rather the state the cache will be in once the next request is processed.

High Level Cache View
High level cache view. T is cache tags and LRU bits, H is a bit indicating hit or miss, and DATA is the cachelines

Client Arbiter

Each cache serves four clients. To give me the simplest test, the clients are currently just the rasterizer tiles, although eventually I may try to add texture reads to my pixel shaders. Each client can have multiple requests in flight, and data is returned in the order that requests are accepted.

The arbiter chooses a client request with a simple four bit rotating onehot mask, where the set bit determines the first client to consider. So if the mask was 4’b0100, the arbiter would start with client 2, and then try 3, 0, and 1 in order, looking for a valid request. A mask of 4’b0010 would check clients 1, 2, 3, and then 0. The priority mask is rotated one bit left when a request is accepted. Early on I tried to be more “clever” with arbitration, favouring client requests that would be hits before ones that would be misses, but this really complicated the design and made it harder to meet timing. It also starved the cache of DDR3 requests that could have been in flight.

The arbiter itself is a two state machine. In the first state, it looks for a valid client request. If one is found, it saves off the request address and client number, presents the request to the cache, rotates the accept mask, and notifies the client that it’s request was accepted, before transitioning to the second state. The second state simply waits for the ack from the cache, which usually comes within two clocks unless the cache input FIFO is full or it’s not in the request accept state.

Requests consist of only an address and client ID. Because we have 256MB of RAM, the full address width is 28 bits. However, cache requests are for 16 bit pixels, meaning the LSB of the address isn’t needed in the request, and so request addresses are 27 bits wide. Client ID is just a 2 bit number indicating client. This ID is returned by the cache, along with the pixel data, and is used to signal a client that there is valid data for them on the bus.

Cache Input Processing

The cache itself is made of four state machines: input, DDR request, DDR return, and output. The input machine is a three state FSM where the first state accepts and saves off an incoming request from the arbiter, the second state gets the tags and LRU bit for the relevant cache set, and the third state updates the cache tags, adds misses to the DDR3 request FIFO, and adds the request to the output request FIFO.

The output request FIFO entry contains a client number, one bit to indicate whether a request is a hit or miss, and some address information. Cache reads are pixel aligned, and so the read address is constructed as {set_bits (7b), way (1b), pixel bits (4b)}. Writes are half cacheline granularity, and so the addresses are {set_bits (7b), way (1b), halfline (1b)}, but the halfline isn’t needed until a miss’s data is returned from the memory fabric, and therefore isn’t stored in the output request FIFO entry.

The cache tags and LRU bits are stored separately from the cacheline data, and don’t reflect the current state of the cache. For example, a miss will set the valid bit so that the next request for the address will be a hit, even though the data is not yet in the cache. The cache is updated with the following pseudocode:

cache_set_flags[input_req_addr.set].lru <= |input_is_hit ? input_is_hit[0] : ~input_set_flags.lru;
cache_set_flags[input_req_addr.set].tag_entry[input_way_to_use] <= {1'b1, input_req_addr.tag};

input_is_hit is a two bit vector (one for each way), that is guaranteed to be either zero in the case of a miss, or $onehot for a hit. The bitwise-or reduction of this, |input_is_hit, indicates whether the request hit in any way. In the case of a hit, the set’s LRU bit is set to input_is_hit[0], so that LRU is now set to whatever way we didn’t hit in. Otherwise for a miss, the current LRU bit is just inverted. The tag_entry is always updated, either with the existing tag for a hit, or the new tag for a miss.

DDR3 Request Processing

In the case of a miss, the input FSM adds memory requests to a CDC FIFO for the DDR3 request FSM to consume, with the write side clocked at the system clock frequency, and the read side clocked at the much lower DDR3 controller user interface frequency. The DDR3 request addresses are just the upper 23 bits of the byte address. This is because byte addresses are 28 bit and cache request addresses are 27 bits (since pixels are 2 bytes), but I want DDR3 request addresses to be cache line aligned. With a cache line holding 16 pixels, this means DDR3 request addresses only need 28 – $clog2(2) – $clog2(32/2) = 23 bits. It is this 23 bit address and the number of 128 bit reads (two for a 256 bit cacheline) that is sent over the fabric.

The DDR3 request FSM is a simple two state machine. The first gets an entry from the DDR3 request FIFO, if there are any. The second state presents the request to the memory fabric, and waits for the ack to come beck before getting the next FIFO entry.

DDR3 Return Processing

Data returned from the memory fabric is 128 bits wide, but cachelines are 256 bits, so I have some logic to gather the returned data into cachelines before adding to the return data FIFO. This is actually really simple, and just requires a single bit to indicate whether incoming data is the high 128 or low 128, and a 128 bit vector to store the previously returned data. And so the return data FIFO input becomes {new_data_from_ddr, prev_data_from_ddr}, and the FIFO write enable is (valid_data & is_high_128).

Output Block

Here’s where I admit I have been lying. The output FSM is actually two FSMs: one for hit processing and another for misses.

Hit FSM

The hit machine has four states. The first state waits for a valid entry in the output request FIFO. Once output_fifo_empty goes low, the request is saved off, the cache address is sent to the BRAM, and it transitions to the second state. All the second state does is transition to the third state, to give the BRAM time to return the requested data. In the third state, the output pixel is read from the BRAM, and it transitions to the final sync state.

output block
output block hit and miss FSMs

Miss FSM

The write machine only has three states. Like the read machine, the write machine initial state also waits until output_fifo_empty goes low, but it also waits for (was_hit | ~ddr3_ret_fifo_empty). This is because in the case of a hit, the write machine has nothing to do and can safely continue. But in the case of a miss, it also has to wait until the required data is returned from the memory fabric. Once the condition is met, the cacheline data is read from the DDR3 return FIFO. The lower 128 bits of that data, the cache write address, and the write enable bit (high if the request was a miss) are sent to the BRAM, and it transitions to the second state. In the second state, the BRAM address is incremented and the upper 128 bits of the cacheline are sent to the BRAM. Like the read machine, the final state is the sync state, which waits for both FSMs to enter into sync, before transitioning back to the first state.

SVA Verification

The bulk of architectural asserts in the cache are for verifying FIFOs, checking for overflow, underflow, and validity, and making sure that the FIFO element sizes match what was set up in the IP. The rest are sanity checks verifying the expected flow of the various FSMs. For example, if an output request FIFO entry is a miss, once the data from memory is available, the cache write enable should rise, stay asserted for two clocks, and then fall. There are also quite a few internal consistency asserts, such as making sure a request never hits in more than one way, that valid bits go low one clock after an invalidation, DDR return FIFO must be empty if the output request FIFO is empty, and that data from the cache is valid exactly two clocks after a read request.

There are some asserts in the arbiter as well. Most of these are sanity checks like the accepted client must actually have a valid request, the client request bit goes low one clock after the arbiter accepts a request, and that the arbiter FSM must be in the kArbStateWaitCacheAck state when the cache asserts arb2cache_if.req_accepted. I bring these three up as an example because all three of the were failing due to a stupid typo that I feel really should have been an error, or at least a warning. Thanks, Vivado. The lesson here is to assert everything, no matter how stupid or small it may seem.

Finally, testbench asserts are used to verify correct high level operation. Each test has three modes: linear, all misses, and alternating ways. Linear uses linearly increasing addresses, and verifies both that the data returned is correct and that I get 1 miss followed by 15 hits. The all misses mode requests the first pixel of every cacheline, and so asserts check that every request is a miss. Alternating ways is a bit harder to explain. Basically I want to test

  1. [tag A, set 0, way 0: miss]
  2. [tag B, set 0, way 1: miss]
  3. [tag A, set 0, way 0: hit]
  4. [tag B, set 0, way 1: hit]
  5. [tag C, set 2, way 0: miss]
  6. [tag D, set 2, way 1: miss]
  7. [tag C, set 2, way 0: hit]
  8. [tag D, set 2, way 1: hit]
  9. [tag E, set 1, way 0: miss]
  10. [tag F, set 1, way 1: miss]
  11. [tag E, set 1, way 0: hit]
  12. [tag F, set 1, way 1: hit]
and so on. That pattern is used to go through all of memory, cycling through all sets and ways, and then wrapping around. I then assert that req_was_hit == test_counter[1], since this pattern should give two misses followed by two hits. One caveat is that the tests above only apply to testbenches where the arbiter has one client. Things get harder in the four client case, since the client request order is randomised. I can still verify the returned data is correct, but since I can no longer rely on any client being the first to touch a cacheline, it becomes harder to verify hits and misses. But this is definitely on my list of things to do

Future Optimisations

Like I said, this represents a first attempt at getting something simple up and running, and in doing so I took quite a few shortcuts. Currently, data is returned in the order requests are accepted, and a miss can block subsequent hits. I’m considering having per-client queues in the cache, where requests from a particular client are always processed in order, but there is some freedom to return a hit from client A while client B’s miss is pending. I actually looked into this pretty early on, but it really complicated the design. I may have to come back to this in the future if I am feeling brave

Special Thanks

As always, thanks to @domipheus, @CarterBen, @Laxer3A, and @sparsevoxel for keeping me inspired with their absolutely mental projects.

And super mega ultra thanks to @mikeev, @tom_forsyth, @rygorous, and everyone else participating in the twitter FPGA threads. Its an honour being able to ask such accomplished professionals so many stupid questions and not get laughed out of the room

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>