Post

Spinning A Donut (Creating an OoO CPU to Run donut.c) [Part 1 of 2]

Spinning A Donut (Creating an OoO CPU to Run donut.c) [Part 1 of 2]

In this write-up, I will go through many weeks of development of an out-of-order CPU core. I will talk about lessons learned, explain microarchitecture concepts, and illustrate justifications (and shortcomings) of my design. The reader is expected to be familiar with assembly and the absolute basics of what CPUs do. Most of my examples will be in x86 rather than RISC-V in the beginning, as more people seem to be familiar with the former.


Main Resources Used

Defining The Scope

Modern CPUs are insanely complex, so the scope of this project is limited. To limit the number of pipeline stages that will need to be built, the clock used is only 50MHz (although the maximum clock is higher according to timing reports under the current design). The CPU will fetch two instructions at a time, issue three at a time, and retire two at a time (these are maximums due to reasons described in the following sections). To take full advantage of the superscalar capability of the CPU, the CPU will be out-of-order, meaning instructions can execute in a different order, but to maintain architectural correctness, they have to be retired in-order (as modern CPUs do). To prevent stalling whenever there is a jump, speculative execution is supported with simple branch prediction (and immediate jumps are immediately executed). Peripheral wise, raw UART is used for loading programs and I/O.

Defining Terms

  • RAW means read after write. If we have two instructions, and one of them reads a just-written-to register, then they are logically dependent. So an instruction that depends on a written register should never execute before the register is written to. An example is mov rax, rcx then mov rdx, rax. The second instruction should never execute before the first in this case.
  • WAR means write after read; it is similar, the read must be from the register before any sequential write occurs. An example: mov rdx, rcx then mov rcx, rax should always execute in-order (we will bypass this using renaming).
  • WAW means write after write, simply put mov rdx, rax then mov rdx, rcx should reflect rcx at the end rather than rax (we will also bypass this using renaming).
  • RAR is not a hazard, mov rax, rcx then mov rdx, rcx can execute at the same time since they are reading from the same source and committing to different registers (so no conflict!)
  • Pipelining: a common metaphor used to describe this is laundry. When you are washing laundry, it makes sense to fold laundry that has just been washed. If you sit idle while the clothes are washing, then you have the same latency for clothes going from dirty to folded, but your throughput is halved. Similarly, CPUs have multiple stages that each take a cycle (fetch, decode, rename, issue, execute, write-back, commit) or more! So, to make full use of the CPU, there is an instruction inside each stage (most of the time) to get the maximum throughput.
  • Superscalar: This means that the CPU executes more than one instruction per cycle. This allows for programs to execute faster (higher throughput of instructions, commonly called IPC) when you can not just increase the clock Hz. Several problems arise when you do this; one is that you must avoid hazards.
  • Out-of-order: if one instruction is waiting for a register that has not yet been written to (e.g., mov rax, [rcx] and mov rdx, rax, rdx would be dependent on a multi-cycle memory read), then the instruction will not be able to execute. For this reason, other instructions with resolved dependencies are allowed to run with no consequence to the microarchitecture state. The stage responsible for seeing if dependencies are ready is the issue stage; the write-back stage marks registers as ready once they are written back, and the rename stage marks registers as no longer ready.
  • Speculative execution: when there is an instruction such as jne 0x123 it is not known if the branch is taken. We could insert bubbles (NOPs) into the pipeline until the branch is issued and resolved, but that would cost many, many cycles. Instead, we make an educated (see branch prediction below) guess about whether the branch is taken. If we are wrong, then we flush the entire pipeline, copy over the architectural (real) state into the physical (speculative) state, and then continue from the real address.
  • Branch prediction: since guessing wrong can cost us cycles, we can make better guesses by looking at the history of whether or not we were correct. A very simple implementation of this will be illustrated in this writeup, but note that real implementations are stunningly complex.
  • uop: a “micro-operation”, which stores many fields relevant to our instruction. In CISC (complex instruction set) architectures like x86 (fun fact: there are 1500+ defined x86 instructions), a single instruction can encode to multiple uops, but since RISC-V is RISC (reduced instruction set) there is no need (at least for our implementation).
  • LSQ: Since memory writes may be incorrect under a speculated branch, we write to a buffer instead, and when retire reaches a store instruction, the buffer writes to real memory. This buffer is also read from during memory reads because the writes still have to have effect on the state. You may call this an SQ, since it is queueing stores, but note that when I refer to an LSQ I will be talking about this.

The Plan

Since there are so many stages, I outlined an architecture from the beginning, so integrating features like rename would not require dozens of iterations. The following is written after making the code, as the original plan had several shortcomings which led to development issues, and I would not like to cover them all in this write-up.

CPU Block Diagram

Fetch

Fetch must handle control flow changes and a lack of them. Since the stage is pretty simple, it is largely combinational, so that the BRAM read (which requires a cycle delay) only takes a cycle rather than two. Normally, the stage simply increments the PC (also known as RIP or IP) by eight (since a RISC-V instruction is 4 bytes, and this is a dual fetch pipeline). On a flush (meaning misprediction) or jump, the PC is changed to the new address.

Decode

The decode stage is largely a switch statement that rewires the instruction into the appropriate uop fields. This makes it easier to use at pretty much no cost. Along with this, the PC is also stored, which is needed for jumps and other instructions. Decode is also responsible for branch prediction and immediate jumps, since it decodes the jump instructions, it is able to find out if an instruction should jump with minimal overhead (another unit would require decoding the jump again, although our way is messier).

Rename

The rename stage is the most complex module so far; it must find a physical register that is free (it stalls the decode and fetch stages to prevent physical registers from ever running out) and then rename the destination register for each instruction to the new free register. The source registers are renamed to whatever they map to in the RAT, which is a table that stores what all the architectural registers are mapped to. There are usually far more physical registers in order to bypass certain dependencies (e.g., mov rax, rcx and mov rax, rdx become mov x2, x4 and mov x3, x8 with the rename table reflecting that rax is x3 for the following instructions). On a misprediction, the RAT will be changed to whatever the commit stage holds, which reflects the committed RAT rather than the speculative one (see the shortcomings section).

Issue

Issue’s responsibility is to find instructions whose registers are ready and issue them. Additionally, for the AGU, it must make sure that the unit is ready before issuing an instruction. The ALU is single-cycle, so instructions can be issued to both ALUs every cycle. Issue must also insert instructions into the issue queue. For memory operations, this is done with a head/tail structure because memory access is in-order in our pipeline (see shortcomings section). For ALU operations, this is done with a free list (indicating what places in our ALU issue queue are able to be written to) and a priority encoder (finds the first available location) since instructions execute out-of-order. If either of these queues becomes too small, we must stall everything behind us. Small note: a golden rule is to never allow a unit to stall ahead, because this can lead to deadlock

Execute: ALU

This is mostly a switch statement that executes arithmetic operations and moves data around in registers, along with calculating whether or not jumps should be taken or what jumps to registers actually have as a destination. It also does sign extension sometimes. This is literally the simplest part of the entire project.

Execute: AGU

This unit must do many things, so I will gloss over many important details that will be covered in the implementation section. For writes, we store the data in an LSQ. For reads, we read from the LSQ to see if anything matches our address. If the write gives us all the data we need, we can end early without using the real memory read. If not, we read from real memory and apply partial writes from the LSQ if applicable. AGU is also responsible for memory-mapped UART and loading programs using UART. The ready signal is set if there are enough entries available in the LSQ and there are no memory reads in flight. While the memory may be easily extendable to SRAM and even DDR, we only use BRAM (see shortcomings).

Write-back

All this stage does is read the output of the execute stage, commit data to the physical state, and mark the written-to registers as ready.

Commit

This takes the output uop from the execute stages and data from the rename stage. The data from the rename stage is used to insert initial data into the reorder buffer, while the outputted uop is used for stuff like “was this instruction a misprediction?” (among other things). Many instructions can be retired from the ROB (which is a head-tail structure). Retiring can be as simple as freeing the previous physical register (which allows rename to allocate a physical register) or as complex as flushing the entire pipeline on a misprediction, updating the PC, and updating the physical RAT to match the architectural RAT. It also has to commit stores one at a time.

Implementation

This section will be long and very HDL-dense, so I would recommend learning some (System)Verilog before reading.

UART

To spin a donut, you must be able to see the donut. To do this, we can implement UART, which allows for input and output of text. Additionally, this allows us to load programs onto the CPU.

UART is asynchronous, so we have to agree on an arbitrary clock. Since our CPU has a fixed 50MHz clock, we can derive a 115,200Hz clock with 50,000,000//115,200=434. So every 434 cycles, we trigger one of the parts of the UART protocol. Note that there is only one wire going out (UART_TX) and one wire going in (UART_RX). The protocol itself to send a byte is (assume a 434 cycle delay (derived above) in between each step):

  1. Start with an output of one
  2. Once we need to send a byte, we pull the wire (UART_TX) low (set it to zero)
  3. We then send the bits of the char one at a time, with the delay in between each
  4. Then we pull the wire high, indicating the end of the byte

To implement this, I created a simple state machine:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
case (state)
    2'b00 : begin // lower signal to signal a start of a byte
        UART_TX <= 0;
        state <= state + 1;
    end
    2'b01 : begin
        if (bit_ptr == 3'b111) // end of byte
            state <= state + 1;
        UART_TX <= chr[bit_ptr]; // put bit in
        bit_ptr <= bit_ptr + 1;
    end
    2'b10 : begin
        UART_TX <= 1; // mark byte as done
        state <= 0;
        tm_ready <= 1;
    end
endcase

Receiving a byte is similar; we wait for the signal to lower, wait for one and a half baud cycles (so we sample in the middle rather than on the edges, which impacts error rate), then we sample a bit every baud cycle until we have sampled all bits. But the signal, UART_RX, can change in the middle of one of the cycles. To address this, we can use a register to store the previous cycle’s UART_RX which contains a “settled” signal.

The core part of the code is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
if (baud == 434) begin
    baud <= 0;
    if (delay) begin // delays initial pulse to make sure it goes in the middle
        delay <= 0;
        baud <= 0;
    end
    else begin
        chr[bit_ptr] <= RX; // read in bit-by-bit
        if (bit_ptr == 3'b111) begin
            rc_done <= 1'b1;
            wait_rise <= 1'b1;
        end
        bit_ptr <= bit_ptr + 1;
    end
end
else
    baud <= baud + 1;

We will come back to UART near the end for loading programs and MMIO.

Fetch

To spin a donut, you must be able to receive instructions on how to spin the donut.

Since our design is constrained to LUTRAM and BRAM (see shortcomings section), and LUTRAM is too small for large structures like instruction memory, we must use BRAM. Since we have a two-wide processor, we must fetch 64 bits at a time (two 4-byte instructions). But, since M10K (what our target FPGA uses) only supports up to 32 bit single port reads, we have to split our BRAM into two different arrays, which doubles our bandwidth since different M10K blocks will be being used (I would hope synthesis tools do this by default, but my experience showed the software I used to be semi-unreliable so I did this prematurely). BRAM also requires a cycle delay after computing the address, so by posedge clk (which means on the clock’s rise (clock goes from 0 to 1)), the address already needs to be computed. After a complete clock cycle, the data will propagate in whatever variable we want.

1
2
3
4
5
6
7
8
9
10
(* ramstyle = "M10K" *) reg [31:0] mem1 [2047:0]; // memory, 2 * 32 * 2048 bits, or 16kb
(* ramstyle = "M10K" *) reg [31:0] mem2 [2047:0];

reg [31:0] tmp_load_word;
assign predecode_instr = {i_buf_two, i_buf_one};
always @(posedge `CLK)
      i_buf_one <= mem1[fetch_addr_aligned];

always @(posedge `CLK)
      i_buf_two <= mem2[fetch_addr_aligned];

There are many cases we must handle. On a reset, we must set the aligned fetch address to zero; for a flush on misprediction, we must set the fetch address to the new PC; on a stall, we must, well, stall. But for a normal case, all we have to do is increment the fetch_addr. But this logic would require a 2-cycle deep fetch, so our code utilizes a combination of clock and combinational logic.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
always @(*) begin
      fetch_addr = prev_fetch_addr;
      next_n_first_valid = 0;
      fetch_addr_aligned = 0;
      if (flush) begin
            next_n_first_valid = 1'b0;
            fetch_addr = {new_flush_pc >> 3, 3'b000};
            if ({new_flush_pc >> 3, 3'b000} != new_flush_pc)
                  next_n_first_valid = 1'b1;
      end else begin
            if (!(STALL_FROM_RENAME|STALL_FROM_ISSUE)) begin
                  next_n_first_valid = 1'b0;
                  if (jmp) begin
                        fetch_addr = {new_pc >> 3, 3'b000};
                        if ({new_pc >> 3, 3'b000} != new_pc)
                              next_n_first_valid = 1'b1;
                  end else begin
                        fetch_addr = prev_fetch_addr + 8;
                  end
            end
      end
      fetch_addr_aligned = fetch_addr >> 3;
end

always @(posedge `CLK or negedge CPU_RESET_n) begin
      if (!CPU_RESET_n) begin
            n_first_valid <= 0;
            prev_fetch_addr <= -8;
      end else if (flush) begin
            n_first_valid <= next_n_first_valid;
            prev_fetch_addr <= {new_flush_pc >> 3, 3'b000};
      end else begin
            if (!(STALL_FROM_RENAME|STALL_FROM_ISSUE)) begin
                  n_first_valid <= next_n_first_valid;
                  if (jmp) begin
                        prev_fetch_addr <= {new_pc >> 3, 3'b000};
                  end else begin
                        prev_fetch_addr <= fetch_addr;
                  end
            end
      end
end

Note that we must stall from rename because rename allocates ROB entries (and physical register entries), and must stall if the ROB is nearly full. Issue must be able to send a stall signal because the IQ can become full.

You may see references to n_first_valid, which is an artifact of attempting to keep everything 64 bit aligned. This is not strictly needed, but it simplifies address calculation. Otherwise, the code is simply aligning the address and handling different conditions.

Decode

To spin a donut, you must be able to understand the instructions you receive on how to spin such a donut.

In my personal experience, this is the most tedious part of the pipeline. Simply put: you need a large switch statement to find out what type of instruction it is (what hardware resources (ALU or AGU) to delegate to and the sub-operation (eg add, or, xor, etc), what the relevant registers are (and if they are valid), the PC at the time (for jumping instructions and auipc), what the immediate is, whether the decoding is even valid (fault bit), and whether we predict a jump was taken (see the branch prediction section later).

I will not show most of the code, but the most non-trivial part is the jumping logic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
...
if ((just_jumped && !stall)) begin
    just_jumped <= 1'b0;
    new_pc <= 1'b0;
    uops <= '0;
    jmp <= 1'b0;
end else if (!stall) begin
    ...
    for (i = 0; i < 2; i = i + 1) begin // despite the for loop, this is done in parallel
                      ...
                      J_type : begin // assigns immediate, dst1
                        just_jumped <= 1'b1; // so next cycle inserts bubbles
                        jumped = 1; // so next instruction becomes bubble
                        jmp <= 1'b1; // so fetch knows to update PC
                        new_pc <= prev_fetch_addr + (i*4) + { {12{instructions[i][31]}}, instructions[i][19:12], instructions[i][20], instructions[i][30:21], 1'b0};

This logic basically says: if this is an immediate jump instruction, we compute the address, then forward it to fetch (shown in the above section), then we stall for a cycle to allow the fetch to propagate.

Small note: n_first_valid is forwarded to this module, which we handle by inserting a non-faulted all-zero uop which functions as a nop in future stages

Rename

To spin a donut, you must remove false dependencies to interpret instructions on how to spin a donut faster.

Renaming

Since I already defined hazards, I will not repeat that. Instead, I will demonstrate how these hazards are bypassed. To prevent most hazards, we can simply remove the fact that we write to a register that something architecturally ahead reads from or writes to (which addresses WAW and WAR). To do this, we “rename” registers to different registers.

Given the following code:

add x1, x2, 6
sub x1, x3, 4
sub x3, x1, 7

A write-after-write dependency is created, so for the first instruction we make x1 point to a physical register, say x45 (the physical register amount is almost always more than the architectural (real, as in x0-x31)), and look up x2 in the rename table. So the instruction can become something like add x45, x54, 6. The next instruction will do something similar, it looks up x3, let us say it’s x9, it renames x1 to a free physical register, let us say it is x16, so now the instruction is sub x16, x9, 4. Then, for the last instruction, we look up x1, see that it is renamed to x16 by the previous instruction, then rename x3 to some free physical register, say x5. So now our code looks like:

add x45, x54, 6
sub x16, x9, 4
sub x5, x16, 7

Look at that! Less hazards! But to prevent something like the third instruction executing before x16 is written to, we mark x16 as not-ready, we will mark it as ready in the write-back stage.

Time for the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
if (uops[i].src1_valid) begin
    renamed[i].src1_reg <= rename_table[uops[i].src1_reg]; // update areg to preg
end
if (uops[i].src2_valid) begin
    renamed[i].src2_reg <= rename_table[uops[i].src2_reg];
end
if (uops[i].dst_valid) begin
    if (uops[i].dst_reg != 0) begin
        for (int a = 0; a < 64; a++) begin // finds free physical register, should be optimized at synthesis time (?)
            if (f_list_dup[a] == 1'b1) begin
                allocated_preg = 6'(a);
                f_list_dup[a] = 1'b0; // allocates it
                f_list_allocated[a] <= 1'b1;
                break;
            end
        end
    end else
        allocated_preg = 6'b0;

ROB Entry Insertion

Rename does not just rename registers, it also allocates ROB entries. Let’s look at the ROB entry structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
typedef struct packed {
    logic       finished; // is it done?
    logic       spec;     // is it a speculative branch?
    logic       store;    // does it have to be retired from the LSQ?
    logic       dst_valid;// is the destination a register?
    logic [5:0] a_dst_reg;// destination for retiring architectural state
    logic [5:0] pp_dst_reg;// for freeing
    logic [5:0] p_dst_reg;// to update architectural state
    logic       misspredict;
    logic [31:0]new_pc; // for mispred
    logic       faulted;
    logic       taken;
} rob_ent_t;

The comments are pretty self-explanatory, for rename the only entries set are: not finished, speculative (basically marks whether or not it is a conditional branch or jump to a register, although this likely differs from traditional implementations), store, whether it was a taken jump, and the destination information. The destination information is pretty important; the architectural register and physical register allow the ROB to update the architectural (real) state, while the previous physical register allows for physical registers to be freed (since the previous destination physical register should not be allowed to be read from by any instructions that are going to execute).

The code is exactly what I described (just moving data from one place into another), so I will omit it.

Stalling

There are two possible conditions in which we stall: if we run out of free physical registers or if we run out of free ROB entries (since retire forwards the head, we are able to calculate how many free entries are left).

Issue

To correctly spin a donut, you must follow the instructions somewhat in order.

Issuing to The ALU(s)

While renaming does prevent WAR and WAW hazards, we still have to account for real dependencies to preserve correctness. To do this, we create an indicator of whether or not registers have been written to. When we allocate a register in rename, we mark it as unready, when we write to a register in write-back we mark it as ready. The issue stage must read this structure in order to issue instructions while preserving real dependencies.

1
2
3
4
5
6
7
8
9
10
11
12
if (alu_ready) begin
    for (int a = 0; a < 8; a = a + 1) begin
        if ((!alu_uops_fl[a]) && (!alu_issue_queue[a].src1_valid || p_reg_ready[alu_issue_queue[a].src1_reg]) &&
            (!alu_issue_queue[a].src2_valid || p_reg_ready[alu_issue_queue[a].src2_reg])) begin // if valid and sources are ready
            alu_uops_fl[a] = 1'b1;
            alu_uop <= alu_issue_queue[a];
            alu_valid <= 1;
            break;
        end
    end
end else
    alu_valid <= 0;

This code is basically saying that if the ALU is ready to receive an instruction, then we search the queue for valid instructions (non-empty entries) and have ready source registers (if it needs such source registers). Once we find a valid instruction, we send it to the ALU.

To extend this code for two ALUs, we can either scan the queue twice and insert uops into two different slots or use two different queues and scan them in parallel. Timing-wise, it makes more sense to have two different queues. Our naive implementation does this at the cost of the number of ALU uops we can store in the queue (see insertion later in this section).

Issuing to The AGU

The AGU is somewhat different, while we still need source registers for data to write with and address calculation, we have to do this in order (see shortcomings/justifications section). Since the issue unit receives instructions in order, we can simply use a head/tail structure.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
if (agu_ready && (!agu_valid)) begin
    if (!(mem_issue_queue[mem_head]))
        agu_valid <= '0;
    else begin
        if ((!mem_issue_queue[mem_head].src1_valid || p_reg_ready[mem_issue_queue[mem_head].src1_reg]) &&
            (!mem_issue_queue[mem_head].src2_valid || p_reg_ready[mem_issue_queue[mem_head].src2_reg])) begin
                agu_valid <= '1;
                agu_uop <= mem_issue_queue[mem_head];
                mem_issue_queue[mem_head] <= '0;
                mem_head = mem_head + 1;
        end else
            agu_valid <= '0;
    end
end

Insertion

To know what resource to delegate a uop to, we can check the op_type that we set in decode. For the ALU, all we have to do is check a free list. For the AGU, we just insert at the tail and increment its value. To prevent these queues from filling up, we stall. The “nops” are all 0-uops, which we skip.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
if ((acum_alu[0] <= 3) || (acum_alu[1] <= 3) // stalling logic
    || ((mem_tail+3) == (mem_head))
    || ((mem_tail+2) == (mem_head))
    || ((mem_tail+1) == (mem_head))
    || (((mem_tail) == (mem_head)) &&
    |mem_issue_queue[mem_head])) begin
    stall_backwards <= 1'b1;
end else begin
    stall_backwards <= 1'b0;
end
...
for (int i = 0; i < 2; i = i + 1) begin
    if (|uops_renamed[i]) begin // do not insert NOPs
        if (uops_renamed[i].op_type == 3'b010 || uops_renamed[i].op_type == 3'b100 || uops_renamed[i].op_type == 3'b110 // inside statements seemingly unsupported by synth
            || uops_renamed[i].op_type == 3'b111  || uops_renamed[i].op_type == 3'b000  || uops_renamed[i].op_type ==  3'b101) begin // ALU
            for (int a = 0; a < 8; a = a + 1) begin // finds free IQ entry, should be auto-optimized to something better
                if (alu_uops_fl[i][a] == 1'b1) begin
                    alu_issue_queue[i][a] <= uops_renamed[i];
                    alu_uops_fl[i][a] = 1'b0; // allocates it
                    break;
                end
            end
        end else begin
            mem_issue_queue[mem_tail] <= uops_renamed[i];
            mem_tail = mem_tail + 1;
        end
    end
end

ALU

To spin a donut, you must run the instructions that you are ready to execute.

In this architecture, the ALU is responsible for handling control flow instructions, loads (into registers with immediates, not memory loads), and arithmetic. The logic is largely repetitive, so I will only cover some core bits.

The uop is copied over into the destination uop, from there, it is largely handled by a case statement (... means that there is more logic, but it is repetitive and omitted).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
    case (uop.op_type)
        3'b110: begin
            if (uop.op)
                uop_out.dst_val <= (uop.immediate << 12) + uop.pc;
            else
                uop_out.dst_val <= { {12{1'b0}}, uop.immediate} << 12;
        end
        3'b101: // jmp+pc // actual jumping handled during decode; dst reg assignment now
            uop_out.dst_val <= uop.pc + 4;
        3'b100: begin // condtional jumps
            uop_out.was_jmp <= 1'b1;
            uop_out.pred_taken <= uop.pred_taken;
            case (uop.op)
                4'b0000:
                    was_taken = (src_1 == src_2);
                4'b0001:
                    was_taken = (src_1 != src_2);
                4'b0100:
                ...
            endcase
            if (was_taken) begin
                uop_out.taken <= 1'b1;
                uop_out.new_pc <= { {19{uop.immediate[11]}}, uop.immediate[11:0], 1'b0} + uop.pc; // sign extends
            end else begin
                uop_out.taken <= 1'b0;
                uop_out.new_pc <= uop.pc + 4; // jumps to the instruction after, not this instruction
            end
        end
        3'b010: begin // JALR (reg)
            uop_out.was_jmp <= 1'b1;
            uop_out.pred_taken <= 1'b0; // manually invoke mispred logic
            uop_out.taken <= 1'b1;
            uop_out.new_pc <= ((src_1) + { {20{uop.immediate[11]}}, uop.immediate[11:0]}) & { {31{1'b1}}, 1'b0};
            uop_out.dst_val <= uop.pc + 4;
        end
        3'b000: begin // the real ALU part
            logic [31:0] imm_tmp;
            imm_tmp = { {20{uop.immediate[11]}}, uop.immediate[11:0]};
            case (uop.op)
                4'b0000: // add
                    uop_out.dst_val <= uop.src2_valid ?
                        (src_1+src_2) : (src_1+imm_tmp);
                4'b1000: // sub
                    uop_out.dst_val <= src_1-src_2;
                ...
                default:
                    uop_out.faulted <= 1'b1;
            endcase
        end
        default:
            uop_out.faulted <= 1'b1;
    endcase
end

The reason that this was not clumped into one execution section is because of the complexity of the AGU.

AGU

This is the most complex unit in the entire CPU, so we must go over quite a few things.

The LSQ

Despite being in-order (which grants us the ability to use a head/tail structure), we still have to deal with speculative loads and reads. Since there is no way to undo a memory write, we have to write to a speculative buffer, called the LSQ. There are several things we must consider, first we have to consider how reads will work.

Reads

We have to consider all speculative writes that would affect the reads we want to issue. So, we can, in parallel, issue a read to the address while searching up our address in the LSQ, then we create a mask of writes. In the next cycle, we take the read data and apply the mask, then we forward the result.

1
2
3
4
5
6
7
8
9
10
for (int i = 0; i < 8; i++) begin
    logic [2:0] q_ptr;
    q_ptr = qhead+i[2:0];
    if (q_ptr == qtail)
        break;
    if ((queue[q_ptr].addr>>2 == addr>>2) && (q_ptr != qtail)) begin
        mask = queue[q_ptr].mask;
        data = queue[q_ptr].data>>(offset*8);
    end
end

A few small details:

  • Our address is aligned to simplify lookup; this is kept consistent for write logic as well
  • The newest write with a matching address is used to help with timing. The write logic makes the write store all previous valid writes to the same address
  • Not shown here, but if the mask covers all the data we need, we exit early without applying the read memory
  • Also not shown here, the op is used to find out where we need to read from

Writes

Since writes do not always cover a full word, we have to do partial writes. To do this, we need masks. We will also need to apply all previous writes to the same address (for the reason shown in Reads).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
if (qhead != qtail) begin
    for (int i = 0; i < 8; i++) begin
        logic [2:0] q_ptr;
        q_ptr = qhead+i[2:0];
        if (q_ptr == qtail)
            break;
        if ((queue[q_ptr].addr>>2 == addr>>2) && (q_ptr != qtail)) begin
            mask_ins = queue[q_ptr].mask;
            data_ins = queue[q_ptr].data;
        end
    end
end

... // uop.op switch statement to find the needed mask

if (!mask_needed_ins)
    uop_out.faulted <= 1'b1;
src_2_shifted = src_2<<(offset*8);
all_dat = { {mask_needed_ins[3] ? src_2_shifted[31:24] : data_ins[31:24]},
            {mask_needed_ins[2] ? src_2_shifted[23:16] : data_ins[23:16]},
            {mask_needed_ins[1] ? src_2_shifted[15:8]  : data_ins[15:8] },
            {mask_needed_ins[0] ? src_2_shifted[7:0]   : data_ins[7:0]  }};
queue[qtail].rob_id <= uop.rob_id;
queue[qtail].data <= all_dat;
queue[qtail].addr <= {addr>>2, 2'b00};
queue[qtail].mask <= mask_needed_ins|mask_ins;
qtail <= qtail + 1;

Applying Writes

This is probably the simplest part of the entire unit. If we receive a signal from retire to retire a write, all we have to do is use the head/tail structure to do a real write.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
always @(posedge clk) begin
    if (write_enable) begin
        if (queue_mask[0]) mem[queue_addr][7:0]   <= queue_data[7:0];
        if (queue_mask[1]) mem[queue_addr][15:8]  <= queue_data[15:8];
        if (queue_mask[2]) mem[queue_addr][23:16] <= queue_data[23:16];
        if (queue_mask[3]) mem[queue_addr][31:24] <= queue_data[31:24];
    end
    raw_mem_data <= mem[addr[14:2]];
end
...
always @(posedge clk or negedge reset) begin
...
        if (prev_written) begin
            t_qhead = t_qhead + 1;
            qhead <= qhead + 1;
            prev_written <= 1'b0;
        end
        ...
            write_enable <= 1;
            queue_mask <= queue[t_qhead].mask;
            queue_addr <= queue[t_qhead].addr>>2;
            queue_data <= queue[t_qhead].data;
            prev_written <= 1'b1;

The extra delay prevents a 1-cycle window where a read can read stale data.

Some notes on the entire AGU:

  • The AGU can fill up and stall; it can also have an in-flight read. For this reason, the AGU has a ready/not-ready bit that the issue stage has to read
  • Flushing (on a misprediction) wipes the entire LSQ
  • This unit also handles MMIO and program loading, which will be covered later

Writeback

To spin a donut, you must store the result of the instructions you executed.

Writeback has a very simple task: write the result of registers and mark them as ready. The entire piece of code is below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
always @(posedge `CLK or negedge CPU_RESET_n) begin
      if (!CPU_RESET_n) begin
            p_regs <= '0;
            p_reg_ready <= { {63{1'b0}},1'b1};
      end else if (!(flush|loading)) begin 
            logic [63:0] p_reg_ready_tmp;
            p_reg_ready_tmp = p_reg_ready & ~f_list_allocated; // just allocated == not yet ready
            for (int i = 0; i < 3; i++) begin
                  if (ex_uops[i].valid 
                  && ex_uops[i].dst_valid 
                  && !ex_uops[i].faulted
                  && (ex_uops[i].dst_reg != 0)) begin
                        p_regs[ex_uops[i].dst_reg] <= ex_uops[i].dst_val;
                        p_reg_ready_tmp[ex_uops[i].dst_reg] = 1'b1;
                  end
            end
            p_reg_ready <= p_reg_ready_tmp;
      end
end

Commit/Retire

I do-not have a metaphor for this unit.

Retire has the main responsibility of making sure the architectural state is correct. To do this, it has the ROB, the main structure within it that keeps uops in-order. Rename, when it receives the uops, generates an ROB entry. When a uop is done executing, the result gets forwarded to the retire unit to update the entry. The format of these entries is:

1
2
3
4
5
6
7
8
9
10
11
12
13
typedef struct packed {
    logic       finished; // is it done? // set after execution
    logic       spec;     // is it a speculative branch? // set by rename
    logic       store;    // does it have to be retired from the LSQ? // set by rename & after execution (due to MMIO, which is later in this section)
    logic       dst_valid;// is the destination a register? // set by rename
    logic [5:0] a_dst_reg;// destination for retiring architectural state // set by rename
    logic [5:0] pp_dst_reg;// for freeing // set by rename
    logic [5:0] p_dst_reg;// to update architectural state // set by rename
    logic       misspredict; // set after execution
    logic [31:0]new_pc; // for mispred // set after execution
    logic       faulted; // set by rename and after execution
    logic       taken; // set by rename
} rob_ent_t;

The bits we need for determining how to retire are: finished, spec, store, and misspredict. If the entry at the head is finished, then we are able to retire it and move on to the next entry. If the entry is a speculative branch, we use the misprediction bit to decide if we need to flush (we also have to send a signal to the decode stage for the BHT, but that section is later). If it is a store, we have to send a signal to the AGU so it retires the write from the LSQ.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
logic [3:0] tmp_tail;
tmp_tail = tail; 
for (int i = 0; i < 2; i++) begin
    if (rob_ent_val[i]) begin
        rob[tmp_tail] <= rob_entries[i];
        tmp_tail = tmp_tail + 1;
    end
end
tail <= tmp_tail;
for (int i = 0; i < 3; i++) begin
    if (ex_uops[i].valid) begin
        rob[ex_uops[i].rob_id].store <= (rob[ex_uops[i].rob_id].store | ex_uops[i].was_uart);
        rob[ex_uops[i].rob_id].misspredict <= (ex_uops[i].was_jmp && 
            (ex_uops[i].pred_taken != ex_uops[i].taken));
        rob[ex_uops[i].rob_id].taken <= ex_uops[i].taken;
        rob[ex_uops[i].rob_id].finished <= 1'b1;
        rob[ex_uops[i].rob_id].new_pc <= ex_uops[i].new_pc;
        rob[ex_uops[i].rob_id].faulted <= rob[ex_uops[i].rob_id].faulted | ex_uops[i].faulted;
    end
end
...
logic prev_ready;
logic [3:0] tmp_head;
logic [1:0] jmp_count;
jmp_count = '0;
f_list_freed <= '0;
tmp_head = head;
prev_ready = 1;
retire_rob_valid <= 0;
update_btb <= 2'b00;
for (int i = 0; i < 6; i++) begin // subject to change
    if (prev_ready && rob[tmp_head].finished && (jmp_count != 2'b11)) begin
        if (rob[tmp_head].spec) begin
            ... // BHT logic
            jmp_count = jmp_count + 1;
        end
        if (rob[tmp_head].faulted) begin
            halt <= 1'b1;
        end
        if (rob[tmp_head].store) begin
            prev_ready = 0;
            retire_rob_valid <= 1;
            retire_rob_id <= tmp_head;
        end else if (rob[tmp_head].misspredict) begin
            flush <= 1'b1; 
            prev_ready = 0;
            new_pc <= rob[tmp_head].new_pc;
        end
        if (rob[tmp_head].dst_valid) begin // not an else statement because misspredicted jumps can set regs
            if (rob[tmp_head].pp_dst_reg)
                f_list_freed[rob[tmp_head].pp_dst_reg] <= 1'b1;
            a_reg_state[rob[tmp_head].a_dst_reg] <= rob[tmp_head].p_dst_reg;
        end
        tmp_head = tmp_head + 1;
    end else
        prev_ready = 0;
end
head <= tmp_head;

One line of this code is extremely important flush <= 1'b1;. This gets forwarded to nearly every unit and triggers something close to what would happen on a CPU reset. But we still have to preserve branch prediction state and fix the rename table. The internal a_reg_state (which is present in the block above) gets sent to rename and gets copied over into the rename table and the free list gets updated. For other structures that should not get reset (like the BHT) we simply skip them.

What else?

As you may have noticed, we have not made a donut spin yet. For that, we need MMIO, program loading, and a few optimizations to make it run at more than half a frame a second. All of this will be covered in part two, since this write-up is already the longest I have ever written.

Shortcomings & Justifications

  • The architectural register state being copied over in a single cycle from the commit stage makes it nearly (emphasis on nearly) unsynthesizable for the FPGA. I learned that traditional architectures typically copy over architectural state in the rename stage whenever a speculative jump is encountered (called “snapshots”). Another choice would be to copy over the state X bits at a time to prevent such a wide (192-bit in our design) output port from being forwarded to rename.
  • Memory access is not like register access; we can not rename or similar. What modern CPUs do is called “memory disambiguation,” where they speculate that two or more memory accesses do not conflict and then issue them out-of-order. Since writes do not really commit in this process (see LSQ) we are able to squash incorrect accesses at a small penalty. Our CPU does not do this, instead it treats memory as in order and only reaps benefits from the LSQ allowing speculative accesses.
  • Memory access is only BRAM on our CPU. SRAM and DDR accesses would mean we would need to introduce a multi-layered cache structure, which introduces a lot of complexity for a relatively slow processor.
  • Only the base RISC-V 32 instruction set is supported (and the multiply/division extensions). This comes at the cost of many things, such as speed, privilege boundaries, etc. This project was largely made for learning SystemVerilog, how OoO CPU cores work, and general microarchitecture design. Implementing more instructions from a standardized instruction set is necessary for completeness on a real CPU, but requires a lot of extra work with minimal educational benefit. The base RISC-V 32 ISA was used because compiler toolchains already support it, so running gcc compiled programs like donut.c is possible with minimal effort. Additionally, extensions like the compressed instruction set also introduce additional decoding complexity.
This post is licensed under CC BY 4.0 by the author.