Spinning A Donut (Creating an OoO CPU to Run donut.c) [Part 2 of 2]

Posted Jun 25, 2026 Updated Jun 25, 2026

By bytebrew

14 min read

This is part two of two talking about an out-of-order CPU I built. In this write-up, I will discuss the implementation of program loading, speculative memory-mapped I/O, branch prediction, and the implementation of the RV32IM extension.

Program Loading

To run any program that spins a donut, we must be able to load it onto the CPU. Since we must be able to read instruction memory and program memory at the same time, it was decided that the design would have two separate memory arrays (for simplicity & bandwidth, although it is not an ideal design). The memory array that fetch had access to was small, as it only needed to house the .text section, while the memory array that the AGU had access to was large, as it needed to house .text, the stack, globals, and more.

To write any program, we need to know how much we are writing. To do that, we read a 32-bit header from the UART before reading the program.

  
else if (!rc_done) // waits for the UART module to say that it is receiving a byte before allowing reads to occur again
    byte_written <= 0;
else if (rc_done && !byte_written) begin
    if (!header_loaded) begin
        byte_written <= 1;
        header[load_word_ptr] <= RXDATA[1];
        load_word_ptr <= load_word_ptr + 1;
        if (load_word_ptr == 2'b11) // terminate on the last byte being written
            header_loaded <= 1'b1; 

Once the header is loaded, we know how much we need to receive. From there, it is pretty simple. We forward one loaded word (since our memory is composed of words instead of bytes) after another to both memory arrays and increment the memory array pointers.

  
byte_written <= 1;
tmp_word[load_word_ptr] = RXDATA[1];
loaded_word[load_word_ptr] <= RXDATA[1];
load_word_ptr <= load_word_ptr + 1;
header <= header - 1;
if (load_word_ptr == 2'b11 || !(header-1)) begin // every 4 bytes we transmit a word
    loaded_valid <= 1'b1;
    write_enable <= '1;
    queue_data <= tmp_word;
    queue_mask <= 4'b1111;
    load_ptr <= load_ptr + 1;
    queue_addr <= load_ptr;
end
if (header == 1) begin
    loading_done <= 1'b1;
end

As you can see, this is directly setting registers which effect the memory array that the AGU uses. Since we have to avoid multiple drivers, this logic is embedded inside the AGU. The word is also forwarded outside of the module:

  
(* ramstyle = "M10K" *) reg [31:0] mem1 [2047:0];
(* ramstyle = "M10K" *) reg [31:0] mem2 [2047:0];

reg [31:0] tmp_load_word;
assign predecode_instr = {i_buf_two, i_buf_one};
always @(posedge `CLK) begin
      i_buf_one <= mem1[fetch_addr_aligned];
      if (we1)
            mem1[load_ptr_tmp] <= tmp_load_word;
end

always @(posedge `CLK) begin
      i_buf_two <= mem2[fetch_addr_aligned];
      if (we2)
            mem2[load_ptr_tmp] <= tmp_load_word;
end
...
if (loading) begin
      we1 <= 1'b0;
      we2 <= 1'b0;
      if (loaded_valid) begin
            if (load_ptr <= 4095) begin
                  tmp_load_word <= loaded_word;
                  load_ptr <= load_ptr + 1;
                  load_ptr_tmp <= {load_ptr} >> 1;
                  if (load_ptr[0]) // even
                        we2 <= 1'b1;
                  else
                        we1 <= 1'b1;
            end
      end
end

This looks kinda wonky because of the non-blocking assignments, but essentially it splits word writes into two buffers while keeping the writes in bounds. If you can recall, the two buffers exist to increase our bandwidth since we read 8 bytes at a time in order to achieve a 2-instruction-wide CPU.

With all of this done, we can finally run programs! Unfortunately, the only feedback we can receive in this design is by wiring some registers to LEDs or similar.

Memory Mapped UART

To achieve text input and output, we can use UART. Since we already built the modules, the abstraction we can use is:

We only need to set a bit high for one cycle to indicate readiness and fill a 1-byte register in order to transmit a byte
We only need to read a 1-byte register if the ready signal goes from low to high

For the program running on the CPU, it needs:

A status byte, with one bit indicating if there was a new byte read since the last byte was read. The other byte indicates if the CPU is ready to transmit another byte
A byte from the receiver to read from
A byte for the transmission to write to

As you may notice, a read has to also write to a status bit. To make things more complicated, the writer has to be speculative. Additionally, the receiver can only overwrite the byte if the previous byte was architecturally read, which needs to set the architectural and speculative status bit. For writing, it is slightly simpler. We need to speculatively write to the status register and transmission buffer, then copy it over into the real thing during the commit stage.

Since this is memory-mapped, it can be embedded into our AGU. But since the dual write structure makes things complicated, we can use pseudo-memory, which is really just registers.

  
queue[qtail].rob_id <= uop.rob_id;
queue[qtail].data <= src_2;
queue[qtail].addr <= {addr>>2, 2'b00};
if (addr == 32'h10000000) begin
    qtail <= qtail + 1;
    STATUS[0][0] <= 1'b0;
end else if (addr == 32'h10000004) begin
    qtail <= qtail + 1;
    case (uop.op)
        4'b0000: // load byte 
            uop_out.dst_val <= { {25{RXDATA[0][7]}}, RXDATA[0][6:0]};
        4'b0001: // load half 
            ... // op handeling
        default:
            uop_out.faulted <= 1'b1;
    endcase
    STATUS[0][1] <= 1'b0;
end else if (addr == 32'h10000008) begin
    uop_out.was_uart <= 1'b0;
    uop_out.dst_val <= { {31{1'b0}}, status};
end

This is kinda coarse, as it only accounts for memory addresses, so invalid usages are handled incorrectly. The STATUS register is a 2*2 array. The 0th array is the speculative state, [0][0] indicates whether the transmission is ready, [0][1] contains a bit indicating whether there is a new byte to receive. So when we want to read a byte, we copy over RXDATA[0] (the 1st index contains the live state driven by the UART module) into the destination value and speculatively mark it as read. For writing, we speculatively mark the transceiver as not-ready. For both of these cases, we allocate an entry in the LSQ. For status reads, we simply read the status.

For retiring, we simply copy over the physical state to the architectural state. But to keep the architectural state consistent, we update based on signals from UART:

  
if (!STATUS[1][1] && (!prev_rc_done) && (rc_done)) begin
    new_status[1][1] <= !new_status[1][1];
    STATUS[1][1] <= 1'b1;
    RXDATA[0] <= RXDATA[1];
end
if (!STATUS[1][0] && (!prev_out_ready) && (out_ready)) begin
    new_status[1][0] <= !new_status[1][0];
    STATUS[1][0] <= 1'b1;
end
if (prev_written) begin
    t_qhead = t_qhead + 1;
    qhead <= qhead + 1;
    prev_written <= 1'b0;
end
...
if ((queue[t_qhead].addr == 32'h10000000 ||
        queue[t_qhead].addr == 32'h10000004) && retire_rob_valid) begin
    prev_written <= 1'b1;
    case (queue[t_qhead].addr)
        32'h10000000 : begin
            TXDATA <= queue[t_qhead].data;
            dat_out_ready <= 1'b1;
            STATUS[1][0] <= 1'b0;
            new_status[1][0] <= !new_status[1][0];
        end
        32'h10000004 : begin
            STATUS[1][1] <= 1'b0;
            new_status[1][1] <= !new_status[1][1];
        end
    endcase
end ...

As you can see, we also update a new_status register. This indicates to the other clocked logic that the architectural state has changed, so it can copy it over without multiple drivers:

  
if (new_status[1][1] != new_status[0][1]) begin
    STATUS[0][1] <= STATUS[1][1];
    new_status[0][1] <= new_status[1][1];
    status[1] = STATUS[1][1];
end if (new_status[1][0] != new_status[0][0]) begin
    STATUS[0][0] <= STATUS[1][0];
    new_status[0][0] <= new_status[1][0];
    status[0] = STATUS[1][0];
end

Branch Prediction

For our implementation, we will use a simple BHT with a few 2-bit entries. The major reason we need at least some form of branch prediction is that a multi-cycle penalty gets incurred if we do not predict correctly, so minimizing this is important. Additionally, it is relatively easy to predict branches at over a 70% success rate, so the small amount of complexity is worth it.

To do this, I embedded a few structures inside the decode stage. On every predicted jump taken, we must tell the fetch unit to jump. Implementing this was simple since we could build off the immediate jump logic. We also must indicate whether or not a branch was taken inside the uop. Upon retiring or flushing, the decode stage receives a signal from the retire stage to indicate if the branch was taken. To prevent storing the PC in the ROB, a head/tail structure is embedded inside the decode stage, which queues the speculative jumps. If a jump is taken, then the 2-bit counter is incremented; if it is not taken, then it is decremented (with overflow being prevented). The reason the structure is 2 bits is that we do not want jumps that rarely occur to flip the state, which could lead to an additional misprediction.

When we flush, we need to update the BHT but also clear the buffer:

  
for (i = 0; i < 2; i = i + 1) begin
	if (update_btb[i]) begin
		if ((taken[i] == 2'b11 && btb[btb_ent_q[btb_head]] != 2'b00)
			|| (taken[i] == 2'b01 && btb[btb_ent_q[btb_head]] != 2'b11))
				btb[btb_ent_q[btb_head]] <= btb[btb_ent_q[btb_head]] + taken[i]; // "taken" will be 1 or negative 1
			btb_head = btb_head + 1;
	end
end
btb_head <= '0;
btb_tail <= '0;
btb_ent_q <= '0;
...

Every cycle, we poll the retire unit for updates:

  
for (i = 0; i < 2; i = i + 1) begin
	if (update_btb[i]) begin
		if ((taken[i] == 2'b11 && btb[btb_ent_q[btb_head]] != 2'b00)
			|| (taken[i] == 2'b01 && btb[btb_ent_q[btb_head]] != 2'b11))
				btb[btb_ent_q[btb_head]] <= btb[btb_ent_q[btb_head]] + taken[i]; // "taken" will be 1 or negative 1
		btb_head = btb_head + 1;
	end
end

And every jump we allocate an entry and read from the BHT:

  
B_type : begin // assigns src1, src2, and immediate
		btb_ent_q[btb_tail] <= prev_fetch_addr + (i*4);
		btb_tail = btb_tail + 1;
		if (btb[btb_addr[5:0]][1]) begin
			uops[i].pred_taken <= 1'b1; // whole uop initialized to zero, so not taken otherwise
			just_jumped <= 1'b1; // so next cycle inserts bubbles
			jumped = 1; // so next instruction becomes bubble
			jmp <= 1'b1; // so fetch knows to update PC
			new_pc <= prev_fetch_addr + (i*4) + , instructions[i][7], instructions[i][30:25], instructions[i][11:8], 1'b0};
		end

Donut!

To load the donut program onto our FPGA, we must wrap the printf/puts/putchar calls in UART helper functions, add helpers for multiplication and addition, compile it with our specific instruction set (bare RV32I) and a small assembly helper that sets the stack pointer, then copy over the ELF into a raw machine code file. To transmit, we have to configure our UART to the right baud rate and make it raw, then send the header, then the raw binary, and then read the output:

RV32IM

The demo is pretty slow. Possibly a frame every two seconds. Luckily, the reason why is very clear: we are emulating multiplication and division at a very heavy penalty, and donuts need a lot of multiplication and division.

Multiplication

Multiplication is pretty easy, since shifts represent multiplication by a power of 2, we can decompose a single multiplication into multiple additions of multiplications by powers of 2. Luckily, we do not have to implement this by hand, and we can use DSPs to make it have a single-cycle delay.

First things first, we have to make the multiplication unsigned. This is because two’s complement does not apply to multiplication. To do this, we use combinational logic (which allows us to feed it to the DSP faster), which is dependent on the opcode passed through.

  
logic [31:0] src_1_unsigned;
assign src_1_unsigned = (src_1[31] & uop.op == 4'b0001) || 
    (src_1[31] & uop.op == 4'b0010) ? -src_1 : src_1;
logic [31:0] src_2_unsigned;
assign src_2_unsigned = (src_2[31] & uop.op == 4'b0001) ? -src_2 : src_2;

The DSP invocation is pretty simple:

  
always @(posedge clk)
    mul_res <= src_1_unsigned*src_2_unsigned;

For our main logic, all we have to do is plug in the result into the destination value with respect to whatever opcode we are executing and mark the ALU as no-longer ready.

  
always @(posedge clk or negedge CPU_RESET_n) begin
...
    end else if (mul) begin
        logic [63:0] negative_mul_res;
        ready <= 1'b1;
        mul <= 1'b0;
        negative_mul_res = -mul_res;

        valid_count <= valid_count + valid;
        ... // set uops to default values

        case (working_uop.op)
            4'b0000 :
                uop_out.dst_val <= mul_res[31:0];
            4'b0001 :
                uop_out.dst_val <= (src_1_sign ^ src_2_sign) ?
                    negative_mul_res[63:32] : mul_res[63:32];
            4'b0010 :
                uop_out.dst_val <= (src_1_sign) ?
                    negative_mul_res[63:32] : mul_res[63:32];
            4'b0011 :
                uop_out.dst_val <= mul_res[63:32];
            default :
                uop_out.faulted <= 1'b1;
        endcase
    end else if (valid | valid_count) begin
        ...
        if (uop.op_type == 3'b111) begin // RV32M logic
            ready <= 1'b0;
            if (uop.op == 4'b0000 || uop.op == 4'b0001
                    || uop.op == 4'b0010 || uop.op == 4'b0011) begin
                mul <= 1'b1;
                uop_out.valid <= 1'b0;
                src_1_sign <= src_1[31];
                src_2_sign <= src_2[31];
            end
...

Division

Division is a bit heavier. We compute the quotient by determining one bit at a time, starting from the highest bit. This is not as parallel as multiplication, which is why we need many more cycles to do this. Additionally, it is a variable cycle operation because our remainder can become zero before we compute the entire quotient.

OP handling is relatively similar, although we have to store our signs in registers because the next instruction can overwrite any combinational logic (as there is a one-cycle delay between the ready signal, so issue may fill the uop data).

  
if (uop.op_type == 3'b111) begin // RV32M logic
    ready <= 1'b0;
    ... // mul handeling
    end else if (uop.op == 4'b0101 || uop.op == 4'b0111) begin
        div_acum <= '0;
        acum_ptr <= 5'd31;
        div <= 1'b1;
        uop_out.valid <= 1'b0;
        src_1_sign <= 1'b0;
        src_2_sign <= 1'b0;
        div_src1 <= src_1;
        div_src2 <= src_2;
    end else if (uop.op == 4'b0100 || uop.op == 4'b0110) begin
        div_acum <= '0;
        acum_ptr <= 5'd31;
        div <= 1'b1;
        uop_out.valid <= 1'b0;
        src_1_sign <= src_1[31];
        src_2_sign <= src_2[31];
        div_src1 <= src_1[31] ? -src_1 : src_1;
        div_src2 <= src_2[31] ? -src_2 : src_2;
    end
end 

The core logic computes two quotient bits by determining if multiplying (which is implemented as shifting) the divisor by a power of 2 and subtracting from the running dividend (which accumulates all past subtractions). If the result is under one, then we set the quotient to zero and move on; if it is over one, we set the quotient to one and move on; if it is zero, then we set the quotient bit to one and exit early.

  
begin
    for (int i = 0; i < 2; i++) begin
        tmp_div = , div_src2} << (acum_ptr-i);
        if (tmp_rem >= tmp_div) begin
            tmp_acum[acum_ptr-i] = 1'b1;
            tmp_rem = , tmp_rem} - tmp_div;
        end
    end

    div_src1 <= tmp_rem;
    if (acum_ptr == 1) begin
        acum_ptr <= '0;
    end else
        acum_ptr <= acum_ptr - 2;
    div_acum <= tmp_acum;
end

The implementation of exit-handling is copying over output values based on op types, so I will omit it.

Donut!

As you can see, this is at least 4fps! An 8x improvement!

microarchitecture

dev

This post is licensed under CC BY 4.0 by the author.

Program Loading

Memory Mapped UART

Branch Prediction

Donut!

RV32IM

Multiplication

Division

Donut!

Trending Tags