Spinning A Donut (Creating an OoO CPU to Run donut.c) [Part 2 of 2]
This is part two of two talking about an out-of-order CPU I built. In this write-up, I will discuss the implementation of program loading, speculative memory-mapped I/O, branch prediction, and the implementation of the RV32IM extension.
Program Loading
To run any program that spins a donut, we must be able to load it onto the CPU. Since we must be able to read instruction memory and program memory at the same time, it was decided that the design would have two separate memory arrays (for simplicity & bandwidth, although it is not an ideal design). The memory array that fetch had access to was small, as it only needed to house the .text section, while the memory array that the AGU had access to was large, as it needed to house .text, the stack, globals, and more.
To write any program, we need to know how much we are writing. To do that, we read a 32-bit header from the UART before reading the program.
1
2
3
4
5
6
7
8
9
else if (!rc_done) // waits for the UART module to say that it is receiving a byte before allowing reads to occur again
byte_written <= 0;
else if (rc_done && !byte_written) begin
if (!header_loaded) begin
byte_written <= 1;
header[load_word_ptr] <= RXDATA[1];
load_word_ptr <= load_word_ptr + 1;
if (load_word_ptr == 2'b11) // terminate on the last byte being written
header_loaded <= 1'b1;
Once the header is loaded, we know how much we need to receive. From there, it is pretty simple. We forward one loaded word (since our memory is composed of words instead of bytes) after another to both memory arrays and increment the memory array pointers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
byte_written <= 1;
tmp_word[load_word_ptr] = RXDATA[1];
loaded_word[load_word_ptr] <= RXDATA[1];
load_word_ptr <= load_word_ptr + 1;
header <= header - 1;
if (load_word_ptr == 2'b11 || !(header-1)) begin // every 4 bytes we transmit a word
loaded_valid <= 1'b1;
write_enable <= '1;
queue_data <= tmp_word;
queue_mask <= 4'b1111;
load_ptr <= load_ptr + 1;
queue_addr <= load_ptr;
end
if (header == 1) begin
loading_done <= 1'b1;
end
As you can see, this is directly setting registers which effect the memory array that the AGU uses. Since we have to avoid multiple drivers, this logic is embedded inside the AGU. The word is also forwarded outside of the module:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
(* ramstyle = "M10K" *) reg [31:0] mem1 [2047:0];
(* ramstyle = "M10K" *) reg [31:0] mem2 [2047:0];
reg [31:0] tmp_load_word;
assign predecode_instr = {i_buf_two, i_buf_one};
always @(posedge `CLK) begin
i_buf_one <= mem1[fetch_addr_aligned];
if (we1)
mem1[load_ptr_tmp] <= tmp_load_word;
end
always @(posedge `CLK) begin
i_buf_two <= mem2[fetch_addr_aligned];
if (we2)
mem2[load_ptr_tmp] <= tmp_load_word;
end
...
if (loading) begin
we1 <= 1'b0;
we2 <= 1'b0;
if (loaded_valid) begin
if (load_ptr <= 4095) begin
tmp_load_word <= loaded_word;
load_ptr <= load_ptr + 1;
load_ptr_tmp <= {load_ptr} >> 1;
if (load_ptr[0]) // even
we2 <= 1'b1;
else
we1 <= 1'b1;
end
end
end
This looks kinda wonky because of the non-blocking assignments, but essentially it splits word writes into two buffers while keeping the writes in bounds. If you can recall, the two buffers exist to increase our bandwidth since we read 8 bytes at a time in order to achieve a 2-instruction-wide CPU.
With all of this done, we can finally run programs! Unfortunately, the only feedback we can receive in this design is by wiring some registers to LEDs or similar.
Memory Mapped UART
To achieve text input and output, we can use UART. Since we already built the modules, the abstraction we can use is:
- We only need to set a bit high for one cycle to indicate readiness and fill a 1-byte register in order to transmit a byte
- We only need to read a 1-byte register if the ready signal goes from low to high
For the program running on the CPU, it needs:
- A status byte, with one bit indicating if there was a new byte read since the last byte was read. The other byte indicates if the CPU is ready to transmit another byte
- A byte from the receiver to read from
- A byte for the transmission to write to
As you may notice, a read has to also write to a status bit. To make things more complicated, the writer has to be speculative. Additionally, the receiver can only overwrite the byte if the previous byte was architecturally read, which needs to set the architectural and speculative status bit. For writing, it is slightly simpler. We need to speculatively write to the status register and transmission buffer, then copy it over into the real thing during the commit stage.
Since this is memory-mapped, it can be embedded into our AGU. But since the dual write structure makes things complicated, we can use pseudo-memory, which is really just registers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
queue[qtail].rob_id <= uop.rob_id;
queue[qtail].data <= src_2;
queue[qtail].addr <= {addr>>2, 2'b00};
if (addr == 32'h10000000) begin
qtail <= qtail + 1;
STATUS[0][0] <= 1'b0;
end else if (addr == 32'h10000004) begin
qtail <= qtail + 1;
case (uop.op)
4'b0000: // load byte
uop_out.dst_val <= { {25{RXDATA[0][7]}}, RXDATA[0][6:0]};
4'b0001: // load half
... // op handeling
default:
uop_out.faulted <= 1'b1;
endcase
STATUS[0][1] <= 1'b0;
end else if (addr == 32'h10000008) begin
uop_out.was_uart <= 1'b0;
uop_out.dst_val <= { {31{1'b0}}, status};
end
This is kinda coarse, as it only accounts for memory addresses, so invalid usages are handled incorrectly. The STATUS register is a 2*2 array. The 0th array is the speculative state, [0][0] indicates whether the transmission is ready, [0][1] contains a bit indicating whether there is a new byte to receive. So when we want to read a byte, we copy over RXDATA[0] (the 1st index contains the live state driven by the UART module) into the destination value and speculatively mark it as read. For writing, we speculatively mark the transceiver as not-ready. For both of these cases, we allocate an entry in the LSQ. For status reads, we simply read the status.
For retiring, we simply copy over the physical state to the architectural state. But to keep the architectural state consistent, we update based on signals from UART:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
if (!STATUS[1][1] && (!prev_rc_done) && (rc_done)) begin
new_status[1][1] <= !new_status[1][1];
STATUS[1][1] <= 1'b1;
RXDATA[0] <= RXDATA[1];
end
if (!STATUS[1][0] && (!prev_out_ready) && (out_ready)) begin
new_status[1][0] <= !new_status[1][0];
STATUS[1][0] <= 1'b1;
end
if (prev_written) begin
t_qhead = t_qhead + 1;
qhead <= qhead + 1;
prev_written <= 1'b0;
end
...
if ((queue[t_qhead].addr == 32'h10000000 ||
queue[t_qhead].addr == 32'h10000004) && retire_rob_valid) begin
prev_written <= 1'b1;
case (queue[t_qhead].addr)
32'h10000000 : begin
TXDATA <= queue[t_qhead].data;
dat_out_ready <= 1'b1;
STATUS[1][0] <= 1'b0;
new_status[1][0] <= !new_status[1][0];
end
32'h10000004 : begin
STATUS[1][1] <= 1'b0;
new_status[1][1] <= !new_status[1][1];
end
endcase
end ...
As you can see, we also update a new_status register. This indicates to the other clocked logic that the architectural state has changed, so it can copy it over without multiple drivers:
1
2
3
4
5
6
7
8
9
if (new_status[1][1] != new_status[0][1]) begin
STATUS[0][1] <= STATUS[1][1];
new_status[0][1] <= new_status[1][1];
status[1] = STATUS[1][1];
end if (new_status[1][0] != new_status[0][0]) begin
STATUS[0][0] <= STATUS[1][0];
new_status[0][0] <= new_status[1][0];
status[0] = STATUS[1][0];
end
Branch Prediction
For our implementation, we will use a simple BHT with a few 2-bit entries. The major reason we need at least some form of branch prediction is that a multi-cycle penalty gets incurred if we do not predict correctly, so minimizing this is important. Additionally, it is relatively easy to predict branches at over a 70% success rate, so the small amount of complexity is worth it.
To do this, I embedded a few structures inside the decode stage. On every predicted jump taken, we must tell the fetch unit to jump. Implementing this was simple since we could build off the immediate jump logic. We also must indicate whether or not a branch was taken inside the uop. Upon retiring or flushing, the decode stage receives a signal from the retire stage to indicate if the branch was taken. To prevent storing the PC in the ROB, a head/tail structure is embedded inside the decode stage, which queues the speculative jumps. If a jump is taken, then the 2-bit counter is incremented; if it is not taken, then it is decremented (with overflow being prevented). The reason the structure is 2 bits is that we do not want jumps that rarely occur to flip the state, which could lead to an additional misprediction.
When we flush, we need to update the BHT but also clear the buffer:
1
2
3
4
5
6
7
8
9
10
11
12
for (i = 0; i < 2; i = i + 1) begin
if (update_btb[i]) begin
if ((taken[i] == 2'b11 && btb[btb_ent_q[btb_head]] != 2'b00)
|| (taken[i] == 2'b01 && btb[btb_ent_q[btb_head]] != 2'b11))
btb[btb_ent_q[btb_head]] <= btb[btb_ent_q[btb_head]] + taken[i]; // "taken" will be 1 or negative 1
btb_head = btb_head + 1;
end
end
btb_head <= '0;
btb_tail <= '0;
btb_ent_q <= '0;
...
Every cycle, we poll the retire unit for updates:
1
2
3
4
5
6
7
8
for (i = 0; i < 2; i = i + 1) begin
if (update_btb[i]) begin
if ((taken[i] == 2'b11 && btb[btb_ent_q[btb_head]] != 2'b00)
|| (taken[i] == 2'b01 && btb[btb_ent_q[btb_head]] != 2'b11))
btb[btb_ent_q[btb_head]] <= btb[btb_ent_q[btb_head]] + taken[i]; // "taken" will be 1 or negative 1
btb_head = btb_head + 1;
end
end
And every jump we allocate an entry and read from the BHT:
1
2
3
4
5
6
7
8
9
10
B_type : begin // assigns src1, src2, and immediate
btb_ent_q[btb_tail] <= prev_fetch_addr + (i*4);
btb_tail = btb_tail + 1;
if (btb[btb_addr[5:0]][1]) begin
uops[i].pred_taken <= 1'b1; // whole uop initialized to zero, so not taken otherwise
just_jumped <= 1'b1; // so next cycle inserts bubbles
jumped = 1; // so next instruction becomes bubble
jmp <= 1'b1; // so fetch knows to update PC
new_pc <= prev_fetch_addr + (i*4) + , instructions[i][7], instructions[i][30:25], instructions[i][11:8], 1'b0};
end
Donut!
To load the donut program onto our FPGA, we must wrap the printf/puts/putchar calls in UART helper functions, add helpers for multiplication and addition, compile it with our specific instruction set (bare RV32I) and a small assembly helper that sets the stack pointer, then copy over the ELF into a raw machine code file. To transmit, we have to configure our UART to the right baud rate and make it raw, then send the header, then the raw binary, and then read the output:
RV32IM
The demo is pretty slow. Possibly a frame every two seconds. Luckily, the reason why is very clear: we are emulating multiplication and division at a very heavy penalty, and donuts need a lot of multiplication and division.
Multiplication
Multiplication is pretty easy, since shifts represent multiplication by a power of 2, we can decompose a single multiplication into multiple additions of multiplications by powers of 2. Luckily, we do not have to implement this by hand, and we can use DSPs to make it have a single-cycle delay.
First things first, we have to make the multiplication unsigned. This is because two’s complement does not apply to multiplication. To do this, we use combinational logic (which allows us to feed it to the DSP faster), which is dependent on the opcode passed through.
1
2
3
4
5
logic [31:0] src_1_unsigned;
assign src_1_unsigned = (src_1[31] & uop.op == 4'b0001) ||
(src_1[31] & uop.op == 4'b0010) ? -src_1 : src_1;
logic [31:0] src_2_unsigned;
assign src_2_unsigned = (src_2[31] & uop.op == 4'b0001) ? -src_2 : src_2;
The DSP invocation is pretty simple:
1
2
always @(posedge clk)
mul_res <= src_1_unsigned*src_2_unsigned;
For our main logic, all we have to do is plug in the result into the destination value with respect to whatever opcode we are executing and mark the ALU as no-longer ready.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
always @(posedge clk or negedge CPU_RESET_n) begin
...
end else if (mul) begin
logic [63:0] negative_mul_res;
ready <= 1'b1;
mul <= 1'b0;
negative_mul_res = -mul_res;
valid_count <= valid_count + valid;
... // set uops to default values
case (working_uop.op)
4'b0000 :
uop_out.dst_val <= mul_res[31:0];
4'b0001 :
uop_out.dst_val <= (src_1_sign ^ src_2_sign) ?
negative_mul_res[63:32] : mul_res[63:32];
4'b0010 :
uop_out.dst_val <= (src_1_sign) ?
negative_mul_res[63:32] : mul_res[63:32];
4'b0011 :
uop_out.dst_val <= mul_res[63:32];
default :
uop_out.faulted <= 1'b1;
endcase
end else if (valid | valid_count) begin
...
if (uop.op_type == 3'b111) begin // RV32M logic
ready <= 1'b0;
if (uop.op == 4'b0000 || uop.op == 4'b0001
|| uop.op == 4'b0010 || uop.op == 4'b0011) begin
mul <= 1'b1;
uop_out.valid <= 1'b0;
src_1_sign <= src_1[31];
src_2_sign <= src_2[31];
end
...
Division
Division is a bit heavier. We compute the quotient by determining one bit at a time, starting from the highest bit. This is not as parallel as multiplication, which is why we need many more cycles to do this. Additionally, it is a variable cycle operation because our remainder can become zero before we compute the entire quotient.
OP handling is relatively similar, although we have to store our signs in registers because the next instruction can overwrite any combinational logic (as there is a one-cycle delay between the ready signal, so issue may fill the uop data).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
if (uop.op_type == 3'b111) begin // RV32M logic
ready <= 1'b0;
... // mul handeling
end else if (uop.op == 4'b0101 || uop.op == 4'b0111) begin
div_acum <= '0;
acum_ptr <= 5'd31;
div <= 1'b1;
uop_out.valid <= 1'b0;
src_1_sign <= 1'b0;
src_2_sign <= 1'b0;
div_src1 <= src_1;
div_src2 <= src_2;
end else if (uop.op == 4'b0100 || uop.op == 4'b0110) begin
div_acum <= '0;
acum_ptr <= 5'd31;
div <= 1'b1;
uop_out.valid <= 1'b0;
src_1_sign <= src_1[31];
src_2_sign <= src_2[31];
div_src1 <= src_1[31] ? -src_1 : src_1;
div_src2 <= src_2[31] ? -src_2 : src_2;
end
end
The core logic computes two quotient bits by determining if multiplying (which is implemented as shifting) the divisor by a power of 2 and subtracting from the running dividend (which accumulates all past subtractions). If the result is under one, then we set the quotient to zero and move on; if it is over one, we set the quotient to one and move on; if it is zero, then we set the quotient bit to one and exit early.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
begin
for (int i = 0; i < 2; i++) begin
tmp_div = , div_src2} << (acum_ptr-i);
if (tmp_rem >= tmp_div) begin
tmp_acum[acum_ptr-i] = 1'b1;
tmp_rem = , tmp_rem} - tmp_div;
end
end
div_src1 <= tmp_rem;
if (acum_ptr == 1) begin
acum_ptr <= '0;
end else
acum_ptr <= acum_ptr - 2;
div_acum <= tmp_acum;
end
The implementation of exit-handling is copying over output values based on op types, so I will omit it.
Donut!
As you can see, this is at least 4fps! An 8x improvement!