Jamie Iles - Computer Systems Engineer

RXV: RISC-V RV32IMA CPU

RXV microarchitecture

Trial layout on SKY130A

FPGA SoC Diagram

The RXV core is a RISC-V RV32IMA CPU that successfully runs mainline Linux with a CoreMark score of 2.61/MHz. The core is in-order with a 7 stage pipeline, out of order completion, register renaming, dynamic branch prediction and indirect branch target prediction, set-associative caches, fully associative TLBs and AXI4 support. A design for the Digilent Arty S7 board uses 32KB 8-way associated cache and 16 way fully associative TLB each for instruction and data paths. Externally, Xilinx IP provides AXI4 interconnect, SPI controller, interrupt controller, DDR3 controller and 16550A.

On top of the RV32IMA base, the following extensions are implemented:

Zicsr
Cifencei
Sstc
Sscofpmf

The pipeline consists of:

Fetch 1: next PC generation, combined BTB/PHT lookup
Fetch 2: instruction cache tag lookup, instruction TLB lookup, branch prediction resteer
Fetch 2: tag compare, prefetch queue write
Decode: instruction decode, register rename, misidentified branch resteer and issue. All instructions apart from atomic fetch/op/store are a single uop, atomic operations may require more than one uop
Execute: execute in one of 4 functional units:
- LSU: fully pipelined 3 cycle load-use latency. Misaligned load/store exceptions are raised in the first cycle, invalid AMO (atomic to device memory) and page faults are handled in the second cycle
- Integer: integer operations, branches, CSR accesses, 1 cycle latency with result forwarding back to input. Illegal instruction, branch alignment and environment call exceptions are raised here
- Multiply: fully pipelined 5 cycle result-use latency
- Divide: non-pipelined 33 cycle result-use latency
Writeback: results are written back to the register file
Commit: rename file updated and exceptions raised

Atomic operations are broken down into multiple micro-ops using register rename.

Performance Counters

There are a configurable number of performance counters in addition to the mcycle+minstret fixed counters which can be used for profiling:

- PMU_CYCLES
- PMU_INSTRET
- PMU_BRANCH
- PMU_BRANCH_MISPRED
- PMU_FE_STALL
- PMU_BE_STALL
- PMU_L1D_READ
- PMU_L1D_READ_MISS
- PMU_L1D_WRITE
- PMU_L1D_WRITE_MISS
- PMU_L1I_READ
- PMU_L1I_READ_MISS
- PMU_DTLB_READ
- PMU_DTLB_READ_MISS
- PMU_ITLB_READ
- PMU_ITLB_READ_MISS

FPGA Implementation

The default Arty S7 configuration has:

RXV Core
- 32KB 8-way set associated instruction cache
- 32KB 8-way set associated data cache
- 16-way fully associated instruction TLB
- 16-way fully associated data TLB
- MTIME reference at ~10MHz
- Separate AXI4 instruction+data busses
Xilinx AXI interrupt controller
Xilinx AXI 16550A UART
Xilinx AXI SPI master with 1 chip select and 256 byte FIFO
Xilinx AXI4 SmartConnect
Xilinx AXI memory adapter connecting to the BootROM

The bootrom fetches the generic platform build of OpenSBI and a flattened device tree from a FAT partition on the micro SD along with a gzip compressed Linux kernel before transferring control to OpenSBI.