processor.risc_v

The module implements a RISC-V processor core using RV32I base ISA with optional support for MUL* instructions from the “M” standard extension. Its features include support for multiple hardware threads (harts), Memory Mapped IO, instruction fetching from external memory (with an option to implement instruction cache), and support for custom instructions. The microarchitecture deploys a five-stage pipeline and dynamic branch prediction.

Processor pipeline stalls occur only when a handler provided by the Execution Environment doesn’t return results immediately (e.g., when implementing asynchronous Memory Mapped IO load/store or instruction fetching from external memory with multi-cycle read), or if a trap/exception handler stalls. There are no stalls due to late results or data hazards.

The RISC-V core can be configured into various variants that make different area/fmax/performance trade-offs.

Hardware Threads

The core can be configured with 1, 2, 4, or 8 hardware threads (harts). Harts have independent architectural states but are time-multiplexed onto underlying core hardware and may share micro-architecture state (e.g., branch target buffer). All harts share tightly coupled instruction memory, and depending on configuration, may or may not share data memory or external instruction memory.

One Hardware Thread

The 1-hart configuration is designed to balance single-threaded execution performance and small core size.

Instruction Cycles
Branch predicted (taken or not taken) 1
Branch mispredicted 4
Memory load 2
ECALL, EBREAK 1+
MUL, MULH, MULHSU, MULHU 5
All other instructions 1

Two Hardware Threads

The 2-hart configuration provides 2 hardware execution threads and is designed to balance high IPC (instructions per cycle) and single-thread execution performance.

Instruction Cycles
Branch predicted (taken or not taken) 1
Branch mispredicted 3 (or 4 in Fmax optimized configuration)
Memory load 1
ECALL, EBREAK 1+
MUL, MULH, MULHSU, MULHU 4
All other instructions 1

Four Hardware Threads

The 4-hart configuration provides 4 hardware execution threads and supports 1-cycle execution of all instructions, including multiplication instructions from the “M” extension.

Instruction Cycles
Branch predicted (taken or not taken) 1
Branch mispredicted 2 (or 3 in Fmax optimized configuration)
Memory load 1
ECALL, EBREAK 1+
MUL, MULH, MULHSU, MULHU 1
All other instructions 1

Eight Hardware Threads

The 8-hart configuration provides 8 hardware execution threads with an average performance of 1 instruction per cycle. This configuration is designed for highly parallel workloads that need maximum aggregate performance but can tolerate lower per-thread throughput.

Instruction Cycles
Branch predicted (taken or not taken) 1
Branch mispredicted 1
Memory load 1
ECALL, EBREAK 1+
MUL, MULH, MULHSU, MULHU 1
All other instructions 1

Memory Map

Memory map can be controlled using RISC_V template parameters. In the default configuration, instruction memory starts at address 0, and separate data memory for each hardware thread follows immediately after instruction memory, e.g.:

RISC_V<2, 0x1000, 0x2000> core;

The core implements fast unaligned memory access in hardware, and there is no benefit to compile programs to force only aligned access (e.g., using gcc strict-align flag).

All memory and MMIO addresses in the interfaces between the core and the execution environment are byte-aligned and all sizes are in bytes, unless explicitly specified otherwise.

Extensibility

The core can be customized via callbacks provided by the Execution Environment to handle Memory Mapped IO, instruction fetching from external memory, system traps and exceptions, decoding and execution of custom instructions, and dynamic instruction tracing. All callbacks have default values and need to be overridden only when a particular functionality is enabled.

Callbacks can be implemented as regular inline functions or fixed latency extern functions implemented in Verilog.

Branch Prediction

The RISC-V core implements dynamic branch prediction using a 2-bit Branch Target Buffer.

Size of BTB can be configured via BTB_SIZE parameter.


template <
    auto HARTS,
    auto IMEM_LENGTH,
    auto DMEM_LENGTH,
    auto MMIO_LENGTH = 0,
    auto IMEM_ORIGIN = 0,
    auto DMEM_ORIGIN = ((IMEM_ORIGIN + IMEM_LENGTH) << 2),
    auto MMIO_ORIGIN = (DMEM_ORIGIN + DMEM_LENGTH),
    auto IMEM_TCM_SIZE = IMEM_LENGTH,
    template <typename, auto> typename DataMemory = memory_norep,
    auto EXTENSIONS = Extension::None,
    auto CONFIG = Optimize::Area,
    Base ISA = Base::RV32I,
    auto BTB_SIZE = 1024,
    template <typename, auto> typename InstrMemory = memory_init
    >
class RISC_V §source

Class template implementing RISC-V processor core.

Examples

A simple 1-hart core with 8KB of data and 64KB of instruction tightly coupled memory (TCM).

import processor.risc_v

const auto HARTS = 1;
const auto IMEM_LENGTH = 0x4000;
const auto DMEM_LENGTH = 0x2000;

RISC_V<1, IMEM_LENGTH DMEM_LENGTH> core;

By default the data TCM is implemented using simple dual-port memory, and the methods that expose DMEM access to the Execution Environment use the same read/write ports as the core. In this configuration, access to data TCM is safe only when the core is not running. Data TCM can be configured to use quad-port memory (on devices that support it) using the Memory template argument. In this configuration, the Execution Environment can read and write data memory concurrently with the core. Note that concurrent writes to the same address are still undefined. The Execution Environment may have at most one call-site for a DMEM read method and one for a DMEM write method.

import processor.risc_v

const auto HARTS = 1;
const auto IMEM_LENGTH = 0x4000;
const auto DMEM_LENGTH = 0x2000;
const auto MMIO_LENGTH = 0x0;

template <typename T, auto N>
using memory_quad_port = [[memory, quad_port]] T[N];

RISC_V<
    HARTS,
    IMEM_LENGTH,
    DMEM_LENGTH,
    MMIO_LENGTH,
    0,
    (IMEM_LENGTH << 2),
    0,
    IMEM_LENGTH,
    memory_quad_port> core;

Parameters

  • auto HARTS
    

    Number of hardware threads (harts), must be a power of 2.

  • auto IMEM_LENGTH
    

    Length of instruction memory address space in 32-bit words.

  • auto DMEM_LENGTH
    

    Length of data memory in bytes. By default the value specifies length of per-hart data memory. If Option::HartsShareDMEM bit of CONFIG parameter is set, the value specifies total length of data memory shared by all harts. In multi-hart configuration DMEM_LENGTH must be a power of 2 unless shared DMEM is used.

  • auto MMIO_LENGTH = 0
    

    Length of Memory Mapped IO address space in bytes. The default value is 0 which disables MMIO support.

  • auto IMEM_ORIGIN = 0
    

    Start address of address space mapped to instruction memory in 32-bit words.

  • auto DMEM_ORIGIN = ((IMEM_ORIGIN + IMEM_LENGTH) << 2)
    

    Start address of address space mapped to data memory. By default data memory starts right after instruction memory.

  • auto MMIO_ORIGIN = (DMEM_ORIGIN + DMEM_LENGTH)
    

    Beginning address of Memory Mapped IO address space. By default memory mapped IO starts right after data memory.

  • auto IMEM_TCM_SIZE = IMEM_LENGTH
    

    Size in 32-bit words of tightly coupled instruction memory (TCM) instantiated by the core. The default value is equal to IMEM_LENGTH, which means that all of the instruction address space is mapped to the internal TCM. If the specified value is less than the IMEM_LENGTH, the reminder of instruction memory address space is handled by external_fetch callback.

  • template <typename, auto> typename DataMemory = memory_norep
    

    Type alias template providing underlying implementation of data memory. The default is mmemory_norep.

  • auto EXTENSIONS = Extension::None
    

    ISA extensions implemented by the core. The default is Extension::None. The only other valid value is Extension::M

  • auto CONFIG = Optimize::Area
    

    Configuration flags. The default is Optimize::Area. Possible optimization flags are Optimize::Area, Optimize::Fmax (mutually exclusive). Other supported flags are defined by the Option enum.

  • Base ISA = Base::RV32I
    

    Specifies the base ISA. The default and the only supported value at this time is Base::RV32I

  • auto BTB_SIZE = 1024
    

    Number of entries in Branch Target Buffer. The default of 1024 is usually a good value for BTB realized as block RAM. If using block RAM is undesirable, a small BTB, for example with 8 entries, can give good results. When BTB_SIZE is less than 1024, the BTB is implemented as an array. The value must be a power of 2.

  • template <typename, auto> typename InstrMemory = memory_init
    

    Type alias template providing underlying implementation of instruction memory. The default is mmemory_init which instantiates memory with initialization. Specifying memory instead may be useful when initialization is not desirable, for example to allow using memories such as Xilinx UltraRAM which don’t support initialization. Note that when using memory without initialization, the execution environment must make sure that that any area of IMEM the core might read from is initialized before starting the core. Failure to do so will likely result in asserts during simulation due to undefined data being read from IMEM.

Aliases

Callbacks and Fields

  • RISC_V::system_trap_t system_trap = RISC_V::core.default_system_trap §source
    

    Callback to handle exceptions and traps. The default callback asserts in simulation on any system exception.

  • RISC_V::trace_t trace = RISC_V::core.default_trace §source
    

    Dynamic instruction trace callback. The default callback is no-op. The print_trace method can be used to print disassembled instructions to the log stream.

  • RISC_V::mmio_access_t mmio_access = RISC_V::core.default_mmio_access §source
    

    The callback handles memory load/store within MMIO address space. If the access operation completes immediately, the is_valid of the returned result should be set to true. For load access the value of the returned result is the result of the load. For store access the value of the returned result is ignored. If the operation will complete asynchronously the is_valid of the returned result should be set to false, and execution environment should indicate completion of load/store by calling the mmio_load_result or mmio_store_completed method respectively.

  • RISC_V::mmio_load_t mmio_load = RISC_V::core.default_mmio_load §source
    

    The callback handles memory loads within MMIO address space. The callback result is optional<int_t>. The callback may return result immediately, setting is_valid to true, or asynchronously by calling the mmio_load_result method later. By default the callback asserts in simulation. The callbacks is used only when mmio_access callback uses the default implementation.

  • RISC_V::mmio_store_t mmio_store = RISC_V::core.default_mmio_store §source
    

    The callback handles memory stores within MMIO address space. The callback result is bool. The callback should return true if the program running on the core can consider the store as committed, or false if the store should be considered as pending. In the latter case, the hart that made the request stalls until the core’s method mmio_store_completed is called. By default the callback asserts in simulation. The callbacks is used only when mmio_access callback uses the default implementation.

  • RISC_V::external_fetch_t external_fetch = RISC_V::core.default_external_fetch §source
    

    Optional external instruction memory support is enabled by specifying IMEM_TCM_SIZE that is less than IMEM_LENGTH. In this case instruction fetch for addresses above IMEM_TCM_SIZE are directed to the Execution Environment via this callback. The default callback asserts in simulation.

  • RISC_V::custom_decode_t custom_decode = RISC_V::core.default_custom_decode §source
    

    The custom_decode callback is called for instructions using one of the four major opcodes reserved for custom extensions:

    RVG::custom_0
    RVG::custom_1
    RVG::custom_2
    RVG::custom_3

    The callback should return one of the standard encoding formats used the custom instruction, or Format::Invalid if the major opcode is not implemented by the custom extension. The default callback decodes all custom instructions as invalid.

  • RISC_V::custom_execute_t custom_execute = RISC_V::core.default_custom_execute §source
    

    Callback is called to execute a custom instruction. The result returned by the callback is written to the destination register if the instruction encoding format specifies one. The result is ignored for formats B and S, or when destination register is zero. The default callback is no-op.

Methods

  • inline void start(RISC_V::uint_t[HARTS] pc) §source
    

    Start the core.

    Arguments

    • RISC_V::uint_t[HARTS] pc
      

      Array of initial program counters for each hart.

  • inline void stop() §source
    

    Request the core to stop. Note that the method doesn’t block until the core has finished running. Use is_running method to wait for the core to stop.

  • inline bool is_running() §source
    

    Returns true if the core is running. Can be used to wait for the core to finish running after stop request.

  • inline 
    void
    init_stack_pointers(
        RISC_V::uint_t stack_start,
        RISC_V::uint_t stack_size
        ) §source
    

    Initialize stack pointers. Not necessary if the firmware startup code performs the initialization. When the option Option::HartsShareDMEM is set then each hart is allocated its own stack of the specified size, with the stack for hart 0 starting at the specified address, and stacks for remaining harts following below it in DMEM address space.

    Arguments

    • RISC_V::uint_t stack_start
      

      Start address of the stack. Must be 16 byte aligned. Since the stack grows down, the starting address usually is set to DMEM_ORIGIN + DMEM_LENGTH - 0x10

    • RISC_V::uint_t stack_size
      

      Size of the stack for each hart.

  • inline void print_memory_map() §source
    

    Print a diagram of memory address space configuration.

  • inline 
    void
    print_trace(
        RISC_V::hart_index_t hid,
        RISC_V::imem_addr_t addr,
        RISC_V::uint_t inst,
        Decoded decoded,
        optional<RISC_V::int_t> value
        ) §source
    

    Print dynamic instruction trace. The method can be used to initialize the trace callback.

    Arguments

  • inline 
    void
    external_fetch_result(
        RISC_V::hart_index_t hid,
        RISC_V::uint_t addr,
        RISC_V::uint_t value
        ) §source
    

    Complete asynchronous external instruction fetch.

    Arguments

  • inline 
    void mmio_load_result(RISC_V::hart_index_t hid, RISC_V::int_t value) §source
    

    Complete asynchronous load from memory mapped IO.

    Arguments

    • RISC_V::hart_index_t hid
      

      Index of the hart that performed the asynchronous load from memory mapped IO.

    • RISC_V::int_t value
      

      Result of memory mapped IO load.

  • inline void mmio_store_completed(RISC_V::hart_index_t hid) §source
    

    Complete asynchronous store to memory mapped IO.

    Arguments

    • RISC_V::hart_index_t hid
      

      Index of the hart that performed the asynchronous store to memory mapped IO.

  • inline void imem_write(RISC_V::uint_t addr, RISC_V::uint_t value) §source
    

    Write 32-bit word to tightly coupled instruction memory. The method is safe to call only when the core is not running.

    Arguments

  • inline 
    RISC_V::int_t dmem_read(RISC_V::hart_index_t hid, RISC_V::uint_t addr) §source
    

    Read a 32-bit word from hart’s data memory.

    Arguments

  • inline 
    void
    dmem_write(
        RISC_V::hart_index_t hid,
        RISC_V::uint_t addr,
        RISC_V::int_t value,
        count_t<4> size
        ) §source
    

    Write specified number of bytes to hart’s data memory.

    Arguments

  • inline 
    RISC_V::int_t
    dmem_read_aligned(RISC_V::hart_index_t hid, RISC_V::uint_t addr) §source
    

    Read a 32-bit word from hart’s data memory at a 32-bit aligned address. More resource efficient than dmem_read when unaligned address support is not required.

    Arguments

  • inline 
    void
    dmem_write_aligned(
        RISC_V::hart_index_t hid,
        RISC_V::uint_t addr,
        RISC_V::int_t value
        ) §source
    

    Write a 32-bit word to hart’s data memory at a 32-bit aligned address. More resource efficient than dmem_write when unaligned address support is not required.

    Arguments