Attention

This document was last updated Nov 25 24 at 21:59

System On Chip

Important

The purpose of this lecture is as follows.

To describe the typical architecture of a system-on-chip (SoC)
To introduce the IBEX RISC-V 32-bit processor as a ceontral control element in SoC
To introduce the relevant abstraction layers of design in SoC
To discuss memory-mapped coprocessing as a mechanism to integrate custom-hardware in SoC
To experiment with memory-mapping in the context of IBEX

Attention

The following references are relevant background to this lecture.

1. Schaumont, “System on Chip,” Chapter in A Practical Introduction to Hardware Software Codesign, Springer. https://link.springer.com/chapter/10.1007/978-1-4614-3737-6_8.
Pasquale Davide Schiavone, Francesco Conti, Davide Rossi, Michael Gautschi, Antonio Pullini, Eric Flamand, Luca Benini, “Slow and steady wins the race? A comparison of ultra-low-power RISC-V cores for Internet-of-Things applications,” PATMOS 2017: 1-8. https://doi.org/10.1109/PATMOS.2017.8106976
RISCV Consortium, RISC-V Technical Specifications.
RISCV Instruction Reference Card

The System On Chip Architecture

In many ways, modern hardware design is about finding the right tradeoff between specialization and flexibility in the design. A higher level of specialization results in a higher of performance of the implementation and a better energy efficiency. On the other hand, high levels of specialization have limited use. When chips are expensive to build, it’s better to make sure that a single chip can be used for multiple purposes.

The system-on-chip is a system architecture that is fundamentally built around this trade-off of specialization and general-purpose design. The core idea is to combine a programmable element, typically a small processor, with several dedicated hardware modules that perform specialized operations. The general-purpose functionality is achieved by software running on the processor, while the specialized functionality is achieved by the hardware modules integrated around the processor.

Micro-controllers have followed the system-on-chip architecture for many years, because of their need to combine dedicated input/output peripherals with software functionality. For example, an motor speed control application may require a high-resolution timer module with pulse-width-modulation output, while an automotive application may need one or more CAN (Controller Area Network) interfaces for communication with sensor units distributed over a car’s body. Different application domains can thus be translated into different microcontroller families which each provide a mix of peripherals that optimized to the application domain.

The following figure (taken from a TI MSP-430 User Guide) is a blueprint of a typical microcontroller with an SoC architecture. The left side of the figure shows the processor along with on-chip storage for data and code. The right side of the figure shows a number of peripherals, all attached to the same bus as the processor memories. From the hardware point-of-view, the interface between the peripherals and the processor looks the same as the interface between the memories and the processor. From the software point-of-view, the peripherals appear as memory locations to the processor. Software can read and write into these memory locations to exchange data with the hardware peripherals. This concept of memory-mapped hardware is central to the system-on-chip architecture. It supports the integration of dedicated hardware into a larger system, while it is backward compatible with the software development tools used by the processor.

The memory-mapped interface is extremely scalable: as many peripherals can be integrated as needed. Of course, these peripherals must each respond to different locations into the memory space of the processor. However, the memory-mapped interface is not a panacea, and the memory-bus can quickly become a bottleneck because it is shared between all peripherals and the processor memory. Hence, memory-mapped interfaces are not suited when high-speed communication between the peripheral and the software (or between peripherals among themselves) is needed. However, that disadvantage is easily outweighed by the advantage of an easy-to-use, extensible interface for dedicated hardware modules. The chapter on System on Chip under the reading materials summarized the most important characteristics of the system-on-chip architecture.

In this course, we will use the system-on-chip architecture as a baseline for hardware designs. We will use a RISC-V processor as the central processor, and extend it with memory-mapped peripherals. We will also study other mechanisms of customization, such as modifying the RISC-V processor with new instructions. However, this is not a course on System-on-Chip design, and the focus will be squarely on the hardware design aspects of SoC. By studying hardware design in the context of SoC architectures, we experience the constraint of a real hardware interface, and a testing environment corresponding to a common integration scenario.

IBEX, a 32-bit RISC-V processor

IBEX is a processor that was initially developed under the name zero_riscy, and is now maintained by lowRISC as an open-source project. IBEX has been used by a number of highly visible open-source projects such as OpenTitan, a root of trust for modern compute infrastructure.

IBEX splits the execution of each instruction in three stages - fetch, decode/execute, and writeback. IBEX supports two configurations of the base RISC-V instruction set. These instruction sets are focused on embedded and low-power operation.

Name	Description
RV32I	32-bit Base Integer Instrution Set
RV32E	32-bit Embedded w 16-bit Registers

Several extensions can be added to the base instruction sets. When an IBEX core is configured, these extension identifiers are appended to define the overall instruction set of the core. For example, RV32IMC means a core with a 32-bit base instruction set, integer multiplication and division, and compressed instruction support.

Name	Extension
C	Extension for compressed instructions (always on)
M	Integer Multiplication/Division
B	Bit Manipulation
ZICSR	Control/Status Reg Manipulation (always on)

Attention

The main repository for the IBEX core is https://github.com/lowRISC/ibex. This repository contains the hardware description as well as the configuration scripts to create cores with various capabilities. The documentation website is an indispensable guide to configure and use the core.

Abstraction Levels in SoC Design

In SoC design, it is common to reason about a design at multiple levels of abstraction. Hardware designers commonly use RTL (register-transfer level) and gate-level abstractions. However, when a programmable processor enters the picture, RTL is too low-level, and incapable of distinguishing the fixed part (hardware and processor) from the flexible part (software running on the processor). Therefore, additional, higher-level abstractions are used. In the following, we assume that the software running on the SoC processor is running in bare-metal mode, i.e., there is no operating system nor virtualization involved, and ‘software’ means a single-threaded executive.

We define four abstraction levels (Application, Instruction-set, RTL and Gate-level). Each of these abstraction levels is naturally captured by a model for time, for software and for hardware, as in the following table.

Level	Time Model	Software Model	Hardware Model
Application	Transaction	ABI	HAL
Instruction-set	Instructions	ISA	Memory Ops, Custom-Ins Ops
RTL	Clock Cycle	Opcodes & Image	Register Transfers
Gate-level	Event	Bits	Gates + Macro’s

We will work through the table from the lowest abstraction level to the highest. The Level column shows the name of each abstraction level. The Time Model column shows the smallest time step in behavior that is typically considered. The Software Model shows the common abstraction model used to define software. In essence, the software model is the interface under which the lower abstraction levels are hidden. The Hardware Model is the common abstraction model used to define hardware. As with software, the hardware model is expressed in terms of interfaces that hide low-level details.

The Gate-level is the lowest abstraction level defined in the table. Software and hardware are logically indistinguishable, although at this level you can map hardware into structural elements (a netlist of gates and dedicated macro’s), and map software into bits occupying memory modules and/or registers. The gate-level is very detailed, and captures all activities in the chip as events: (time, value) tuples. The Gate-level is a detailed representation that helps one to understand the physical characteristics of the chip, such as its power consumption, maximum clock frequency, and silicon area. But it is exceedingly difficult to understand the chip’s activities at the macroscopic level.

The Register-transfer level (RTL) is a crucial stepping stone between a chip’s high-level model that emphasizes functionality, and the gate-level model that emphasizes physical realization. In an RTL model, all registers in the design are accurately captured, and their update is captured per clock cycle. However, the operations between registers are abtracted out into functional expressions. One of the great achievements of modern EDA is that design automation tools are able to make the transition from the RTL to the gate-level automatically through logic synthesis. Yet, RTL is much easier to understand to a hardware designer. RTL also recognized the association between bits into words. This allows us to think of software models as a sequence of opcodes. RTL is a natural modeling level of hardware designers, but requires a bit of effort in understanding from the software point of view.

The Instruction-Set level captures the behavior of the SoC in terms of instructions of the target processor. The smallest activity in the SoC is the execution of a single instruction, even if this instruction extends over multiple clock cycles. At this level, software is defined by the Instruction-set Architecture (ISA) of the processor. Also, at this level, hardware starts to disappear behind these abstractions. For example, ISA hides the low-level implementation details of the SoC processor, while the SoC peripherals and hardware modules are abstracted into memory locations and (in some cases) specialized instructions on the processor. The instruction-set level is naturally useful to the firmware designer. Because of the pronounced abstraction of low-level details, instruction-level simulation tends to be very fast (with simulation speeds running into KHz to MHz on typical workstations). Yet, the reduced visibility of hardware details at this level means that it is hard to capture hardware bugs.

The Application level is the highest abstraction level we’ll consider at this moment. At this point, both the hardware and the software are abstracted into an interface that emphasized functionality. Software uses an Application Binary Interface (ABI) to abstract the major functional details of the SoC processor (such as the stack, the interrupt model, the memory organization, and so on). Hardware uses the Hardware Abstraction Layer (HAL) which replaces the interaction with dedicated hardware modules by a simple funtion call. The large granularity of modeling usually also affects the time granularity. Application-level models emphasize transaction-level statistics such as number of bus operations, or the number of bytes transferred in one HAL call. Application-level models are crucial for system designers to create the SoC’s blueprint. But they cannot reveal detailed implementation metrics such as the area, gate-count or power of a design.

The table above is not an exhaustive list of abstraction levels. An important abstraction level below that of gate-level is the layout, which is a purely structural description. Likewise, there are additional levels above the application level, where hardware and software merge into a purely functional model of computation. In this course on Advanced Digital Systems Design, our emphasis will be on the RTL and below (Gate, Layout). However, to understand the SoC’s design and operation, the instruction-set level plays an important role as well.

Important

Our objective in the project will be to design hardware accelerator modules for an IBEX based SoC. That is the primary motivation for studying SoC design in this lecture.

A corollary of the above observations on abstraction level, is that it is meaningful to consider multiple simulation-levels for a system on chip. Instruction-level simulation, for example, would be useful to develop and test software (firmware) for the SoC once the hardware is more or less fixed. Cycle-accurate simulation (at the RTL level) would be useful to efficiently co-simulate hardware and software, but without consideration about the detailed timing-dependent hardware-level effects. Cycle-accurate simulation will not be as quick as instruction-level simulation, but still fast enough to cover, say, a few million clock cycles in an hour. Gate-level simulation, with full timing accuracy, would help understand the low-level effects on the hardware platform, such as race conditions and timing faults. But realistically, a gate-level simulation can cover only a few thousand clock cycles in an hour, and so this level is no longer suitable for SoC firmware development.

While we use IBEX, we will similarly use multiple simulators to study the design at each abstraction level. IBEX supports Verilator, a cycle-accurate HDL simulator that efficiently runs firmware applications on IBEX. We will eventually also use XCelium as a gate-level simulator for IBEX-based chips. But the initial experiments will be based on Verilator.

IBEX Instructions

When evaluating an embedded SoC, we frequently run into assembly programs for the embedded processor. Even if we write C and use a C compiler, the output of the compiler will often be evaluated at the assembly level in order to answer precise questions regarding the performance of the design. For this reason, we also consider RISC-V assembly coding.

The standard RISC-V processor has 32 32-bit registers. Their generic name starts with an X: X0 to X31. However, assemblers will often use a naming that is compatible with the standard C ABI for RISC-V. The following table lists the register names and their primary purpose.

Reg Name	Symbolic Name	Description	Saved by
X0	zero	Always zero
X1	ra	Returns Address	Caller
X2	sp	Stack Pointer	Callee
X3	gp	Global Pointer
X4	tp	Thread Pointer
X5-7	t0-2	Temporary	Caller
X8	s0/fp	Frame Pointer	Callee
X9	s1	Saved Register	Callee
X10-17	a0-7	Function Argument	Caller
X18-27	s2-11	Saved Register	Callee
X28-31	t3-6	Temporary	Caller

Assembly code for an embedded processor such as IBEX is typically created as a byproduct of compiling a firmware application. A compiler that produces code for a processor different from the processor that is used for running the compiler is a cross-compiler. To create IBEX programs, we will use a cross-compiler for C programs. On the class design server, the cross-compiler is installed under /opt/lowrisc-toolchain-rv32imcb-... The dots hold the compiler release date. To use the compiler, you’d also set the path to the compiler by adding the following lines in your ~/.bashrc.

# RISCV LOWRISC IBEX
export PATH=$PATH:/opt/lowrisc-toolchain-rv32imcb-20240206-1/bin

Consider the following C function stored in a file max.c.

int max(int a, int b) {
   return (a > b) ? a : b;
}

To create assembly code for this program, you compile max.c as follows.

riscv32-unknown-elf-gcc -S max.c

The produces the following assembly listing in max.S.

     .file   "max.c"
     .option nopic
     .attribute arch, "rv32i2p0_m2p0_c2p0"
     .attribute unaligned_access, 0
     .attribute stack_align, 16
     .text
     .align  1
     .globl  max
     .type   max, @function
max:
     addi    sp,sp,-32
     sw      s0,28(sp)
     addi    s0,sp,32
     sw      a0,-20(s0)
     sw      a1,-24(s0)
     lw      a4,-20(s0)
     lw      a5,-24(s0)
     bge     a5,a4,.L3
     mv      a5,a4
.L3:
     mv      a0,a5
     lw      s0,28(sp)
     addi    sp,sp,32
     jr      ra
     .size   max, .-max
     .ident  "GCC: (crosstool-NG 1.26.0_rc1) 10.2.0"

To study the file, look for lines with instruction mnemonics such as addi (add immediate) and sw (store word). The lines starting with dots are either jump target labels (like .L3) or else directives (like .size and .text). The official RISC-V instruction-set specification describes the functionality of each instruction in detail, but it is easy to find quick-reference cards such as this one by James Zhu or the reference card from the book by Patterson and Henessy. This

Using a basic description of the instruction set, it is possible to figure out what this code does.

   max:
     addi    sp,sp,-32    % make room on the stack
     sw      s0,28(sp)    % save s0 (the frame pointer)
     addi    s0,sp,32     % create a new frame on the stack
     sw      a0,-20(s0)   % store argument 'a' on stack
     sw      a1,-24(s0)   % store argument 'b' on stack
     lw      a4,-20(s0)   % load argument 'a'
     lw      a5,-24(s0)   % load argument 'b'
     bge     a5,a4,.L3    % if b >= a, jump to .L3
     mv      a5,a4        % move a4 into a5
.L3:
     mv      a0,a5        % a5 holds max. Move it into return argument
     lw      s0,28(sp)    % destroy the stack frame
     addi    sp,sp,32     % restore the stack pointer
     jr      ra           % and return

The point is that it is fairly straightforward to determine the functionality of a section of assembly code, especially if you have the C code as a reference.

Important

In this course on hardware design, we won’t spent and extensive amount of time on firmware. However, basic manipulation of firmware and binaries will be part of our design activities because these are an inherent part of SoC design.

The compiler can heavily optimize assembly code. For example, adding the O2 flag during compilation, reduces the assembly implementation to just five instructions.

riscv32-unknown-elf-gcc -O2 -S max.c

max:
     mv      a5,a0
     mv      a0,a1
     bge     a1,a5,.L2
     mv      a0,a5
.L2:
     ret

Of course we care about the efficiency of firmware. In the context of SoC, optimized firmware are two obvious benefits. First, an optimized program may be faster, this completing more work in the same amount of clock cycles. Second, an optimized binary may have a smaller footprint, so it can run with a smaller on-chip memory. Note that the use of the -O flag can also lead to larger binaries, for example when compiler optimizations such as loop unrolling are performed. If a small footprint is most important, use the -Os flag instead (Optimize for Size).

Hello IBEX

Important

IBEX has an extensive online documentation manual

We are now ready to configure and run a basic IBEX system. The following steps can be run on the class design server to create an IBEX core, compile a Hello World program and run it.

The following prerequisites are already available on the machine. You do not have to add these tools.

verilator, a hardware simulator (/usr/local/bin/verilator)
RISC-V cross-compiler (/opt/lowrisc-toolchain-rv32imcb-20240206-1/bin/riscv32-unknown-elf-gcc)

IBEX makes use a code generator called FuseSOC. We will discuss this code generator in further detail in the design-flow lecture (Week 5). For now, think of it as a preprocessor that can instantiate an IBEX core with specific configuration needed. The first step is to get fusesoc on your account.

pyenv virtualenv fusesoc
pyenv activate fusesoc
pip3 install --upgrade fusesoc

If pyenv is not yet available on your account, install it locally using:

curl https://pyenv.run | bash

If you are using the class server (RHEL7), also switch your compiler toolchain to that of RHEL9.

scl enable devtoolset-9 bash

Now you can download and configure the IBEX core as follows.

git clone https://github.com/lowRISC/ibex
cd ibex
pip3 install -r python-requirements.txt

The IBEX system comes with a built-in default configuration, described as IBEX Simple System <https://github.com/lowRISC/ibex/tree/master/examples/simple_system> in the documentation. To configure/compile the Simple System, run fusesoc as follows.

fusesoc --cores-root=. \
        run            \
        --target=sim   \
        --setup        \
        --build        \
        lowrisc:ibex:ibex_simple_system \
        $(util/ibex_config.py small fusesoc_opts)

This command runs FuseSoC to configure an IBEX simulator for Verilator, and compiles the simulator. When the command terminates, a simulator is ready and the core can simulate. The Simple System is bare bones in terms of the SoC concept. It contains an IBEX core, a single on-chip memory, a peripheral to capture output, and a timer peripheral. Moreover, the Simple System also shows how to prepare firmware for the IBEX core.

The Hello World example can be built as follows.

make -C examples/sw/simple_system/hello_test

Take a close look at the result of the firmware compilation.

cd examples/sw/simple_system/hello_test

The initial firmware directory contains two files: Makefile and hello_test.c. All other files are created by the compilation process. Type make distclean followed by make if you want to see the compilation process again.

Let’s look at the files that are produced by compiling hello_test.

File	Function
`hello_test.o`	Object code obtained by compiling C
`hello_test.elf`	Executable Linkable Format from linking the object code
`hello_test.vmem`	Binary executable in VMEM format (Verilog compatible)
`hello_test.bin`	Binary executable for loading in a prototype

These files all have a different purpose. The object code contains the RISCV instructions that are used to implement the C program. The ELF code contains the linked version of the object code. If external functions and libraries are used, then these external functions are combined with the object code into a single format, ELF. The ELF format contains several sections, corresponding to the instruction bytes and data bytes (for example, constants) that are needed for the C program.

The contents of the object code and the ELF code can be visualized (and disassembled!) with an object dumper, a processor-specific program that understands these file formats. The first few lines of the disassembled object code are produced as follows. Note the command riscv32-unknown-elf-objdump to produce this information.

$ riscv32-unknown-elf-objdump -D hello_test.o | head -40

hello_test.o:     file format elf32-littleriscv


Disassembly of section .rodata.str1.4:

00000000 <.LC0>:
   0:   6548                    flw     fa0,12(a0)
   2:   6c6c                    flw     fa1,92(s0)
   4:   6973206f                j       32e9a <.LASF17+0x32cca>
   8:   706d                    c.lui   zero,0xffffb
   a:   656c                    flw     fa1,76(a0)
   c:   7320                    flw     fs0,96(a4)
   e:   7379                    lui     t1,0xffffe
  10:   6574                    flw     fa3,76(a0)
  12:   0a6d                    addi    s4,s4,27
  14:   0000                    unimp
        ...

00000018 <.LC1>:
  18:   6954                    flw     fa3,20(a0)
  1a:   0a216b63                bltu    sp,sp,d0 <.L13+0x4>
        ...

00000020 <.LC2>:
  20:   6f54                    flw     fa3,28(a4)
  22:   0a216b63                bltu    sp,sp,d8 <.L3+0x4>
        ...

Disassembly of section .text.startup:

00000000 <main>:
   0:   1101                    addi    sp,sp,-32
   2:   cc22                    sw      s0,24(sp)
   4:   ce06                    sw      ra,28(sp)
   6:   ca26                    sw      s1,20(sp)
   8:   c84a                    sw      s2,16(sp)
   a:   c64e                    sw      s3,12(sp)
   c:   c452                    sw      s4,8(sp)
   e:   c256                    sw      s5,4(sp)

The object code shown has two sections, .rodata and .text, corresponding to constants and instructions, respectively. While the disassembler tries to guess instructions for .rodata, the instructions shown are meaningless, because they correspond the data. In fact, if you observe the opcode bytes closely, you can see that these are ASCII characters. However, you can recognized (part of) a valid program in the text section, which starts with the main function performing stack operations.

The same command can also be applied on the ELF file, and because this is now a fully functional executable, it will show the actual memory address allocated to each function and each variable. For example, from the elf file we can now see that the main function starts at address 0x1003c6, and that it calls a subroutine pcount_reset located at address 0x1003dc.

$ riscv32-unknown-elf-objdump -D hello_test.elf

 001003c6 <main>:
   1003c6:       1101                    addi    sp,sp,-32
   1003c8:       cc22                    sw      s0,24(sp)
   1003ca:       ce06                    sw      ra,28(sp)
   1003cc:       ca26                    sw      s1,20(sp)
   1003ce:       c84a                    sw      s2,16(sp)
   1003d0:       c64e                    sw      s3,12(sp)
   1003d2:       c452                    sw      s4,8(sp)
   1003d4:       c256                    sw      s5,4(sp)
   1003d6:       547d                    li      s0,-1
   1003d8:       32041073                csrw    mcountinhibit,s0
   1003dc:       39dd                    jal     1000d2 <pcount_reset>
   1003de:       4781                    li      a5,0

The ELF file is a complete program, but it is not an exact representation of the physical memory layout. The two other files, hello_test.vmem and hello_test.bin serve that purpose. They are a direct representation of the memory contents that would hold the instruction opcodes. The vmem file is a Verilog-specific format that can be directly read in by $readmemh system commands in Verilog. It is an ASCII file that you can directly inspect. For example, the first column below represents an address, and the words following that address are the contents of subsequent memory locations.

$ cat hello_test.vmem

/* http://srecord.sourceforge.net/ */
@00000000 33E0006F 33A0006F 3360006F 3320006F 32E0006F 32A0006F 3260006F
@00000007 3260006F 31E0006F 31A0006F 3160006F 3120006F 30E0006F 30A0006F
@0000000E 3060006F 3020006F 2FE0006F 2FA0006F 2F60006F 2F20006F 2EE0006F
@00000015 2EA0006F 2E60006F 2E20006F 2DE0006F 2DA0006F 2D60006F 2D20006F
@0000001C 2CE0006F 2CA0006F 2C60006F 00000000 2C60006F 0FF57793 00020737
...

The bin file has the same purpose as the vmem file, but it uses a binary format. You can inspect it with the od utility as follows. Note that the addresses are formatted in decimal.

$ od -t x4 -w28 hello_test.bin
  0000000 33e0006f 33a0006f 3360006f 3320006f 32e0006f 32a0006f 3260006f
  0000034 3260006f 31e0006f 31a0006f 3160006f 3120006f 30e0006f 30a0006f
  0000070 3060006f 3020006f 2fe0006f 2fa0006f 2f60006f 2f20006f 2ee0006f
  0000124 2ea0006f 2e60006f 2e20006f 2de0006f 2da0006f 2d60006f 2d20006f
  0000160 2ce0006f 2ca0006f 2c60006f 00000000 2c60006f 0ff57793 00020737
  0000214 8082c31c 00020737 00054783 4501e399 05058082 bfcdc31c 46254721

Finally, we can run the simulation (hardware model in Verilator + compiled firmware) as follows.

$ ./build/lowrisc_ibex_ibex_simple_system_0/sim-verilator/Vibex_simple_system \
    -t --meminit=ram,examples/sw/simple_system/hello_test/hello_test.elf

  Simulation of Ibex
  ==================

  Tracing can be toggled by sending SIGUSR1 to this process:
  $ kill -USR1 10990

  Simulation running, end by pressing CTRL-c.
  Tracing enabled.
  Writing simulation traces to sim.fst
  TOP.ibex_simple_system.u_top.u_ibex_tracer.unnamedblk2.unnamedblk3: Writing execution trace to      trace_core_00000000.log
  Terminating simulation by software request.
  - ../src/lowrisc_ibex_sim_shared_0/./rtl/sim/simulator_ctrl.sv:93: Verilog $finish
  Received $finish() from Verilog, shutting down simulation.

  Simulation statistics
  =====================
  Executed cycles:  13138
  Wallclock time:   0.101 s
  Simulation speed: 130079 cycles/s (130.079 kHz)
  Trace file size:  531021 B

  You can view the simulation traces by calling
  $ gtkwave sim.fst

  Performance Counters
  ====================
  Cycles:               475
  Instructions Retired: 261

The -t parameter in the simulation command line instructs Verilog to create a trace file in FST format. Similar to a Value Change Dump file or VCD, the FST collects the activities of every signal in the simulation. It can be viewed using GTKWave, an stand-alone VCD/FST waveform viewer. The waveform below shows the instruction bus and the data bus. The core is activated by a timer interrupts. Through the waveform diagram, the exact behavior per clock cycle can be evaluated, while the firmware is executing. As we will design hardware accelerator modules, the ability to simulate and inspect low-level behavior at the cycle-accurate hardware level will be very useful.

$ gtkwave sim.fst

Important

The IBEX has a path to implementation, as well. In this lecture, we focus on (co)simulation of IBEX software and custom hardware modules. Hardware implementation will be covered later.

SoC Hardware Interfaces

We are now ready to discuss the SoC hardware interfaces that are relevant to this course. In this lecture, we will focus on memory-mapped hardware. That is the most generic form on SoC interfacing, and it only requires us to understand the IBEX on-chip bus.

Memory-mapped Interface

In a memory-mapped interface, hardware is integrated on the same memory bus as other on-chip memories. The idea of a memory-mapped interface is to define a hardware register that can be written using a memory-write operation on the SoC bus, and that can be read using a memory-read operation. Hence, a memory-mapped interface is a mechanism to allocate a hardware register into the memory space of a processor.

We will illustrate the memory-mapped interface in IBEX through the operation of the Hello World example, and its access of the timer. First, let’s see how the timer is used in the C program. Inspect ibex/examples/sw/simple_system/hello_test/hello_text.v and locate the main activity.

timer_enable(2000);

uint64_t last_elapsed_time = get_elapsed_time();

while (last_elapsed_time <= 4) {
  uint64_t cur_time = get_elapsed_time();
  if (cur_time != last_elapsed_time) {
    last_elapsed_time = cur_time;

    if (last_elapsed_time & 1) {
      puts("Tick!\n");
    } else {
      puts("Tock!\n");
    }
 }
 asm volatile("wfi");
}

This program waits for timer interrupts from a period timer that is set at a timeout count of 2000 cycles. Every four timer ticks, the program will print either the string Tick! or else the string Tock!. The inline assembly instruction wfi (wait for interrupt) puts the core asleep until the next timer interrupt arrives. However, we are interested in the activities on the timer itself. The timer interrupt service routine is located in ../common/simple_system_common.c relative to the main program. At the end of this file, you will find the following C code.

inline static void increment_timecmp(uint64_t time_base) {
  uint64_t current_time = timer_read();
  current_time += time_base;
  timecmp_update(current_time);
}

uint64_t timer_read(void) {
  uint32_t current_timeh;
  uint32_t current_time;
  // check if time overflowed while reading and try again
  do {
    current_timeh = DEV_READ(TIMER_BASE + TIMER_MTIMEH, 0);
    current_time = DEV_READ(TIMER_BASE + TIMER_MTIME, 0);
  } while (current_timeh != DEV_READ(TIMER_BASE + TIMER_MTIMEH, 0));
  uint64_t final_time = ((uint64_t)current_timeh << 32) | current_time;
  return final_time;
}

void timecmp_update(uint64_t new_time) {
  DEV_WRITE(TIMER_BASE + TIMER_MTIMECMP, -1);
  DEV_WRITE(TIMER_BASE + TIMER_MTIMECMPH, new_time >> 32);
  DEV_WRITE(TIMER_BASE + TIMER_MTIMECMP, new_time);
}

uint64_t get_elapsed_time(void) { return time_elapsed; }

void simple_timer_handler(void) __attribute__((interrupt));

void simple_timer_handler(void) {
  increment_timecmp(time_increment);
  time_elapsed++;
}

The timer ISR is called simple_timer_handler which itself calls a function increment_timecmp() which reprograms the timer. The reprogramming first reads the current timestamp (timer_read()) followed by loading it with an updated timeout value (timecmp_update()). Each of these functions eventually boil down to calling DEV_READ and DEV_WRITE. These are memory mapped operations. Note that these DEV_READ and DEV_WRITE operations are macros; they expand into memory read/write operations.

#define DEV_WRITE(addr, val) (*((volatile uint32_t *)(addr)) = val)
#define DEV_READ(addr, val) (*((volatile uint32_t *)(addr)))

So in summary, for each timer interrupt, the timer will be reprogrammed, and the reprogramming operation consists of three memory reads followed by three memory writes to addresses allocated to the timer. These addresses are expressed as offsets from a TIMER_BASE, such as TIMER_BASE + TIMER_MTIMEH and TIMER_BASE + TIMER_MTIMECMPH. Now, we can trace these timer operations in the FST waveform generated during the simulation. The following shows the output of gtkwave, with some of the timer peripheral interface signals added to the waveform diagram.

timer_intr_o: Interrupt request signal to the processor
timer_addr_i: Address coming from the processor to the timer peripheral
timer_rdata_o: Data read from timer peripheral to processor
timer_wdata_i: Data written to the timer peripheral from processor
timer_rvalid_o: Read operation on the timer peripheral
timer_we: Write operation into the timer peripheral

We can clearly identify the periodic interrupt. Zooming on one such periodic interrupt, you can identify the three read operations followed by the three write operations. That is evidence, in fact, that the processor is running the functions timer_read() and timecmp_update().

The cursor is positions at moment of the first write operation to the timer, namely:

DEV_WRITE(TIMER_BASE + TIMER_MTIMECMP, -1);

Finally, now that we are able to connect the hardware execution to the software C code, we can also inspect the SystemVerilog of the actual timer module. The timer module is located in shared/rtl/timer.sv. We include the full module below to highlight the various sections of the timer.

// Copyright lowRISC contributors.
// Licensed under the Apache License, Version 2.0, see LICENSE for details.
// SPDX-License-Identifier: Apache-2.0

// Example memory mapped timer

`include "prim_assert.sv"

module timer #(
  // Bus data width (must be 32)
  parameter int unsigned DataWidth    = 32,
  // Bus address width
  parameter int unsigned AddressWidth = 32
) (
  input  logic                    clk_i,
  input  logic                    rst_ni,
  // Bus interface
  input  logic                    timer_req_i,

  input  logic [AddressWidth-1:0] timer_addr_i,
  input  logic                    timer_we_i,
  input  logic [ DataWidth/8-1:0] timer_be_i,
  input  logic [   DataWidth-1:0] timer_wdata_i,
  output logic                    timer_rvalid_o,
  output logic [   DataWidth-1:0] timer_rdata_o,
  output logic                    timer_err_o,
  output logic                    timer_intr_o
);

  // The timers are always 64 bits
  localparam int unsigned TW = 64;
  // Upper bits of address are decoded into timer_req_i
  localparam int unsigned ADDR_OFFSET = 10; // 1kB
  // Register map
  localparam bit [9:0] MTIME_LOW = 0;
  localparam bit [9:0] MTIME_HIGH = 4;
  localparam bit [9:0] MTIMECMP_LOW = 8;
  localparam bit [9:0] MTIMECMP_HIGH = 12;

  logic                 timer_we;
  logic                 mtime_we, mtimeh_we;
  logic                 mtimecmp_we, mtimecmph_we;
  logic [DataWidth-1:0] mtime_wdata, mtimeh_wdata;
  logic [DataWidth-1:0] mtimecmp_wdata, mtimecmph_wdata;
  logic [TW-1:0]        mtime_q, mtime_d, mtime_inc;
  logic [TW-1:0]        mtimecmp_q, mtimecmp_d;
  logic                 interrupt_q, interrupt_d;
  logic                 error_q, error_d;
  logic [DataWidth-1:0] rdata_q, rdata_d;
  logic                 rvalid_q;

  // Global write enable for all registers
  assign timer_we = timer_req_i & timer_we_i;

  // mtime increments every cycle
  assign mtime_inc = mtime_q + 64'd1;

  // Generate write data based on byte strobes
  for (genvar b = 0; b < DataWidth / 8; b++) begin : gen_byte_wdata

    assign mtime_wdata[(b*8)+:8]     = timer_be_i[b] ? timer_wdata_i[b*8+:8] :
                                                       mtime_q[(b*8)+:8];
    assign mtimeh_wdata[(b*8)+:8]    = timer_be_i[b] ? timer_wdata_i[b*8+:8] :
                                                       mtime_q[DataWidth+(b*8)+:8];
    assign mtimecmp_wdata[(b*8)+:8]  = timer_be_i[b] ? timer_wdata_i[b*8+:8] :
                                                       mtimecmp_q[(b*8)+:8];
    assign mtimecmph_wdata[(b*8)+:8] = timer_be_i[b] ? timer_wdata_i[b*8+:8] :
                                                       mtimecmp_q[   DataWidth+(b*8)+:8];
  end

  // Generate write enables
  assign mtime_we     = timer_we & (timer_addr_i[ADDR_OFFSET-1:0] == MTIME_LOW);
  assign mtimeh_we    = timer_we & (timer_addr_i[ADDR_OFFSET-1:0] == MTIME_HIGH);
  assign mtimecmp_we  = timer_we & (timer_addr_i[ADDR_OFFSET-1:0] == MTIMECMP_LOW);
  assign mtimecmph_we = timer_we & (timer_addr_i[ADDR_OFFSET-1:0] == MTIMECMP_HIGH);

  // Generate next data
  assign mtime_d    = {(mtimeh_we    ? mtimeh_wdata    : mtime_inc[63:32]),
                       (mtime_we     ? mtime_wdata     : mtime_inc[31:0])};
  assign mtimecmp_d = {(mtimecmph_we ? mtimecmph_wdata : mtimecmp_q[63:32]),
                       (mtimecmp_we  ? mtimecmp_wdata  : mtimecmp_q[31:0])};

  // Generate registers
  always_ff @(posedge clk_i or negedge rst_ni) begin
    if (~rst_ni) begin
      mtime_q <= 'b0;
    end else begin
      mtime_q <= mtime_d;
    end
  end

  always_ff @(posedge clk_i or negedge rst_ni) begin
    if (~rst_ni) begin
      mtimecmp_q <= 'b0;
    end else if (mtimecmp_we | mtimecmph_we) begin
      mtimecmp_q <= mtimecmp_d;
    end
  end

  // interrupt remains set until mtimecmp is written
  assign interrupt_d  = ((mtime_q >= mtimecmp_q) | interrupt_q) & ~(mtimecmp_we |    mtimecmph_we);

  always_ff @(posedge clk_i or negedge rst_ni) begin
    if (~rst_ni) begin
      interrupt_q <= 'b0;
    end else begin
      interrupt_q <= interrupt_d;
    end
  end

  assign timer_intr_o = interrupt_q;

  // Read data
  always_comb begin
    rdata_d = 'b0;
    error_d = 1'b0;
    unique case (timer_addr_i[ADDR_OFFSET-1:0])
      MTIME_LOW:     rdata_d = mtime_q[31:0];
      MTIME_HIGH:    rdata_d = mtime_q[63:32];
      MTIMECMP_LOW:  rdata_d = mtimecmp_q[31:0];
      MTIMECMP_HIGH: rdata_d = mtimecmp_q[63:32];
      default: begin
        rdata_d = 'b0;
        // Error if no address matched
        error_d = 1'b1;
      end
    endcase
  end

  // error_q and rdata_q are only valid when rvalid_q is high
  always_ff @(posedge clk_i) begin
    if (timer_req_i) begin
      rdata_q <= rdata_d;
      error_q <= error_d;
    end
  end

  assign timer_rdata_o = rdata_q;

  // Read data is always valid one cycle after a request
  always_ff @(posedge clk_i or negedge rst_ni) begin
    if (!rst_ni) begin
      rvalid_q <= 1'b0;
    end else begin
      rvalid_q <= timer_req_i;
    end
  end

  assign timer_rvalid_o = rvalid_q;
  assign timer_err_o    = error_q;

  // Assertions
  `ASSERT_INIT(param_legal, DataWidth == 32)
endmodule

Line 9-27 shows the interface of the memory mapped module. This interface is the same for every memory mapped module. It defines an address and data bus, as well as a set of control signals to define read/write operations. Note that there are both read-data and write-data bus lines: on-chip digital signals are typically unidirectional, so that we need two data busses: one for write and one for read.
Line 72-75 show the write address decoding. This is where the write-control signal for each memory mapped register is created. There are four such registers. The actual write operations on the registers are in line 83-98. These are 64-bit registers, such that there two 32-bit memory-mapped locations (LOW and HIGH) for each register.
Line 114-128 show the read address decoding. In this case, data is sent from the internal registers to the processor over rdata_d, according to the memory address provided through timer_addr_i.
The interrupt logic is show on Line 101. An interrupt is asserted when the timer counter exceeds the timer bound in mtimecmp_q.
Read/write operations from the memory bus to the timer have to follow the proper memory bus protocol. For this reason, there are additional delays and internal logic added, such as in line 130-147.
In our experiments, we will typically start from a working memory-mapped interface for a hardware peripheral, and customize that peripheral further to our need.