Lecture 10 - Nios II and Quartus Platform Designer

Introduction
Nios II core overview
- Nios II Implementations
- Nios II Instructions
Platform Designer Overview (Demonstration)

Introduction

The Nios II is a 32-bit soft-core processor that is implemented in the FPGA fabric. ‘Soft-core’ means that Nios II exists as a set of Verilog files that are synthesized in Quartus.

The advantage of a soft-core design is that it is portable: it will run in any FPGA that can synthesize the Nios II Verilog implementation. The disadvantage is that the design does not run directly in silicon, but rather is configured as a bitstream. This results in a performance hit and a less energy-efficient solution. This is easily demonstrated when you look at the FPGA in your DE1-SoC board. A Nios II in this FPGA - a Cyclone V - will run up to 210 MHz. The hard macro ARM cores in the same FPGA, on the other hand, can run up to 925MHz.

Often, however, the advantage of having a programmable micro-controller in an FPGA - next to other dedicated hardware logic - outweighs the performance overhead.

Nios II core overview

The Nios II is a highly configurable 32-bit microcontroller, optimized for the Cyclone V FPGA fabric. The Nios II core by itself only contains a processor capable of executing the Nios II instruction set. When you make a design with Nios II you still have to add a bus, program/data memory, and peripherals.

The standard reference on the Nios II processor is the Nios II Processor Reference Guide.

Figure: Nios II Processor niosii

The blue boxes in the figure make up the ‘essential components’ of a Nios. They include instruction-fetch and decode logic, registers, exception logic, and an ALU. The grey boxes in the figure are optional elements, used to enhance the capabilities (and typically the performance) of the Nios. Let’s explain some of these optional elements.

Shadow Register Sets: Shadow Registers Sets reduce context switching overhead when the processor switches from one function to another. Rather then saving registers on the stack (as us normally done in a a software function call or an interrupt service routine), the Nios has dedicated instructions to move the 32 registers from the main register file to a shadow register set.
Data and Instruction Cache: The caches are useful when the Nios data or program memory is located off-chip in a slow main memory. The cache size can be configured, and ideally must be chosen such that the instructions of the largest performance-critical loop fits in the instruction cache, and the largest performance-critical data structure fits in the data cache. When only on-chip memory is used for instructions and data, a cache memory can be avoided. Note that a data cache introduces a subtle complication for memory-mapped registers. Since the content of memory-mapped registers is not under control of (only) the software, memory-mapped registers cannot be cached. The Nios provides several cache-bypass methods, that allow to turn off the cache for specific addresses.
Memory Management Unit and Translation Lookaside Buffer: The Nios can implement virtual memory and memory protection, which is used to support memory management in an operating system.
Tightly-Coupled Memory: Microcontrollers inside of an FPGA face the same challenge as every other processor, namely that the memory bandwidth very quickly becomes a limiting factor. In system-on-chip context, where peripherals often share the same bus as the instruction and data memory, the memory bandwidth bottleneck can also be compounded with a bus bottleneck. A tightly couple memory provides a data memory of limited size close to the processor, with similar performance to cache memory. However, tightly-coupled memory is directly under control of the programmer (unlike the cache): accessing tightly-coupled memory looks no different to the programmer than accessing any other memory location.
Custom Instructions: The standard ALU in Nios supports a basic set of instructions for arithmetic, comparison, logical and shift/rotate operations. These instructions follow a two-operand, one-destination format. The instruction set can be extended with new instructions, implemented directly in the FPGA datapath. This offers a convenient mechanism to integrate custom hardware at the top of the memory hierarchy.

Nios II Implementations

There are two flavors of the Nios II design, one called Nios II/e and the other called Nios II/f. The extensions stand for ‘economical’ and ‘fast’, respectively. Nios II/e is a slower, but smaller core, used where the microcontroller capability of the Nios is most important. Nios II/f is a faster, but larger core, used when performance is critical. The Nios II/e and Nios II/f are not simply two different instantiations of the same Nios II design (with different options). Instead, Nios II/e and Nios II/f are two different implementations of the same Nios instruction set.

Table: Nios II Feature Comparison

	Nios II/f	Nios II/e
Pipeline Stages	6	1
Branch Prediction	Dynamic	-
Multiplier	hardware	software
Shifter	barrel shifter	1-bit-per-cycle
ICache	Configurable	-
DCache	Configurable	-
MMU	Configurable	-

The six pipeline stages of the Nios II/f comprise fetch, decode, execute, memory-access, operand alignment, and write-back. The branches are resolved during decoding, and branch prediction drives the next program counter when a branch is being decoded.

The larger feature set of the Nios II/f yields a larger performance at a larger area cost. The Nios II/f is almost three times as large as the Nios II/e. And overall it runs at a lower clock frequency. However, is is almost time times more efficient in running a benchmark. The DMIPS stands for ‘Dhrystone MIPS, a mix of standard applications to test processor performance.

Table: Nios II Performance Comparison (Cyclone V)

	Nios II/f	Nios II/e
Max Clock Frequency	170 MHz	210 MHz
Area (ALM)	867	308
DMIPS/MHz	0.9	0.1

The Nios II/f offers a reasonably good performance compared to a standard ARM. For example, an ARM Cortex M4 (the same ARM as used in MSP-432) runs at 1.25-1.95 DMIPS/MHz. And both the Nios II/f and Nios II/e are very compact. The Cyclone V FPGA on your DE1-SoC board has 32,070 ALM’s, which is the equivalent of about 100 Nios II/e cores. Of course, a Nios II core is not a complete system; you would need at least some memory to drive instructions to the core. Here are a few typical logic use sizes for Nios II peripherals.

Peripheral	Size (ALM)
JTAG Debug Module	125
UART	57
SDRAM Controller	2475
Timer	54

Nios II Instructions

The Nios II is a load-store architecture. In a load-store architecture, a dedicated set of instructions (load and store) are used to access the memory hierarchy. Other instructions (including arithmetic and logical instructions, for example) are limited to register-to-register operations. A load/store architecture simplifies the addressing of operands, at the expense of a larger amount of instructions.

Section 7 of the Nios II Processor Reference Guide describes the Application Binary Interface for the Nios II. The Nios II data types have a different size then those of the MSP-430.

Data Type	Size (bytes)	MSP-430 size
`char`	1	1
`short`	2	2
`unsigned`	4	2
`long`	4	4
pointer	4	2
`long long`	8	8

The Nios II uses word-aligned memory (32-bit alignment), whereas the MSP0430 used 16-bit aligned memory. This means that an integer or larger data type (unsigned, long, long long, pointer) always must start on a 4-byte boundary. The alignment requirement stems from the hardware implementation of the Nios II core, which has a 32-bit data bus. Without alignment, a simple memory load of an integer could result otherwise in two memory loads, which is inefficient.

The Nios II is a little-endian machine, just like the MSP-430. This means that lower bytes are stored in lower-valued addresses. The Nios II has 32 registers, and uses many of the same ideas as found in the MSP-430. We will look into those aspects later.

The instructions of the Nios II are listed in section 8 of the Nios II Processor Reference Guide. The best way to learn about the Nios II instructions, though, is to look at the compiler output in assembly. The following function does an interesting bit of magic. It computes a modulo-179 operation without making use of an integer division operation. It only uses modulo and divide operations with powers of 2, which can be implemented using simple shift and bitwise-and.

unsigned modk(unsigned x, unsigned k) { return (x & ((1 << k) - 1)); }
unsigned divk(unsigned x, unsigned k) { return (x >> k); }

unsigned modulo(unsigned x) {
  unsigned r, q, k, a, m, z;
  m = 0xB3; // 179
  k = 8;
  a = (1 << k) - m;
  r = modk(x, k);
  q = divk(x, k);
  do {
    do {
      r = r + modk(q * a, k);
      q = divk(q * a, k);
    } while (q != 0);
    q = divk(r, k);
    r = modk(r, k);
  } while (q != 0);
  z = (r >= m) ? z = r - m : z = r;
  return z;
}

The assembly output of this function is as follows. Some useful things to know are the following.

The input argument x is provided in r4.
The return argument is given in r2
Instructions ending in ‘i’ use an immediate operand (constant)

Try to explain what assembly instructions correspond to what parts of the C program!

0000001c <modulo>:
1c: 200ad23a    srli r5,r4,8      ; r5 = r4 >> 8
20: 20803fcc    andi r2,r4,255    ; r2 = r4 & 255
24: 28c01364    muli r3,r5,77     ; r3 = r5 * 77
28: 180ad23a    srli r5,r3,8      ; r5 = r3 >> 8
2c: 18c03fcc    andi r3,r3,255    ; r3 = r3 & 255
30: 10c5883a    add r2,r2,r3      ; r2 = r2 & r3
34: 283ffb1e    bne r5,zero,24 <modulo+0x8> ;
38: 100ad23a    srli r5,r2,8      ; r2 = r2 >> 8
3c: 10803fcc    andi r2,r2,255    ; r2 = r2 & 255
40: 283ff81e    bne r5,zero,24 <modulo+0x8>
44: 00c02c84    movi r3,178       ; r3 = 178
48: 1880012e    bgeu r3,r2,50 <modulo+0x34>
4c: 10bfd344    addi r2,r2,-179   ; r2 = r2 - 179
50: f800283a    ret

Introduction

Nios II core overview

Nios II Implementations

Nios II Instructions

Platform Designer Overview (Demonstration)