Introduction

So far, we introduced the basics of SoC design using Nios II and the Platform Designer tool in Quartus. We made a broad overview of on-chip buses and the ensemble of tricks that are used to speed up bus transfers. To attach coprocessor hardware to the SoC, we identified three possible locations in the interconnect facbric. First, we can use a memory-mapped coprocessor attached to the on-chip bus. Second, we can use a tightly-coupled coprocessor attached to a processor-specific local bus. Third, we can hack the processor and modify the processor internals to add new instructions that do precisely what we want.

The Nios II custom-instruction interface fits in the second category. Even though the ‘custom instruction’ terminology may suggest that we are modifying the Nios II internals (i.e., strategy 3 above), this is not the case. The Nios II custom instruction hardware is attached to a specialized local bus. The Nios II processor uses an opcode reserved specifically for this bus.

Custom-instruction design is a convenient coprocessor design strategy that offers an easy software integration: the software does all the work of fetching data from memory and returning it to memory, while the custom-instruction still offers a customizable datapath. The objectives of this lecture are to learn about the basics of the Nios II custom-instruction interface, and to use it in a few hardware acceleration experiments.

The Nios-II Custom Instruction Interface is documented in detail. The following is a summary of the features.

Nios-II Custom Instruction Interface

The Nios-II Custom Instruction Interface enables calling hardware modules that fit a two-operand, one-result model. A single Nios-II core has a single custom-instruction master, and thus can support a single custom instruction.

However, the custom-instruction interface can accept additional parameters, such that it is possible to support multiple variants of a hardware function. These variants include combinational instructions, multi-cycle instructions, extended instructions and internal register-file instructions. All of them, however, use the same basic hardware interface: the Nios II Custom Instruction Slave interface.

Hardware Interface

Figure: Nios II Custom Instruction Slave Interface

nios-custom

A combinational custom intruction accepts two arguments, dataa and datab, from the Nios. It returns the result of the custom instruction through result. No other signals are needed, and the custom instruction is expected to complete within one clock cycle.

A multi-cycle custom instruction accepts a clock signal clk, a clock-enable signal clk_en, and a start signal start. The start input is asserted when new operands are available on the data inputs dataa and datab. Next, the instruction can continue over an arbitrary amount of clock cycles, while the Nios II processor will be stalled. When the custom instruction completes, the result is presented in the result output and the done output is asserted for one clock cycle. The clk_en signal must be used to assert the update of every internal register in the custom instruction. If the Nios II has to stall for a reason other than the custom instruction, it will de-assert clk_en, which stalls the custom instruction as well.

Both combination and multi-cycle custom instructions can use an additional argument n, turning them into extended instructions. The n argument is a static (compile-time) parameter between 0 and 255. It is used as an additional function qualifier for the custom instruction. For example, a custom instructions could offer three different computations for the input arguments. The n argument could then be used to select one of three computations.

The Nios II works with a register file of 32 registers, and the register source and destination indices of the custom instruction are offered through the fields a (source), b (source) and c (destination). Simultaneously, a program can assert three additional control bits readra, readrb and writerc, that select if the register address fields a, b, and c refer to the Nios II register file or not. If the control bits are asserted, then the source or destination will be the Nios II register file. If the control bits are not asserted, then it is up to the custom instruction to provide meaning to a, b and c. The custom instruction is expected to provide its own internal register file, and use the addressing indices a, b and c to control that internal register file.

Most interface signals are optional. At the minimum, we use dataa, datab and result. For multi-cycle instructions we add clk, clk_en, reset, start and done. For extended variants of either of these two groups, we add n. And finally, when register index fields are needed, we also include a, b, c, readra, readrb and writerc

Software Interface

A custom-instruction is a Nios-II R-type instruction, one of three different kinds of instructions on the Nios-II. An R-type instruction takes both operands and results from registers.

nios-customopcode

The lower 6 bits of the instruction opcode are fixed to 0x32 to mark it as a custom instruction. The N field holds 8 bits of the n input for extended instructions. The A, B and C hold the register index of the source and destination registers. These fields are wired to the a, b and c ports of the custom intruction. Finally, readra, readrb and writerc can be used for internal register file operations in the custom-instruction hardware. When the source and destination of the custom-instruction is the Nios II register file, readra, readrb and writerc will always be asserted.

Because the opcode has user-defined parts, the opcode has to be specially formed by the compiler. The Nios II compiler provides a collection of builtin functions that can be used to create these opcodes easily. The format for these builtin functions is as follows.

RETURN_TYPE __builtin_custom_RETURNnPARAMETER(n, PARAMETER A, PARAMETER B);

For example, the following is a custom instruction that uses two integer inputs a and b, and that returns an integer c.

int __builtin_custom_inii(int n, int a, int b);

There are several variants of these instructions, depending on the type of the operands and the result. All variants are listed in the Nios-II Custom Instruction Interface.

When a board support package is available, the system.h file will provide an additional C macro wrapper for the custom instruction.

Example 1: A basic byte-merge instruction

A first example of a hardware design using custom instructions is a byte-merge. Given two 32-bit integers A and B, the custom instruction creates a new integer C which consists of the upper two bytes of A and the lower two bytes of B. This directly fits a custom-instruction template.

module bitmergeci(input wire [31:0] dataa,
                  input wire [31:0] datab,
                  output wire [31:0] result);

  assign result = {dataa[31:16],datab[15:0]};

endmodule

We discuss how to integrate this custom instruction in Quartus. Download the example repository example-nios-ci to obtain the starter files.

git clone https://github.com/vt-ece4530-f19/example-nios-ci

Open the design in Quartus, and run the Platform Designer tool. The IP catalog shows two custom instructions, bitmergeci and macci. Only one of these two can be used per Nios II, and the platform needs to be re-generated when you swap them out.

The bitmergeci must be attached to the Nios II Custom-Instruction Master. After generating the implementation (Generate HDL …) of the platform, compile the design in Quartus to obtain a bitstream. Alternately you can use the command line:

quartus_sh --flow compile exampleniosci

Next, generate the board support package. Since this platform has only 32K of memory, select the ‘small driver’ options as done previously with the (example-nios-timer)[https://github.com/vt-ece4530-f19/example-nios-timer] example.

nios2-bsp-editor
cd software
nios2-bsp-generate-files.exe \
           --settings=hal_bsp/settings.bsp \
           --bsp-dir=hal_bsp
cd hal_bsp
make

The software driver for the custom instruction applies a test pattern and reads the result. Note that we are making use of the C Macro generated in system.h to call the custom instruction.

#include "system.h"
#include <stdio.h>

int main() {
        int a = 0xAAAAAAAA;
        int b = 0x55555555;
        int r;

        r = ALT_CI_BITMERGECI_0(a, b);
        printf("%x %x %x\n", a, b, r);

        return 0;
}

To compile and run the software, use the same strategy as before. Compile the software, run a second Nios II command shell to open a nios2-terminal, and download the executable onto the Nios II. Note that this example contains a slightly different organization for the software, as there will be two software applications for the same example repository.

cd software/bitmerge
nios2-app-generate-makefile.exe \
        --bsp-dir=../hal_bsp \
        --src-files=bitmerge.c \
        --elf-name=bitmerge.elf
make

# In the second Nios II Command Shell
nios2-terminal

# In the first Nios II Command Shell
nios2-download bitmerge.elf --go

The custom instructions generated can be inspected in the bitmerge.objdump file generated out of the software compilation flow.

0000805c <main>:
    ...
    8084:       e0bffd17        ldw     r2,-12(fp)
    8088:       e0fffe17        ldw     r3,-8(fp)
    808c:       10c5c032        custom  0,r2,r2,r3
    8090:       e0bfff15        stw     r2,-4(fp)
    ...
    80c0:       f800283a        ret

Breaking down the opcode 10c5c032 = {5'h2, 5'h3, 5'h2, 1'h1', 1'h1, 1'h1, 8'h0, 5'h32}, we find the bitfields for this custom instruction are defined as follows:

  • OP = 5'h32, the opcode for a custom instruction
  • N = 8'h0, the N parameter (unused in this case)
  • writec = 1'h1, the destination is the Nios-II register file
  • readb = 1'h1, operand B comes from the Nios-II register file
  • reada = 1'h1, operand A comes from the Nios-II register file
  • C = 5'h2, the destination register index
  • B = 5'h3, the operand B register index
  • A = 5'h2, the operand A register index

Example 2: An extended multiply-accumulate instruction

A second example demonstrates a signed 64-bit multiply-accumulator fitted into a custom-instruction template. The custom instruction delivers two signed integers, a and b, performs a signed multiplication (64-bit) and returns the signed result. The design contains an internal 64-bit accumulator that must be properly initialized. Furthermore, since the custom instruction can deliver only 32 bit at a time, multiple operations may be needed to retrieve the result. Both of these cases fit very well into the extended custom-instruction template. We define the following six cases of the n input.

dataa datab n result operation
A B 1 NA ACC = A * B
A B 2 NA ACC = ACC + A * B
NA NA 3 lower(ACC) returns ACC
NA NA 4 upper(ACC) returns ACC
NA NA 15 32’h0ECE4530 test operation
NA NA all other 32’hDEADBEEF test operation


Translating this code into Verilog is straightforward, although care has to be placed into performing a signed multiplication. The signed multiplication differs from the unsigned multiplication when we need to keep the full precision of the result (that is, a 2k-bit result for two k-bit inputs).

For example, -1 * 1, computed as a 32-bit unsigned multiplication, would compute 64'h00000000FFFFFFFF, since -1 = 32'hFFFFFFFF and 1 = 32'h1. Has a signed multiplication, the 64-bit result has to be properly sign-extended. In Verilog, we can handle this by sign-extending the input operands before multiplying them.

wire [31:0] dataa;
wire [63:0] dataa_64_signed;

assign dataa_64_signed = { {32{dataa[31]}} , dataa };

The Verilog design of the custom instruction uses a single-cycle implementation, and is written in a single-process style.

module macci(input  wire  clk,
             input  wire  clk_en,
             input  wire  reset,
             input  wire  [7:0]  n,
             input  wire  start,
             output reg   done,
             input  wire  [31:0] dataa,
             input  wire  [31:0] datab,
             output reg   [31:0] result);

  reg signed [63:0] acc;

  always @(posedge clk or posedge reset)
  begin
    if (reset) 
    begin
      acc <= 64'h0; 
      done <= 1'b0;
    end
    else begin
    done <= 1'b0;
    if (clk_en & start & (n == 8'd1))
        begin
          acc  <= { {32{dataa[31]}},dataa} * { {32{datab[31]}},datab};
          done <= 1'b1;
          end
      else if (clk_en & start & (n == 8'd2))
          begin
          acc  <= acc + { {32{dataa[31]}},dataa} * { {32{datab[31]}},datab};
          done <= 1'b1;       
          end
      else if (clk_en & start & (n == 8'd3))
          begin
          result <= acc[31:0];
          done <= 1'b1;
          end
      else if (clk_en & start & (n == 8'd4))
          begin
          result <= acc[63:32];
          done <= 1'b1;       
          end
      else if (clk_en & start & (n == 8'hf))
          begin
          result <= 32'h0ECE4530;
          done <= 1'b1;       
          end
      else if (clk_en & start)
          begin
          result <= 32'hDEADBEEF;
          done   <= 1'b1;
          end
      end
    end

endmodule

The integration of this custom instruction in Nios II goes through Platform Designer. The steps to integrate a new Verilog module as a custom instruction in the IP library are explained in Chapter 4 of Nios-II Custom Instruction Interface. A key aspect of the integration is to properly identify the signals on the Nios II Custom Instruction Slave Interface. Note that the Nios II Custom Instruction Interface has an embedded clock signal, and that it does not make use of the clock interface used by other components such as the Avalon Slaves.

nios-custommacci

The following sequence of commands can be used to build and download the bitstream, build the board support package and compile and download the application.

# In Nios-II Command Shell
quartus_sh --flow compile 
nios2-configure-sof -d 2 exampleniosci.sof

nios2-bsp-editor
cd software
nios2-bsp-generate-files \
         --settings=hal_bsp/settings.bsp \
         --bsp-dir=hal_bsp
cd hal_bsp
make
cd ..

cd mac
nios2-app-generate-makefile \
          --bsp-dir=../hal_bsp \
          --src-files=mac.c \
          --elf-name=main.elf

# In second Nios-II Command Shell
nios2-terminal

# In first Nios-II Command Shell
nios2-download main.elf --go

The application software, shown next, uses the custom instruction through builtin functions that create the opcode for the custom instruction.

#include "system.h"
#include <stdio.h>

int main() {
  int a = -2;
  int b =  3;
  int rlo, rhi;
  int i;
  long long r;

  // test
  rlo = __builtin_custom_inii (15, a, b); 
  printf("Expect 0x0ECE4530: %8x\n", rlo);
  rlo = __builtin_custom_inii (14, a, b); 
  printf("Expect 0xDEADBEEF: %8x\n", rlo);

  // initialize: MAC = A * B
  rlo = __builtin_custom_inii (1, a, b); 
    
  for (i=0; i<20; i++) {
      // read result
      rlo = __builtin_custom_inii (3, a, b); 
      rhi = __builtin_custom_inii (4, a, b);
      printf("%6d HI %6x LO %6x -> ", i, rhi, rlo);
      *((int *) &r)     = rlo;
      *((int *) &r + 1) = rhi;
      printf("%lld\n", r);

      // MAC: MAC = MAC + A * B
      rlo = __builtin_custom_inii (2, a, b); 
  }

  return 0;
}