Lecture 13 - The Nios II Custom-Instruction Interface
Introduction
So far, we introduced the basics of SoC design using Nios II and the Platform Designer tool in Quartus. We made a broad overview of on-chip buses and the ensemble of tricks that are used to speed up bus transfers. To attach coprocessor hardware to the SoC, we identified three possible locations in the interconnect facbric. First, we can use a memory-mapped coprocessor attached to the on-chip bus. Second, we can use a tightly-coupled coprocessor attached to a processor-specific local bus. Third, we can hack the processor and modify the processor internals to add new instructions that do precisely what we want.
The Nios II custom-instruction interface fits in the second category. Even though the ‘custom instruction’ terminology may suggest that we are modifying the Nios II internals (i.e., strategy 3 above), this is not the case. The Nios II custom instruction hardware is attached to a specialized local bus. The Nios II processor uses an opcode reserved specifically for this bus.
Custom-instruction design is a convenient coprocessor design strategy that offers an easy software integration: the software does all the work of fetching data from memory and returning it to memory, while the custom-instruction still offers a customizable datapath. The objectives of this lecture are to learn about the basics of the Nios II custom-instruction interface, and to use it in a few hardware acceleration experiments.
The Nios-II Custom Instruction Interface is documented in detail. The following is a summary of the features.
Nios-II Custom Instruction Interface
The Nios-II Custom Instruction Interface enables calling hardware modules that fit a two-operand, one-result model. A single Nios-II core has a single custom-instruction master, and thus can support a single custom instruction.
However, the custom-instruction interface can accept additional parameters, such that it is possible to support multiple variants of a hardware function. These variants include combinational instructions, multi-cycle instructions, extended instructions and internal register-file instructions. All of them, however, use the same basic hardware interface: the Nios II Custom Instruction Slave interface.
Hardware Interface
Figure: Nios II Custom Instruction Slave Interface
A combinational custom intruction accepts two arguments, dataa
and datab
, from the Nios. It returns the result of the custom instruction through result
. No other signals
are needed, and the custom instruction is expected to complete within one clock cycle.
A multi-cycle custom instruction accepts a clock signal clk
, a clock-enable signal
clk_en
, and a start signal start
. The start
input is asserted when new
operands are available on the data inputs dataa
and datab
. Next, the instruction
can continue over an arbitrary amount of clock cycles, while the Nios II processor
will be stalled. When the custom instruction completes, the result is presented
in the result
output and the done
output is asserted for one clock cycle.
The clk_en
signal must be used to assert the update of every internal register
in the custom instruction. If the Nios II has to stall for a reason other than
the custom instruction, it will de-assert clk_en, which stalls the custom instruction
as well.
Both combination and multi-cycle custom instructions can use an additional argument
n
, turning them into extended instructions. The n
argument is a static (compile-time) parameter between 0 and 255. It is used as an additional
function qualifier for the custom instruction. For example, a custom instructions
could offer three different computations for the input arguments. The n
argument
could then be used to select one of three computations.
The Nios II works
with a register file of 32 registers, and the register source and destination indices
of the custom instruction are offered through the fields a
(source), b
(source) and c
(destination). Simultaneously, a program can assert three additional control bits
readra
, readrb
and writerc
, that select if the register address fields a
, b
, and
c
refer to the Nios II register file or not. If the control bits are asserted, then
the source or destination will be the Nios II register file. If the control bits are
not asserted, then it is up to the custom instruction to provide meaning to
a
, b
and c
. The custom instruction is expected to provide its own internal
register file, and use the addressing indices a
, b
and c
to control that internal
register file.
Most interface signals are optional. At the minimum, we use dataa
, datab
and result
. For multi-cycle instructions we add clk
, clk_en
, reset
, start
and
done
. For extended variants of either of these two groups, we add n
. And
finally, when register index fields are needed, we also include
a
, b
, c
, readra
, readrb
and writerc
Software Interface
A custom-instruction is a Nios-II R-type instruction, one of three different kinds of instructions on the Nios-II. An R-type instruction takes both operands and results from registers.
The lower 6 bits of the instruction opcode are fixed to 0x32 to mark it as
a custom instruction. The N
field holds 8 bits of the n
input for extended
instructions. The A
, B
and C
hold the register index of the source and
destination registers. These fields are wired to the a
, b
and c
ports
of the custom intruction. Finally, readra
, readrb
and writerc
can be
used for internal register file operations in the custom-instruction hardware.
When the source and destination of the custom-instruction is the Nios II register
file, readra
, readrb
and writerc
will always be asserted.
Because the opcode has user-defined parts, the opcode has to be specially formed by the compiler. The Nios II compiler provides a collection of builtin functions that can be used to create these opcodes easily. The format for these builtin functions is as follows.
RETURN_TYPE __builtin_custom_RETURNnPARAMETER(n, PARAMETER A, PARAMETER B);
For example, the following is a custom instruction that uses two integer
inputs a
and b
, and that returns an integer c
.
int __builtin_custom_inii(int n, int a, int b);
There are several variants of these instructions, depending on the type of the operands and the result. All variants are listed in the Nios-II Custom Instruction Interface.
When a board support package is available, the system.h
file will provide
an additional C macro wrapper for the custom instruction.
Example 1: A basic byte-merge instruction
A first example of a hardware design using custom instructions is a byte-merge. Given two 32-bit integers A and B, the custom instruction creates a new integer C which consists of the upper two bytes of A and the lower two bytes of B. This directly fits a custom-instruction template.
module bitmergeci(input wire [31:0] dataa,
input wire [31:0] datab,
output wire [31:0] result);
assign result = {dataa[31:16],datab[15:0]};
endmodule
We discuss how to integrate this custom instruction in Quartus. Download the example repository example-nios-ci to obtain the starter files.
git clone https://github.com/vt-ece4530-f19/example-nios-ci
Open the design in Quartus, and run the Platform Designer tool.
The IP catalog shows two custom instructions, bitmergeci
and macci
.
Only one of these two can be used per Nios II, and the platform needs
to be re-generated when you swap them out.
The bitmergeci
must be attached to the Nios II Custom-Instruction Master.
After generating the implementation (Generate HDL …) of the platform, compile the design in Quartus to obtain a bitstream. Alternately you can use
the command line:
quartus_sh --flow compile exampleniosci
Next, generate the board support package. Since this platform has only 32K of memory, select the ‘small driver’ options as done previously with the (example-nios-timer)[https://github.com/vt-ece4530-f19/example-nios-timer] example.
nios2-bsp-editor
cd software
nios2-bsp-generate-files.exe \
--settings=hal_bsp/settings.bsp \
--bsp-dir=hal_bsp
cd hal_bsp
make
The software driver for the custom instruction applies a test pattern and reads
the result. Note that we are making use of the C Macro generated in system.h
to call the custom instruction.
#include "system.h"
#include <stdio.h>
int main() {
int a = 0xAAAAAAAA;
int b = 0x55555555;
int r;
r = ALT_CI_BITMERGECI_0(a, b);
printf("%x %x %x\n", a, b, r);
return 0;
}
To compile and run the software, use the same strategy as before. Compile
the software, run a second Nios II command shell to open a nios2-terminal
,
and download the executable onto the Nios II. Note that this example
contains a slightly different organization for the software, as there will
be two software applications for the same example repository.
cd software/bitmerge
nios2-app-generate-makefile.exe \
--bsp-dir=../hal_bsp \
--src-files=bitmerge.c \
--elf-name=bitmerge.elf
make
# In the second Nios II Command Shell
nios2-terminal
# In the first Nios II Command Shell
nios2-download bitmerge.elf --go
The custom instructions generated can be inspected in the bitmerge.objdump
file generated out of the software compilation flow.
0000805c <main>:
...
8084: e0bffd17 ldw r2,-12(fp)
8088: e0fffe17 ldw r3,-8(fp)
808c: 10c5c032 custom 0,r2,r2,r3
8090: e0bfff15 stw r2,-4(fp)
...
80c0: f800283a ret
Breaking down the opcode 10c5c032
= {5'h2, 5'h3, 5'h2, 1'h1', 1'h1, 1'h1, 8'h0, 5'h32}
, we find the bitfields for
this custom instruction are defined as follows:
- OP =
5'h32
, the opcode for a custom instruction - N =
8'h0
, the N parameter (unused in this case) - writec =
1'h1
, the destination is the Nios-II register file - readb =
1'h1
, operand B comes from the Nios-II register file - reada =
1'h1
, operand A comes from the Nios-II register file - C =
5'h2
, the destination register index - B =
5'h3
, the operand B register index - A =
5'h2
, the operand A register index
Example 2: An extended multiply-accumulate instruction
A second example demonstrates a signed 64-bit multiply-accumulator fitted into
a custom-instruction template. The custom instruction delivers two signed
integers, a
and b
, performs a signed multiplication (64-bit) and returns
the signed result. The design contains an internal 64-bit accumulator that
must be properly initialized. Furthermore, since the custom instruction
can deliver only 32 bit at a time, multiple operations may be needed to
retrieve the result. Both of these cases fit very well into the
extended custom-instruction template. We define the following six cases
of the n
input.
dataa |
datab |
n |
result |
operation |
---|---|---|---|---|
A | B | 1 | NA | ACC = A * B |
A | B | 2 | NA | ACC = ACC + A * B |
NA | NA | 3 | lower(ACC) | returns ACC |
NA | NA | 4 | upper(ACC) | returns ACC |
NA | NA | 15 | 32’h0ECE4530 | test operation |
NA | NA | all other | 32’hDEADBEEF | test operation |
Translating this code into Verilog is straightforward, although care has to be placed into performing a signed multiplication. The signed multiplication differs from the unsigned multiplication when we need to keep the full precision of the result (that is, a 2k-bit result for two k-bit inputs).
For example, -1 * 1
, computed as a 32-bit unsigned multiplication, would compute 64'h00000000FFFFFFFF
, since -1 = 32'hFFFFFFFF
and 1 = 32'h1
.
Has a signed multiplication, the 64-bit result has to be properly sign-extended. In Verilog, we can handle this by sign-extending the input operands before multiplying them.
wire [31:0] dataa;
wire [63:0] dataa_64_signed;
assign dataa_64_signed = { {32{dataa[31]}} , dataa };
The Verilog design of the custom instruction uses a single-cycle implementation, and is written in a single-process style.
module macci(input wire clk,
input wire clk_en,
input wire reset,
input wire [7:0] n,
input wire start,
output reg done,
input wire [31:0] dataa,
input wire [31:0] datab,
output reg [31:0] result);
reg signed [63:0] acc;
always @(posedge clk or posedge reset)
begin
if (reset)
begin
acc <= 64'h0;
done <= 1'b0;
end
else begin
done <= 1'b0;
if (clk_en & start & (n == 8'd1))
begin
acc <= { {32{dataa[31]}},dataa} * { {32{datab[31]}},datab};
done <= 1'b1;
end
else if (clk_en & start & (n == 8'd2))
begin
acc <= acc + { {32{dataa[31]}},dataa} * { {32{datab[31]}},datab};
done <= 1'b1;
end
else if (clk_en & start & (n == 8'd3))
begin
result <= acc[31:0];
done <= 1'b1;
end
else if (clk_en & start & (n == 8'd4))
begin
result <= acc[63:32];
done <= 1'b1;
end
else if (clk_en & start & (n == 8'hf))
begin
result <= 32'h0ECE4530;
done <= 1'b1;
end
else if (clk_en & start)
begin
result <= 32'hDEADBEEF;
done <= 1'b1;
end
end
end
endmodule
The integration of this custom instruction in Nios II goes through Platform Designer. The steps to integrate a new Verilog module as a custom instruction in the IP library are explained in Chapter 4 of Nios-II Custom Instruction Interface. A key aspect of the integration is to properly identify the signals on the Nios II Custom Instruction Slave Interface. Note that the Nios II Custom Instruction Interface has an embedded clock signal, and that it does not make use of the clock interface used by other components such as the Avalon Slaves.
The following sequence of commands can be used to build and download the bitstream, build the board support package and compile and download the application.
# In Nios-II Command Shell
quartus_sh --flow compile
nios2-configure-sof -d 2 exampleniosci.sof
nios2-bsp-editor
cd software
nios2-bsp-generate-files \
--settings=hal_bsp/settings.bsp \
--bsp-dir=hal_bsp
cd hal_bsp
make
cd ..
cd mac
nios2-app-generate-makefile \
--bsp-dir=../hal_bsp \
--src-files=mac.c \
--elf-name=main.elf
# In second Nios-II Command Shell
nios2-terminal
# In first Nios-II Command Shell
nios2-download main.elf --go
The application software, shown next, uses the custom instruction through builtin functions that create the opcode for the custom instruction.
#include "system.h"
#include <stdio.h>
int main() {
int a = -2;
int b = 3;
int rlo, rhi;
int i;
long long r;
// test
rlo = __builtin_custom_inii (15, a, b);
printf("Expect 0x0ECE4530: %8x\n", rlo);
rlo = __builtin_custom_inii (14, a, b);
printf("Expect 0xDEADBEEF: %8x\n", rlo);
// initialize: MAC = A * B
rlo = __builtin_custom_inii (1, a, b);
for (i=0; i<20; i++) {
// read result
rlo = __builtin_custom_inii (3, a, b);
rhi = __builtin_custom_inii (4, a, b);
printf("%6d HI %6x LO %6x -> ", i, rhi, rlo);
*((int *) &r) = rlo;
*((int *) &r + 1) = rhi;
printf("%lld\n", r);
// MAC: MAC = MAC + A * B
rlo = __builtin_custom_inii (2, a, b);
}
return 0;
}