Lecture 20 - Cordic Coprocessor
- Introduction
- The CORDIC Reference Implementation
- CORDIC Custom Instruction Design
- Building the reference implementation
- Extra: Controlling where to place a variable in memory
Introduction
We will discuss the implementation of a CORDIC coprocessor. This lecture is based on Chapter 15 of the Codesign Book.
The example repository for this lecture is example-nios-sdram-cordic.
You can find a list of previous codesign challenges at the following webpage: https://rijndael.ece.vt.edu/schaum/teaching/4530/
The CORDIC Reference Implementation
Here is a 20-bit CORDIC reference implementation in C, computed using fix<32,28>.
#define PI 843314856
#define AG_CONST 163008218
static const int angles[] = {
210828714,
124459457,
65760959,
33381289,
16755421,
8385878,
4193962,
2097109,
1048570,
524287,
262143,
131071,
65535,
32767,
16383,
8191,
4095,
2047,
1024,
511 };
void golden_cordic(int target, int *rX, int *rY) {
int X, Y, T, current;
unsigned step;
X = AG_CONST;
Y = 0;
current = 0;
for(step=0; step < 20; step++) {
if (target > current) {
T = X - (Y >> step);
Y = (X >> step) + Y;
X = T;
current += angles[step];
} else {
T = X + (Y >> step);
Y = -(X >> step) + Y;
X = T;
current -= angles[step];
}
}
*rX = X;
*rY = Y;
}
The table of constants follows from the CORDIC design. The angles are <32,28>
representations of tan(1/(1 << i))
, for ‘i’ going from 0 to 19.
The AG_CONST
is a normalization factor equal to prod(cos((1 << i)), i)
for
i
going from 0 to 19.
n | alpha | alpha<32,28> | 1/cos(alpha) | PROD | 1/PROD | (1/PROD)<32,28> |
---|---|---|---|---|---|---|
0 | 0.785398 | 210828714 | 1.414214 | 1.414214 | 0.707107 | 189812531 |
1 | 0.463648 | 124459457 | 1.118034 | 1.581139 | 0.632456 | 169773489 |
2 | 0.244979 | 65760959 | 1.030776 | 1.629801 | 0.613572 | 164704477 |
3 | 0.124355 | 33381290 | 1.007782 | 1.642484 | 0.608834 | 163432609 |
4 | 0.062419 | 16755422 | 1.001951 | 1.645689 | 0.607648 | 163114337 |
5 | 0.031240 | 8385879 | 1.000488 | 1.646492 | 0.607352 | 163034749 |
6 | 0.015624 | 4193963 | 1.000122 | 1.646693 | 0.607278 | 163014851 |
7 | 0.007812 | 2097109 | 1.000031 | 1.646744 | 0.607259 | 163009877 |
8 | 0.003906 | 1048571 | 1.000008 | 1.646756 | 0.607254 | 163008633 |
9 | 0.001953 | 524287 | 1.000002 | 1.646759 | 0.607253 | 163008322 |
10 | 0.000977 | 262144 | 1.000000 | 1.646760 | 0.607253 | 163008244 |
11 | 0.000488 | 131072 | 1.000000 | 1.646760 | 0.607253 | 163008225 |
12 | 0.000244 | 65536 | 1.000000 | 1.646760 | 0.607253 | 163008220 |
13 | 0.000122 | 32768 | 1.000000 | 1.646760 | 0.607253 | 163008219 |
14 | 0.000061 | 16384 | 1.000000 | 1.646760 | 0.607253 | 163008219 |
15 | 0.000031 | 8192 | 1.000000 | 1.646760 | 0.607253 | 163008219 |
16 | 0.000015 | 4096 | 1.000000 | 1.646760 | 0.607253 | 163008219 |
17 | 0.000008 | 2048 | 1.000000 | 1.646760 | 0.607253 | 163008219 |
18 | 0.000004 | 1024 | 1.000000 | 1.646760 | 0.607253 | 163008219 |
19 | 0.000002 | 512 | 1.000000 | 1.646760 | 0.607253 | 163008219 |
CORDIC Custom Instruction Design
The following is a Verilog design for a custom-instruction implementation of a 20-bit CORDIC. The instruction set is defined as follows.
dataa |
datab |
n |
result |
operation |
---|---|---|---|---|
Angle | NA | 0 | NA | Load Angle, start rotation |
NA | NA | 1 | X | X = cos(Angle) |
NA | NA | 2 | Y | Y = sin(Angle) |
module cordicci(input wire clk,
input wire clk_en,
input wire reset,
input wire [7:0] n,
input wire start,
output reg done,
input wire [31:0] dataa,
input wire [31:0] datab,
output reg [31:0] result);
reg signed [31:0] X, X_next;
reg signed [31:0] Y, Y_next;
reg signed [31:0] A, A_next;
reg signed [31:0] T, T_next;
reg [4:0] cnt, cnt_next;
always @(posedge clk)
begin
X <= reset ? 32'h0 : (clk_en ? X_next : X);
Y <= reset ? 32'h0 : (clk_en ? Y_next : Y);
A <= reset ? 32'h0 : (clk_en ? A_next : A);
T <= reset ? 32'h0 : (clk_en ? T_next : T);
cnt <= reset ? 5'h0 : (clk_en ? cnt_next : cnt);
end
// state machine
localparam
Sidle = 0,
SX = 1,
SY = 2,
Srot0 = 3,
Srot1 = 4,
Srot2 = 5;
reg [2:0] state, state_next;
always @(posedge clk or posedge reset)
state <= reset ? Sidle : (clk_en ? state_next : state);
reg [31:0] angle;
always @(*)
begin
angle = 32'd0;
case (cnt)
5'd0: angle = 32'd210828714;
5'd1: angle = 32'd124459457;
5'd2: angle = 32'd65760959;
5'd3: angle = 32'd33381289;
5'd4: angle = 32'd16755421;
5'd5: angle = 32'd8385878;
5'd6: angle = 32'd4193962;
5'd7: angle = 32'd2097109;
5'd8: angle = 32'd1048570;
5'd9: angle = 32'd524287;
5'd10: angle = 32'd262143;
5'd11: angle = 32'd131071;
5'd12: angle = 32'd65535;
5'd13: angle = 32'd32767;
5'd14: angle = 32'd16383;
5'd15: angle = 32'd8191;
5'd16: angle = 32'd4095;
5'd17: angle = 32'd2047;
5'd18: angle = 32'd1024;
5'd19: angle = 32'd511;
default: angle = 32'd0;
endcase
end
always @(*)
begin
done = 0;
result = 32'd0;
cnt_next = cnt;
X_next = X;
Y_next = Y;
A_next = A;
T_next = T;
case (state)
Sidle: if (start)
state_next = (n == 8'd0) ? Srot0 :
(n == 8'd1) ? SX :
(n == 8'd2) ? SY :
Sidle;
SX: begin
result = X;
done = 1;
state_next = Sidle;
end
SY: begin
result = Y;
done = 1;
state_next = Sidle;
end
Srot0: begin
cnt_next = 0;
X_next = 32'd163008218;
Y_next = 32'd0;
T_next = dataa;
A_next = 32'd0;
state_next = Srot1;
end
Srot1: begin
cnt_next = cnt + 1'd1;
Y_next = (T > A) ? (X >>> cnt) + Y : Y - (X >>> cnt);
X_next = (T > A) ? X - (Y >>> cnt) : X + (Y >>> cnt);
A_next = (T > A) ? A + angle : A - angle;
state_next = (cnt == 5'd19) ? Srot2 : Srot1;
end
Srot2: begin
done = 1;
state_next = Sidle;
end
default: begin
state_next = Sidle;
end
endcase
end
endmodule
Building the reference implementation
Download the repository
(Cygwin)
git clone https://github.com/vt-ece4530-f19/example-nios-sdram-cordic
Simulate the custom-instruction hardware
(Cygwin)
cd example-nios-sdram-cordic
cd hardware-verification
vlib work
vlog ../cordicci.v
vlog ../cordiccitb.v
modelsim work.cordiccitb
Implement the hardware
(Nios2 Command Shell)
cd example-nios-sdram-cordic
quartus_sh --flow compile example-nios-sdram
Implement the software
(Nios2 Command Shell)
cd example-nios-sdram-cordic
nios2-bsp-editor
cd software
nios2-bsp-generate-files \
--settings=hal_bsp/settings.bsp \
--bsp-dir=hal_bsp
cd hal_bsp
make
Remember to connect both timers (to the system clock sourse and timestamp timer source, respectively). Remember to disable the alt_load
facility.
Verify the mapping of linker sections to memories.
Compile the application
(Nios2 Command Shell)
cd example-nios-sdram-cordic
cd software
cd cordic
nios2-app-generate-makefile \
--bsp-dir ../hal_bsp \
--elf-name main.elf \
--src-files main.c
make
Run the application
(Nios2 Command Shell)
cd example-nios-sdram-cordic
nios2-configure-sof -d 2 exampleniossdram.sof
nios2-terminal
(Nios2 Command Shell)
cd example-nios-sdram-cordic
cd software
cd cordic
nios2-download main.elf --go
This produces the following output:
$ nios2-terminal.exe
nios2-terminal: connected to hardware target using JTAG UART on cable
nios2-terminal: "DE-SoC [USB-1]", device 2, instance 0
nios2-terminal: (Use the IDE stop button or Ctrl-C to terminate)
Starting Cordic Measurement
Overhead correction: 74151
Software Cordic Cycles: 9703222
Hardware Cordic Cycles: 471958
Errors: 0
Sum_abs_error: 0
Extra: Controlling where to place a variable in memory
The location of variables in memory affect the speed at which they are accessed.
Consider the readram
software application.
It performs 10000 reads and writes of a variable allocated in memory.
#include <system.h>
#include <stdio.h>
#include "sys/alt_timestamp.h"
#define ARRAYSIZE 10000
// volatile unsigned int __attribute__((section (".localmem"))) a[ARRAYSIZE];
volatile unsigned int a[ARRAYSIZE];
int main() {
register unsigned int i;
register int c;
unsigned long ticks;
unsigned long overhead;
alt_timestamp_start();
printf("Timer Frequency %lu\n", alt_timestamp_freq());
for (i=0; i<9; i++) {
ticks = alt_timestamp();
for (i = 0; i< ARRAYSIZE; i++)
;
overhead = alt_timestamp() - ticks;
printf("%5ld ", overhead);
}
printf("\n");
printf("RAM write ticks: ");
for (i=0; i<9; i++) {
ticks = alt_timestamp();
for (i = 0; i< ARRAYSIZE; i++)
a[i] = i;
ticks = alt_timestamp() - ticks - overhead;
printf("%5ld (per write %5ld)", ticks, ticks/ARRAYSIZE);
}
printf("\n");
printf("RAM read ticks: ");
for (i=0; i<9; i++) {
ticks = alt_timestamp();
for (i = 0; i< ARRAYSIZE; i++)
c += a[i];
ticks = alt_timestamp() - ticks - overhead;
printf("%5ld (per read %5ld)", ticks, ticks/ARRAYSIZE);
}
printf("\n");
i = c;
return 0;
}
If we compile the code as shown, we get the following output.
RAM write ticks: 455838 (per write 45)
RAM read ticks: 773470 (per read 77)
Consulting the objdump file, we can figure out exactly what code is being run. Here is the loop for read:
40003a8: 400eaa00 call 400eaa0 <alt_timestamp>
40003ac: e0bffd15 stw r2,-12(fp)
40003b0: 0021883a mov r16,zero
40003b4: 00000a06 br 40003e0 <main+0x174>
40003b8: 00810134 movhi r2,1028
40003bc: 10800004 addi r2,r2,0
40003c0: 8407883a add r3,r16,r16
40003c4: 18c7883a add r3,r3,r3
40003c8: 10c5883a add r2,r2,r3
40003cc: 10800017 ldw r2,0(r2)
40003d0: 8807883a mov r3,r17
40003d4: 10c5883a add r2,r2,r3
40003d8: 1023883a mov r17,r2
40003dc: 84000044 addi r16,r16,1
40003e0: 8089c430 cmpltui r2,r16,10000
40003e4: 103ff41e bne r2,zero,40003b8 <a+0xfffc03b8>
400030c: 400eaa00 call 400eaa0 <alt_timestamp>
We have around 12 instructions (on a Nios/e), including one memory read. This takes 77 cycles.
The memory access can be accelerated by moving the variable to
on-chip memory. This requires a change in the C source code.
A variable can be specifically allocated to a section using
an attribute
.
volatile unsigned int __attribute__((section (".localmem"))) a[ARRAYSIZE];
In the board support package, the new section (.localmem
) must be
specifically added to the linker, so that we can allocate it to
a local memory. In this case, the bus contains an extra local memory
called scratchpad
.
We can now re-compile the BSP and the application and run it again.
cd example-nios-sdram-cordic
nios2-bsp-generate-files \
--settings hal_bsp/settings.bsp \
--bsp-dir hal_bsp
cd hal_bsp
make
cd ../readram
nios2-app-generate-makefile \
--bsp-dir ../hal_bsp \
--elf-name main.elf \
--src-files main.c
make
nios2-download main.elf --go
This yields the following output. We use 10 cycles less for reading (while the writing overhead does not change). It’s not a very impressive improvement, primarily because we are using a Nios/e rather than a Nios/f.
RAM write ticks: 455878 (per write 45)
RAM read ticks: 675793 (per read 67)
Important: I was not successful generating a Nios/f in Quartus 18.1. (The Nios/f does not appear to support hardware multiply instructions, which the compiler generates by default for the Nios/f setting). I don’t like to admit it, but it appears that the ‘free’ version of Quartus is a step down from previous versions in quality.