Introduction

We will discuss the implementation of a CORDIC coprocessor. This lecture is based on Chapter 15 of the Codesign Book.

The example repository for this lecture is example-nios-sdram-cordic.

You can find a list of previous codesign challenges at the following webpage: https://rijndael.ece.vt.edu/schaum/teaching/4530/

The CORDIC Reference Implementation

Here is a 20-bit CORDIC reference implementation in C, computed using fix<32,28>.

#define PI 843314856       
#define AG_CONST 163008218

static const int angles[] = {
  210828714,
  124459457,
   65760959,
   33381289,
   16755421,
    8385878,
    4193962,
    2097109,
    1048570,
     524287,
     262143,
     131071,
      65535,
      32767,
      16383,
       8191,
       4095,
       2047,
       1024,
        511 };

void golden_cordic(int target, int *rX, int *rY) {
  int X, Y, T, current;
  unsigned step;
  X       = AG_CONST;  
  Y       = 0;         
  current = 0;
  for(step=0; step < 20; step++) {
    if (target > current) {
      T          =  X - (Y >> step);
      Y          = (X >> step) + Y;
      X          = T;
      current   += angles[step];
    } else {
      T          = X + (Y >> step);
      Y          = -(X >> step) + Y;
      X          = T;
      current   -= angles[step]; 
    }
  }
  *rX = X;
  *rY = Y;
}

The table of constants follows from the CORDIC design. The angles are <32,28> representations of tan(1/(1 << i)), for ‘i’ going from 0 to 19. The AG_CONST is a normalization factor equal to prod(cos((1 << i)), i) for i going from 0 to 19.

n alpha alpha<32,28> 1/cos(alpha) PROD 1/PROD (1/PROD)<32,28>
0 0.785398 210828714 1.414214 1.414214 0.707107 189812531
1 0.463648 124459457 1.118034 1.581139 0.632456 169773489
2 0.244979 65760959 1.030776 1.629801 0.613572 164704477
3 0.124355 33381290 1.007782 1.642484 0.608834 163432609
4 0.062419 16755422 1.001951 1.645689 0.607648 163114337
5 0.031240 8385879 1.000488 1.646492 0.607352 163034749
6 0.015624 4193963 1.000122 1.646693 0.607278 163014851
7 0.007812 2097109 1.000031 1.646744 0.607259 163009877
8 0.003906 1048571 1.000008 1.646756 0.607254 163008633
9 0.001953 524287 1.000002 1.646759 0.607253 163008322
10 0.000977 262144 1.000000 1.646760 0.607253 163008244
11 0.000488 131072 1.000000 1.646760 0.607253 163008225
12 0.000244 65536 1.000000 1.646760 0.607253 163008220
13 0.000122 32768 1.000000 1.646760 0.607253 163008219
14 0.000061 16384 1.000000 1.646760 0.607253 163008219
15 0.000031 8192 1.000000 1.646760 0.607253 163008219
16 0.000015 4096 1.000000 1.646760 0.607253 163008219
17 0.000008 2048 1.000000 1.646760 0.607253 163008219
18 0.000004 1024 1.000000 1.646760 0.607253 163008219
19 0.000002 512 1.000000 1.646760 0.607253 163008219


CORDIC Custom Instruction Design

The following is a Verilog design for a custom-instruction implementation of a 20-bit CORDIC. The instruction set is defined as follows.

dataa datab n result operation
Angle NA 0 NA Load Angle, start rotation
NA NA 1 X X = cos(Angle)
NA NA 2 Y Y = sin(Angle)


module cordicci(input  wire  clk,
                input wire        clk_en,
                input wire        reset,
                input wire [7:0]  n,
                input wire        start,
                output reg        done,
                input wire [31:0] dataa,
                input wire [31:0] datab,
                output reg [31:0] result);

   reg signed [31:0]              X, X_next;
   reg signed [31:0]              Y, Y_next;
   reg signed [31:0]              A, A_next;
   reg signed [31:0]              T, T_next;
   reg [4:0]                      cnt, cnt_next;

   always @(posedge clk)
     begin
        X   <= reset ? 32'h0 : (clk_en ? X_next : X);
        Y   <= reset ? 32'h0 : (clk_en ? Y_next : Y);
        A   <= reset ? 32'h0 : (clk_en ? A_next : A);
        T   <= reset ? 32'h0 : (clk_en ? T_next : T);
        cnt <= reset ? 5'h0  : (clk_en ? cnt_next : cnt);
     end

   // state machine
   localparam
     Sidle     = 0,
     SX        = 1,
     SY        = 2,
     Srot0     = 3,
     Srot1     = 4,
     Srot2     = 5;

   reg [2:0] state, state_next;
   always @(posedge clk or posedge reset)
     state <= reset ? Sidle : (clk_en ? state_next : state);

   reg [31:0] angle;
   always @(*)
     begin
        angle = 32'd0;
        case (cnt)
          5'd0:  angle = 32'd210828714;
          5'd1:  angle = 32'd124459457;
          5'd2:  angle = 32'd65760959;
          5'd3:  angle = 32'd33381289;
          5'd4:  angle = 32'd16755421;
          5'd5:  angle = 32'd8385878;
          5'd6:  angle = 32'd4193962;
          5'd7:  angle = 32'd2097109;
          5'd8:  angle = 32'd1048570;
          5'd9:  angle = 32'd524287;
          5'd10: angle = 32'd262143;
          5'd11: angle = 32'd131071;
          5'd12: angle = 32'd65535;
          5'd13: angle = 32'd32767;
          5'd14: angle = 32'd16383;
          5'd15: angle = 32'd8191;
          5'd16: angle = 32'd4095;
          5'd17: angle = 32'd2047;
          5'd18: angle = 32'd1024;
          5'd19: angle = 32'd511;
          default: angle = 32'd0;
        endcase
     end

   always @(*)
     begin
        done   = 0;
        result = 32'd0;
        cnt_next = cnt;
        X_next   = X;
        Y_next   = Y;
        A_next   = A;
        T_next   = T;

        case (state)
          Sidle: if (start)
            state_next = (n == 8'd0) ? Srot0 :
                         (n == 8'd1) ? SX :
                         (n == 8'd2) ? SY :
                         Sidle;

          SX: begin
             result = X;
             done   = 1;
             state_next = Sidle;
          end

          SY: begin
             result = Y;
             done   = 1;
             state_next = Sidle;
          end

          Srot0: begin
             cnt_next = 0;
             X_next   = 32'd163008218;
             Y_next   = 32'd0;
             T_next   = dataa;
             A_next   = 32'd0;
             state_next = Srot1;
          end

          Srot1: begin
             cnt_next = cnt + 1'd1;
             Y_next   = (T > A) ? (X >>> cnt) + Y : Y - (X >>> cnt);
             X_next   = (T > A) ? X - (Y >>> cnt) : X + (Y >>> cnt);
             A_next   = (T > A) ? A + angle : A - angle;
             state_next = (cnt == 5'd19) ? Srot2 : Srot1;
          end

          Srot2: begin
             done = 1;
             state_next = Sidle;
          end

          default: begin
             state_next = Sidle;
          end

        endcase
     end

endmodule

Building the reference implementation

Download the repository

(Cygwin)

git clone https://github.com/vt-ece4530-f19/example-nios-sdram-cordic

Simulate the custom-instruction hardware

(Cygwin)

cd example-nios-sdram-cordic
cd hardware-verification
vlib work
vlog ../cordicci.v
vlog ../cordiccitb.v
modelsim work.cordiccitb

l20-modelsim

Implement the hardware

(Nios2 Command Shell)

cd example-nios-sdram-cordic
quartus_sh --flow compile example-nios-sdram

Implement the software

(Nios2 Command Shell)

cd example-nios-sdram-cordic
nios2-bsp-editor
cd software
nios2-bsp-generate-files  \
    --settings=hal_bsp/settings.bsp  \
    --bsp-dir=hal_bsp
cd hal_bsp
make

Remember to connect both timers (to the system clock sourse and timestamp timer source, respectively). Remember to disable the alt_load facility. Verify the mapping of linker sections to memories.

l20-linker

Compile the application

(Nios2 Command Shell)

cd example-nios-sdram-cordic
cd software
cd cordic
nios2-app-generate-makefile \
     --bsp-dir ../hal_bsp \
     --elf-name main.elf \
     --src-files main.c
make

Run the application

(Nios2 Command Shell)

cd example-nios-sdram-cordic
nios2-configure-sof -d 2 exampleniossdram.sof
nios2-terminal

(Nios2 Command Shell)

cd example-nios-sdram-cordic
cd software
cd cordic
nios2-download main.elf --go

This produces the following output:

$ nios2-terminal.exe
nios2-terminal: connected to hardware target using JTAG UART on cable
nios2-terminal: "DE-SoC [USB-1]", device 2, instance 0
nios2-terminal: (Use the IDE stop button or Ctrl-C to terminate)

Starting Cordic Measurement
Overhead correction: 74151
Software Cordic Cycles: 9703222
Hardware Cordic Cycles: 471958
Errors: 0
Sum_abs_error: 0

Extra: Controlling where to place a variable in memory

The location of variables in memory affect the speed at which they are accessed.

Consider the readram software application. It performs 10000 reads and writes of a variable allocated in memory.

#include <system.h>
#include <stdio.h>
#include "sys/alt_timestamp.h"

#define ARRAYSIZE 10000

// volatile unsigned int __attribute__((section (".localmem"))) a[ARRAYSIZE];

volatile unsigned int a[ARRAYSIZE];

int main() {
  register unsigned int i;
  register int c;
  unsigned long ticks;
  unsigned long overhead;

  alt_timestamp_start();

  printf("Timer Frequency %lu\n", alt_timestamp_freq());

  for (i=0; i<9; i++) {
    ticks = alt_timestamp();
          for (i = 0; i< ARRAYSIZE; i++)
                ;
    overhead = alt_timestamp() - ticks;
    printf("%5ld ", overhead);
  }
  printf("\n");

  printf("RAM write ticks: ");
  for (i=0; i<9; i++) {
    ticks = alt_timestamp();
    for (i = 0; i< ARRAYSIZE; i++)
                a[i] = i;
    ticks = alt_timestamp() - ticks - overhead;
    printf("%5ld (per write %5ld)", ticks, ticks/ARRAYSIZE);
  }
  printf("\n");

  printf("RAM read ticks: ");
  for (i=0; i<9; i++) {
    ticks = alt_timestamp();
    for (i = 0; i< ARRAYSIZE; i++)
                c += a[i];
    ticks = alt_timestamp() - ticks - overhead;
    printf("%5ld (per read %5ld)", ticks, ticks/ARRAYSIZE);
  }
  printf("\n");

  i = c;

  return 0;
}

If we compile the code as shown, we get the following output.

RAM write ticks: 455838 (per write    45)
RAM read ticks: 773470 (per read    77)

Consulting the objdump file, we can figure out exactly what code is being run. Here is the loop for read:

 40003a8:       400eaa00        call    400eaa0 <alt_timestamp>
 40003ac:       e0bffd15        stw     r2,-12(fp)
 40003b0:       0021883a        mov     r16,zero
 40003b4:       00000a06        br      40003e0 <main+0x174>
 
 40003b8:       00810134        movhi   r2,1028
 40003bc:       10800004        addi    r2,r2,0
 40003c0:       8407883a        add     r3,r16,r16
 40003c4:       18c7883a        add     r3,r3,r3
 40003c8:       10c5883a        add     r2,r2,r3
 40003cc:       10800017        ldw     r2,0(r2)
 40003d0:       8807883a        mov     r3,r17
 40003d4:       10c5883a        add     r2,r2,r3
 40003d8:       1023883a        mov     r17,r2
 40003dc:       84000044        addi    r16,r16,1
 40003e0:       8089c430        cmpltui r2,r16,10000
 40003e4:       103ff41e        bne     r2,zero,40003b8 <a+0xfffc03b8>

 400030c:       400eaa00        call    400eaa0 <alt_timestamp>

We have around 12 instructions (on a Nios/e), including one memory read. This takes 77 cycles.

The memory access can be accelerated by moving the variable to on-chip memory. This requires a change in the C source code. A variable can be specifically allocated to a section using an attribute.

volatile unsigned int __attribute__((section (".localmem"))) a[ARRAYSIZE];

In the board support package, the new section (.localmem) must be specifically added to the linker, so that we can allocate it to a local memory. In this case, the bus contains an extra local memory called scratchpad.

l20-linker-scratch

We can now re-compile the BSP and the application and run it again.

cd example-nios-sdram-cordic
nios2-bsp-generate-files \
      --settings hal_bsp/settings.bsp \
      --bsp-dir hal_bsp
cd hal_bsp
make
cd ../readram
nios2-app-generate-makefile \
      --bsp-dir ../hal_bsp \
      --elf-name main.elf \
      --src-files main.c
make
nios2-download main.elf --go

This yields the following output. We use 10 cycles less for reading (while the writing overhead does not change). It’s not a very impressive improvement, primarily because we are using a Nios/e rather than a Nios/f.

RAM write ticks: 455878 (per write    45)
RAM read ticks: 675793 (per read    67)

Important: I was not successful generating a Nios/f in Quartus 18.1. (The Nios/f does not appear to support hardware multiply instructions, which the compiler generates by default for the Nios/f setting). I don’t like to admit it, but it appears that the ‘free’ version of Quartus is a step down from previous versions in quality.