Attention
This document was last updated Aug 22 24 at 21:58
Optimized 32-bit CORDIC Module for GPDK45
Important
The purpose of this lab is as follows.
Study a high-level reference implementation of the CORDIC algorithm in C, including a reference testbench with 1024 test vectors.
Select an optimization strategy (Area or Throughput) and develop an RTL implementation of the CORDIC algorithm, and verify the correctness of the design using the same set of test vectors. Optimizing for area means that your final design must be as small as possible in terms of layout area. Optimizing for throughput means that your final design must compute as many test vectors per second as possible at the maximum clock speed possible with the implemented layout.
Synthesize the design into GPDK45 standard cells. Perform static timing analysis on the design result and demonstrate that your design meets timing.
Create a layout for the design. Perform parastic extraction. Perform static timing analysis on the post-route design and demonstrate that your design meets timing. Perform gate-level simulation using back-annotated post-route timing and demonstrate that your design simulation correctly computes the set of 1024 test vectors.
Summarize your design result in a written report.
Attention
The due date of this lab is 10 November
Introduction
The CORDIC algorithm is an algorithm that efficiently computes transcendental functions (sine, cosine, tangent). CORDIC stands for COordinate Rotation DIgital Computer. The remarkable property of CORDIC is that it requires only integer arithmetic and yet can obtain highly accurate approximutions of sine and cosine functions. There is extensive online documentation of this algorithm, so this lab assigment does not elaborate furter on the algorithm. Some examples references are the following.
Meher, J. Valls, T. -B. Juang, K. Sridharan and K. Maharatna, “50 Years of CORDIC: Algorithms, Architectures, and Applications,” in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56, no. 9, pp. 1893-1907, Sept. 2009, doi: 10.1109/TCSI.2009.2025803.
Schaumont, P.R. (2010). CORDIC Coprocessor. In: A Practical Introduction to Hardware/Software Codesign. Springer, Boston, MA. doi:10.1007/978-1-4419-6000-9_12
The purpose of this lab is to build an optimized hardware implementation of this algorithm in GPDK45 standard cells. You will go through all steps of Digital Design that we discussed so far: RTL design and optimization, RTL synthesis, static timing analysis, layout, gate-level simulation and gate-level static timing analysis.
Important
In the context of this lab, optimzation means that your design is the best among all submissions. You can optimized for area (layout area as small as possible) or for throughput (final design as fast as possible), so there will be two ‘best’ designs. A portion of your lab grade is directly derived from the rank of your design in the overall ranking in terms of the selected optimization criteria.
You need/may submit only one design, and you may select only one optimization criteria.
Reference Implementation
After you have accepted the assignment, a starter repository will be available for you. Clone it to the design server (your repo name includes your account name).
git clone git@github.com:wpi-ece574-f23/lab-2-patrickschaumont.git
You will obtain the following lab directory structure:
refC |
Reference Implementation in C. Golden model and testvectors |
refV |
Behavioral Implementation in Verilog. Defines the testbench |
rtl |
A Sample RTL implementation of a multiplexed solution |
constraints |
Constraints for the sample RTL implementation |
sim |
Simulation for the sample RTL implementation |
syn |
RTL synthesis for the sample RTL implementation |
sta |
Static timing analysis for the sample RTL implementation |
layout |
Layout for the sample RTL implementation |
glsim |
Gate level simulation for the sample RTL implementation |
glsta |
Gate level STA for the sample RTL implementation |
The golden model is in refC
. It contains a CORDIC implementation that computes
sine/cosine of angles in the first quadrant, with a precision of 32 bits and 28
fractional bits (FIX<32,28>). You can compile and run this reference implementation:
cd refC
make
./cordic
Running this program will produce a set of test vectors:
rad c12ac7f cos ba83bca sin af5a193 rad +0.7546 cos +0.7286 sin +0.6850 verif cos +0.7286 sin +0.6850
rad 31102e9 cos fb5011c sin 30c37cf rad +0.1917 cos +0.9817 sin +0.1905 verif cos +0.9817 sin +0.1905
rad bfb8eb2 cos bb80511 sin ae4bea8 rad +0.7489 cos +0.7324 sin +0.6808 verif cos +0.7324 sin +0.6808
In this table, rad represents an encoded input in radians, and cos and sin represent an encoded output. The hex numbers in this output represent FIX<32,28> fixed point numbers. For example, c12ac7f represent a fractional number in radians equal to c12ac7f/(1 << 28) = 0.7546.
This pattern rad, cos, sin is repeated three times. The first is the input/output of the C program in FIX<32,28>. The second group is the input/output of the C program as floating point numbers. The third group is the verification computed using functions of the C math library.
Running this program also creates a test vector file, vectors.txt. The test vector file has three columns: the 32-bit input, and two 32-bit outputs. The output columns can be used to verify the Verilog implementation.
Important
Your design is correct only if it simulates every test vector in vectors.txt correctly. Before proceeding to synthesis/layout, make sure that your RTL design is correct.
Verilog Reference Module
Your design must provide the following I/O ports.
module cordic(
input wire clk,
input wire reset,
input wire start,
input wire signed [31:0] target,
output wire signed [31:0] X,
output wire signed [31:0] Y,
output wire done
);
clk |
I |
positive-edge-triggered clock |
reset |
I |
synchronous active-high reset |
start |
I |
synchronous active-high load signal |
target |
I |
32-bit target input, read when start=1 |
X, Y |
O |
32-bit output with sine and cosine components |
done |
O |
active-high completion signal, mark XY valid |
Verilog Testbench Implementation
The refV
directory contains a behavioral (non-synthesizable) Verilog implementation
of the CORDIC module, along with a testbench cordictb.v
. This testbench
drives the input/output ports of the CORDIC module with the testvectors
generated under refC
. You can reuse this testbench for later testing of the RTL
and gate-level design of your CORDIC.
To simulate, run the testbench as follows.
cd refV
make
You will see the output for each testvector followed by the message ‘TESTBENCH PASSES’. If you don’t see that message, your design is not correct and you have to debug the RTL before continuing.
target 0c12ac7f X 0ba83bca chkX 0ba83bca Y 0af5a193 chkY 0af5a193 OK 1
target 031102e9 X 0fb5011c chkX 0fb5011c Y 030c37cf chkY 030c37cf OK 1
target 0bfb8eb2 X 0bb80511 chkX 0bb80511 Y 0ae4bea8 chkY 0ae4bea8 OK 1
target 00a06e71 X 0ffcdbd8 chkX 0ffcdbd8 Y 00a0625b chkY 00a0625b OK 1
target 1074b5bf X 0841fae1 chkX 0841fae1 Y 0db451be chkY 0db451be OK 1
target 0b414d96 X 0c33af6b chkX 0c33af6b Y 0a597de0 chkY 0a597de0 OK 1
TESTBENCH PASSES
Sample RTL Implementation
To illustrate what steps you have to take during your design, you get one sample solution of an RTL implementation. This design is a multiplexed solution that has not been particularly optimized for area nor throughput. While you may start for your final solution from this design, it’s likely that you may not end up with a good performance rank (since everybody has access to this solution).
To simulate the sample solution:
cd sim
make
To perform RTL synthesis, see below. Note that this is where you select the clock period of your design. The sample RTL implementation is implemented at a clock period of 8ns. Also, note that the RTL synthesis makes use of the constraints SDC file in constraints
.
cd syn
make
You can verify that this design meets timing:
cat reports/cordic_report_qor.rpt
Timing
--------
Clock Period
-------------
clk 8000.0
Cost Critical Violating
Group Path Slack TNS Paths
-------------------------------------
clk 1437.1 0.0 0
default No paths 0.0
-------------------------------------
Total 0.0 0
Instance Count
--------------
Leaf Instance Count 1261
Physical Instance count 0
Sequential Instance Count 136
Combinational Instance Count 1125
Hierarchical Instance Count 0
Area
----
Cell Area 2918.628
Physical Cell Area 0.000
Total Cell Area (Cell+Physical) 2918.628
Net Area 0.000
Total Area (Cell+Physical+Net) 2918.628
After synthesis, you also perform static timing analysis:
cd sta
make
In particular, refer to the first path in late.rpt to check the critical path of your design.
cat late.rpt
Path 1: MET Setup Check with Pin Yr_reg[31]/CK
Endpoint: Yr_reg[31]/D (^) checked with leading edge of 'clk'
Beginpoint: A_reg[5]/Q (^) triggered by leading edge of 'clk'
Path Groups: {clk}
Other End Arrival Time 0.000
- Setup 0.119
+ Phase Shift 8.000
= Required Time 7.881
- Arrival Time 6.484
= Slack Time 1.397
Clock Rise Edge 0.000
+ Clock Network Latency (Ideal) 0.000
= Beginpoint Arrival Time 0.000
-----------------------------------------------------------------------------------
Instance Arc Cell Delay Arrival Required
Time Time
-----------------------------------------------------------------------------------
A_reg[5] CK ^ - - 0.000 1.397
A_reg[5] CK ^ -> Q ^ DFFHQX1 0.186 0.186 1.583
gt_88_21_g2159__5477 AN ^ -> Y ^ NAND2BX1 0.082 0.268 1.665
gt_88_21_g2109__2802 A2 ^ -> Y v AOI32X1 0.155 0.423 1.820
gt_88_21_g2098__6417 A1 v -> Y ^ OAI211X1 0.150 0.573 1.970
gt_88_21_g2087__1881 A0 ^ -> Y v OAI211X1 0.167 0.740 2.137
g6174 A1 v -> Y v AO21XL 0.145 0.885 2.282
gt_88_21_g2084__8246 B v -> Y ^ NAND2X1 0.033 0.918 2.315
gt_88_21_g2083__5122 A1 ^ -> Y v AOI21X1 0.102 1.020 2.417
g6168 A v -> Y ^ INVX2 0.164 1.184 2.581
add_90_38_Y_sub_90_49_g2874 A ^ -> Y v INVX2 0.197 1.381 2.778
add_90_38_Y_sub_90_49_g2873__4733 A v -> Y ^ MXI2XL 0.167 1.548 2.945
add_90_38_Y_sub_90_49_g2792__5477 CI ^ -> CO ^ ADDFX1 0.217 1.765 3.162
add_90_38_Y_sub_90_49_g2791__6417 CI ^ -> CO ^ ADDFX1 0.186 1.951 3.348
add_90_38_Y_sub_90_49_g2790__7410 CI ^ -> CO ^ ADDFX1 0.186 2.137 3.534
add_90_38_Y_sub_90_49_g2789__1666 CI ^ -> CO ^ ADDFX1 0.186 2.323 3.720
add_90_38_Y_sub_90_49_g2788__2346 CI ^ -> CO ^ ADDFX1 0.186 2.509 3.906
add_90_38_Y_sub_90_49_g2787__2883 CI ^ -> CO ^ ADDFX1 0.186 2.695 4.092
add_90_38_Y_sub_90_49_g2786__9945 CI ^ -> CO ^ ADDFX1 0.186 2.881 4.278
add_90_38_Y_sub_90_49_g2785__9315 CI ^ -> CO ^ ADDFX1 0.188 3.069 4.466
add_90_38_Y_sub_90_49_g2784__6161 C ^ -> Y v NAND3BXL 0.108 3.177 4.574
add_90_38_Y_sub_90_49_g2783__4733 C0 v -> Y ^ OAI211X1 0.093 3.270 4.667
add_90_38_Y_sub_90_49_g2779__6131 CI ^ -> CO ^ ADDFX1 0.212 3.482 4.879
add_90_38_Y_sub_90_49_g2778__7098 CI ^ -> CO ^ ADDFX1 0.186 3.668 5.065
add_90_38_Y_sub_90_49_g2777__8246 CI ^ -> CO ^ ADDFX1 0.186 3.854 5.251
add_90_38_Y_sub_90_49_g2776__5122 CI ^ -> CO ^ ADDFX1 0.186 4.040 5.437
add_90_38_Y_sub_90_49_g2775__1705 CI ^ -> CO ^ ADDFX1 0.186 4.226 5.623
add_90_38_Y_sub_90_49_g2774__2802 CI ^ -> CO ^ ADDFX1 0.188 4.414 5.811
add_90_38_Y_sub_90_49_g2773__1617 C ^ -> Y v NAND3BXL 0.134 4.548 5.945
add_90_38_Y_sub_90_49_g2769__8428 A1 v -> Y ^ OAI221X1 0.172 4.720 6.117
add_90_38_Y_sub_90_49_g2763__6417 A2 ^ -> Y v AOI31X1 0.171 4.891 6.288
add_90_38_Y_sub_90_49_g2758__9945 A1 v -> Y ^ OAI21X1 0.110 5.001 6.398
add_90_38_Y_sub_90_49_g2755__4733 A1 ^ -> Y v AOI21X1 0.116 5.117 6.514
add_90_38_Y_sub_90_49_g2753__5115 C v -> Y v OR3XL 0.171 5.288 6.685
add_90_38_Y_sub_90_49_g2749__8246 A1 v -> Y ^ OAI221X1 0.093 5.381 6.778
add_90_38_Y_sub_90_49_g2746__2802 CI ^ -> CO ^ ADDFX1 0.219 5.600 6.997
add_90_38_Y_sub_90_49_g2745__1617 CI ^ -> CO ^ ADDFX1 0.186 5.786 7.183
add_90_38_Y_sub_90_49_g2744__3680 CI ^ -> CO ^ ADDFX1 0.186 5.972 7.369
add_90_38_Y_sub_90_49_g2743__6783 CI ^ -> CO ^ ADDFX1 0.184 6.156 7.553
add_90_38_Y_sub_90_49_g2742__5526 B ^ -> Y ^ XNOR2X1 0.163 6.319 7.716
g6091__4733 A1 ^ -> Y ^ AO22XL 0.165 6.484 7.881
Yr_reg[31] D ^ DFFHQX1 0.000 6.484 7.881
-----------------------------------------------------------------------------------
The next step is to generate the layout. Note that the layout makes use of
the chip I/O pin constraints in chip/chip.io
. The synthesis step of the layout directly
reads the synthesis netlist produced in syn
. Hence, you must complete RTL synthesis
before building the layout.
cd layout
make syn
make layout
When the layout run ends, the script prints the final core area. This is the number that will be used for area cost of your design. This design would measure 6140.61 square micron.
FINAL DESIGN CORE AREA:
@file 173: puts [expr ($xu - $xl)*($yu - $yl)]
6140.61
You can inpect the layout with the gui_show
command.
Pay particular attention to white crosses in your design, which indicate design violation errors.
While we did not discuss design violations in detail, a correct design should avoid them. Design
violations can occur because of overly tight design constraints.

After layout, you can run gate-level simulation
cd glsim
make
The testbench must of course still pass
...
target 00a06e71 X 0ffcdbd8 chkX 0ffcdbd8 Y 00a0625b chkY 00a0625b OK 1
target 1074b5bf X 0841fae1 chkX 0841fae1 Y 0db451be chkY 0db451be OK 1
target 0b414d96 X 0c33af6b chkX 0c33af6b Y 0a597de0 chkY 0a597de0 OK 1
TESTBENCH PASSES
Finally, you can run gate-level static timing analysis.
cd glsta
make
The design must still meet timing, i.e. have a positive slack.
Path 1: MET Setup Check with Pin Xr_reg[31]/CK
Endpoint: Xr_reg[31]/D (^) checked with leading edge of 'clk'
Beginpoint: T_reg[2]/Q (v) triggered by leading edge of 'clk'
Path Groups: {clk}
Other End Arrival Time 0.104
- Setup 0.113
+ Phase Shift 8.000
= Required Time 7.991
- Arrival Time 7.641
= Slack Time 0.350
Clock Rise Edge 0.000
+ Clock Network Latency (Prop) 0.104
= Beginpoint Arrival Time 0.104
----------------------------------------------------------------
Instance Arc Cell Delay Arrival Required
Time Time
----------------------------------------------------------------
T_reg[2] CK ^ - - 0.104 0.454
T_reg[2] CK ^ -> Q v DFFHQX1 0.262 0.366 0.716
g37105 A v -> Y ^ NAND2X1 0.060 0.426 0.776
g36792 C ^ -> Y v NAND3BXL 0.143 0.568 0.919
g36636 A v -> Y ^ NAND3X1 0.136 0.705 1.055
g37800 B0 ^ -> Y v OAI21X1 0.098 0.803 1.153
g36385__2802 B0 v -> Y ^ OAI21X1 0.065 0.868 1.218
g36369__2883 B0 ^ -> Y v AOI2BB1X1 0.073 0.942 1.292
g36293__4319 A2 v -> Y ^ OAI31X1 0.143 1.084 1.434
g36290__2398 A2 ^ -> Y v AOI31X1 0.245 1.329 1.680
g36289__5477 A v -> Y ^ NAND3X1 0.148 1.477 1.827
g36288__6417 A ^ -> Y v NAND2X1 0.123 1.601 1.951
FE_OFC7_n_938 A v -> Y v CLKBUFX6 0.203 1.804 2.154
g36287 A v -> Y ^ CLKINVX6 0.153 1.957 2.307
g36196__2883 B ^ -> Y ^ MX2X1 0.254 2.211 2.561
g36013__5477 B0 ^ -> Y v AOI2BB1X1 0.075 2.286 2.636
g35937__6131 B v -> Y ^ NOR2X1 0.082 2.368 2.719
g35874__1705 A1 ^ -> Y v OAI21X1 0.131 2.500 2.850
g35847__6131 A1 v -> Y ^ AOI21X1 0.130 2.630 2.980
g35816__5115 B0 ^ -> Y v AOI2BB1X1 0.085 2.715 3.065
g35809__2802 B v -> Y ^ NOR2X1 0.089 2.803 3.153
g35792__2883 B0 ^ -> Y v AOI2BB1X1 0.087 2.890 3.240
g35790__9315 B v -> Y ^ NOR2X1 0.084 2.974 3.325
g35758__2883 A1 ^ -> Y v OAI21X1 0.130 3.105 3.455
g35730__5477 A1 v -> Y ^ AOI22X1 0.158 3.263 3.613
g35704__6783 A1 ^ -> Y v OAI21X1 0.165 3.428 3.778
g35674__1705 A1 v -> Y ^ AOI21X1 0.140 3.567 3.917
g35655__2883 A1 ^ -> Y v OAI21X1 0.163 3.730 4.080
g37644 B0 v -> Y v OA21X1 0.149 3.879 4.229
g35620__5107 B v -> Y ^ NOR2X1 0.074 3.953 4.303
g35597__5122 B0 ^ -> Y v AOI2BB1X1 0.083 4.037 4.387
g35595__2802 B v -> Y ^ NOR2X1 0.083 4.120 4.470
g35573__4733 A1 ^ -> Y v OAI21X1 0.150 4.270 4.620
g35548__6417 B1 v -> Y ^ AOI22X1 0.161 4.431 4.781
g35520__8428 B0 ^ -> Y v AOI2BB1X1 0.110 4.541 4.891
g35518__6260 B v -> Y ^ NOR2X1 0.105 4.645 4.996
g35488__6783 A1 ^ -> Y v OAI21X1 0.157 4.803 5.153
g35466__1881 B v -> Y ^ NAND2X1 0.102 4.905 5.255
g35454__4319 B ^ -> Y v NOR2X1 0.069 4.974 5.324
g35428__3680 B1 v -> Y ^ AOI221X1 0.175 5.148 5.499
g35401__5115 A1 ^ -> Y v OAI21X1 0.204 5.352 5.702
g35378 A v -> Y ^ INVX1 0.108 5.461 5.811
g35365__9315 A1 ^ -> Y v OAI21X1 0.123 5.584 5.934
g35350 A v -> Y ^ INVX1 0.102 5.685 6.036
g35339__4319 A1 ^ -> Y v OAI21X1 0.098 5.784 6.134
g35336 A v -> Y ^ INVX1 0.075 5.859 6.209
g35323__6161 A1 ^ -> Y v OAI21X1 0.113 5.972 6.323
g35299__5107 B v -> Y ^ NAND2X1 0.108 6.080 6.430
g35287__6161 A1 ^ -> Y v OAI21X1 0.124 6.204 6.554
g35281 A v -> Y ^ INVX1 0.093 6.297 6.647
g35253__9315 A1 ^ -> Y v OAI21X1 0.109 6.406 6.756
g35238__1617 A1 v -> Y ^ AOI22X1 0.158 6.564 6.914
g35214__6161 A1 ^ -> Y v OAI22X1 0.188 6.752 7.102
g37699 CI v -> CO v ADDFX1 0.245 6.996 7.346
g35194 A v -> Y ^ INVX1 0.072 7.068 7.418
g35184__2883 A1 ^ -> Y v OAI22X2 0.141 7.210 7.560
g35170__5122 B v -> Y v XNOR2X1 0.181 7.391 7.741
g35163__6783 A1 v -> Y ^ AOI21X1 0.078 7.469 7.819
g35160__8428 B0 ^ -> Y v OAI21X1 0.103 7.573 7.923
g35158__4319 B0 v -> Y ^ OAI2BB1X1 0.069 7.641 7.991
Xr_reg[31] D ^ DFFX2 0.000 7.641 7.991
Performance Metrics
The performance metrics used in this lab differ slightly from the convention we have adopted in Lecture 8.
When you optimize for area, the final area of your design is the core area as reported by the
run_innovus.tcl
script.When you optimize for throughput, your final throughput is the maximum clock frequency achieved by your design times the number of clock cycles per output. For example, the sample RTL implementation has a cycle budget of 22 cycles and a maximum clock frequency of 125MHz, so the throughput of this design is 5.682 million CORDIC evaluations per second.
The differences of these metrics with our earlier convention is as follows: To define area, we use physical core area, while previously we have used the active area (the area of the standard cells). To define throughput, we determine the clock frequency through the clock period constraint you use for your design, and not the critical path.
Grading Rubric
The lab receives 100 points, to be distributed as follows.
20 points for the RTL design: correctness and quality of the code and the testbench
20 points for the synthesis result: design synthesizes correctly (no latches), meets advertised timing, meets input/output specification
20 points for the layout: design has a layout that meets timing, that has no design violations
20 points for the rank of your design within the category you choose to optimize for (area/throughput). The rank-1 design gets 20 points, the rank-2 design gets 18 points, and so on.
20 points for your report: clearly document your design process, include listings, include tables with area and critical path, include a snapshot of the layout.
Attention
The following are important report requirements.
Use a typesetting tool such as latex or word. Do not submit handwritten scanned reports.
Use a screen capture tool to collect graphics such as layout information. Do not use your smartphone to take a picture from the screen.
Be clear and complete in your report. I am not looking for the correct solution; I am looking to understand your solution. Explain the steps that lead up to a result. The report does not have a minimum length nor a maximum length, as long as it is clear what you did and your answers to questions are complete.
Make sure to update, commit and push the results of your lab to your repository. You do not have to turn in anything on Canvas. All results will be communicated through your github repository.