.. ECE 574 .. attention:: This document was last updated |today| .. _08optimizing: Optimizing Area and Timing ========================== .. important:: The purpose of this lecture is as follows. * To explore the impact of optimization on design cost factors (gate count, performance, layout area) for a concrete design * To apply two techniques that help in building *high-speed* hardware * To apply two techniques that help in building *low-area* hardware * To discuss Verilog coding and testbench design for all these cases .. important:: The designs discussed in this lecture are on https://github.com/wpi-ece574-f24/ex-layout-avg64 Moving Average Application -------------------------- We will study a concrete design consisting of a moving-average filter application with a 64-bit wordlength. The filter adds up the last 4 samples that have been entered at the ``din`` input. When two 64-bit numbers are added, the carry-out bit is thrown out, and only the lower 64 bits are kept. .. figure:: img/mtfig.jpg :figwidth: 600px :align: center This implementation exhibits the following characteristics. * The inputs and outputs are produced at a rate of 1 data element per clock cycle. Thus, the data introduction interval of the reference implementation is one per clock cycle. * There is a combinational path that runs from input to output through three 64-bit additions. The *latency* of the implementation, counted in clock cycles, is 0 cycles. That is, if the input is applied at :math:`t=0` (ns), then the output will be ready at :math:`t=T_{add}` (ns), with :math:`T_{add}` the time of the combinational path through three 64-bit additions. Obviously, to avoid a timing path violation on a clock period of :math:`T_c`, we must ensure that :math:`T_{add} < T_c`. If the *input delay* constraint on *din* is :math:`T_{din}`, and the *output delay* constraint on *dout* is :math:`T_{dout}`, then we must ensure that :math:`T_{add} + T_{din} + T_{dout} < T_c`. * The input is captured into a register which forms part of a register delay chain. The combinational delay along the chain is very small, and defined by the clock-to-Q delay of the register. Therefore, it's reasonable to assume that the slowest path in the overall design will be dominated by three sequential additions. Nevertheless, one must keep in mind that the chain of 64-bit adders enables considerable parallellism, as the adder chain sums up four indepent numbers with are available at the start of the cycle (plus the input delay for *din*, plus the clock-to-Q delay for the *registers*). .. code:: :number-lines: 1 module movavg( input logic clk, input logic reset, input logic [63:0] din, output logic [63:0] dout ); logic [63:0] tap1, tap1_next; logic [63:0] tap2, tap2_next; logic [63:0] tap3, tap3_next; always_ff @(posedge clk) begin if (reset) begin tap1 <= 64'h0; tap2 <= 64'h0; tap3 <= 64'h0; end else begin tap1 <= tap1_next; tap2 <= tap2_next; tap3 <= tap3_next; end end logic [63:0] doutreg; always_comb begin doutreg = din + tap1 + tap2 + tap3; tap1_next = din; tap2_next = tap1; tap3_next = tap2; end assign dout = doutreg; endmodule The correctness of the design is evaluated using a testbench that feeds in random values. The testbench performs a parallel computation while feeding random values, so that the correctness of the module implementation is verified. In the following testbench, pay special attention to the timing of the testbench. Since the DUT has no data-ready signal, the testbench is sensitive to the time when the output *DUT.dout* is read. * (line 7) We apply a 10ns clock. At this point, we are looking for functional timing verification, so the period is an arbitrary but convenient number. * (line 33) We use a one-cycle reset sequence and start applying inputs from cycle 2. The reference values chk_tap1, chk_tap2, chk_tap3 are reset simultaneously with the module. * (Line 42) Each clock cycle, a new vector is applied. The input to the DUT is provided with a tiny delay to the module inputs. This ensures that the inputs are not changing exactly *at* the clock edge. While an RTL simulation suffers no hold timing effects, there may still occur non-deterministic behavior when multiple variables change at the same simulation timestamp. In this case, we have to make sure that the registers in the DUT are updated before din changes value. * (Line 54) We assert the outputs at the end of the clock period, just before the clock edge. * (Line 60) We update the delay line after the output assertion is complete. Then, we await the next clock edge and move to the next input test vector. .. code:: :number-lines: 1 `timescale 1ns/1ps `define CLOCKPERIOD 10 module tb; logic clk, reset; always begin clk = 1'b0; #(`CLOCKPERIOD/2); clk = 1'b1; #(`CLOCKPERIOD/2); end logic [63:0] din; logic [63:0] dout; movavg DUT(.clk(clk), .reset(reset), .din(din), .dout(dout) ); logic [63:0] chk_tap3; logic [63:0] chk_tap2; logic [63:0] chk_tap1; logic [63:0] chk_din; logic [63:0] chk_dout; initial begin reset = 1'b1; @(negedge clk); reset = 1'b0; chk_tap1 = 0; chk_tap2 = 0; chk_tap3 = 0; din = 0; @(posedge clk); while (1) begin #1; // making sure we assign new inputs just after the clock edge din[31:0] = $random; din[63:32] = $random; chk_din = din; #(`CLOCKPERIOD - 1); chk_dout = chk_din + chk_tap3 + chk_tap2 + chk_tap1; $display("%t out %h expected %h OK %d", $time, dout, chk_dout, dout == chk_dout); // $display("%t DUT in %h taps %h %h %h", $time, DUT.din, DUT.tap1, DUT.tap2, DUT.tap3); // $display("%t CHK in %h taps %h %h %h", $time, chk_din, chk_tap1, chk_tap2, chk_tap3); chk_tap3 = chk_tap2; chk_tap2 = chk_tap1; chk_tap1 = chk_din; @(posedge clk); end end initial begin repeat(256) @(posedge clk); $finish; end endmodule FAST design strategy #1: Pipelining ----------------------------------- The first design strategy to improve performance of the reference design is to introduce pipeline registers. Because the design does not have a latency requirement, an arbitrary number of pipeline registers can be inserted. The main challenge for pipelining is to ensure that pipeline registers are inserted consistently, i.e. according to the rules of retiming as discussed in our lecture on Timing. Because the reference design has no feedback loops, the insertion of pipeline registers can be done by leveling of the circuit graph. The *expected* critical path of the design runs through three sequential adders. Therefore, these adders are isolated by two pipeline registers. We do not insert a pipeline register at the input (din) or the output (dout) because we are assuming that the input delay and the output delay are both 0. When a different constraint would be chosen, this may lead to the insertion of an additional pipeline register at din or dout. .. figure:: img/refpipe.jpg :figwidth: 600px :align: center Furthermore, we do not insert a pipeline register *within* the adder, although it is perfectly feasible to pipeline the adder itself as well: .. figure:: img/adderpipe.jpg :figwidth: 400px :align: center In the overall pipelined design, notice how the insertion of a pipeline level implies the insertion of multiple 64-bit registers. For example, the first pipeline level leads to three registers: ``pipe1a`` at the output of the first adder, ``pipe1b`` at the input of the second adder, and ``pipe1c`` at the output of the third tap. We can thus rewrite the Verilog description while inserting these additional registers in the computation flow. .. code:: :number-lines: 1 `define WL 63 `define WL1 64 module movavg(input wire clk, input wire reset, input wire [`WL:0] din, output wire [`WL:0] dout); reg [`WL:0] tap1, tap1_next; reg [`WL:0] tap2, tap2_next; reg [`WL:0] tap3, tap3_next; reg [`WL:0] pipe1a, pipe1a_next; reg [`WL:0] pipe1b, pipe1b_next; reg [`WL:0] pipe1c, pipe1c_next; reg [`WL:0] pipe2a, pipe2a_next; reg [`WL:0] pipe2b, pipe2b_next; always @(posedge clk) begin if (reset) begin tap1 <= `WL1'h0; tap2 <= `WL1'h0; tap3 <= `WL1'h0; pipe1a <= `WL1'h0; pipe1b <= `WL1'h0; pipe1c <= `WL1'h0; pipe2a <= `WL1'h0; pipe2b <= `WL1'h0; end else begin tap1 <= tap1_next; tap2 <= tap2_next; tap3 <= tap3_next; pipe1a <= pipe1a_next; pipe1b <= pipe1b_next; pipe1c <= pipe1c_next; pipe2a <= pipe2a_next; pipe2b <= pipe2b_next; end end // always @ (posedge clk) reg [`WL:0] doutreg; always @(*) begin pipe1a_next = din + tap1; pipe1b_next = tap2; pipe1c_next = tap3; pipe2a_next = pipe1a + pipe1b; pipe2b_next = pipe1c; doutreg = pipe2a + pipe2b; tap1_next = din; tap2_next = tap1; tap3_next = tap2; end assign dout = doutreg; endmodule .. attention:: The delay of the pipelined design will be given by the critical path, since a new result is computed every clock cycle. For a pipelined design, we have to adjust the testbench so that it deals with the increased latency of the design. With two pipeline registers, the result is available two clock cycles later. Additionally, the testbench has to take into account that a pipelined design will complete several sums at the same time. * (Line 57) The testbench keeps track of the two previous results, ``pp_chk_dout`` and ``p+chk_dout`` while computing the current result, ``chk_dout``. * (Line 62) The result is tested from the first cycle. This implies that we can expect the first two results to be wrong. So, we would check OK only from the third result produced. .. code:: :number-lines: 1 `timescale 1ns/1ps `define CLOCKPERIOD 10 module tb; logic clk, reset; always begin clk = 1'b0; #(`CLOCKPERIOD/2); clk = 1'b1; #(`CLOCKPERIOD/2); end logic [63:0] din; logic [63:0] dout; movavg DUT(.clk(clk), .reset(reset), .din(din), .dout(dout) ); logic [63:0] chk_tap3; logic [63:0] chk_tap2; logic [63:0] chk_tap1; logic [63:0] chk_din; logic [63:0] chk_dout; logic [63:0] p_chk_dout; logic [63:0] pp_chk_dout; initial begin reset = 1'b1; @(negedge clk); reset = 1'b0; chk_tap1 = 0; chk_tap2 = 0; chk_tap3 = 0; din = 0; @(posedge clk); while (1) begin #1; // making sure we assign new inputs just after the clock edge din[31:0] = $random; din[63:32] = $random; chk_din = din; #(`CLOCKPERIOD - 1); // This delay ensures the tested output matches the pipelined output (2 pipe stages) pp_chk_dout = p_chk_dout; p_chk_dout = chk_dout; chk_dout = chk_din + chk_tap3 + chk_tap2 + chk_tap1; $display("%t out %h expected %h OK %d", $time, dout, pp_chk_dout, dout == pp_chk_dout); $display("%t CHK in %h taps %h %h %h -> %h", $time, chk_din, chk_tap1, chk_tap2, chk_tap3, chk_dout); chk_tap3 = chk_tap2; chk_tap2 = chk_tap1; chk_tap1 = chk_din; @(posedge clk); end end initial begin repeat(256) @(posedge clk); $finish; end endmodule Here are the first few lines of the testbench output demonstrating the latency effect. Note that the third result ``b9594e719d53863a``, ie. the output after three inputs, is only checked when the fifth input is read in at timestamp 65000. .. code:: 25000 out 0000000000000000 expected xxxxxxxxxxxxxxxx OK x 25000 CHK in c0895e8112153524 taps 0000000000000000 0000000000000000 0000000000000000 -> c0895e8112153524 35000 out 0000000000000000 expected xxxxxxxxxxxxxxxx OK x 35000 CHK in b1f056638484d609 taps c0895e8112153524 0000000000000000 0000000000000000 -> 7279b4e4969a0b2d 45000 out c0895e8112153524 expected c0895e8112153524 OK 1 45000 CHK in 46df998d06b97b0d taps b1f056638484d609 c0895e8112153524 0000000000000000 -> b9594e719d53863a 55000 out 7279b4e4969a0b2d expected 7279b4e4969a0b2d OK 1 55000 CHK in 89375212b2c28465 taps 46df998d06b97b0d b1f056638484d609 c0895e8112153524 -> 4290a08450160a9f 65000 out b9594e719d53863a expected b9594e719d53863a OK 1 65000 CHK in 06d7cd0d00f3e301 taps 89375212b2c28465 46df998d06b97b0d b1f056638484d609 -> 88df0f103ef4b87c FAST design strategy #2: Unfolding ---------------------------------- An alternate strategy to improve design performance is to exploit the parallellism of hardware and remove the input/output constraint of a single data item per cycle. This strategy is called *unfolding*. Unfolding is more than simply parallellizing. For example, simply doubling the averager is not a correct implementation of the averaging algorithm, because that implementation would compute the average on two independent streams. In *unfolding*, we perform computations on a single stream of data items which are delivered in groups of *n* data items at a time. The following example illustrates this concept. A simple counter circuit counts at a rate of one per clock cycle. The 2x unfolded version of this counter will produce two outputs per clock cycle, *n* and *n+1*. We have this by doubling the combinational hardware. However, the state variables of the unfolded circuit are identical to those of the original circuit: there is still one single register. .. figure:: img/unfold1.jpg :figwidth: 400px :align: center The unfolding transformation is general, and applies to any circuit. For example, assume we have a counter circuit with two back-to-back registers. If the initial state of the registers is zero, then the sequence of values produced at the output will be 1, 1, 2, 2, 3, 3, 4, 4, and so forth. The 2x unfolded version of this circuit is shown below that. It uses two incrementer circuits, and the state variables of the original circuit are *distributed* over the unfolded circuit. It's easy to see that this circuit produced the same outputs as the original circuit, but in pairs of two: (1,1), (2,2), (3,3), .. .. figure:: img/unfold2.jpg :figwidth: 400px :align: center It is possible to derive systematic unfolding rules for arbitrary circuits, but since the averager is sufficiently small and simple, we can attempt to derive a 2x unfolded version by hand. The unfolded version of the averager reads in pairs of data *(dinA, dinB)* and produced pairs of output data *(doutA, doutB)*. The output data is the sum of the last four data items in the stream. For example, if the stream consists of the tuples *(dinA1, dinB1), (dinA2, dinB2), (dinA3, dinB3)* (with dinA1 the most recent), then the outputs are defined by *(doutA = dinA1 + dinB1 + dinA2 + dinB2)* and *(doutB = dinB1 + dinA2 + dinB2 + dinA3)*. Graphically, this leads to the unfolded circuit we aim to create. Notice that the number of state variables is the same as the original circuit (3 registers) while the logic has doubled. .. figure:: img/unfoldavg.jpg :figwidth: 600px :align: center .. code:: :number-lines: 1 module movavg(input wire clk, input wire reset, input wire [63:0] dinA, input wire [63:0] dinB, output wire [63:0] doutA, output wire [63:0] doutB); reg [63:0] tap1, tap1_next; reg [63:0] tap2, tap2_next; reg [63:0] tap3, tap3_next; always @(posedge clk) begin if (reset) begin tap1 <= 64'h0; tap2 <= 64'h0; tap3 <= 64'h0; end else begin tap1 <= tap1_next; tap2 <= tap2_next; tap3 <= tap3_next; end end // always @ (posedge clk) reg [63:0] doutregA; reg [63:0] doutregB; always @(*) begin doutregA = dinA + dinB + tap1 + tap2; doutregB = dinB + tap1 + tap2 + tap3; tap1_next = dinA; tap2_next = dinB; tap3_next = tap1; end assign doutA = doutregA; assign doutB = doutregB; endmodule .. attention:: The delay of the unfolded design will be given by the critical path divided by the unfolding factor (2). Each clock cycle, two new results are computed. The testbench of this design will have to reflect the double-rate nature of the hardware, and hence verify two outputs per iteration. The latency of the design is still 0 clock cycles, as with the original design. The test bench verification does not implement the unfolded design, but rather implements the original design working at twice the speed. That is, each loop iteration, two new items are inserted into the delsy line and two outputs are computed. * (line 61-62): the first (oldest) element is inserted into the delay line * (line 66-67): Two consecutive outputs of the moving averager are computed .. code:: :number-lines: 1 `timescale 1ns/1ps `define CLOCKPERIOD 10 module tb; logic clk, reset; always begin clk = 1'b0; #(`CLOCKPERIOD/2); clk = 1'b1; #(`CLOCKPERIOD/2); end logic [63:0] dinA; logic [63:0] doutA; logic [63:0] dinB; logic [63:0] doutB; movavg DUT(.clk(clk), .reset(reset), .dinA(dinA), .dinB(dinB), .doutA(doutA), .doutB(doutB) ); logic [63:0] chk_tap4; logic [63:0] chk_tap3; logic [63:0] chk_tap2; logic [63:0] chk_tap1; logic [63:0] chk_dinA; logic [63:0] chk_doutA; logic [63:0] chk_dinB; logic [63:0] chk_doutB; initial begin reset = 1'b1; @(negedge clk); reset = 1'b0; chk_tap1 = 0; chk_tap2 = 0; chk_tap3 = 0; dinA = 0; dinB = 0; @(posedge clk); while (1) begin #1; // making sure we assign new inputs just after the clock edge dinA[31:0] = $random; dinA[63:32] = $random; dinB[31:0] = $random; dinB[63:32] = $random; chk_dinB = dinB; chk_dinA = dinA; #(`CLOCKPERIOD - 1); chk_doutA = chk_dinA + chk_dinB + chk_tap1 + chk_tap2; chk_doutB = chk_dinB + chk_tap1 + chk_tap2 + chk_tap3; $display("A %t out %h expected %h OK %d", $time, doutA, chk_doutA, doutA == chk_doutA); $display("B %t out %h expected %h OK %d", $time, doutB, chk_doutB, doutB == chk_doutB); // $display("%t DUT in %h taps %h %h %h", $time, DUT.din, DUT.tap1, DUT.tap2, DUT.tap3); // $display("%t CHK in %h taps %h %h %h", $time, chk_din, chk_tap1, chk_tap2, chk_tap3); chk_tap3 = chk_tap1; chk_tap2 = chk_dinB; chk_tap1 = chk_dinA; @(posedge clk); end end initial begin repeat(256) @(posedge clk); $finish; end endmodule SMALL design strategy #1: Sequentializing ----------------------------------------- To reduce the area of a hardware design, we have to reuse hardware elements over multiple clock cycles. An obvious candidate for reuse is the 64-bit adder. The flow graph of the average is partitioned in clusters that contain simular operations. At the edge of a cluster, a register is placed to carry signals over to the next clock cycle. * (Line 48-75) We use a 5-state FSM that reads the input, accumulates all the taps, and finally shifts the taps and produces the output * The level of area reduction depends on two things. First, the synthesis tools have to realize that the 64-bit addition can be reused over different FSM states. Second, the area saved by *sharing* the adder hardware must be larger than the area overhead added by the sharing support hardware. This includes the finite state machine, the accmulator register (line 11), and the multiplexers (invisible in the RTL!). This is no small requirement, and we will see that this overhead is significant. .. figure:: img/avgseq.jpg :figwidth: 600px :align: center .. code:: :number-lines: 1 module movavg(input logic clk, input logic reset, input logic [63:0] din, output logic [63:0] dout, output logic read ); typedef enum logic [3:0] { S0 = 4'b0000, S1 = 4'b0001, S2 = 4'b0010, S3 = 4'b0011, S4 = 4'b0100 } state_t; logic [63:0] tap1, tap1_next; logic [63:0] tap2, tap2_next; logic [63:0] tap3, tap3_next; state_t state, state_next; logic [63:0] acc, acc_next; always_ff @(posedge clk) begin if (reset) begin tap1 <= 64'h0; tap2 <= 64'h0; tap3 <= 64'h0; acc <= 64'h0; state <= S0; end else begin tap1 <= tap1_next; tap2 <= tap2_next; tap3 <= tap3_next; acc <= acc_next; state <= state_next; end end logic [63:0] doutreg; logic readreg; always @(*) begin state_next = state; acc_next = acc; tap1_next = tap1; tap2_next = tap2; tap3_next = tap3; doutreg = 64'h0; readreg = 1'b0; case (state) S0: begin acc_next = din; readreg = 1'b1; state_next = S1; end S1: begin acc_next = acc + tap1; state_next = S2; end S2: begin acc_next = acc + tap2; state_next = S3; end S3: begin acc_next = acc + tap3; state_next = S4; end S4: begin doutreg = acc; tap1_next = din; tap2_next = tap1; tap3_next = tap2; state_next = S0; end default: state_next = S0; endcase end assign dout = doutreg; assign read = readreg; endmodule .. attention:: The delay of the sequential design equals the critical path times the number of clock cycles required to compute the output dout (5). The testbench of the sequential design is derived from the original testbench and simply increases the latency of the design by 4 cycles. We have also introduced an additional synchronization signal, ``ready``, which indicates when the FSM is in state ``S0``. In that state, the new input is read in to the delay line. .. code:: :number-lines: 1 `timescale 1ns/1ps `define CLOCKPERIOD 10 module tb; logic clk, reset; always begin clk = 1'b0; #(`CLOCKPERIOD/2); clk = 1'b1; #(`CLOCKPERIOD/2); end logic [63:0] din; logic [63:0] dout; logic read; movavg DUT(.clk(clk), .reset(reset), .din(din), .dout(dout), .read(read) ); logic [63:0] chk_tap3; logic [63:0] chk_tap2; logic [63:0] chk_tap1; logic [63:0] chk_din; logic [63:0] chk_dout; initial begin reset = 1'b1; @(negedge clk); reset = 1'b0; chk_tap1 = 0; chk_tap2 = 0; chk_tap3 = 0; din = 0; while (read == 1'b0) // wait until computation can start @(posedge clk); while (1) begin #1; // making sure we assign new inputs just after the clock edge din[31:0] = $random; din[63:32] = $random; chk_din = din; repeat(4) @(posedge clk); // wait until FSM complete #(`CLOCKPERIOD - 2); chk_dout = chk_din + chk_tap3 + chk_tap2 + chk_tap1; $display("%t out %h expected %h OK %d", $time, dout, chk_dout, dout == chk_dout); // $display("%t DUT in %h taps %h %h %h", $time, DUT.din, DUT.tap1, DUT.tap2, DUT.tap3); // $display("%t CHK in %h taps %h %h %h", $time, chk_din, chk_tap1, chk_tap2, chk_tap3); chk_tap3 = chk_tap2; chk_tap2 = chk_tap1; chk_tap1 = chk_din; @(posedge clk); end end initial begin repeat(256) @(posedge clk); $finish; end endmodule SMALL design strategy #2: Bit-serializing ----------------------------------------- The second strategy towards small area is to take the hardware reuse concept to the extreme, and transform the 64-bit averager to a bit-serial design. The basic components of the design are easy to transform to bit-serial circuits: * A 64-bit adder becomes a 1-bit bit-serial adder * A 64-bit tap register becomes a 64-bit shift register (a 64-position 1-bit FIFO) * Parallel inputs and outputs are created using appropriate parallel to serial and serial to parallel registers A challenge of bit-serial design is the controller, which needs to create a sync signal with a duty cycle of 1/64. The phase of the sync signal changes with the position of the bit-serial operator in the overall design. .. figure:: img/bitserialavg.jpg :figwidth: 600px :align: center The advantage of bit-serialization over sequentializing is that we can get rid of most multiplexers by ensuring that the data arrives always just-in-time. The latency of this design is a little more complicated; every 64 clock cycles, a new result is produced. However, the output is offset by 4 clock cycles from the input due to the phase delay inserted during bit-serial computation. * The controller is implemented as a 6-bit cyclic counter. The phase signals :math:`\phi_0, \phi_1, \phi_2, \phi_3` are derived from this counter by decoding the proper counter state. An alternate implementation would by a 64-bit circular shift register -- which is probably larger than this solution due to the relatively large size of a flip-flop cell. * Three additional modules are used to create this implementation: a parallel-to-serial converter, a serial-to-parallel converter and a bit-serial adder. * The tap registers from earlier implementations are reused as 64x1 FIFO registers. .. code:: :number-lines: 1 module movavg(input wire clk, input wire reset, input wire [63:0] din, output wire [63:0] dout); reg [63:0] tap1, tap1_next; reg [63:0] tap2, tap2_next; reg [63:0] tap3, tap3_next; reg [5:0] ctr, ctr_next; always @(posedge clk) begin if (reset) begin tap1 <= 64'h0; tap2 <= 64'h0; tap3 <= 64'h0; ctr <= 5'd0; end else begin tap1 <= tap1_next; tap2 <= tap2_next; tap3 <= tap3_next; ctr <= ctr_next; end end // always @ (posedge clk) // controller reg ctl0, ctl1, ctl2, ctl3, ctl4; always @(*) begin ctr_next = ctr + 6'b1; ctl0 = (ctr == 6'd0); ctl1 = (ctr == 6'd1); ctl2 = (ctr == 6'd2); ctl3 = (ctr == 6'd3); ctl4 = (ctr == 6'd4); end // first stage: parallel to serial conversion wire din_s; ps stage1(din, ctl0, clk, din_s); // second stage: delay line always @(*) begin tap1_next = {din_s, tap1[63:1]}; tap2_next = {tap1[0], tap2[63:1]}; tap3_next = {tap2[0], tap3[63:1]}; end // third stage: first set of adders wire a1_s, a2_s; serialadd a1(din_s, tap1[0], a1_s, ctl1, clk); serialadd a2(tap2[0], tap3[0], a2_s, ctl1, clk); // fourth stage: second adder wire a3_s; serialadd a3(a1_s, a2_s, a3_s, ctl2, clk); // final stage: s/p converter sp stage9(a3_s, ctl3, clk, dout); endmodule module ps(input wire[63:0] a, input wire sync, input wire clk, output wire as); reg [63:0] ra; always @(posedge clk) if (sync) ra <= a; else ra <= {1'b0, ra[63:1]}; assign as = ra[0]; endmodule // ps module sp(input wire as, input wire sync, input wire clk, output wire [63:0] a); reg [63:0] ra; always @(posedge clk) ra <= {as, ra[63:1]}; assign a = sync ? ra : 8'b0; endmodule // sp module serialadd(input wire a, input wire b, output wire s, input wire sync, input wire clk); reg carry, q; always @(posedge clk) if (sync) begin carry <= a & b; q <= a ^ b; end else begin q <= a ^ b ^ carry; carry <= (a & b) | (b & carry) | (carry & a); end assign s = q; endmodule .. attention:: The delay of this design is equal to the critical path times 64. Note that the latency is higher (64 + 64 + 4) because the result must be converted back to parallel format. The testbench of this design is more complex than the sequentialized design. This is because the outputs are delivered at cycle (4 mod 64) while inputs need to be inserted at cycle (0 mod 64). Due to the extra serial-to-parallel conversion at the output, the DUT produces the result of the previous average computation rather than the current average computation. * (Line 9) The LATENCY is set at 63 (ie. 63 cycles longer than a single-cycle design) * (Line 78) We delay the expected result by one for-loop iteration to make it line up with the bit-serial computation * (Line 80) At cycle 4 of the 64-cycle period, the output is valid and captured from the parallel output dout. * (Line 83) Then we count off the remaining cycles of the 64-cycle period before verifying the expected result. .. code:: :number-lines: 1 `timescale 1ns/1ps // The Data Introduction Interval (DII) specifies the number of clock cycles // between successive data inputs. `define DII 1 // The Latency specifies the number of clock cycles between the entry of the first // input, and the first output. `define LATENCY 63 // The CLOCKPERIOD defines the number of time units in a clock period `define CLOCKPERIOD 10 module movavgtb; reg clk, reset; always begin clk = 1'b0; #(`CLOCKPERIOD/2); clk = 1'b1; #(`CLOCKPERIOD/2); end initial begin reset = 1'b1; #(`CLOCKPERIOD); reset = 1'b0; end reg [63:0] din, snapdout, expdout, prev_expdout; wire [63:0] dout; movavg DUT(.clk(clk), .reset(reset), .din(din), .dout(dout)); reg [63:0] chk_in; reg [63:0] chk_tap1; reg [63:0] chk_tap2; reg [63:0] chk_tap3; integer n; initial begin $dumpfile("trace.vcd"); $dumpvars(0, movavgtb); #(`CLOCKPERIOD/2); // reset delay chk_in = 64'b0; chk_tap1 = 64'b0; chk_tap2 = 64'b0; chk_tap3 = 64'b0; prev_expdout = 64'd0; for (n=0; n < 1024; n = n + 1) begin din[31: 0] = $random; din[63:32] = $random; chk_tap3 = chk_tap2; chk_tap2 = chk_tap1; chk_tap1 = chk_in; chk_in = din; // the testbench will check the *previous* data output // because it takes 64 cycles (+ latency) before a // bitserial result is converted back to bitparallel prev_expdout = expdout; expdout = chk_in+chk_tap1+chk_tap2+chk_tap3; repeat (4) @(posedge clk); snapdout = dout; repeat (`LATENCY - 4) @(posedge clk); #(`CLOCKPERIOD - 1); $display("%d din %x dout %x exp %x OK %d", n, din, snapdout, prev_expdout, snapdout == prev_expdout); @(posedge clk); end $finish; end endmodule Implementation Flow ------------------- All of the above designs are going through a standard block-level layout flow. We discuss the steps that you would take, along with command-line execution and parameters. 1. ``rtl/``: The RTL directory in each folder holds one or more SystemVerilog (or Verilog) files. The standard convention is to call these files ``module1.sv`` for a module named ``module1``. Further, one single (System)Verilog file holds a single module. 2. ``sim/``: The RTL simulation directory uses xcelium to execute an RTL testbench on the design in the ``rtl/`` directory. Before any synthesis and layout steps, the first activity is to verify the correctness of your design using RTL simulation. .. code:: make sim # runs the simulation make simgui # runs the simulation in the GUI 3. ``constraints/``: The directory sets the synthesis constraints. At a minimum, this includes identifying the clock signal, clock period, and the input delay and output delay constraints. 4. ``syn/``: The RTL synthesis directory converts your RTL design to a gate level netlist in a chosen technology. In the Makefile, you can edit important synthesis parameters including * ``BASENAME`` Name used for report files, usually corresponds to the top-level module * ``CLOCKPERIOD`` Clock period constraint used for the global clk * ``TIMINGPATH`` Location of the synthesis .lib files * ``TIMINGLIB`` Specific .lib file to use (depending on mode and constraints) * ``VERILOG`` A list of (System)Verilog files that are to be synthesized The synthesis script relies on a constraint file in the ``sdc/`` directory. The constraint filename is hardcoded in the synthesis script. After synthesis, the following outputs are generated. * ``reports/.._qor.rpt``: Quality of results report * ``reports/.._timing.rpt``: Post-synthesis timing analysis * ``reports/.._power.rpt``: Post-synthesis power estimation * ``reports/.._area.rpt``: Post-synthesis area estimation * ``outputs/.._netlist.v``: Resulting netlist * ``outputs/.._delays.sdf``: Post-synthesis delay file (for gate-level simulation) * ``outputs/.._constraints.sdc``: Post-synthesis constraints file .. code:: make syn # runs the synthesis 5. ``sta/``: The static timing analysis on the resulting synthesis output uses Tempus with a standard script that produces the following analysis files. Additional analysis can be performed by modifying the tempus script (application and case dependent). * ``late.rpt``: Shows the three slowest paths from setup timing perspective * ``early.rpt``: Shows the three slowest paths from hold timing perspective * ``allpaths.rpt``: Shows the slowest path for each primary starting point in four different groups: reg-reg, input-reg, reg-output, input-output .. code:: make sta # runs STA 5. ``chip/``: This directory contains layout-level constraints on input/output pins and pads. * ``chip.io`` specifies the order and location of pins * For full-chip design including pads, this directory will also include the pad frame and its connection to the top-level module. 6. ``layout/``: This directory contains the backend layout script and supporting files. The layout is created with a two-step process, starting with synthesis, and followed by backend layout. The synthesis run here is principally identical to the one of step 4, but it produces a design data based that is directly used by the layout tool. The synthesis and layout scripts, run_genus.tcl and run_innovus.tcl, require adaption depending on the case you are implementing. The following parameters, set in the Makefile, play a role in the process. * ``BASENAME`` Name of the top-level module * ``CLOCKPERIOD`` Clock period constraint used for the global clk * ``TIMINGPATH`` Location of the synthesis .lib files * ``TIMINGLIB`` Specific .lib file to use (depending on mode and constraints) * ``VERILOG`` A list of (System)Verilog files that are to be synthesized * ``LEF`` Layout views of the standard cells, hard macro's and pad cells used in the design * ``QRC`` Technology file used for parasitic extraction .. code:: make syn # runs synthesis make layout # run layout After the layout is complete, the following design information is produced. * ``syndb`` and ``synout`` contain outputs of the synthesis process * ``reports/.._qor.rpt`` Quality of results report file * ``reports/.._area.rpt`` (Synthesis) Area report * ``reports/.._timing.rpt`` (Synthesis) Timing report * ``reports/.._power.rpt``(Synthesis) Power estimation report * ``reports/..check_timing_intent.rpt`` presents the timing constraints as understood for this design * ``reports/..ccopt_skew_groups.rpt`` reports on the skew in the clock tree * ``reports/report_clock_trees.rpt`` describes the clock trees from the layout * ``reports/layout_summary.rpt`` describes physical data on the layout * ``reports/layout_check_drc.rpt`` describes design rule violation found in the layout * ``reports/layout_check_connectivity.rpt`` describes connectivity violations found in the layout * ``reports/STA/...`` contains an extensive collection post-layout timing analysis reports. * ``out/design_default_rc.spec`` containts post-layout parastics, used for post-layout STA * ``out/design.v`` containts the post-layout netlist * ``out/final_route.db`` contains the post-layout database 7. ``glsim/``: This directory performs gate-level simulation using the delay file (sdf) creating during layout. Other SDF can be used by modifying the testbench. .. code:: make sim # run simulation 8. ``glsta/``: This directory performs post-layout static timing analysis using the sdf (delay) and spef (parasitics) files created during place and route. This static timing analysis has a higher accuracy than the on from ``sta/``. It produces the same output reports. * ``late.rpt``: Shows the three slowest paths from setup timing perspective * ``early.rpt``: Shows the three slowest paths from hold timing perspective * ``allpaths.rpt``: Shows the slowest path for each primary starting point in four different groups: reg-reg, input-reg, reg-output, input-output .. code:: make sta # Perform post-layout timing analysis .. attention:: The normal design flow consists of performing steps 1 through 8, as described above. Each of the five designs discussed so far, can be processed along these steps. Performance Metrics ------------------- Reporting the performance of a design requires selecting a set of metrics that will be used to capture the main features and cost factors such as area and performance. There are many possible ways of processing the data produced by the layout construction script. We will use one 'standard' way of collecting metrics, so that we are able to compare results. Post-synthesis Metrics ^^^^^^^^^^^^^^^^^^^^^^ Post synthesis metrics are extracted from step 4 (syn) and step 5 (sta) in the flow. They show a first-order approximation of the area cost and the performance of the design. * **Post-synthesis Area**: The area of the design is the active area of standard cells used to implement the digital block, as found in ``syn/reports/.._report_area.rpt``. For example, the following block uses 2750.364 square micron. .. code:: Instance Module Cell Count Cell Area Net Area Total Area Wireload -------------------------------------------------------------------------- movavg 1054 2750.364 0.000 2750.364 (D) * **Post-synthesis Timing**: The performance of the design is the worst-case slack on the setup time of the design as reported in static timing analysis, in the report ``sta/late.rpt``. For example, the following block has a worst-case slack of -4 picoseconds. .. code:: Path 1: VIOLATED Late External Delay Assertion Endpoint: dout[31] (^) checked with leading edge of 'clk' Beginpoint: tap2_reg[12]/Q (v) triggered by leading edge of 'clk' Path Groups: {clk} Other End Arrival Time 0.000 - External Delay 0.000 + Phase Shift 2.000 = Required Time 2.000 - Arrival Time 2.061 = Slack Time -0.061 Post-layout Metrics ^^^^^^^^^^^^^^^^^^^ Post-layout metrics are extracted from step 6 (layout) from the design. They show an improved estimation of the area cost and the performance of the design. * **Post-layout Area**: The area of the design is the total area of the core after place and route, as reported in ``layout/repports/layout_summary.rpt``. For example, the total area of the core is 5300.316 square micron. If you report the core density, then do report the density of the core without the physical cells. For example, the core density of the following report is 71.112% (Core Density #2). .. code:: Total area of Standard cells: 5300.316 um^2 Total area of Standard cells(Subtracting Physical Cells): 3769.182 um^2 Total area of Macros: 0.000 um^2 Total area of Blockages: 0.000 um^2 Total area of Pad cells: 0.000 um^2 Total area of Core: 5300.316 um^2 Total area of Chip: 6878.304 um^2 Effective Utilization: 1.0000e+00 Number of Cell Rows: 62 % Pure Gate Density #1 (Subtracting BLOCKAGES): 100.000% % Pure Gate Density #2 (Subtracting BLOCKAGES and Physical Cells): 71.112% % Pure Gate Density #3 (Subtracting MACROS): 100.000% % Pure Gate Density #4 (Subtracting MACROS and Physical Cells): 71.112% % Pure Gate Density #5 (Subtracting MACROS and BLOCKAGES): 100.000% % Pure Gate Density #6 (Subtracting MACROS and BLOCKAGES for insts are not placed): 71.112% % Core Density (Counting Std Cells and MACROs): 100.000% % Core Density #2(Subtracting Physical Cells): 71.112% % Chip Density (Counting Std Cells and MACROs and IOs): 77.058% % Chip Density #2(Subtracting Physical Cells): 54.798% # Macros within 5 sites of IO pad: No Macro halo defined?: No * ** Post-layout Timing**: The performance of the design with the Worst Case Negative Slack (WNS) of the design as reported in ``layout/reports/STA/..._postRoute.summary.gz``. A .gz file can be printed with the ``zcat`` command. For example, in the following design the WNS is -10ps. Note that the Cadence tools will also report a postive WNS to indicate a design that meets timing. .. code:: +--------------------+---------+---------+---------+ | Setup mode | all | reg2reg | default | +--------------------+---------+---------+---------+ | WNS (ns):| -0.010 | 1.514 | -0.010 | | TNS (ns):| -0.010 | 0.000 | -0.010 | | Violating Paths:| 1 | 0 | 1 | | All Paths:| 271 | 128 | 263 | +--------------------+---------+---------+---------+ Hence, we can report the quality of a block with four numbers. These numbers are necessarily a strong simplification, but they are sufficient to reasonably compare apples to apples when you trade one design solution against another. The following is a summary table for the example we discussed. Note that it is crucial to also mention the main clock constraint that applies to these results. In this case, we used a ``CLOCK_PERIOD`` of 2, which corresponds to 500MHz. A single combined metric, the area-delay product, can then be computed by multiplying the area with the delay, where delay is (CLOCK_PERIOD - Timing Slack). +-----------------+-----------------+-------------------+-----------------+ | Perf at 500MHz | Area (sqmu) | Timing (ns) | Area x Delay | +-----------------+-----------------+-------------------+-----------------+ | Post-synthesis | 2750.364 | -0.061 | 5668 | +-----------------+-----------------+-------------------+-----------------+ | Post-layout | 5300.316 | -0.010 | 10654 | +-----------------+-----------------+-------------------+-----------------+ Analysis for Ref, Small1, Small2, Fast1, Fast2 ---------------------------------------------- The following data shows the relevant metrics for each of the five designs, implemented at four different clock targets. Within a given target, it's relatively easy to make comparison. For example, for a 2ns clock period target (500MHz), the variation in postlayout area is substantial, from 3488 (for bitserial) to 8897 (for pipelined design). Likewise, the slack shows considerable variation, from comfortably positive (bitserial) to barely not-ok (for the reference). The area delay product of this comparison shows that the bitserial has a area delay product that is one seventh of that of the pipelined design. However, that doesn't mean that the bitserial design is the best. What the AxD column doesn't show, is the latency of the design. The reference needs a single-cycle budget, small1 uses 4 clock cycles, small2 uses 64 cycles, fast1 uses 1/3 of a cycle (because of the two pipeline registers), and fast2 uses 1/2 of a cycle (because of the unfolding). +---------+--------+-------+---------+---------+----------+------------+ | | |PostSyn|PostSyn |PostRoute|PostRoute | PostRoute | +---------+--------+-------+---------+---------+----------+------------+ | Design | Target | Slack | Area | Slack | Area | A x D | +---------+--------+-------+---------+---------+----------+------------+ | | |ns |sqmu |ns |sqmu | ns.sqmu | +---------+--------+-------+---------+---------+----------+------------+ | ref | 1.50 |-0.026 | 3161.7 | -0.219 | 11979.6 | 37,877,886 | +---------+--------+-------+---------+---------+----------+------------+ | small1 | 1.50 |-0.042 | 5152.2 | 0.006 | 16070.6 | 82,799,228 | +---------+--------+-------+---------+---------+----------+------------+ | small2 | 1.50 | 0.429 | 2349.9 | 0.336 | 3500.0 | 8,223,477 | +---------+--------+-------+---------+---------+----------+------------+ | fast1 | 1.50 |-0.047 | 5559.6 | 0.003 | 11656.4 | 64,804,249 | +---------+--------+-------+---------+---------+----------+------------+ | fast2 | 1.50 |-0.048 | 5165.4 | -0.285 | 14411.9 | 74,447,434 | +---------+--------+-------+---------+---------+----------+------------+ | ref | 1.75 |-0.023 | 2803.0 | -0.057 | 10690.9 | 29,967,600 | +---------+--------+-------+---------+---------+----------+------------+ | small1 | 1.75 |-0.024 | 4957.3 | 0.005 | 10752.5 | 53,303,108 | +---------+--------+-------+---------+---------+----------+------------+ | small2 | 1.75 | 0.651 | 2348.5 | 0.676 | 3488.4 | 8,190,198 | +---------+--------+-------+---------+---------+----------+------------+ | fast1 | 1.75 |-0.025 | 5499.4 | 0.002 | 9766.5 | 53,710,111| +---------+--------+-------+---------+---------+----------+------------+ | fast2 | 1.75 |-0.028 | 4283.2 | -0.052 | 14483.7 | 62,037,453 | +---------+--------+-------+---------+---------+----------+------------+ | ref | 2.00 |-0.061 | 2750.4 | -0.010 | 5300.3 | 14,577,851| +---------+--------+-------+---------+---------+----------+------------+ | small1 | 2.00 |-0.019 | 4897.8 | 0.026 | 9028.8 | 44,220,859| +---------+--------+-------+---------+---------+----------+------------+ | small2 | 2.00 | 0.944 | 2348.2 | 0.836 | 3488.4 | 8,188,447 | +---------+--------+-------+---------+---------+----------+------------+ | fast1 | 2.00 |-0.040 | 5159.8 | 0.002 | 8897.1 | 45,906,984| +---------+--------+-------+---------+---------+----------+------------+ | fast2 | 2.00 |-0.052 | 3879.0 | -0.008 | 8790.8 | 34,099,741| +---------+--------+-------+---------+---------+----------+------------+ | ref | 2.25 |-0.017 | 2723.3 | 0.018 | 4601.6 | 12,531,693| +---------+--------+-------+---------+---------+----------+------------+ | small1 | 2.25 | 0.009 | 3212.1 | 0.016 | 8265.5 | 26,549,041| +---------+--------+-------+---------+---------+----------+------------+ | small2 | 2.25 | 1.194 | 2348.2 | 1.189 | 3500.0 | 8,214,506 | +---------+--------+-------+---------+---------+----------+------------+ | fast1 | 2.25 |-0.041 | 4965.5 | 0.010 | 8428.6 | 41,852,063| +---------+--------+-------+---------+---------+----------+------------+ | fast2 | 2.25 | 0.000 | 3651.2 | 0.016 | 7155.7 | 26,126,596| +---------+--------+-------+---------+---------+----------+------------+ Hence, if we evaluate the design efficiency on the actual throughput achieved, the AxD number has to be multiplied with the appropriate cycle latency of each design. The following table lists the designs sorted from smallest AxDxlatency (best) to highest (worst). There are some interesting insights available from the table. 1. The least constrained designs (with the highest clock period target) are the best ones in terms of AxDxlatency. Remember what AxDxlatency really reflects: it reflects how efficiently the gates of the design are used to achieve the purpose of the application. When designs are highly multiplexed (such as small1 and small2), or when the design implementation is highly stressed (such as with small target clock), then additional gates are required to acvhieve the same goal. For multiplexed designs, the reuse of computational gates requires additional multiplexing and registers. For highly stressed designs, the additional performance requirements becomes especially demanding on the hardware needed. Either of these effects result in an inferior AxDxlatency. 2. The design that appeared outstanding when ignoring the latency (i.e. small2) is actually worst of all. Bitserial designs are not very efficient from gate utilization perspective. True, the resulting design has the smallest possible area of all designs, but this comes at an enormous performance overhead. Moreover, a register cell is significantly larger than a logic cell such as NAND2. If we would try to multiplex two NAND2 into a single NAND2 and a register, the resulting design would actually grow in size. The table does not show the power metrics, but you can resonably expect that highly multiplex designs are rich in additional bit transistions, which cause the power to significantly increase. +--------+---------+----------------+---------+------------------+ | Target | Design | AxD | Latency | AxDxLatency | +--------+---------+----------------+---------+------------------+ | 2.25 | ref | 12,531,693 | 1.000 | 12,531,693 | +--------+---------+----------------+---------+------------------+ | 2.25 | fast2 | 26,126,595 | 0.500 | 13,063,297 | +--------+---------+----------------+---------+------------------+ | 2.25 | fast1 | 41,852,062 | 0.333 | 13,950,686 | +--------+---------+----------------+---------+------------------+ | 2.00 | ref | 14,577,851 | 1.000 | 14,577,851 | +--------+---------+----------------+---------+------------------+ | 2.00 | fast1 | 45,906,984 | 0.333 | 15,302,326 | +--------+---------+----------------+---------+------------------+ | 2.00 | fast2 | 34,099,740 | 0.500 | 17,049,870 | +--------+---------+----------------+---------+------------------+ | 1.75 | fast1 | 53,710,111 | 0.333 | 17,903,368 | +--------+---------+----------------+---------+------------------+ | 1.50 | fast1 | 64,804,249 | 0.333 | 21,601,414 | +--------+---------+----------------+---------+------------------+ | 1.75 | ref | 29,967,600 | 1.000 | 29,967,600 | +--------+---------+----------------+---------+------------------+ | 1.75 | fast2 | 62,037,452 | 0.500 | 31,018,726 | +--------+---------+----------------+---------+------------------+ | 1.50 | fast2 | 74,447,434 | 0.500 | 37,223,717 | +--------+---------+----------------+---------+------------------+ | 1.50 | ref | 37,877,885 | 1.000 | 37,877,885 | +--------+---------+----------------+---------+------------------+ | 2.25 | small1 | 26,549,041 | 4.000 | 106,196,165 | +--------+---------+----------------+---------+------------------+ | 2.00 | small1 | 44,220,859 | 4.000 | 176,883,437 | +--------+---------+----------------+---------+------------------+ | 1.75 | small1 | 53,303,107 | 4.000 | 213,212,431 | +--------+---------+----------------+---------+------------------+ | 1.50 | small1 | 82,799,227 | 4.000 | 331,196,911 | +--------+---------+----------------+---------+------------------+ | 2.00 | small2 | 8,188,446 | 64.000 | 524,060,601 | +--------+---------+----------------+---------+------------------+ | 1.75 | small2 | 8,190,198 | 64.000 | 524,172,677 | +--------+---------+----------------+---------+------------------+ | 2.25 | small2 | 8,214,506 | 64.000 | 525,728,397 | +--------+---------+----------------+---------+------------------+ | 1.50 | small2 | 8,223,476 | 64.000 | 526,302,514 | +--------+---------+----------------+---------+------------------+ Layout Analysis for Ref @ T=2ns ------------------------------- The layout itself is also an excellent source for additional analysis. The following figures illustrate some interesting aspects for a single design, the reference design at 500MHz (2ns clock). .. figure:: img/reflayout1.png :figwidth: 600px :align: center The full layout, showing 64 inputs coming in from the left, and producing 64 output bits on the right. Clock and reset ports are on top. .. figure:: img/reflayout2.png :figwidth: 600px :align: center The standard cells only show that the design has a relatively high utilization; with an empty area at the top. This could be optimized as a modified aspect ratio, or a re-arranging of the I/O ports. .. figure:: img/reflayout3.png :figwidth: 600px :align: center This shows the clock tree. At the leaf position of each branch is a flip-flop. .. figure:: img/reflayout4.png :figwidth: 600px :align: center The clock tree is remarkably well balanced. The y axis in this graph shows clock delay, and the horizontal position enumerates flip-flops. Almost every flip-flop within a window of 0.05ns. .. figure:: img/reflayout5.png :figwidth: 600px :align: center The layout uses a standard power grid, and no effort was made to create a mesh structure. This is very likely to cause problems for the standard cells in the middle of a row, who are too far apart from the power ring around the core. Hence, additional work is needed to design the power network of this layout. Separate analysis (signal integrity tools) can be used to determine of a power grid was well designed. .. figure:: img/reflayout6.png :figwidth: 600px :align: center A pin density map shows which areas of the layout require most connections. A high pin density is correlated to routing congestion, as many wires need to converge in a single area of the chip.