Attention

This document was last updated Nov 25 24 at 21:59

Optimizing Area and Timing

Important

The purpose of this lecture is as follows.

To explore the impact of optimization on design cost factors (gate count, performance, layout area) for a concrete design
To apply two techniques that help in building high-speed hardware
To apply two techniques that help in building low-area hardware
To discuss Verilog coding and testbench design for all these cases

Important

The designs discussed in this lecture are on https://github.com/wpi-ece574-f24/ex-layout-avg64

Moving Average Application

We will study a concrete design consisting of a moving-average filter application with a 64-bit wordlength. The filter adds up the last 4 samples that have been entered at the din input. When two 64-bit numbers are added, the carry-out bit is thrown out, and only the lower 64 bits are kept.

This implementation exhibits the following characteristics.

The inputs and outputs are produced at a rate of 1 data element per clock cycle. Thus, the data introduction interval of the reference implementation is one per clock cycle.
There is a combinational path that runs from input to output through three 64-bit additions. The latency of the implementation, counted in clock cycles, is 0 cycles. That is, if the input is applied at $t=0$ (ns), then the output will be ready at $t=T_{add}$ (ns), with $T_{add}$ the time of the combinational path through three 64-bit additions. Obviously, to avoid a timing path violation on a clock period of $T_c$ , we must ensure that $T_{add} < T_c$ . If the input delay constraint on din is $T_{din}$ , and the output delay constraint on dout is $T_{dout}$ , then we must ensure that $T_{add} + T_{din} + T_{dout} < T_c$ .
The input is captured into a register which forms part of a register delay chain. The combinational delay along the chain is very small, and defined by the clock-to-Q delay of the register. Therefore, it’s reasonable to assume that the slowest path in the overall design will be dominated by three sequential additions. Nevertheless, one must keep in mind that the chain of 64-bit adders enables considerable parallellism, as the adder chain sums up four indepent numbers with are available at the start of the cycle (plus the input delay for din, plus the clock-to-Q delay for the registers).

module movavg(
    input logic clk,
    input logic reset,
    input logic [63:0] din,
    output logic [63:0] dout
);

   logic [63:0]         tap1, tap1_next;
   logic [63:0]         tap2, tap2_next;
   logic [63:0]         tap3, tap3_next;

   always_ff @(posedge clk) begin
      if (reset) begin
         tap1 <= 64'h0;
         tap2 <= 64'h0;
         tap3 <= 64'h0;
      end else begin
         tap1 <= tap1_next;
         tap2 <= tap2_next;
         tap3 <= tap3_next;
      end
   end

   logic [63:0] doutreg;

   always_comb begin
      doutreg   = din + tap1 + tap2 + tap3;
      tap1_next = din;
      tap2_next = tap1;
      tap3_next = tap2;
   end

   assign dout = doutreg;

endmodule

The correctness of the design is evaluated using a testbench that feeds in random values. The testbench performs a parallel computation while feeding random values, so that the correctness of the module implementation is verified.

In the following testbench, pay special attention to the timing of the testbench. Since the DUT has no data-ready signal, the testbench is sensitive to the time when the output DUT.dout is read.

(line 7) We apply a 10ns clock. At this point, we are looking for functional timing verification, so the period is an arbitrary but convenient number.
(line 33) We use a one-cycle reset sequence and start applying inputs from cycle 2. The reference values chk_tap1, chk_tap2, chk_tap3 are reset simultaneously with the module.
(Line 42) Each clock cycle, a new vector is applied. The input to the DUT is provided with a tiny delay to the module inputs. This ensures that the inputs are not changing exactly at the clock edge. While an RTL simulation suffers no hold timing effects, there may still occur non-deterministic behavior when multiple variables change at the same simulation timestamp. In this case, we have to make sure that the registers in the DUT are updated before din changes value.
(Line 54) We assert the outputs at the end of the clock period, just before the clock edge.
(Line 60) We update the delay line after the output assertion is complete. Then, we await the next clock edge and move to the next input test vector.

`timescale 1ns/1ps
`define CLOCKPERIOD 10

module tb;
   logic clk, reset;

   always
     begin
       clk = 1'b0;
       #(`CLOCKPERIOD/2);
       clk = 1'b1;
       #(`CLOCKPERIOD/2);
     end

   logic [63:0] din;
   logic [63:0] dout;

   movavg DUT(.clk(clk),
              .reset(reset),
              .din(din),
              .dout(dout)
              );

   logic [63:0] chk_tap3;
   logic [63:0] chk_tap2;
   logic [63:0] chk_tap1;
   logic [63:0] chk_din;
   logic [63:0] chk_dout;

   initial
     begin
        reset = 1'b1;
        @(negedge clk);
        reset = 1'b0;
        chk_tap1 = 0;
        chk_tap2 = 0;
        chk_tap3 = 0;
        din      = 0;

        @(posedge clk);

        while (1)
          begin

             #1; // making sure we assign new inputs just after the clock edge

             din[31:0]  = $random;
             din[63:32] = $random;

             chk_din = din;

             #(`CLOCKPERIOD - 1);

             chk_dout = chk_din + chk_tap3 + chk_tap2 + chk_tap1;

             $display("%t out %h expected %h OK %d", $time, dout, chk_dout, dout == chk_dout);
             // $display("%t DUT in %h taps %h %h %h",  $time, DUT.din, DUT.tap1, DUT.tap2, DUT.tap3);
             // $display("%t CHK in %h taps %h %h %h",  $time, chk_din, chk_tap1, chk_tap2, chk_tap3);

             chk_tap3 = chk_tap2;
             chk_tap2 = chk_tap1;
             chk_tap1 = chk_din;
             @(posedge clk);

          end
     end

   initial
     begin
        repeat(256)
          @(posedge clk);
        $finish;
     end

endmodule

FAST design strategy #1: Pipelining

The first design strategy to improve performance of the reference design is to introduce pipeline registers. Because the design does not have a latency requirement, an arbitrary number of pipeline registers can be inserted.

The main challenge for pipelining is to ensure that pipeline registers are inserted consistently, i.e. according to the rules of retiming as discussed in our lecture on Timing. Because the reference design has no feedback loops, the insertion of pipeline registers can be done by leveling of the circuit graph. The expected critical path of the design runs through three sequential adders. Therefore, these adders are isolated by two pipeline registers.

We do not insert a pipeline register at the input (din) or the output (dout) because we are assuming that the input delay and the output delay are both 0. When a different constraint would be chosen, this may lead to the insertion of an additional pipeline register at din or dout.

Furthermore, we do not insert a pipeline register within the adder, although it is perfectly feasible to pipeline the adder itself as well:

In the overall pipelined design, notice how the insertion of a pipeline level implies the insertion of multiple 64-bit registers. For example, the first pipeline level leads to three registers: pipe1a at the output of the first adder, pipe1b at the input of the second adder, and pipe1c at the output of the third tap.

We can thus rewrite the Verilog description while inserting these additional registers in the computation flow.

`define WL 63
`define WL1 64

module movavg(input wire         clk,
              input wire         reset,
              input wire [`WL:0]  din,
              output wire [`WL:0] dout);

   reg [`WL:0]                    tap1, tap1_next;
   reg [`WL:0]                    tap2, tap2_next;
   reg [`WL:0]                    tap3, tap3_next;

   reg [`WL:0]         pipe1a, pipe1a_next;
   reg [`WL:0]         pipe1b, pipe1b_next;
   reg [`WL:0]         pipe1c, pipe1c_next;
   reg [`WL:0]         pipe2a, pipe2a_next;
   reg [`WL:0]         pipe2b, pipe2b_next;

   always @(posedge clk)
     begin
        if (reset)
          begin
             tap1 <= `WL1'h0;
             tap2 <= `WL1'h0;
             tap3 <= `WL1'h0;
             pipe1a <= `WL1'h0;
             pipe1b <= `WL1'h0;
             pipe1c <= `WL1'h0;
             pipe2a <= `WL1'h0;
             pipe2b <= `WL1'h0;
          end
        else
          begin
             tap1   <= tap1_next;
             tap2   <= tap2_next;
             tap3   <= tap3_next;
             pipe1a <= pipe1a_next;
             pipe1b <= pipe1b_next;
             pipe1c <= pipe1c_next;
             pipe2a <= pipe2a_next;
             pipe2b <= pipe2b_next;
          end
     end // always @ (posedge clk)

   reg [`WL:0] doutreg;

   always @(*)
     begin
        pipe1a_next = din + tap1;
        pipe1b_next = tap2;
        pipe1c_next = tap3;
        pipe2a_next = pipe1a + pipe1b;
        pipe2b_next = pipe1c;
        doutreg     = pipe2a + pipe2b;
        tap1_next   = din;
        tap2_next   = tap1;
        tap3_next   = tap2;
     end

   assign dout = doutreg;

endmodule

Attention

The delay of the pipelined design will be given by the critical path, since a new result is computed every clock cycle.

For a pipelined design, we have to adjust the testbench so that it deals with the increased latency of the design. With two pipeline registers, the result is available two clock cycles later. Additionally, the testbench has to take into account that a pipelined design will complete several sums at the same time.

(Line 57) The testbench keeps track of the two previous results, pp_chk_dout and p+chk_dout while computing the current result, chk_dout.
(Line 62) The result is tested from the first cycle. This implies that we can expect the first two results to be wrong. So, we would check OK only from the third result produced.

`timescale 1ns/1ps
`define CLOCKPERIOD 10

module tb;
   logic clk, reset;

   always
     begin
       clk = 1'b0;
       #(`CLOCKPERIOD/2);
       clk = 1'b1;
       #(`CLOCKPERIOD/2);
     end

   logic [63:0] din;
   logic [63:0] dout;

   movavg DUT(.clk(clk),
              .reset(reset),
              .din(din),
              .dout(dout)
              );

   logic [63:0] chk_tap3;
   logic [63:0] chk_tap2;
   logic [63:0] chk_tap1;
   logic [63:0] chk_din;
   logic [63:0] chk_dout;
   logic [63:0] p_chk_dout;
   logic [63:0] pp_chk_dout;

   initial
     begin
        reset = 1'b1;
        @(negedge clk);
        reset = 1'b0;
        chk_tap1 = 0;
        chk_tap2 = 0;
        chk_tap3 = 0;
        din      = 0;

        @(posedge clk);

        while (1)
          begin

             #1; // making sure we assign new inputs just after the clock edge

             din[31:0]  = $random;
             din[63:32] = $random;

             chk_din = din;

             #(`CLOCKPERIOD - 1);

             // This delay ensures the tested output matches the pipelined output (2 pipe    stages)
             pp_chk_dout = p_chk_dout;
             p_chk_dout = chk_dout;
             chk_dout = chk_din + chk_tap3 + chk_tap2 + chk_tap1;

             $display("%t out %h expected %h OK %d", $time, dout, pp_chk_dout, dout ==    pp_chk_dout);
             $display("%t CHK in %h taps %h %h %h -> %h",  $time, chk_din, chk_tap1,    chk_tap2, chk_tap3, chk_dout);

             chk_tap3 = chk_tap2;
             chk_tap2 = chk_tap1;
             chk_tap1 = chk_din;
             @(posedge clk);

          end
     end

   initial
     begin
        repeat(256)
          @(posedge clk);
        $finish;
     end

endmodule

Here are the first few lines of the testbench output demonstrating the latency effect. Note that the third result b9594e719d53863a, ie. the output after three inputs, is only checked when the fifth input is read in at timestamp 65000.

out 0000000000000000 expected xxxxxxxxxxxxxxxx OK x
CHK in c0895e8112153524 taps 0000000000000000 0000000000000000 0000000000000000 -> c0895e8112153524
out 0000000000000000 expected xxxxxxxxxxxxxxxx OK x
CHK in b1f056638484d609 taps c0895e8112153524 0000000000000000 0000000000000000 -> 7279b4e4969a0b2d
out c0895e8112153524 expected c0895e8112153524 OK 1
CHK in 46df998d06b97b0d taps b1f056638484d609 c0895e8112153524 0000000000000000 -> b9594e719d53863a
out 7279b4e4969a0b2d expected 7279b4e4969a0b2d OK 1
CHK in 89375212b2c28465 taps 46df998d06b97b0d b1f056638484d609 c0895e8112153524 -> 4290a08450160a9f
out b9594e719d53863a expected b9594e719d53863a OK 1
CHK in 06d7cd0d00f3e301 taps 89375212b2c28465 46df998d06b97b0d b1f056638484d609 -> 88df0f103ef4b87c

FAST design strategy #2: Unfolding

An alternate strategy to improve design performance is to exploit the parallellism of hardware and remove the input/output constraint of a single data item per cycle. This strategy is called unfolding. Unfolding is more than simply parallellizing. For example, simply doubling the averager is not a correct implementation of the averaging algorithm, because that implementation would compute the average on two independent streams. In unfolding, we perform computations on a single stream of data items which are delivered in groups of n data items at a time.

The following example illustrates this concept. A simple counter circuit counts at a rate of one per clock cycle. The 2x unfolded version of this counter will produce two outputs per clock cycle, n and n+1. We have this by doubling the combinational hardware. However, the state variables of the unfolded circuit are identical to those of the original circuit: there is still one single register.

The unfolding transformation is general, and applies to any circuit. For example, assume we have a counter circuit with two back-to-back registers. If the initial state of the registers is zero, then the sequence of values produced at the output will be 1, 1, 2, 2, 3, 3, 4, 4, and so forth. The 2x unfolded version of this circuit is shown below that. It uses two incrementer circuits, and the state variables of the original circuit are distributed over the unfolded circuit. It’s easy to see that this circuit produced the same outputs as the original circuit, but in pairs of two: (1,1), (2,2), (3,3), ..

It is possible to derive systematic unfolding rules for arbitrary circuits, but since the averager is sufficiently small and simple, we can attempt to derive a 2x unfolded version by hand. The unfolded version of the averager reads in pairs of data (dinA, dinB) and produced pairs of output data (doutA, doutB). The output data is the sum of the last four data items in the stream. For example, if the stream consists of the tuples (dinA1, dinB1), (dinA2, dinB2), (dinA3, dinB3) (with dinA1 the most recent), then the outputs are defined by (doutA = dinA1 + dinB1 + dinA2 + dinB2) and (doutB = dinB1 + dinA2 + dinB2 + dinA3). Graphically, this leads to the unfolded circuit we aim to create. Notice that the number of state variables is the same as the original circuit (3 registers) while the logic has doubled.

module movavg(input wire         clk,
              input wire     reset,
              input wire [63:0]  dinA,
              input wire [63:0]  dinB,
              output wire [63:0] doutA,
              output wire [63:0] doutB);

   reg [63:0]             tap1, tap1_next;
   reg [63:0]             tap2, tap2_next;
   reg [63:0]             tap3, tap3_next;

   always @(posedge clk)
     begin
    if (reset)
      begin
         tap1 <= 64'h0;
         tap2 <= 64'h0;
         tap3 <= 64'h0;
      end
    else
      begin
         tap1 <= tap1_next;
         tap2 <= tap2_next;
         tap3 <= tap3_next;
      end
     end // always @ (posedge clk)

   reg [63:0] doutregA;
   reg [63:0] doutregB;

   always @(*)
     begin
       doutregA   = dinA + dinB + tap1 + tap2;
       doutregB   = dinB + tap1 + tap2 + tap3;
       tap1_next  = dinA;
       tap2_next  = dinB;
       tap3_next  = tap1;
     end

   assign doutA = doutregA;
   assign doutB = doutregB;

endmodule

Attention

The delay of the unfolded design will be given by the critical path divided by the unfolding factor (2). Each clock cycle, two new results are computed.

The testbench of this design will have to reflect the double-rate nature of the hardware, and hence verify two outputs per iteration. The latency of the design is still 0 clock cycles, as with the original design. The test bench verification does not implement the unfolded design, but rather implements the original design working at twice the speed. That is, each loop iteration, two new items are inserted into the delsy line and two outputs are computed.

(line 61-62): the first (oldest) element is inserted into the delay line
(line 66-67): Two consecutive outputs of the moving averager are computed

`timescale 1ns/1ps
`define CLOCKPERIOD 10

module tb;
   logic clk, reset;

   always
     begin
       clk = 1'b0;
       #(`CLOCKPERIOD/2);
       clk = 1'b1;
       #(`CLOCKPERIOD/2);
     end

   logic [63:0] dinA;
   logic [63:0] doutA;
   logic [63:0] dinB;
   logic [63:0] doutB;

   movavg DUT(.clk(clk),
              .reset(reset),
              .dinA(dinA),
              .dinB(dinB),
              .doutA(doutA),
              .doutB(doutB)
              );

   logic [63:0] chk_tap4;
   logic [63:0] chk_tap3;
   logic [63:0] chk_tap2;
   logic [63:0] chk_tap1;
   logic [63:0] chk_dinA;
   logic [63:0] chk_doutA;
   logic [63:0] chk_dinB;
   logic [63:0] chk_doutB;

   initial
     begin
        reset = 1'b1;
        @(negedge clk);
        reset = 1'b0;
        chk_tap1 = 0;
        chk_tap2 = 0;
        chk_tap3 = 0;
        dinA     = 0;
        dinB     = 0;

        @(posedge clk);

        while (1)
          begin

             #1; // making sure we assign new inputs just after the clock edge

             dinA[31:0]  = $random;
             dinA[63:32] = $random;

             dinB[31:0]  = $random;
             dinB[63:32] = $random;

             chk_dinB = dinB;
             chk_dinA = dinA;

             #(`CLOCKPERIOD - 1);

             chk_doutA = chk_dinA + chk_dinB + chk_tap1 + chk_tap2;
             chk_doutB = chk_dinB + chk_tap1 + chk_tap2 + chk_tap3;

             $display("A %t out %h expected %h OK %d", $time, doutA, chk_doutA, doutA == chk_doutA);
             $display("B %t out %h expected %h OK %d", $time, doutB, chk_doutB, doutB == chk_doutB);
             // $display("%t DUT in %h taps %h %h %h",  $time, DUT.din, DUT.tap1, DUT.tap2, DUT.tap3);
             // $display("%t CHK in %h taps %h %h %h",  $time, chk_din, chk_tap1, chk_tap2, chk_tap3);

             chk_tap3 = chk_tap1;
             chk_tap2 = chk_dinB;
             chk_tap1 = chk_dinA;

             @(posedge clk);

          end
     end

   initial
     begin
        repeat(256)
          @(posedge clk);
        $finish;
     end

endmodule

SMALL design strategy #1: Sequentializing

To reduce the area of a hardware design, we have to reuse hardware elements over multiple clock cycles. An obvious candidate for reuse is the 64-bit adder. The flow graph of the average is partitioned in clusters that contain simular operations. At the edge of a cluster, a register is placed to carry signals over to the next clock cycle.

(Line 48-75) We use a 5-state FSM that reads the input, accumulates all the taps, and finally shifts the taps and produces the output
The level of area reduction depends on two things. First, the synthesis tools have to realize that the 64-bit addition can be reused over different FSM states. Second, the area saved by sharing the adder hardware must be larger than the area overhead added by the sharing support hardware. This includes the finite state machine, the accmulator register (line 11), and the multiplexers (invisible in the RTL!). This is no small requirement, and we will see that this overhead is significant.

module movavg(input logic         clk,
              input logic         reset,
              input logic [63:0]  din,
              output logic [63:0] dout,
              output logic        read
              );

   typedef enum         logic [3:0] {
                                     S0 = 4'b0000,
                                     S1 = 4'b0001,
                                     S2 = 4'b0010,
                                     S3 = 4'b0011,
                                     S4 = 4'b0100
                                     } state_t;

   logic [63:0]  tap1, tap1_next;
   logic [63:0]  tap2, tap2_next;
   logic [63:0]  tap3, tap3_next;

   state_t state, state_next;
   logic [63:0]  acc, acc_next;

   always_ff @(posedge clk)
     begin
        if (reset)
          begin
             tap1 <= 64'h0;
             tap2 <= 64'h0;
             tap3 <= 64'h0;
             acc  <= 64'h0;
             state <= S0;
          end
        else
          begin
             tap1  <= tap1_next;
             tap2  <= tap2_next;
             tap3  <= tap3_next;
             acc   <= acc_next;
             state <= state_next;
          end
     end

   logic [63:0] doutreg;
   logic        readreg;

   always @(*)
     begin
        state_next = state;
        acc_next   = acc;
        tap1_next  = tap1;
        tap2_next  = tap2;
        tap3_next  = tap3;
        doutreg    = 64'h0;
        readreg    = 1'b0;
        case (state)
          S0:
            begin
               acc_next   = din;
               readreg    = 1'b1;
               state_next = S1;
            end
          S1:
            begin
               acc_next   = acc + tap1;
               state_next = S2;
            end
          S2:
            begin
               acc_next   = acc + tap2;
               state_next = S3;
            end
          S3:
            begin
               acc_next   = acc + tap3;
               state_next = S4;
            end
          S4:
            begin
               doutreg    = acc;
               tap1_next  = din;
               tap2_next  = tap1;
               tap3_next  = tap2;
               state_next = S0;
            end
          default:
            state_next = S0;
        endcase
     end

   assign dout = doutreg;
   assign read = readreg;

endmodule

Attention

The delay of the sequential design equals the critical path times the number of clock cycles required to compute the output dout (5).

The testbench of the sequential design is derived from the original testbench and simply increases the latency of the design by 4 cycles. We have also introduced an additional synchronization signal, ready, which indicates when the FSM is in state S0. In that state, the new input is read in to the delay line.

`timescale 1ns/1ps
`define CLOCKPERIOD 10

module tb;
   logic clk, reset;

   always
     begin
       clk = 1'b0;
       #(`CLOCKPERIOD/2);
       clk = 1'b1;
       #(`CLOCKPERIOD/2);
     end

   logic [63:0] din;
   logic [63:0] dout;
   logic        read;

   movavg DUT(.clk(clk),
              .reset(reset),
              .din(din),
              .dout(dout),
              .read(read)
              );

   logic [63:0] chk_tap3;
   logic [63:0] chk_tap2;
   logic [63:0] chk_tap1;
   logic [63:0] chk_din;
   logic [63:0] chk_dout;

   initial
     begin
        reset = 1'b1;
        @(negedge clk);
        reset = 1'b0;
        chk_tap1 = 0;
        chk_tap2 = 0;
        chk_tap3 = 0;
        din      = 0;

        while (read == 1'b0) // wait until computation can start
          @(posedge clk);

        while (1)
          begin

             #1; // making sure we assign new inputs just after the clock edge

             din[31:0]  = $random;
             din[63:32] = $random;

             chk_din = din;

             repeat(4)
               @(posedge clk); // wait until FSM complete

             #(`CLOCKPERIOD - 2);

             chk_dout = chk_din + chk_tap3 + chk_tap2 + chk_tap1;

             $display("%t out %h expected %h OK %d", $time, dout, chk_dout, dout == chk_dout);
//           $display("%t DUT in %h taps %h %h %h",  $time, DUT.din, DUT.tap1, DUT.tap2, DUT.tap3);
//           $display("%t CHK in %h taps %h %h %h",  $time, chk_din, chk_tap1, chk_tap2, chk_tap3);

             chk_tap3 = chk_tap2;
             chk_tap2 = chk_tap1;
             chk_tap1 = chk_din;
             @(posedge clk);

          end
     end

   initial
     begin
        repeat(256)
          @(posedge clk);
        $finish;
     end

endmodule

SMALL design strategy #2: Bit-serializing

The second strategy towards small area is to take the hardware reuse concept to the extreme, and transform the 64-bit averager to a bit-serial design. The basic components of the design are easy to transform to bit-serial circuits:

A 64-bit adder becomes a 1-bit bit-serial adder
A 64-bit tap register becomes a 64-bit shift register (a 64-position 1-bit FIFO)
Parallel inputs and outputs are created using appropriate parallel to serial and serial to parallel registers

A challenge of bit-serial design is the controller, which needs to create a sync signal with a duty cycle of 1/64. The phase of the sync signal changes with the position of the bit-serial operator in the overall design.

The advantage of bit-serialization over sequentializing is that we can get rid of most multiplexers by ensuring that the data arrives always just-in-time. The latency of this design is a little more complicated; every 64 clock cycles, a new result is produced. However, the output is offset by 4 clock cycles from the input due to the phase delay inserted during bit-serial computation.

The controller is implemented as a 6-bit cyclic counter. The phase signals $\phi_0, \phi_1, \phi_2, \phi_3$ are derived from this counter by decoding the proper counter state. An alternate implementation would by a 64-bit circular shift register – which is probably larger than this solution due to the relatively large size of a flip-flop cell.
Three additional modules are used to create this implementation: a parallel-to-serial converter, a serial-to-parallel converter and a bit-serial adder.
The tap registers from earlier implementations are reused as 64x1 FIFO registers.

 module movavg(input wire         clk,
               input wire     reset,
               input wire [63:0]  din,
               output wire [63:0] dout);

    reg [63:0]             tap1, tap1_next;
    reg [63:0]             tap2, tap2_next;
    reg [63:0]             tap3, tap3_next;

    reg [5:0]              ctr, ctr_next;

    always @(posedge clk)
      begin
        if (reset)
          begin
             tap1 <= 64'h0;
             tap2 <= 64'h0;
             tap3 <= 64'h0;
             ctr <= 5'd0;
          end
        else
          begin
             tap1 <= tap1_next;
             tap2 <= tap2_next;
             tap3 <= tap3_next;
             ctr  <= ctr_next;
          end
       end // always @ (posedge clk)

    // controller
    reg ctl0, ctl1, ctl2, ctl3, ctl4;
    always @(*)
      begin
        ctr_next = ctr + 6'b1;
        ctl0 = (ctr == 6'd0);
        ctl1 = (ctr == 6'd1);
        ctl2 = (ctr == 6'd2);
        ctl3 = (ctr == 6'd3);
        ctl4 = (ctr == 6'd4);
      end

    // first stage: parallel to serial conversion
    wire din_s;
    ps stage1(din, ctl0, clk, din_s);

    // second stage: delay line
    always @(*)
      begin
        tap1_next = {din_s, tap1[63:1]};
        tap2_next = {tap1[0], tap2[63:1]};
        tap3_next = {tap2[0], tap3[63:1]};
      end

    // third stage: first set of adders
    wire a1_s, a2_s;
    serialadd a1(din_s,    tap1[0], a1_s, ctl1, clk);
    serialadd a2(tap2[0],  tap3[0], a2_s, ctl1, clk);

    // fourth stage: second adder
    wire a3_s;
    serialadd a3(a1_s, a2_s, a3_s, ctl2, clk);

    // final stage: s/p converter
    sp stage9(a3_s, ctl3, clk, dout);

 endmodule

 module ps(input wire[63:0] a,
           input wire  sync,
           input wire  clk,
           output wire as);

    reg [63:0]         ra;

    always @(posedge clk)
      if (sync)
        ra <= a;
      else
        ra <= {1'b0, ra[63:1]};

    assign as = ra[0];

 endmodule // ps

 module sp(input wire as,
           input wire         sync,
           input wire         clk,
           output wire [63:0] a);

    reg [63:0]            ra;

    always  @(posedge clk)
      ra <= {as, ra[63:1]};

    assign a  = sync ? ra : 8'b0;

 endmodule // sp

 module serialadd(input wire a, input wire b, output wire s,
                  input wire sync, input wire clk);

    reg              carry, q;

    always @(posedge clk)
      if (sync)
        begin
           carry <= a & b;
           q     <= a ^ b;
        end
      else
        begin
           q     <= a ^ b ^ carry;
           carry <= (a & b) | (b & carry) | (carry & a);
        end

    assign s = q;

 endmodule

Attention

The delay of this design is equal to the critical path times 64. Note that the latency is higher (64 + 64 + 4) because the result must be converted back to parallel format.

The testbench of this design is more complex than the sequentialized design. This is because the outputs are delivered at cycle (4 mod 64) while inputs need to be inserted at cycle (0 mod 64). Due to the extra serial-to-parallel conversion at the output, the DUT produces the result of the previous average computation rather than the current average computation.

(Line 9) The LATENCY is set at 63 (ie. 63 cycles longer than a single-cycle design)
(Line 78) We delay the expected result by one for-loop iteration to make it line up with the bit-serial computation
(Line 80) At cycle 4 of the 64-cycle period, the output is valid and captured from the parallel output dout.
(Line 83) Then we count off the remaining cycles of the 64-cycle period before verifying the expected result.

`timescale 1ns/1ps

// The Data Introduction Interval (DII) specifies the number of clock cycles
// between successive data inputs.
`define DII 1

// The Latency specifies the number of clock cycles between the entry of the first
// input, and the first output.
`define LATENCY 63

// The CLOCKPERIOD defines the number of time units in a clock period
`define CLOCKPERIOD 10

module movavgtb;

   reg clk, reset;

   always
     begin
       clk = 1'b0;
       #(`CLOCKPERIOD/2);
       clk = 1'b1;
       #(`CLOCKPERIOD/2);
     end

   initial
     begin
       reset = 1'b1;
       #(`CLOCKPERIOD);
       reset = 1'b0;
     end


   reg [63:0] din, snapdout, expdout, prev_expdout;
   wire [63:0] dout;

   movavg DUT(.clk(clk),
              .reset(reset),
              .din(din),
              .dout(dout));

   reg [63:0]  chk_in;
   reg [63:0]  chk_tap1;
   reg [63:0]  chk_tap2;
   reg [63:0]  chk_tap3;

   integer     n;

   initial
     begin
      $dumpfile("trace.vcd");
      $dumpvars(0, movavgtb);

      #(`CLOCKPERIOD/2); // reset delay

      chk_in   = 64'b0;
      chk_tap1 = 64'b0;
      chk_tap2 = 64'b0;
      chk_tap3 = 64'b0;
      prev_expdout = 64'd0;

      for (n=0; n < 1024; n = n + 1)
            begin

           din[31: 0] = $random;
           din[63:32] = $random;

           chk_tap3 = chk_tap2;
           chk_tap2 = chk_tap1;
           chk_tap1 = chk_in;
           chk_in   = din;

           // the testbench will check the *previous* data output
           // because it takes 64 cycles (+ latency) before a
           // bitserial result is converted back to bitparallel

           prev_expdout = expdout;
           expdout  = chk_in+chk_tap1+chk_tap2+chk_tap3;

           repeat (4) @(posedge clk);
           snapdout = dout;

           repeat (`LATENCY - 4) @(posedge clk);

           #(`CLOCKPERIOD - 1);

           $display("%d din %x dout %x exp %x OK %d",
                n,
                din,
                snapdout,
                prev_expdout,
                snapdout == prev_expdout);

           @(posedge clk);
      end

    $finish;

     end

endmodule

Implementation Flow

All of the above designs are going through a standard block-level layout flow. We discuss the steps that you would take, along with command-line execution and parameters.

rtl/: The RTL directory in each folder holds one or more SystemVerilog (or Verilog) files. The standard convention is to call these files module1.sv for a module named module1. Further, one single (System)Verilog file holds a single module.
sim/: The RTL simulation directory uses xcelium to execute an RTL testbench on the design in the rtl/ directory. Before any synthesis and layout steps, the first activity is to verify the correctness of your design using RTL simulation.

make sim      # runs the simulation
make simgui   # runs the simulation in the GUI

constraints/: The directory sets the synthesis constraints. At a minimum, this includes identifying the clock signal, clock period, and the input delay and output delay constraints.
syn/: The RTL synthesis directory converts your RTL design to a gate level netlist in a chosen technology. In the Makefile, you can edit important synthesis parameters including
- BASENAME Name used for report files, usually corresponds to the top-level module
- CLOCKPERIOD Clock period constraint used for the global clk
- TIMINGPATH Location of the synthesis .lib files
- TIMINGLIB Specific .lib file to use (depending on mode and constraints)
- VERILOG A list of (System)Verilog files that are to be synthesized

The synthesis script relies on a constraint file in the sdc/ directory. The constraint filename is hardcoded in the synthesis script.

After synthesis, the following outputs are generated.

reports/.._qor.rpt: Quality of results report

reports/.._timing.rpt: Post-synthesis timing analysis

reports/.._power.rpt: Post-synthesis power estimation

reports/.._area.rpt: Post-synthesis area estimation

outputs/.._netlist.v: Resulting netlist

outputs/.._delays.sdf: Post-synthesis delay file (for gate-level simulation)

outputs/.._constraints.sdc: Post-synthesis constraints file

make syn     # runs the synthesis

sta/: The static timing analysis on the resulting synthesis output uses Tempus with a standard script that produces the following analysis files. Additional analysis can be performed by modifying the tempus script (application and case dependent).
- late.rpt: Shows the three slowest paths from setup timing perspective
- early.rpt: Shows the three slowest paths from hold timing perspective
- allpaths.rpt: Shows the slowest path for each primary starting point in four different groups: reg-reg, input-reg, reg-output, input-output

make sta     # runs STA

chip/: This directory contains layout-level constraints on input/output pins and pads.
- chip.io specifies the order and location of pins
- For full-chip design including pads, this directory will also include the pad frame and its connection to the top-level module.
layout/: This directory contains the backend layout script and supporting files. The layout is created with a two-step process, starting with synthesis, and followed by backend layout. The synthesis run here is principally identical to the one of step 4, but it produces a design data based that is directly used by the layout tool. The synthesis and layout scripts, run_genus.tcl and run_innovus.tcl, require adaption depending on the case you are implementing. The following parameters, set in the Makefile, play a role in the process.
- BASENAME Name of the top-level module
- CLOCKPERIOD Clock period constraint used for the global clk
- TIMINGPATH Location of the synthesis .lib files
- TIMINGLIB Specific .lib file to use (depending on mode and constraints)
- VERILOG A list of (System)Verilog files that are to be synthesized
- LEF Layout views of the standard cells, hard macro’s and pad cells used in the design
- QRC Technology file used for parasitic extraction

make syn    # runs synthesis
make layout # run layout

After the layout is complete, the following design information is produced.

syndb and synout contain outputs of the synthesis process

reports/.._qor.rpt Quality of results report file

reports/.._area.rpt (Synthesis) Area report

reports/.._timing.rpt (Synthesis) Timing report

``reports/.._power.rpt``(Synthesis) Power estimation report

reports/..check_timing_intent.rpt presents the timing constraints as understood for this design

reports/..ccopt_skew_groups.rpt reports on the skew in the clock tree

reports/report_clock_trees.rpt describes the clock trees from the layout

reports/layout_summary.rpt describes physical data on the layout

reports/layout_check_drc.rpt describes design rule violation found in the layout

reports/layout_check_connectivity.rpt describes connectivity violations found in the layout

reports/STA/... contains an extensive collection post-layout timing analysis reports.

out/design_default_rc.spec containts post-layout parastics, used for post-layout STA

out/design.v containts the post-layout netlist

out/final_route.db contains the post-layout database

glsim/: This directory performs gate-level simulation using the delay file (sdf) creating during layout. Other SDF can be used by modifying the testbench.

make sim  # run simulation

glsta/: This directory performs post-layout static timing analysis using the sdf (delay) and spef (parasitics) files created during place and route. This static timing analysis has a higher accuracy than the on from sta/. It produces the same output reports.
- late.rpt: Shows the three slowest paths from setup timing perspective
- early.rpt: Shows the three slowest paths from hold timing perspective
- allpaths.rpt: Shows the slowest path for each primary starting point in four different groups: reg-reg, input-reg, reg-output, input-output

make sta  # Perform post-layout timing analysis

Attention

The normal design flow consists of performing steps 1 through 8, as described above. Each of the five designs discussed so far, can be processed along these steps.

Performance Metrics

Reporting the performance of a design requires selecting a set of metrics that will be used to capture the main features and cost factors such as area and performance. There are many possible ways of processing the data produced by the layout construction script. We will use one ‘standard’ way of collecting metrics, so that we are able to compare results.

Post-synthesis Metrics

Post synthesis metrics are extracted from step 4 (syn) and step 5 (sta) in the flow. They show a first-order approximation of the area cost and the performance of the design.

Post-synthesis Area: The area of the design is the active area of standard cells used to implement the digital block, as found in syn/reports/.._report_area.rpt. For example, the following block uses 2750.364 square micron.

Instance Module  Cell Count  Cell Area  Net Area   Total Area   Wireload
--------------------------------------------------------------------------
movavg                 1054   2750.364     0.000     2750.364 <none> (D)

Post-synthesis Timing: The performance of the design is the worst-case slack on the setup time of the design as reported in static timing analysis, in the report sta/late.rpt. For example, the following block has a worst-case slack of -4 picoseconds.

Path 1: VIOLATED Late External Delay Assertion
Endpoint:   dout[31]       (^) checked with  leading edge of 'clk'
Beginpoint: tap2_reg[12]/Q (v) triggered by  leading edge of 'clk'
Path Groups: {clk}
Other End Arrival Time          0.000
- External Delay                0.000
+ Phase Shift                   2.000
= Required Time                 2.000
- Arrival Time                  2.061
= Slack Time                   -0.061

Post-layout Metrics

Post-layout metrics are extracted from step 6 (layout) from the design. They show an improved estimation of the area cost and the performance of the design.

Post-layout Area: The area of the design is the total area of the core after place and route, as reported in layout/repports/layout_summary.rpt. For example, the total area of the core is 5300.316 square micron. If you report the core density, then do report the density of the core without the physical cells. For example, the core density of the following report is 71.112% (Core Density #2).

Total area of Standard cells: 5300.316 um^2
Total area of Standard cells(Subtracting Physical Cells): 3769.182 um^2
Total area of Macros: 0.000 um^2
Total area of Blockages: 0.000 um^2
Total area of Pad cells: 0.000 um^2
Total area of Core: 5300.316 um^2
Total area of Chip: 6878.304 um^2
Effective Utilization: 1.0000e+00
Number of Cell Rows: 62
% Pure Gate Density #1 (Subtracting BLOCKAGES): 100.000%
% Pure Gate Density #2 (Subtracting BLOCKAGES and Physical Cells): 71.112%
% Pure Gate Density #3 (Subtracting MACROS): 100.000%
% Pure Gate Density #4 (Subtracting MACROS and Physical Cells): 71.112%
% Pure Gate Density #5 (Subtracting MACROS and BLOCKAGES): 100.000%
% Pure Gate Density #6 (Subtracting MACROS and BLOCKAGES for insts are not placed): 71.112%
% Core Density (Counting Std Cells and MACROs): 100.000%
% Core Density #2(Subtracting Physical Cells): 71.112%
% Chip Density (Counting Std Cells and MACROs and IOs): 77.058%
% Chip Density #2(Subtracting Physical Cells): 54.798%
# Macros within 5 sites of IO pad: No
Macro halo defined?: No

** Post-layout Timing**: The performance of the design with the Worst Case Negative Slack (WNS) of the design as reported in layout/reports/STA/..._postRoute.summary.gz.

A .gz file can be printed with the zcat command. For example, in the following design the WNS is -10ps. Note that the Cadence tools will also report a postive WNS to indicate a design that meets timing.

+--------------------+---------+---------+---------+
|     Setup mode     |   all   | reg2reg | default |
+--------------------+---------+---------+---------+
|           WNS (ns):| -0.010  |  1.514  | -0.010  |
|           TNS (ns):| -0.010  |  0.000  | -0.010  |
|    Violating Paths:|    1    |    0    |    1    |
|          All Paths:|   271   |   128   |   263   |
+--------------------+---------+---------+---------+

Hence, we can report the quality of a block with four numbers. These numbers are necessarily a strong simplification, but they are sufficient to reasonably compare apples to apples when you trade one design solution against another. The following is a summary table for the example we discussed. Note that it is crucial to also mention the main clock constraint that applies to these results. In this case, we used a CLOCK_PERIOD of 2, which corresponds to 500MHz. A single combined metric, the area-delay product, can then be computed by multiplying the area with the delay, where delay is (CLOCK_PERIOD - Timing Slack).

Perf at 500MHz

Area (sqmu)

Timing (ns)

Area x Delay

Post-synthesis

2750.364

-0.061

5668

Post-layout

5300.316

-0.010

10654

Analysis for Ref, Small1, Small2, Fast1, Fast2

The following data shows the relevant metrics for each of the five designs, implemented at four different clock targets.

Within a given target, it’s relatively easy to make comparison. For example, for a 2ns clock period target (500MHz), the variation in postlayout area is substantial, from 3488 (for bitserial) to 8897 (for pipelined design). Likewise, the slack shows considerable variation, from comfortably positive (bitserial) to barely not-ok (for the reference). The area delay product of this comparison shows that the bitserial has a area delay product that is one seventh of that of the pipelined design. However, that doesn’t mean that the bitserial design is the best. What the AxD column doesn’t show, is the latency of the design. The reference needs a single-cycle budget, small1 uses 4 clock cycles, small2 uses 64 cycles, fast1 uses 1/3 of a cycle (because of the two pipeline registers), and fast2 uses 1/2 of a cycle (because of the unfolding).

		PostSyn	PostSyn	PostRoute	PostRoute	PostRoute
Design	Target	Slack	Area	Slack	Area	A x D
		ns	sqmu	ns	sqmu	ns.sqmu
ref	1.50	-0.026	3161.7	-0.219	11979.6	37,877,886
small1	1.50	-0.042	5152.2	0.006	16070.6	82,799,228
small2	1.50	0.429	2349.9	0.336	3500.0	8,223,477
fast1	1.50	-0.047	5559.6	0.003	11656.4	64,804,249
fast2	1.50	-0.048	5165.4	-0.285	14411.9	74,447,434
ref	1.75	-0.023	2803.0	-0.057	10690.9	29,967,600
small1	1.75	-0.024	4957.3	0.005	10752.5	53,303,108
small2	1.75	0.651	2348.5	0.676	3488.4	8,190,198
fast1	1.75	-0.025	5499.4	0.002	9766.5	53,710,111
fast2	1.75	-0.028	4283.2	-0.052	14483.7	62,037,453
ref	2.00	-0.061	2750.4	-0.010	5300.3	14,577,851
small1	2.00	-0.019	4897.8	0.026	9028.8	44,220,859
small2	2.00	0.944	2348.2	0.836	3488.4	8,188,447
fast1	2.00	-0.040	5159.8	0.002	8897.1	45,906,984
fast2	2.00	-0.052	3879.0	-0.008	8790.8	34,099,741
ref	2.25	-0.017	2723.3	0.018	4601.6	12,531,693
small1	2.25	0.009	3212.1	0.016	8265.5	26,549,041
small2	2.25	1.194	2348.2	1.189	3500.0	8,214,506
fast1	2.25	-0.041	4965.5	0.010	8428.6	41,852,063
fast2	2.25	0.000	3651.2	0.016	7155.7	26,126,596

Hence, if we evaluate the design efficiency on the actual throughput achieved, the AxD number has to be multiplied with the appropriate cycle latency of each design. The following table lists the designs sorted from smallest AxDxlatency (best) to highest (worst). There are some interesting insights available from the table.

The least constrained designs (with the highest clock period target) are the best ones in terms of AxDxlatency. Remember what AxDxlatency really reflects: it reflects how efficiently the gates of the design are used to achieve the purpose of the application. When designs are highly multiplexed (such as small1 and small2), or when the design implementation is highly stressed (such as with small target clock), then additional gates are required to acvhieve the same goal. For multiplexed designs, the reuse of computational gates requires additional multiplexing and registers. For highly stressed designs, the additional performance requirements becomes especially demanding on the hardware needed. Either of these effects result in an inferior AxDxlatency.
The design that appeared outstanding when ignoring the latency (i.e. small2) is actually worst of all. Bitserial designs are not very efficient from gate utilization perspective. True, the resulting design has the smallest possible area of all designs, but this comes at an enormous performance overhead. Moreover, a register cell is significantly larger than a logic cell such as NAND2. If we would try to multiplex two NAND2 into a single NAND2 and a register, the resulting design would actually grow in size. The table does not show the power metrics, but you can resonably expect that highly multiplex designs are rich in additional bit transistions, which cause the power to significantly increase.

Target	Design	AxD	Latency	AxDxLatency
2.25	ref	12,531,693	1.000	12,531,693
2.25	fast2	26,126,595	0.500	13,063,297
2.25	fast1	41,852,062	0.333	13,950,686
2.00	ref	14,577,851	1.000	14,577,851
2.00	fast1	45,906,984	0.333	15,302,326
2.00	fast2	34,099,740	0.500	17,049,870
1.75	fast1	53,710,111	0.333	17,903,368
1.50	fast1	64,804,249	0.333	21,601,414
1.75	ref	29,967,600	1.000	29,967,600
1.75	fast2	62,037,452	0.500	31,018,726
1.50	fast2	74,447,434	0.500	37,223,717
1.50	ref	37,877,885	1.000	37,877,885
2.25	small1	26,549,041	4.000	106,196,165
2.00	small1	44,220,859	4.000	176,883,437
1.75	small1	53,303,107	4.000	213,212,431
1.50	small1	82,799,227	4.000	331,196,911
2.00	small2	8,188,446	64.000	524,060,601
1.75	small2	8,190,198	64.000	524,172,677
2.25	small2	8,214,506	64.000	525,728,397
1.50	small2	8,223,476	64.000	526,302,514

Layout Analysis for Ref @ T=2ns

The layout itself is also an excellent source for additional analysis. The following figures illustrate some interesting aspects for a single design, the reference design at 500MHz (2ns clock).

_images/reflayout1.png — The full layout, showing 64 inputs coming in from the left, and producing 64 output bits on the right. Clock and reset ports are on top.

_images/reflayout2.png — The standard cells only show that the design has a relatively high utilization; with an empty area at the top. This could be optimized as a modified aspect ratio, or a re-arranging of the I/O ports.

_images/reflayout3.png — This shows the clock tree. At the leaf position of each branch is a flip-flop.

_images/reflayout4.png — The clock tree is remarkably well balanced. The y axis in this graph shows clock delay, and the horizontal position enumerates flip-flops. Almost every flip-flop within a window of 0.05ns.

_images/reflayout5.png — The layout uses a standard power grid, and no effort was made to create a mesh structure. This is very likely to cause problems for the standard cells in the middle of a row, who are too far apart from the power ring around the core. Hence, additional work is needed to design the power network of this layout. Separate analysis (signal integrity tools) can be used to determine of a power grid was well designed.

_images/reflayout6.png — A pin density map shows which areas of the layout require most connections. A high pin density is correlated to routing congestion, as many wires need to converge in a single area of the chip.