Lecture 13: Area/Time Trade Off

5:00 Introduction

Practical issues for the coming few lectures:
- Today I will finish class exceptionall at 5:45PM.
  additional material will be available online (video)
- Next week (3/26, 3/28) I am traveling.
  I will post the lectures online; they will include
  tool demonstrations

Laying out the topics going forward:

So far, we concentraded on creating functionally correct circuits in
Verilog. In the next few lectures, we will be focusing on optimization
and trade-offs, in addition to further deepening out our expertise in
hardware design:

Optimization:
- L13: Area/Time tradeoffs
- L17: Pipelining and Retiming
- L18: Unfolding

Further into hardare design:
- L14: Hardware synthesis of Memories
- L15: Timing Simulation
- L16: Working w IP modules

5:05 Area/Time tradeoff

The product of the area of a circuit, with its delay (latency),
represents the AREA-DELAY product. It is a popular complexity metric
to study the quality of a digital circuit.

Say you have a design with 400 Gates, and it has a latency of 20
cycles to compute an output from an input, then that design has an
area-delay of 400*20 = 8000 gate-cycles.

The area-delay product represents a cost metric. It expresses how well
the hardware is used to compute a result.

The area-delay product can be used to compare one design to another
one.  In general, a lower area-delay product is better.

When you OPTIMIZE a design, you would try to reduce the area-delay
product.  When you evaluate a TRADE-OFF of a design, you would try to
exchange a better area for a worse delay, or vice versa. Hence, a
trade-off does not have to change the area-delay product. Then, how do
we compare the area-delay product of such designs?

We need to capture these designs in a single design-space, the area-delay
design space:

                | Area
                |
                |    X Design 1
                |          X Design 2
                |                      X Design 3
                |
                |                             Delay
                +----------------------------------

For these three designs:

        Area (Design 1) > Area (Design 2) > Area (Design 3)
        Delay (Design 1) < Delay (Design 2) < Delay (Design 3)

It is common to introduce design constraints when comparing designs.
Such as:
        C1: Your design must be smaller then area A1
        C2: Your design must be faster then delay D1

Once the constraints are set, we then can select what designs we
should pick.

For the constraint C1 (area), we must find a design such that 
Area (Design) < A1. Note that it makes no sense to pick the smallest
design. As soon as we have a design which meets the constraint, we are done.

For the constraint C2 (delay), we must find a design such that Delay
(Design) < D1. Again, it makes no sense to pick the fastest design.
One that is a littlecbit faster then D1 is OK.

                | Area               C2
                |                     |
                |    X Design 1       |
            +---------------------+--------- C1 
                |          X Design 2 |
                |                     |X Design 3
                |                     |
                |                     |       Delay
                +---------------------+------------

 Design 1 and Design 2 meet C2
 Design 2 and Design 3 meet C1

Design 2 is most desirable to meet C1 
(closest to D1 but still meeting the constraint)

Design 2 is most desirable to meet C2 
(closest to A1 but still meeting the constraint)

Design 2 is also the only design that can meet BOTH C1 and C2.

Now, consider suboptimal design, as shown by the following examples:


                | Area
                |                 X Design 4
                |    X Design 1            
                |          X Design 2        X Design 5
                |                      X Design 3
                |
                |                             Delay
                +----------------------------------

Design 4 is a suboptimal design, even though it is faster than Design
3 The reason is that there is a design, Design 2, which is BOTH faster
AND smaller then Design 4. Hence, if we would pick Design 4 under any
constraint, then Design 2 would be a better choice. Design 2 avoids
making a trade-off that would be made by Design 4.

Similarly, Design 5 is a suboptimal design, even though it is smaller
then Design 1. The reason is that there is a design, Design 3, that is
BOTH faster AND smaller then Design 5. Hence, Design 3 is always
preferable over Design 5.

You can identify optimal designs by imagining each desiqgn to be a corner of
an infinite square, and evaluating which other design are 'outshadowed' by
the square. This demonstrates that Design 1, Design 2 and Design 3 are 
optimal, and Design 4 and Design 5 are outshadowed.

                |
                | Are|     |           |
                |    |     |      X Design 4
                |    X-----|           |  
                |    1     X-----------|  X Design 5
                |          2           X--------------
                |                      3
                |                             Delay
                +----------------------------------

5:20 Area/Time tradeoff through resource sharing

The most common method to generate design variants which have
different area-delay product is to create implementations with
different resource sharing.

Assume the following design, made using a module with combinational
logic module_f1:

DESIGN 1:

        --> module_f1 --> register --> module_f1 --> register -->

If module_f1 is large, we can try to resource-share it as follows:

DESIGN 2:

        --> MUX --> module_f1 --> register ---+--->
             ^                                |
             +--------------------------------+

*** AREA

The area of the first design is: 
                2 x Area(module_f1) + 2 x Area(register)

The area of the second design is: 
                Area(MUX) + Area(module_f1) + Area(register)

The second design will be smaller if: 
                Area(MUX) < Area(module_f1) + Area(register)

In practical cases, the second design will be smaller
However! This assumes that the controller that drives
the MUX will be simple. That is not always true.

*** LATENCY

Both designs have a latency of two cycles

*** THROUGHPUT

The first design accepts one new input per cycle
The second design accepts one new input per two cycles

In practical cases, the first design will be faster (higher troughput)

To express the ability of a hardware design to do resource
sharing, we use the HARDWARE SHARING FACTOR (HSF).

If you have a design that reads inputs at a rate fin
and that produces outputs at a rate fout
and that is working at a clock frequency fclk

then the hardware sharing factor is defined as:

                 fclk
        HSF = -----------------
               max(fin, fout)

The HSF expresses how many clock cycles we have available
to produce each output or consume each input.

A low hardware sharing factor means that hardware resources
are not shared over clock cycles.

In the above example, HSF(DESIGN 1) = 1 and HSF(DESIGN 2) = 2

The HSF is an indicator for the architectural style that a hardware
design has to use:

        HSF = 1         Non-shared, maximum-throughput hardware

        HSF = 10        Lowly multiplexed hardware, uses FSM-based control

        HSF = 100       Medium complexity multiplexed hardware,
                        uses more sophisticated control such as a small
                        microcontroller or a sequencer

        HSF = 1000      Highly multiplexed hardware,
                        requires sophisticated microcoded control

        HSF = inifinite Microprocessor!
                        requires a compiler to develop control (=software!)

5:30 Introducing the multiplier example

To get a feel for area-time trade-offs when writing Verilog, we will
study the example of a 16*16 multiplier with a 32-bit output.

Last week, we mentioned that the FPGA has dedicated DSP resources with
hard-macro multipliers. For this example, however, we will force the
tools to use only 6-input lookup tables as available in the Cyclone-V
fabric (ALMs).

We will discuss 4 different versions of the multiplier:

        multstyle.v   toplevel, for synthesis
        mymult1.v     first version, nonmultiplexed
        mymult2.v     second version, nonmultiplexed
        mymult3.v     third version, 16-time multiplexed
        mymult4.v     fourth version, 8-time multiplexed
        mymulttb.v    testbench for any version

To simulate these designs you would take the following steps:

1/ Select the version to use (mymult1 to mymult4) and add it to
   multstyle.v and mymulttb.v

2/ Compile the verilog for simulation

        vlib work
        vlog *.v

3/ Simulate

        vsim -c mymulttb -do "run 10000ns; exit"

4/ Synthesize

        quartus_sh --flow compile multstyle

If you run the synthesis in Quartus, you can easily extract the 10
most critical timing paths (which have minimum slack).

We have to compare the four designs from the following point of view:

        - number of cycles needed to complete a multiplication
        - minimum clock period of the design
          (or maximum clock frequency, or minimum slack)
        - number of ALMs and registers 

        The Delay will be: cycles x clock period
        The Area will be: #ALMs

We don't include the registers in the area because each
ALM has logic as well as registers.

Obviously the area metric is an approximation.


5:40 Let's start with MULT1.V:

===============================================
module mymult1(input  wire        clk,
               input wire         reset,
               input wire [15:0]  a,
               input wire [15:0]  b,
               input wire         start,
               output wire [31:0] result,
               output wire        done);
   
   (* multstyle = "logic" *) reg [31:0] r, next_r;
   reg                            d, next_d;
   
   always @(posedge clk)
     begin
       r <= (reset) ? 32'h0 : next_r;
       d <= (reset) ? 1'd0  : next_d;  
     end
   
   // IMPLEMENTATION 1
   always @(*)
     begin
       next_r = {16'b0,a} * {16'b0,b};
       next_d = start; 
     end
   
   assign done   = d;
   assign result = r;
   
endmodule
===============================================

This design reads in two arguments a and b,
multiplies them and stores the result in
a register. We use the register to break the critical
path.

Note how the multiplication is written:
                next_r = {16'b0,a} * {16'b0,b};

This ensures that we will get a 32-bit multiplier, not
a 16-bit multiplier.

There is a start/done control, which is simple because
this is a single-clock design.

Also, note how we tell the tools to map this design
in ALMs, and not in a hard-macro DSP multiplier:

   (* multstyle = "logic" *) reg [31:0] r, next_r;

(that comment is called a 'synthesis attribute')
See https://www.intel.com/content/www/us/en/programmable/quartushelp/17.0/index.htm#hdl/vlog/vlog_file_dir_multstyle.htm

Consult Quartus -> help (and then search for 'multstyle')

           DESIGN 1     DESIGN 2        DESIGN 3        DESIGN 4
--------------------------------------------------------
ALM         130
Registers    64
Cycles        1
Min CLKP    8.052

AREA        130
DELAY         8

5:50 Next, consider MULT2.v

In order to express the sharing, we need to rewrite

                next_r = {16'b0,a} * {16'b0,b};

because the multiplication is a single line.

We can do this, by realizing how a multiplication is computed.
Here is an example of a 4-bit x 4-bit multiply:

                      a       1       0       1       1 
                      b       1       1       0       1
                     ----------------------------------
                              1       0       1       1     b[0] x a
                      0       0       0       0             b[1] x a
              1       0       1       1                     b[2] x a
      1       0       1       1                             b[3] x a
      -------------------------
  1   0       0       0       1       1       1       1 

So we compute the partial products (a multiply of a b-bit with all the
bits of a), and next add up all of those to compute the result.

We can write this behavior directly in verilog:

===============================================
module mymult2( input  wire             clk,
                input wire                      reset,
                input wire [15:0]       a,
                input wire [15:0]       b,
                input wire                      start,
                output wire [31:0]      result,
                output wire             done);
   
   (* multstyle = "logic" *) reg [31:0] r, next_r;
   reg                            d, next_d;
   
   always @(posedge clk)
     begin
       r <= (reset) ? 32'h0 : next_r;
       d <= (reset) ? 1'b0  : next_d;  
     end
   
   // IMPLEMENTATION 2
   reg [31:0] tmp[15:0];
   reg [31:0] tmpa;
   integer    j;        
   always @(*)
     begin
       tmp[0] = b[0] ? {16'b0, a} : 32'b0;
       for (j = 1; j<16; j=j+1)
       begin
         tmpa = {15'b0, a} << j;
         tmp[j] = tmp[j-1] + (b[j] ? tmpa : 32'b0);
       end
       next_r = tmp[15];
       next_d = start; 
     end
   
   assign result = r;
   assign done   = d;
   
endmodule
===============================================

Let's look at it in more detail.

- tmp is a 32-bit accumulator that will add partial products
together. 
- the partial products are created using tmpa, which is nothing
  more than a shifted version of a
- the partial products are given by:

                        b[j] ? tmpa + 32'b0

- the entire multiplication is written in a loop. Keep in mind
  that Verilog synthesis implements this loop as a combinational
  loop, ie. in a single clock cycle.

                tmp[0] = b[0] ? {16'b0, a} : 32'b0;
                for (j = 1; j<16; j=j+1)
                begin
                  tmpa = {15'b0, a} << j;
                  tmp[j] = tmp[j-1] + (b[j] ? tmpa : 32'b0);
                end

  What gets synthesized is the UNROLLED version of this loop
- Finally, to create a proper done signal, we create
  a delayed version of start. The latency of this multiplier
  is still one clock cycle.

          DESIGN 1     DESIGN 2    DESIGN 3     DESIGN 4
--------------------------------------------------------
ALM         130         159
Registers    64          64
Cycles        1           1
Min CLKP     8.052       10.047

AREA        130         159
DELAY         8          10

6:00 Next, consider MULT3.v

Rewriting the multiplier in MULT2.v also gives an idea how we
can share the accumulator: we should try to synthesize each loop
iteration in a single clock cycle. Then, the accumulator can
be used for 16 clock cycles to compute a result.

However, there are additional issues to pay attention to:
- the shifted version of a has to be created. We could do
  that using a shift register, each time shifting a up one
  position.
- different bits of b have to be indexed. We should do that
  as well using a shift register, each time shifting b
  down one position and looking at the zero-bit of the
  shft-b register
- to load the shift registers, drive the multiplication,
  and extract the result, we introduce a state machine

===============================================
module mymult3(input  wire        clk,
               input wire         reset,
               input wire [15:0]  a,
               input wire [15:0]  b,
               input wire         start,
               output wire [31:0] result,
               output wire        done);
     
   (* multstyle = "logic" *) reg [31:0] r, next_r;
   reg                                  d, next_d;
   
   always @(posedge clk)
     begin
       r <= (reset) ? 32'h0 : next_r;
       d <= (reset) ? 1'b0  : next_d;  
     end
   
   // IMPLEMENTATION 3
   reg [31:0] tmp, next_tmp;
   reg [31:0] tmpa, next_tmpa;
   reg [15:0] shftb, next_shftb;   
   
   reg [1:0]  state, next_state;
   localparam S0 = 0, S1 = 1, S2 = 2;
   
   always @(posedge clk)
     begin
       tmp   <= (reset) ? 32'b0 : next_tmp;
       tmpa  <= (reset) ? 32'b0 : next_tmpa;
       shftb <= (reset) ? 16'b0 : next_shftb;  
       state <= (reset) ? S0    : next_state;
     end
   
   always @(*)
     begin
       next_tmpa  = tmpa;
       next_shftb = shftb;
       next_tmp   = tmp;
       next_state = state;
       next_d     = 1'b0;
       next_r     = r;
        case(state)
          S0: if (start)
            begin
               next_tmpa  = {16'b0, a};
               next_shftb = b;
               next_tmp   = 32'b0;
               next_state = S1;
            end
          S1: if (|shftb)
                begin
                next_tmpa  = tmpa << 1;
                next_shftb = shftb >> 1;
                next_tmp   = tmp + (shftb[0] ? tmpa : 32'b0);
            end
          else
                next_state = S2;
          S2: begin
                next_d = 1'b1;
                next_r = tmp;
                next_state = S0;
          end
          default:
                next_state = S0;
        endcase
     end
   
   assign result = r;
   assign done   = d;
   
endmodule
===============================================

While this design looks more complicated, it is in fact smaller
then the previous multiplier, because it needs only one
32-bit adder, while MULT2.v uses 16 32-bit adders.

A caveat with this design is that is has variable latency.
It will iterate in state S1 as long as there are non-zero bits
in the b shift-register:

          S1: if (|shftb)
                        ...
                  else
                    next_state = S2;

In simulation, you can find that this design takes between 4 and 19
cycles depending of the position of the msbit in b.
We will take the worst-case latency as the DELAY metric.
You can see that this design can run at a much higher clock
frequency then DESIGN 1 and DESIGN 2, because the critical
path is much shorter.

          DESIGN 1     DESIGN 2    DESIGN 3   DESIGN 4
--------------------------------------------------------
ALM         130         159        67
Registers    64          64        147
Cycles        1           1        4 .. 19
Min CLKP     8.052       10.047    4.063

AREA        130         159        67
DELAY         8          10        77

6:10 Finally, let's look at MULT4.v, which is a trade-off design

We can now consider designs that accumulate multiple partial products
per clock cycle. In MULT4, we accumulate two partial products per
clock cycle. This design is a simple transformation from MULT3:

===============================================
module mymult4( input  wire             clk,
                input wire              reset,
                input wire [15:0]       a,
                input wire [15:0]       b,
                input wire              start,
                output wire [31:0]      result,
                output wire             done);
   
   (* multstyle = "logic" *) reg [31:0] r, next_r;
   reg                            d, next_d;
   
   always @(posedge clk)
     begin
       r <= (reset) ? 32'h0 : next_r;
       d <= (reset) ? 1'b0  : next_d;  
     end
   
   // IMPLEMENTATION 4
   reg [31:0] tmp, next_tmp;
   reg [31:0] tmpa, next_tmpa;
   reg [15:0] shftb, next_shftb;   
   
   reg [1:0]  state, next_state;
   localparam S0 = 0, S1 = 1, S2 = 2;
   
   always @(posedge clk)
     begin
       tmp   <= (reset) ? 32'b0 : next_tmp;
       tmpa  <= (reset) ? 32'b0 : next_tmpa;
       shftb <= (reset) ? 16'b0 : next_shftb;  
       state <= (reset) ? S0    : next_state;
     end
   
   always @(*)
     begin
       next_tmpa  = tmpa;
       next_shftb = shftb;
       next_tmp   = tmp;
       next_state = state;
       next_d     = 1'b0;
       next_r     = r;
        case(state)
          S0: if (start)
            begin
               next_tmpa  = {16'b0, a};
               next_shftb = b;
               next_tmp   = 32'b0;
               next_state = S1;
            end
          S1: if (|shftb)
            begin
               next_tmpa  = tmpa << 2;
               next_shftb = shftb >> 2;
               next_tmp   = tmp +      (shftb[0] ?  tmpa      : 32'b0);
               next_tmp   = next_tmp + (shftb[1] ? (tmpa << 1): 32'b0);
            end
          else
            next_state = S2;
          S2: begin
                next_d = 1'b1;
                next_r = tmp;
                next_state = S0;
          end
          default:
            next_state = S0;
        endcase
     end
   
   assign result = r;
   assign done   = d;
   
endmodule
===============================================

So now, the shift registers move by two bits per clock cycle,
there are two accumulations per clock cycle:

               next_tmpa  = tmpa << 2;
               next_shftb = shftb >> 2;
               next_tmp   = tmp +      (shftb[0] ?  tmpa      : 32'b0);
               next_tmp   = next_tmp + (shftb[1] ? (tmpa << 1): 32'b0);

This design will be larger than DESIGN 4, and a little slower
in terms of max clock frequency

       DESIGN 1      DESIGN 2     DESIGN 3     DESIGN 4
--------------------------------------------------------
ALM         130         159        67          81
Registers    64          64        147         147
Cycles        1           1        4 .. 19     4 .. 11
Min CLKP     8.052       10.047    4.063       5.961

AREA        130         159        67          81
DELAY         8          10        77          66

6:15 Now, let's compare these four designs:

            Area  |
                  |         2
                  |
                  |    1
                  |                            
                  |                            4
                  |                                    3
                  |
                  |
                  +---------------------------------------- Delay

(these positions are approximate)

You can see that:
- Design 2 never outperforms Design 1; compared to Design 1,
  Design 2 is suboptimal. Nevertheless, we did this transformation
  to enable the resource sharing that led to Design 3 and Design 4;
  So even though Design 2 is a 'throw-away' design, it was not useless.

- Design 1, 3, and 4 are each optimal within the design space
  considered. Depending on how you select the area and delay
  constraint, you may choose Design 1, 3 or 4 as the most appropriate
  solution.