Lecture 17 - 4/2/2019 - Pipelining and Retiming

1. Basics
  Latency
  Troughput
  Critical Path
2. Pipelined Adder
   Pipelined Multiplier
   (Wallace Tree Multiplier)
3. Retiming
   Rules for Retiming
4. Retiming in Quartus

--------------------------------------------------------

5:00PM Pipelining and Retiming

A pipeline partitions a long combinational computations in multiple small
stages of combinational logic. Each stage is separated from the next one using
a pipeline register.

Instead of computing

    output = comb_logic(input)

We will compute (example of a three stage pipeline):

    pipe1 = comb_logic_1(input);        // pipeline stage 1
    pipe2 = comb_logic_2(pipe1);        // pipeline stage 2
    output = comb_logic_3(pipe2);       // pipeline stage 3

Such that

    comb_logic ~ comb_logic_3(comb_logic_2(comb_logic_1))

A pipeline circuit has the following properties, compared to the original circuit.

    1/ Increased latency.

        Original circuit  = 1 cycle (combinational circuit, but we need to allocate
                                     at least one clock cycle so that it can complete)
        Pipelined circuit = N cycles (for n pipeline stages)

    2/ Decreased critical path.

        Original circuit = K
        Pipelined circuit = between K/N and K

        In a perfectly balanced pipeline, the delay of each pipeline stage
        is identical. This represents an optimal case, since it results in
        the shortest possible critical path.

        In a real pipeline, the delays are not balanced over each pipeline
        stage. Worst case, one pipeline stage dominates over all others,
        with a combinational delay close to the original, non-pipelined circuit.

    3/ Increased throughput.

        In the original circuit, we have one result per clock cycle
        In the pipeline circuit, we have also one result per clock cycle

        However, ideally, the clock frequency of the pipeline will be
        n times higher than the clock frequency of the original circuit

        Therefore, the throuhgput is N times higher.

The operation of a pipeline is easy to express in terms of clock cycles
and pipeline stages:

          Cycle1  Cycle2  Cycle3
          ----------------------
stage-1     X
stage-2             X
state-3                     X

This structure is called a RESERVATION TABLE, is describes how the pipeline
stages are utilized. A reservation table helps to illustrate how the pipeline
operates:

          Cycle1  Cycle2  Cycle3  Cycle 4  Cycle 5  Cycle 6
          -------------------------------------------------
stage-1     1       2       3       4       ..      
stage-2             1       2       3       4       ..
state-3                     1       2       3       4       

In a LINEAR pipeline, each pipeline stage is used exactly once for 
every data item that is introduced in the pipeline.

In a NONLINEAR pipeline, some pipeline stages are REUSED during
processing. Nonlinear pipelines have a lower throughput than
linear pipelines. If we call G the max. number of times a pipeline
stage is reused, then the throughput of the nonlinear pipeline
will be 1/G compared to the throughput of the linear pipeline.

Example of a nonlinear pipeline with G = 2:

          Cycle1  Cycle2  Cycle3  Cycle 4
          -------------------------------
stage-1     X
stage-2             X               X
state-3                     X

In this example, stage-2 is used during Cycle 2 and Cycle 4.

5:20 Pipelined Adders

How do we design circuits with pipelining?
We will start with two datapath components - adders and multipliers - 
and study how to apply pipelining to their implementation.

Consider the basic addition algorithm.

Operand A       A2      A1      A0
Operand B       B2      B1      B0
Result C        C2      C1      C0

Consider a full-adder cell (Cq,Q) = FA(A, B, Ci)

Then (Cq0, C0) = FA(A0, B0, 0)
     (Cq1, C1) = FA(A1, B1, Cq0)
     (Cq2, C2) = FA(A2, B2, Cq1)

This circuit is slow - it has a critical path delay of three full-adder cells.
Namely, there is a dependency from the first FA to the second (through Cq0) and
from the second FA to the third (through Cq1). Therefore, even though a combinational
hardware circuit could compute three additions in a single clock cycle, they will stil
be computed sequentially, one after the other.

Can we use pipelining to make this faster?
Yes! We can design a structure so that there is only one FA between any two registers.

However, you have to do the pipelining taking ALL datadependencies into account.

<figure>


            A2              A1              A0
            B2              B1              B0
            |               |               |
        .............................       add
            |               | ------.------/|
            |               |/      .       |
        ...............    add      ...................
            |/--------.----/|               |
            add       .     |               |
            |         .................................
            |               |               |

           Cq2, C2          C1              C0

Note how the pipeline registers cut multiple wires.
These wires all need to be stored in pipeline registers

In addition, this structure can also be used to make
high-speed long-wordlength additions using smaller adders.
Eg with 8x 16-bit adder, you can create a 128-bit adder with
the speed (thorughput) of a 16-bit adder but with 8 times higher
latency.

5:20 Pipelined Multiplier

Similar to an adder, we can pipeline a multiplier. This will, however,
reveal a challenge with our approach of pipelining: exuberant cost
of storing intermediate results in pipeline registers.

Consider


Operand A       A2      A1      A0
Operand B       B2      B1      B0

And we compute A x B:


                        A2      A1      A0
                        B2      B1      B0
                        --------------------
                        B0A2    B0A1    B0A0
                B1A2    B1A1    B1A0
        B2A2    B2A1    B2A0
        ------------------------------------

        L4      L3      L2      L1      L0

The partial products themselves can all be computer in 
parallel. However, the partial products then have to be 
accumulated in 'lanes'

L0 = B0A0               - does not generate carry (assuming A0 and B0 are bits)

L1 = B0A1 + B1A0        - may generate carry cy1

L2 = B0A2 + B1A1 + B2A0 + cy1   - may generate carry cy2

L3 = B1A2 + B2A1 + cy2      - may generate cy3

L4 = B2A2 + cy3             - may generate cy4

Again, there is a dependency through the carry chain.
Moreover, if we assume that we have only a full adder cell
available, then we will again need multiple clock cycles:

<figure>

Cycle 0  (cy1, C1)  = B0A1 + B1A0
Cycle 1  (cy2a, t0) = B0A2 + B1A1 + cy1
Cycle 2  (cy2b, C2) = t0 + B2A0
         (cy3a, t1) = B1A2 + B2A1 + cy2a
Cycle 3  (cy3b, C3) = cy2b + t1
         (cy4a, t2) = A2B2 + c3a
Cycle 4  (cy4b, C4) = cy3b + t2
Cycle 5  (_,    C5) = cy4a + cy4b

Unfortunately, this formulation creates a lot of intermediate results
on the pipeline registers (see figure).

A better solution is to consider addition of partial products.

Consider again the 'lanes' of additions:

L0 = B0A0                       - does not generate carry (assuming A0 and B0 are bits)
L1 = B0A1 + B1A0                - may generate carry cy1
L2 = B0A2 + B1A1 + B2A0 + cy1   - may generate carry cy2
L3 = B1A2 + B2A1 + cy2          - may generate cy3
L4 = B2A2 + cy3                 - may generate cy4

Then, part of L1, L2 and L3 can all be done in parallel:

Cycle 0     (cy1, C1)  = B0A1 + B1A0
            (cy2, v2)  = B0A2 + B1A1 + B2A0
            (cy3, v3)  = B1A2 + B2A1

This structure is called a Wallace Tree, named after its inventor, Wallace (1964).
The final accumulation of partial products is still sequential, and still uses
a ripple carry adder. 

<show figure, Kogge>

5:40 Retiming

The question then arises - can we compute the distribution of pipeline registers
automatically, so as to balance the stages of a pipeline?


        Logic1 -->|pipe|--> Logic2 -->|pipe| --> Logic3


If the critical path of Logic1, Logic2 and Logic3 is very different, what can we do?

The solution to that is 'retiming' - the redistribution of pipeline registers.
Retiming consists of a set of netlist transformations that are based on simply
equivalence rules.

Rule 1: A register cane be moved across a logic circuit which has no loops, without
       changing the functionality of the resulting design

        --> register --> logic   ==  --> logic --> register -->


Rule 2: A register can be moved over a fork by duplicating it

        --> register ----+----> logic1         -------+---> register --> logic1
                         |                ==          |
                         +----> logic2                +---> register --> logic2

Rule 3: A register can be merged into a fork

         --> register -----> logic --->         ---> logic --> register --->
                              |          ==            |
         --> register --------+                 -------+

That's it!

Let's try to retime a few sample circuits.
Try to reduce the critical path in each case

EXAMPLE 1:

        --> R R ---> logic1 ---> logic2 ---> output

Before: critical path = path(logic1) + path(logic2)

Retimed result:

        ----- R ---> logic1 - R -> logic2 ---> output

After: critical path = max(path(logic1), path(logic2))

EXAMPLE 2:

        --> R --> |
                  logic1 --------> |
        --> R --> |                logic2 --> output
                                   |
        --> R -------------------> |


Before: critical path = path(logic1) + path(logic2)

Retimed result:

        --------> |
                  logic1 -- R ---> |
        --------> |                logic2 --> output
                                   |
        ------------------- R ---> |

After: critical path = max(path(logic1), path(logic2))

EXAMPLE 3:

        --> R --> |
                  logic1 --------> logic2 ---+
        +-> R --> |                          |
        |                                    |
        +------------------------------------+     
                                   
Before: critical path = path(logic1) + path(logic2)

Retimed result:

        --------> |
                  logic1 --- R --> logic2 ---+
        +-------> |                          |
        |                                    |
        +------------------------------------+     
     
After: critical path = path(logic1) + path(logic2)

EXAMPLE 4:

        --> R --> |
                  logic1 --- R --> logic2 ---+
        +-------> |                          |
        |                                    |
        +------------------------------------+     
                                   
Before: critical path = path(logic1) + path(logic2)

Retimed result:

        --------> |
                  logic1 --- R --> logic2 ---+
        +-------> |                          |
        |                                    |
        +------------------------------------+     
     
After: critical path = path(logic1) + path(logic2)

This demonstrates that feedback loops cannot be retimed; the number
of delays (registers) found in a feedback loop must stay constant!

6:00 Retiming in Quartus

Discuss example circuit:

module fir(
    input                       CLOCK_50,
    input            [3:0]      KEY,
    output           [9:0]      LEDR,
    input            [9:0]      SW
);

    reg [9:0] sysin, sysin_next, sysout, sysout_next;
    reg [9:0] tap[`TAP-1:0], tap_next[`TAP-1:0];
    reg [9:0] acc[`TAP-1:0];
    integer i;
    
    always @(posedge CLOCK_50)
    begin
      sysin  <= KEY[0] ? sysin_next : 10'b0;
      sysout <= KEY[0] ? sysout_next : 10'b0;
      for (i = 0; i < `TAP; i = i +1)
        tap[i] <= KEY[0] ? tap_next[i] : 10'b0;
    end

    
    always @(*)
    begin
      sysin_next = SW[9:0];
     
      tap_next[0] = sysin;  
      for (i = 1; i < `TAP; i = i + 1)
         tap_next[i] = tap[i-1];
        
      acc[0] = tap[0];
      for (i = 1; i < `TAP; i = i + 1)
         acc[i] = acc[i-1] + tap[i];
        
      sysout_next = acc[15];
    end

    assign LEDR = sysout;

endmodule

Demonstrate:
  - Synthesis results
  - Use of retiming options to compute critical path