Lecture 18 - 4/4/2019 - Unfolding

1.  Example of Retiming in Quartus
2.  Unfolding
2.1 Example of stacked counter
2.2 Formalized Model
2.3 Loop Bound and Iteration Bound
2.4 Unfolding Algorithm
2.5 Examples

--------------------------------------------------------

Timeline

This lecture:
  Retiming Demo
  Unfolding

Next lecture 4/9:
  Review

Next Next lecture 4/11:
  Exam II


5:00PM Pipelining

Recap of key ideas:

  - Pipelining improves throughput

  - A combinational delay of C
    is split in N parts by using (N-1) pipeline registers
    where each part has a delay of C/N (ideally)

    Hence the latency:
      Original Circuit: 1 (cycle) x C     = C (s)
      Pipelined Circuit: N (cycle) x C/N  = C (s)

    Hence the throughout:
      Original Circuit: C^-1 (s^-1)
      Pipeline Circuit: N. C^-1 (s^-1)

    The latency stays the same, the throughput increases with a factor N

  - Retiming is used to automatically redistributed the registers
    in a network

    Three rules:

    Rule 1:

        --> register --> logic   ==  --> logic --> register -->

    Rule 2: A register can be moved over a fork by duplicating it

        --> register ----+----> logic1         -------+---> register --> logic1
                         |                ==          |
                         +----> logic2                +---> register --> logic2

    Rule 3: A register can be merged into a fork

         --> register -----> logic --->         ---> logic --> register --->
                              |          ==            |
         --> register --------+                 -------+

    Retiming can reduce the critical path of a design, generally at the expense
    of area.

EXAMPLE 1:

        --> R --> |
                  logic1 --------> |
        --> R --> |                logic2 --> output
                                   |
        --> R -------------------> |


Before: critical path = path(logic1) + path(logic2)

Retimed result:

        --------> |
                  logic1 -- R ---> |
        --------> |                logic2 --> output
                                   |
        ------------------- R ---> |

After: critical path = max(path(logic1), path(logic2))

EXAMPLE 2:

        --> R --> |
                  logic1 --------> logic2 ---+
        +-> R --> |                          |
        |                                    |
        +------------------------------------+     
                                   
Before: critical path = path(logic1) + path(logic2)

Retimed result:

        --------> |
                  logic1 --- R --> logic2 ---+
        +-------> |                          |
        |                                    |
        +------------------------------------+     
     
After: critical path = path(logic1) + path(logic2)

5:10 Automatic retiming in Quartus

Discuss example circuit (draw it):

module fir(
    input                       CLOCK_50,
    input            [3:0]      KEY,
    output           [9:0]      LEDR,
    input            [9:0]      SW
);

    reg [9:0] sysin, sysin_next, sysout, sysout_next;
    reg [9:0] tap[`TAP-1:0], tap_next[`TAP-1:0];
    reg [9:0] acc[`TAP-1:0];
    integer i;
    
    always @(posedge CLOCK_50)
    begin
      sysin  <= KEY[0] ? sysin_next : 10'b0;
      sysout <= KEY[0] ? sysout_next : 10'b0;
      for (i = 0; i < `TAP; i = i +1)
        tap[i] <= KEY[0] ? tap_next[i] : 10'b0;
    end
    
    always @(*)
    begin
      sysin_next = SW[9:0];
     
      tap_next[0] = sysin;  
      for (i = 1; i < `TAP; i = i + 1)
         tap_next[i] = tap[i-1];
        
      acc[0] = tap[0];
      for (i = 1; i < `TAP; i = i + 1)
         acc[i] = acc[i-1] + tap[i];
        
      sysout_next = acc[15];
    end

    assign LEDR = sysout;

endmodule

We will do timing analysis of this design.
- When using default optimization
- When using retiming and register duplication

Note: when doing retiming, we have to specify the delay that comes
with system inputs (and optionally system outputs).

For example:

set_input_delay -add_delay  -clock [get_clocks {CLOCK_50}]  0.000 [get_ports {KEY[0]}]

Do the following steps:

1/ git clone https://github.com/vt-ece4514-s19/fir

2/ open the design and synthesize

  170 registers
  65 ALM
  Timequest 146.2 MHz

3/ open the timequest timing analyzer
  scroll to 'report timing'
  select from-clock CLOCK_50
        to-clock CLOCK_50 

  Find critical path:
     SW[5] -> tap[0][5]
     SW[3] -> tap[0][3]
     ..

  This means that the critical path is in fact not in the adder three but in
  the routing from the IO pin to the logic

  Right-click on a path and select
  'locate path in chip planner'
  'locate path in technology map viewer'

  This gives a physical meaning to the path

  Examine the impact of routing and interconnect (IC in TimeQuest)

4/ add delay constraints 

   KEY and SW add a delay of 0ns for clock CLOCK_50

   Close timequest timing analyzer and update the fir.sdc file.

5/ In quartus, verify and open fir.sdc

6/ Open project settings select Compiler Settings
   Optimize for performance
   Enable register retiming
   Click Fitter Advanced Settings
       Allow register duplication
       Allow register retiming
   Run synthesis

   470 registers
   136 ALM
   140.55MHz
   
   Adding two pipeline registers at the input:

   154 MHz
   141 ALM
   493 registers

5:30 Unfolding

Thought example:

What is the critical path of the following circuit?
            10ns
      +--> (+1) --->  reg ---> reg ---+
      |                               |
      +-------------------------------+

The critical path is 10ns. 

What does this circuit do?

It increments.

cycle   R1  R2
0       0   0
1       1   0
2       1   1
3       2   1
4       2   2
5       3   2
6       3   3
        ...

So this circuit has two independent 'streams': odd and even clock cycles.
The two registers work like two interleaved counters.
We could, therefore, have obtained the same functionality as follows:

            10ns
      +--> (+1) --->  reg ------------+
      |                               |
      +-------------------------------+

            10ns
      +--> (+1) --->  reg ------------+
      |                               |
      +-------------------------------+

Two parallel counters, each with a critical path of 10ns, each produce
an independent count. If we would interleave the outputs of these two circuits,
we would obtain the same as for the original circuit.

What we just demonstrated is UNFOLDING: we have unfolded the top circuit into
two circuits, one which represents the even clock cycles, and another which
represents the odd clock cycles.

In the process, we have improved the throughput of the overall solution.
In the original circuit, the critical path was never smaller than 10ns.
In the unfolded circuit, the critical path is still 10ns but there are
twice as many results per cycle.

The question then is: can we SYSTEMATICALLY unfold digital circuits in order
to improve throughput?

The answer is - yes.

5:40 Definitions

We are defininig circuits with combination modules U, V, ...
which have a given critical path.

For example

    --> U --> V -->
       10ns  10ns

means: U is connected to V, and each has a critical path of 10ns,
therefore the overall critical path is 20ns.

We also have registers (delays) which can be inserted on every interconnection.

    --> U --R--> V -->
       10ns     10ns

means: U is connected to R which is connected to V.
The critical path of the overall design is 10ns

We can now draw circuits and we allow loops as well.
For example:

    +--> U --R--R--R--> V ---+
    |                        |
    +------------------------+

We can have multiple loops, such as:

    +----------------------------------+
    +-->                               |
    +--> U --R--R--R--> V ---+---> W --+
    |                        |
    +------------------------+

We define the LOOP BOUND as the sum of the delays divided by
the number of registers for a given loop.

Assuming U, V have 10ns delay, then the loop bound is:

    +--> U --R--R--R--> V ---+
    |                        |
    +------------------------+

    Loop Bound:  20 / 3 = 6.66

In the following circuit there are two loop bounds:

    +----------------------------------+
    +-->                               |
    +--> U --R--R--R--> V ---+---> W --+
    |                        |
    +------------------------+

Assuming U, V, W are 10ns then

    Loop Bound 1:  20 / 3 = 6.66
    Loop Bound 2:  30 / 3 = 10

We can also define the ITERATION BOUND.

ITERATION BOUND = MAX(LOOP BOUNDS)

For the above two circuits, the ITERATION BOUND for the first is 6.66
while for the second it is 10.

What is the meaning if the iteration bound? The iteration bound
expresses the optimum achievable throughput on the circuit.
Because of combinational delays, we may not be able to achieve it.

For example, the following circuit has a critical path of 10, while
the iteration bound is 6.66:

    +--> U --R--R--R--> V ---+
    |                        |
    +------------------------+

However, through UNFOLDING, it is possible to build circuits for which
to critical path and the iteration bound approximate each other.

The COST of unfolding is that we have to duplicate combinational
logic. Thus instead of a single logic U, we will introduce multiple
of them: U0, U1, U2, ...

For a J-fold unfolding, we have U -> U0, U1, ... UJ-1

5:50 Systematic rules for unfolding

=============================================================
  When unfolding J times:

  1/ duplicate each module J times

    U -> U0, U1, U2, ...

  2/ For each connection with w registers:

        U --R-->  V

    make J connections (i:0, 1, ..., J-1) from

        U(i) to V(i+w)%J   with  floor((i+w)/J)  delays
=============================================================


Example. Let's unfold this circuit three times:

    +--> U --R--R--R--> V ---+
    |                        |
    +------------------------+

    U0 goes to V((0+3)%3) = V0
    and #delays = floor((0 + 3)/3) = 1

    +--------------------+
    |                    |
    +-- U0 -- R --> V0 --+

    +--------------------+
    |                    |
    +-- U1 -- R --> V1 --+

    +--------------------+
    |                    |
    +-- U2 -- R --> V2 --+

The circuit now has a critical path of 20ns, but the overall design
produces three outputs every 20ns, or one output every 6.66ns (iteration bound)

We could also unfold the circuit only two times:

    +--> U --R--R--R--> V ---+
    |                        |
    +------------------------+

    U0 goes to V((0+3)%2) = V1
    and #delays = floor((0 + 3)/2) = 1

    U1 goes to V((1+3)%2) = V0
    and #delays = floor((1 + 3)/2) = 2

    V0 goes to U0  (no delays)
    V1 goes to U1  (no delays)

    +------------------------+
    |                        |
    +--> U0    ---R--R- V0 --+
            \ /
             X
    +--> U1 / \-----R-- V1 --+
    |                        |
    +------------------------+

In this case the critical path is still 20ns. The circuit produces two outputs
every 20ns. This is better then before (one output every 20ns) but not as good as
the iteration bound.

There are no automatic tools for unfolding. This is something you would do using
verilog transformations. For example, when dealing with ultra-high-speed
input/output, a serial-to-parallel conversion unit is used
to make the processing feasible. We then use unfolding to convert an algorithm
that processes one input at a time into an algorithm that processes N
inputs at a time. 

Unfolding will increase the area cost.

  high speed 
  serial
  -------+                                                 +---->
         |                                                 |
         +--->                                        >----+
         +--->                                        >----+
         +--->                                        >----+
         +---> parallel   unfolded circuit   parallel >----+
         +--->                                        >----+
         +--->                                        >----+

6:00 Summary of area-time trade-off


           Area  --- Resource sharing
             |
             |
             |     (verilog)-- designer --(tools)
             |
             |
             |   Unfolding   Pipelining   Retiming
             |       |           |           |
             +------------------------------------ Delay