Lecture 18 - 4/4/2019 - Unfolding 1. Example of Retiming in Quartus 2. Unfolding 2.1 Example of stacked counter 2.2 Formalized Model 2.3 Loop Bound and Iteration Bound 2.4 Unfolding Algorithm 2.5 Examples -------------------------------------------------------- Timeline This lecture: Retiming Demo Unfolding Next lecture 4/9: Review Next Next lecture 4/11: Exam II 5:00PM Pipelining Recap of key ideas: - Pipelining improves throughput - A combinational delay of C is split in N parts by using (N-1) pipeline registers where each part has a delay of C/N (ideally) Hence the latency: Original Circuit: 1 (cycle) x C = C (s) Pipelined Circuit: N (cycle) x C/N = C (s) Hence the throughout: Original Circuit: C^-1 (s^-1) Pipeline Circuit: N. C^-1 (s^-1) The latency stays the same, the throughput increases with a factor N - Retiming is used to automatically redistributed the registers in a network Three rules: Rule 1: --> register --> logic == --> logic --> register --> Rule 2: A register can be moved over a fork by duplicating it --> register ----+----> logic1 -------+---> register --> logic1 | == | +----> logic2 +---> register --> logic2 Rule 3: A register can be merged into a fork --> register -----> logic ---> ---> logic --> register ---> | == | --> register --------+ -------+ Retiming can reduce the critical path of a design, generally at the expense of area. EXAMPLE 1: --> R --> | logic1 --------> | --> R --> | logic2 --> output | --> R -------------------> | Before: critical path = path(logic1) + path(logic2) Retimed result: --------> | logic1 -- R ---> | --------> | logic2 --> output | ------------------- R ---> | After: critical path = max(path(logic1), path(logic2)) EXAMPLE 2: --> R --> | logic1 --------> logic2 ---+ +-> R --> | | | | +------------------------------------+ Before: critical path = path(logic1) + path(logic2) Retimed result: --------> | logic1 --- R --> logic2 ---+ +-------> | | | | +------------------------------------+ After: critical path = path(logic1) + path(logic2) 5:10 Automatic retiming in Quartus Discuss example circuit (draw it): module fir( input CLOCK_50, input [3:0] KEY, output [9:0] LEDR, input [9:0] SW ); reg [9:0] sysin, sysin_next, sysout, sysout_next; reg [9:0] tap[`TAP-1:0], tap_next[`TAP-1:0]; reg [9:0] acc[`TAP-1:0]; integer i; always @(posedge CLOCK_50) begin sysin <= KEY[0] ? sysin_next : 10'b0; sysout <= KEY[0] ? sysout_next : 10'b0; for (i = 0; i < `TAP; i = i +1) tap[i] <= KEY[0] ? tap_next[i] : 10'b0; end always @(*) begin sysin_next = SW[9:0]; tap_next[0] = sysin; for (i = 1; i < `TAP; i = i + 1) tap_next[i] = tap[i-1]; acc[0] = tap[0]; for (i = 1; i < `TAP; i = i + 1) acc[i] = acc[i-1] + tap[i]; sysout_next = acc[15]; end assign LEDR = sysout; endmodule We will do timing analysis of this design. - When using default optimization - When using retiming and register duplication Note: when doing retiming, we have to specify the delay that comes with system inputs (and optionally system outputs). For example: set_input_delay -add_delay -clock [get_clocks {CLOCK_50}] 0.000 [get_ports {KEY[0]}] Do the following steps: 1/ git clone https://github.com/vt-ece4514-s19/fir 2/ open the design and synthesize 170 registers 65 ALM Timequest 146.2 MHz 3/ open the timequest timing analyzer scroll to 'report timing' select from-clock CLOCK_50 to-clock CLOCK_50 Find critical path: SW[5] -> tap[0][5] SW[3] -> tap[0][3] .. This means that the critical path is in fact not in the adder three but in the routing from the IO pin to the logic Right-click on a path and select 'locate path in chip planner' 'locate path in technology map viewer' This gives a physical meaning to the path Examine the impact of routing and interconnect (IC in TimeQuest) 4/ add delay constraints KEY and SW add a delay of 0ns for clock CLOCK_50 Close timequest timing analyzer and update the fir.sdc file. 5/ In quartus, verify and open fir.sdc 6/ Open project settings select Compiler Settings Optimize for performance Enable register retiming Click Fitter Advanced Settings Allow register duplication Allow register retiming Run synthesis 470 registers 136 ALM 140.55MHz Adding two pipeline registers at the input: 154 MHz 141 ALM 493 registers 5:30 Unfolding Thought example: What is the critical path of the following circuit? 10ns +--> (+1) ---> reg ---> reg ---+ | | +-------------------------------+ The critical path is 10ns. What does this circuit do? It increments. cycle R1 R2 0 0 0 1 1 0 2 1 1 3 2 1 4 2 2 5 3 2 6 3 3 ... So this circuit has two independent 'streams': odd and even clock cycles. The two registers work like two interleaved counters. We could, therefore, have obtained the same functionality as follows: 10ns +--> (+1) ---> reg ------------+ | | +-------------------------------+ 10ns +--> (+1) ---> reg ------------+ | | +-------------------------------+ Two parallel counters, each with a critical path of 10ns, each produce an independent count. If we would interleave the outputs of these two circuits, we would obtain the same as for the original circuit. What we just demonstrated is UNFOLDING: we have unfolded the top circuit into two circuits, one which represents the even clock cycles, and another which represents the odd clock cycles. In the process, we have improved the throughput of the overall solution. In the original circuit, the critical path was never smaller than 10ns. In the unfolded circuit, the critical path is still 10ns but there are twice as many results per cycle. The question then is: can we SYSTEMATICALLY unfold digital circuits in order to improve throughput? The answer is - yes. 5:40 Definitions We are defininig circuits with combination modules U, V, ... which have a given critical path. For example --> U --> V --> 10ns 10ns means: U is connected to V, and each has a critical path of 10ns, therefore the overall critical path is 20ns. We also have registers (delays) which can be inserted on every interconnection. --> U --R--> V --> 10ns 10ns means: U is connected to R which is connected to V. The critical path of the overall design is 10ns We can now draw circuits and we allow loops as well. For example: +--> U --R--R--R--> V ---+ | | +------------------------+ We can have multiple loops, such as: +----------------------------------+ +--> | +--> U --R--R--R--> V ---+---> W --+ | | +------------------------+ We define the LOOP BOUND as the sum of the delays divided by the number of registers for a given loop. Assuming U, V have 10ns delay, then the loop bound is: +--> U --R--R--R--> V ---+ | | +------------------------+ Loop Bound: 20 / 3 = 6.66 In the following circuit there are two loop bounds: +----------------------------------+ +--> | +--> U --R--R--R--> V ---+---> W --+ | | +------------------------+ Assuming U, V, W are 10ns then Loop Bound 1: 20 / 3 = 6.66 Loop Bound 2: 30 / 3 = 10 We can also define the ITERATION BOUND. ITERATION BOUND = MAX(LOOP BOUNDS) For the above two circuits, the ITERATION BOUND for the first is 6.66 while for the second it is 10. What is the meaning if the iteration bound? The iteration bound expresses the optimum achievable throughput on the circuit. Because of combinational delays, we may not be able to achieve it. For example, the following circuit has a critical path of 10, while the iteration bound is 6.66: +--> U --R--R--R--> V ---+ | | +------------------------+ However, through UNFOLDING, it is possible to build circuits for which to critical path and the iteration bound approximate each other. The COST of unfolding is that we have to duplicate combinational logic. Thus instead of a single logic U, we will introduce multiple of them: U0, U1, U2, ... For a J-fold unfolding, we have U -> U0, U1, ... UJ-1 5:50 Systematic rules for unfolding ============================================================= When unfolding J times: 1/ duplicate each module J times U -> U0, U1, U2, ... 2/ For each connection with w registers: U --R--> V make J connections (i:0, 1, ..., J-1) from U(i) to V(i+w)%J with floor((i+w)/J) delays ============================================================= Example. Let's unfold this circuit three times: +--> U --R--R--R--> V ---+ | | +------------------------+ U0 goes to V((0+3)%3) = V0 and #delays = floor((0 + 3)/3) = 1 +--------------------+ | | +-- U0 -- R --> V0 --+ +--------------------+ | | +-- U1 -- R --> V1 --+ +--------------------+ | | +-- U2 -- R --> V2 --+ The circuit now has a critical path of 20ns, but the overall design produces three outputs every 20ns, or one output every 6.66ns (iteration bound) We could also unfold the circuit only two times: +--> U --R--R--R--> V ---+ | | +------------------------+ U0 goes to V((0+3)%2) = V1 and #delays = floor((0 + 3)/2) = 1 U1 goes to V((1+3)%2) = V0 and #delays = floor((1 + 3)/2) = 2 V0 goes to U0 (no delays) V1 goes to U1 (no delays) +------------------------+ | | +--> U0 ---R--R- V0 --+ \ / X +--> U1 / \-----R-- V1 --+ | | +------------------------+ In this case the critical path is still 20ns. The circuit produces two outputs every 20ns. This is better then before (one output every 20ns) but not as good as the iteration bound. There are no automatic tools for unfolding. This is something you would do using verilog transformations. For example, when dealing with ultra-high-speed input/output, a serial-to-parallel conversion unit is used to make the processing feasible. We then use unfolding to convert an algorithm that processes one input at a time into an algorithm that processes N inputs at a time. Unfolding will increase the area cost. high speed serial -------+ +----> | | +---> >----+ +---> >----+ +---> >----+ +---> parallel unfolded circuit parallel >----+ +---> >----+ +---> >----+ 6:00 Summary of area-time trade-off Area --- Resource sharing | | | (verilog)-- designer --(tools) | | | Unfolding Pipelining Retiming | | | | +------------------------------------ Delay