Lecture 17 - 4/2/2019 - Pipelining and Retiming 1. Basics Latency Troughput Critical Path 2. Pipelined Adder Pipelined Multiplier (Wallace Tree Multiplier) 3. Retiming Rules for Retiming 4. Retiming in Quartus -------------------------------------------------------- 5:00PM Pipelining and Retiming A pipeline partitions a long combinational computations in multiple small stages of combinational logic. Each stage is separated from the next one using a pipeline register. Instead of computing output = comb_logic(input) We will compute (example of a three stage pipeline): pipe1 = comb_logic_1(input); // pipeline stage 1 pipe2 = comb_logic_2(pipe1); // pipeline stage 2 output = comb_logic_3(pipe2); // pipeline stage 3 Such that comb_logic ~ comb_logic_3(comb_logic_2(comb_logic_1)) A pipeline circuit has the following properties, compared to the original circuit. 1/ Increased latency. Original circuit = 1 cycle (combinational circuit, but we need to allocate at least one clock cycle so that it can complete) Pipelined circuit = N cycles (for n pipeline stages) 2/ Decreased critical path. Original circuit = K Pipelined circuit = between K/N and K In a perfectly balanced pipeline, the delay of each pipeline stage is identical. This represents an optimal case, since it results in the shortest possible critical path. In a real pipeline, the delays are not balanced over each pipeline stage. Worst case, one pipeline stage dominates over all others, with a combinational delay close to the original, non-pipelined circuit. 3/ Increased throughput. In the original circuit, we have one result per clock cycle In the pipeline circuit, we have also one result per clock cycle However, ideally, the clock frequency of the pipeline will be n times higher than the clock frequency of the original circuit Therefore, the throuhgput is N times higher. The operation of a pipeline is easy to express in terms of clock cycles and pipeline stages: Cycle1 Cycle2 Cycle3 ---------------------- stage-1 X stage-2 X state-3 X This structure is called a RESERVATION TABLE, is describes how the pipeline stages are utilized. A reservation table helps to illustrate how the pipeline operates: Cycle1 Cycle2 Cycle3 Cycle 4 Cycle 5 Cycle 6 ------------------------------------------------- stage-1 1 2 3 4 .. stage-2 1 2 3 4 .. state-3 1 2 3 4 In a LINEAR pipeline, each pipeline stage is used exactly once for every data item that is introduced in the pipeline. In a NONLINEAR pipeline, some pipeline stages are REUSED during processing. Nonlinear pipelines have a lower throughput than linear pipelines. If we call G the max. number of times a pipeline stage is reused, then the throughput of the nonlinear pipeline will be 1/G compared to the throughput of the linear pipeline. Example of a nonlinear pipeline with G = 2: Cycle1 Cycle2 Cycle3 Cycle 4 ------------------------------- stage-1 X stage-2 X X state-3 X In this example, stage-2 is used during Cycle 2 and Cycle 4. 5:20 Pipelined Adders How do we design circuits with pipelining? We will start with two datapath components - adders and multipliers - and study how to apply pipelining to their implementation. Consider the basic addition algorithm. Operand A A2 A1 A0 Operand B B2 B1 B0 Result C C2 C1 C0 Consider a full-adder cell (Cq,Q) = FA(A, B, Ci) Then (Cq0, C0) = FA(A0, B0, 0) (Cq1, C1) = FA(A1, B1, Cq0) (Cq2, C2) = FA(A2, B2, Cq1) This circuit is slow - it has a critical path delay of three full-adder cells. Namely, there is a dependency from the first FA to the second (through Cq0) and from the second FA to the third (through Cq1). Therefore, even though a combinational hardware circuit could compute three additions in a single clock cycle, they will stil be computed sequentially, one after the other. Can we use pipelining to make this faster? Yes! We can design a structure so that there is only one FA between any two registers. However, you have to do the pipelining taking ALL datadependencies into account.
A2 A1 A0 B2 B1 B0 | | | ............................. add | | ------.------/| | |/ . | ............... add ................... |/--------.----/| | add . | | | ................................. | | | Cq2, C2 C1 C0 Note how the pipeline registers cut multiple wires. These wires all need to be stored in pipeline registers In addition, this structure can also be used to make high-speed long-wordlength additions using smaller adders. Eg with 8x 16-bit adder, you can create a 128-bit adder with the speed (thorughput) of a 16-bit adder but with 8 times higher latency. 5:20 Pipelined Multiplier Similar to an adder, we can pipeline a multiplier. This will, however, reveal a challenge with our approach of pipelining: exuberant cost of storing intermediate results in pipeline registers. Consider Operand A A2 A1 A0 Operand B B2 B1 B0 And we compute A x B: A2 A1 A0 B2 B1 B0 -------------------- B0A2 B0A1 B0A0 B1A2 B1A1 B1A0 B2A2 B2A1 B2A0 ------------------------------------ L4 L3 L2 L1 L0 The partial products themselves can all be computer in parallel. However, the partial products then have to be accumulated in 'lanes' L0 = B0A0 - does not generate carry (assuming A0 and B0 are bits) L1 = B0A1 + B1A0 - may generate carry cy1 L2 = B0A2 + B1A1 + B2A0 + cy1 - may generate carry cy2 L3 = B1A2 + B2A1 + cy2 - may generate cy3 L4 = B2A2 + cy3 - may generate cy4 Again, there is a dependency through the carry chain. Moreover, if we assume that we have only a full adder cell available, then we will again need multiple clock cycles:
Cycle 0 (cy1, C1) = B0A1 + B1A0 Cycle 1 (cy2a, t0) = B0A2 + B1A1 + cy1 Cycle 2 (cy2b, C2) = t0 + B2A0 (cy3a, t1) = B1A2 + B2A1 + cy2a Cycle 3 (cy3b, C3) = cy2b + t1 (cy4a, t2) = A2B2 + c3a Cycle 4 (cy4b, C4) = cy3b + t2 Cycle 5 (_, C5) = cy4a + cy4b Unfortunately, this formulation creates a lot of intermediate results on the pipeline registers (see figure). A better solution is to consider addition of partial products. Consider again the 'lanes' of additions: L0 = B0A0 - does not generate carry (assuming A0 and B0 are bits) L1 = B0A1 + B1A0 - may generate carry cy1 L2 = B0A2 + B1A1 + B2A0 + cy1 - may generate carry cy2 L3 = B1A2 + B2A1 + cy2 - may generate cy3 L4 = B2A2 + cy3 - may generate cy4 Then, part of L1, L2 and L3 can all be done in parallel: Cycle 0 (cy1, C1) = B0A1 + B1A0 (cy2, v2) = B0A2 + B1A1 + B2A0 (cy3, v3) = B1A2 + B2A1 This structure is called a Wallace Tree, named after its inventor, Wallace (1964). The final accumulation of partial products is still sequential, and still uses a ripple carry adder. 5:40 Retiming The question then arises - can we compute the distribution of pipeline registers automatically, so as to balance the stages of a pipeline? Logic1 -->|pipe|--> Logic2 -->|pipe| --> Logic3 If the critical path of Logic1, Logic2 and Logic3 is very different, what can we do? The solution to that is 'retiming' - the redistribution of pipeline registers. Retiming consists of a set of netlist transformations that are based on simply equivalence rules. Rule 1: A register cane be moved across a logic circuit which has no loops, without changing the functionality of the resulting design --> register --> logic == --> logic --> register --> Rule 2: A register can be moved over a fork by duplicating it --> register ----+----> logic1 -------+---> register --> logic1 | == | +----> logic2 +---> register --> logic2 Rule 3: A register can be merged into a fork --> register -----> logic ---> ---> logic --> register ---> | == | --> register --------+ -------+ That's it! Let's try to retime a few sample circuits. Try to reduce the critical path in each case EXAMPLE 1: --> R R ---> logic1 ---> logic2 ---> output Before: critical path = path(logic1) + path(logic2) Retimed result: ----- R ---> logic1 - R -> logic2 ---> output After: critical path = max(path(logic1), path(logic2)) EXAMPLE 2: --> R --> | logic1 --------> | --> R --> | logic2 --> output | --> R -------------------> | Before: critical path = path(logic1) + path(logic2) Retimed result: --------> | logic1 -- R ---> | --------> | logic2 --> output | ------------------- R ---> | After: critical path = max(path(logic1), path(logic2)) EXAMPLE 3: --> R --> | logic1 --------> logic2 ---+ +-> R --> | | | | +------------------------------------+ Before: critical path = path(logic1) + path(logic2) Retimed result: --------> | logic1 --- R --> logic2 ---+ +-------> | | | | +------------------------------------+ After: critical path = path(logic1) + path(logic2) EXAMPLE 4: --> R --> | logic1 --- R --> logic2 ---+ +-------> | | | | +------------------------------------+ Before: critical path = path(logic1) + path(logic2) Retimed result: --------> | logic1 --- R --> logic2 ---+ +-------> | | | | +------------------------------------+ After: critical path = path(logic1) + path(logic2) This demonstrates that feedback loops cannot be retimed; the number of delays (registers) found in a feedback loop must stay constant! 6:00 Retiming in Quartus Discuss example circuit: module fir( input CLOCK_50, input [3:0] KEY, output [9:0] LEDR, input [9:0] SW ); reg [9:0] sysin, sysin_next, sysout, sysout_next; reg [9:0] tap[`TAP-1:0], tap_next[`TAP-1:0]; reg [9:0] acc[`TAP-1:0]; integer i; always @(posedge CLOCK_50) begin sysin <= KEY[0] ? sysin_next : 10'b0; sysout <= KEY[0] ? sysout_next : 10'b0; for (i = 0; i < `TAP; i = i +1) tap[i] <= KEY[0] ? tap_next[i] : 10'b0; end always @(*) begin sysin_next = SW[9:0]; tap_next[0] = sysin; for (i = 1; i < `TAP; i = i + 1) tap_next[i] = tap[i-1]; acc[0] = tap[0]; for (i = 1; i < `TAP; i = i + 1) acc[i] = acc[i-1] + tap[i]; sysout_next = acc[15]; end assign LEDR = sysout; endmodule Demonstrate: - Synthesis results - Use of retiming options to compute critical path