Lecture 13: Area/Time Trade Off 5:00 Introduction Practical issues for the coming few lectures: - Today I will finish class exceptionall at 5:45PM. additional material will be available online (video) - Next week (3/26, 3/28) I am traveling. I will post the lectures online; they will include tool demonstrations Laying out the topics going forward: So far, we concentraded on creating functionally correct circuits in Verilog. In the next few lectures, we will be focusing on optimization and trade-offs, in addition to further deepening out our expertise in hardware design: Optimization: - L13: Area/Time tradeoffs - L17: Pipelining and Retiming - L18: Unfolding Further into hardare design: - L14: Hardware synthesis of Memories - L15: Timing Simulation - L16: Working w IP modules 5:05 Area/Time tradeoff The product of the area of a circuit, with its delay (latency), represents the AREA-DELAY product. It is a popular complexity metric to study the quality of a digital circuit. Say you have a design with 400 Gates, and it has a latency of 20 cycles to compute an output from an input, then that design has an area-delay of 400*20 = 8000 gate-cycles. The area-delay product represents a cost metric. It expresses how well the hardware is used to compute a result. The area-delay product can be used to compare one design to another one. In general, a lower area-delay product is better. When you OPTIMIZE a design, you would try to reduce the area-delay product. When you evaluate a TRADE-OFF of a design, you would try to exchange a better area for a worse delay, or vice versa. Hence, a trade-off does not have to change the area-delay product. Then, how do we compare the area-delay product of such designs? We need to capture these designs in a single design-space, the area-delay design space: | Area | | X Design 1 | X Design 2 | X Design 3 | | Delay +---------------------------------- For these three designs: Area (Design 1) > Area (Design 2) > Area (Design 3) Delay (Design 1) < Delay (Design 2) < Delay (Design 3) It is common to introduce design constraints when comparing designs. Such as: C1: Your design must be smaller then area A1 C2: Your design must be faster then delay D1 Once the constraints are set, we then can select what designs we should pick. For the constraint C1 (area), we must find a design such that Area (Design) < A1. Note that it makes no sense to pick the smallest design. As soon as we have a design which meets the constraint, we are done. For the constraint C2 (delay), we must find a design such that Delay (Design) < D1. Again, it makes no sense to pick the fastest design. One that is a littlecbit faster then D1 is OK. | Area C2 | | | X Design 1 | +---------------------+--------- C1 | X Design 2 | | |X Design 3 | | | | Delay +---------------------+------------ Design 1 and Design 2 meet C2 Design 2 and Design 3 meet C1 Design 2 is most desirable to meet C1 (closest to D1 but still meeting the constraint) Design 2 is most desirable to meet C2 (closest to A1 but still meeting the constraint) Design 2 is also the only design that can meet BOTH C1 and C2. Now, consider suboptimal design, as shown by the following examples: | Area | X Design 4 | X Design 1 | X Design 2 X Design 5 | X Design 3 | | Delay +---------------------------------- Design 4 is a suboptimal design, even though it is faster than Design 3 The reason is that there is a design, Design 2, which is BOTH faster AND smaller then Design 4. Hence, if we would pick Design 4 under any constraint, then Design 2 would be a better choice. Design 2 avoids making a trade-off that would be made by Design 4. Similarly, Design 5 is a suboptimal design, even though it is smaller then Design 1. The reason is that there is a design, Design 3, that is BOTH faster AND smaller then Design 5. Hence, Design 3 is always preferable over Design 5. You can identify optimal designs by imagining each desiqgn to be a corner of an infinite square, and evaluating which other design are 'outshadowed' by the square. This demonstrates that Design 1, Design 2 and Design 3 are optimal, and Design 4 and Design 5 are outshadowed. | | Are| | | | | | X Design 4 | X-----| | | 1 X-----------| X Design 5 | 2 X-------------- | 3 | Delay +---------------------------------- 5:20 Area/Time tradeoff through resource sharing The most common method to generate design variants which have different area-delay product is to create implementations with different resource sharing. Assume the following design, made using a module with combinational logic module_f1: DESIGN 1: --> module_f1 --> register --> module_f1 --> register --> If module_f1 is large, we can try to resource-share it as follows: DESIGN 2: --> MUX --> module_f1 --> register ---+---> ^ | +--------------------------------+ *** AREA The area of the first design is: 2 x Area(module_f1) + 2 x Area(register) The area of the second design is: Area(MUX) + Area(module_f1) + Area(register) The second design will be smaller if: Area(MUX) < Area(module_f1) + Area(register) In practical cases, the second design will be smaller However! This assumes that the controller that drives the MUX will be simple. That is not always true. *** LATENCY Both designs have a latency of two cycles *** THROUGHPUT The first design accepts one new input per cycle The second design accepts one new input per two cycles In practical cases, the first design will be faster (higher troughput) To express the ability of a hardware design to do resource sharing, we use the HARDWARE SHARING FACTOR (HSF). If you have a design that reads inputs at a rate fin and that produces outputs at a rate fout and that is working at a clock frequency fclk then the hardware sharing factor is defined as: fclk HSF = ----------------- max(fin, fout) The HSF expresses how many clock cycles we have available to produce each output or consume each input. A low hardware sharing factor means that hardware resources are not shared over clock cycles. In the above example, HSF(DESIGN 1) = 1 and HSF(DESIGN 2) = 2 The HSF is an indicator for the architectural style that a hardware design has to use: HSF = 1 Non-shared, maximum-throughput hardware HSF = 10 Lowly multiplexed hardware, uses FSM-based control HSF = 100 Medium complexity multiplexed hardware, uses more sophisticated control such as a small microcontroller or a sequencer HSF = 1000 Highly multiplexed hardware, requires sophisticated microcoded control HSF = inifinite Microprocessor! requires a compiler to develop control (=software!) 5:30 Introducing the multiplier example To get a feel for area-time trade-offs when writing Verilog, we will study the example of a 16*16 multiplier with a 32-bit output. Last week, we mentioned that the FPGA has dedicated DSP resources with hard-macro multipliers. For this example, however, we will force the tools to use only 6-input lookup tables as available in the Cyclone-V fabric (ALMs). We will discuss 4 different versions of the multiplier: multstyle.v toplevel, for synthesis mymult1.v first version, nonmultiplexed mymult2.v second version, nonmultiplexed mymult3.v third version, 16-time multiplexed mymult4.v fourth version, 8-time multiplexed mymulttb.v testbench for any version To simulate these designs you would take the following steps: 1/ Select the version to use (mymult1 to mymult4) and add it to multstyle.v and mymulttb.v 2/ Compile the verilog for simulation vlib work vlog *.v 3/ Simulate vsim -c mymulttb -do "run 10000ns; exit" 4/ Synthesize quartus_sh --flow compile multstyle If you run the synthesis in Quartus, you can easily extract the 10 most critical timing paths (which have minimum slack). We have to compare the four designs from the following point of view: - number of cycles needed to complete a multiplication - minimum clock period of the design (or maximum clock frequency, or minimum slack) - number of ALMs and registers The Delay will be: cycles x clock period The Area will be: #ALMs We don't include the registers in the area because each ALM has logic as well as registers. Obviously the area metric is an approximation. 5:40 Let's start with MULT1.V: =============================================== module mymult1(input wire clk, input wire reset, input wire [15:0] a, input wire [15:0] b, input wire start, output wire [31:0] result, output wire done); (* multstyle = "logic" *) reg [31:0] r, next_r; reg d, next_d; always @(posedge clk) begin r <= (reset) ? 32'h0 : next_r; d <= (reset) ? 1'd0 : next_d; end // IMPLEMENTATION 1 always @(*) begin next_r = {16'b0,a} * {16'b0,b}; next_d = start; end assign done = d; assign result = r; endmodule =============================================== This design reads in two arguments a and b, multiplies them and stores the result in a register. We use the register to break the critical path. Note how the multiplication is written: next_r = {16'b0,a} * {16'b0,b}; This ensures that we will get a 32-bit multiplier, not a 16-bit multiplier. There is a start/done control, which is simple because this is a single-clock design. Also, note how we tell the tools to map this design in ALMs, and not in a hard-macro DSP multiplier: (* multstyle = "logic" *) reg [31:0] r, next_r; (that comment is called a 'synthesis attribute') See https://www.intel.com/content/www/us/en/programmable/quartushelp/17.0/index.htm#hdl/vlog/vlog_file_dir_multstyle.htm Consult Quartus -> help (and then search for 'multstyle') DESIGN 1 DESIGN 2 DESIGN 3 DESIGN 4 -------------------------------------------------------- ALM 130 Registers 64 Cycles 1 Min CLKP 8.052 AREA 130 DELAY 8 5:50 Next, consider MULT2.v In order to express the sharing, we need to rewrite next_r = {16'b0,a} * {16'b0,b}; because the multiplication is a single line. We can do this, by realizing how a multiplication is computed. Here is an example of a 4-bit x 4-bit multiply: a 1 0 1 1 b 1 1 0 1 ---------------------------------- 1 0 1 1 b[0] x a 0 0 0 0 b[1] x a 1 0 1 1 b[2] x a 1 0 1 1 b[3] x a ------------------------- 1 0 0 0 1 1 1 1 So we compute the partial products (a multiply of a b-bit with all the bits of a), and next add up all of those to compute the result. We can write this behavior directly in verilog: =============================================== module mymult2( input wire clk, input wire reset, input wire [15:0] a, input wire [15:0] b, input wire start, output wire [31:0] result, output wire done); (* multstyle = "logic" *) reg [31:0] r, next_r; reg d, next_d; always @(posedge clk) begin r <= (reset) ? 32'h0 : next_r; d <= (reset) ? 1'b0 : next_d; end // IMPLEMENTATION 2 reg [31:0] tmp[15:0]; reg [31:0] tmpa; integer j; always @(*) begin tmp[0] = b[0] ? {16'b0, a} : 32'b0; for (j = 1; j<16; j=j+1) begin tmpa = {15'b0, a} << j; tmp[j] = tmp[j-1] + (b[j] ? tmpa : 32'b0); end next_r = tmp[15]; next_d = start; end assign result = r; assign done = d; endmodule =============================================== Let's look at it in more detail. - tmp is a 32-bit accumulator that will add partial products together. - the partial products are created using tmpa, which is nothing more than a shifted version of a - the partial products are given by: b[j] ? tmpa + 32'b0 - the entire multiplication is written in a loop. Keep in mind that Verilog synthesis implements this loop as a combinational loop, ie. in a single clock cycle. tmp[0] = b[0] ? {16'b0, a} : 32'b0; for (j = 1; j<16; j=j+1) begin tmpa = {15'b0, a} << j; tmp[j] = tmp[j-1] + (b[j] ? tmpa : 32'b0); end What gets synthesized is the UNROLLED version of this loop - Finally, to create a proper done signal, we create a delayed version of start. The latency of this multiplier is still one clock cycle. DESIGN 1 DESIGN 2 DESIGN 3 DESIGN 4 -------------------------------------------------------- ALM 130 159 Registers 64 64 Cycles 1 1 Min CLKP 8.052 10.047 AREA 130 159 DELAY 8 10 6:00 Next, consider MULT3.v Rewriting the multiplier in MULT2.v also gives an idea how we can share the accumulator: we should try to synthesize each loop iteration in a single clock cycle. Then, the accumulator can be used for 16 clock cycles to compute a result. However, there are additional issues to pay attention to: - the shifted version of a has to be created. We could do that using a shift register, each time shifting a up one position. - different bits of b have to be indexed. We should do that as well using a shift register, each time shifting b down one position and looking at the zero-bit of the shft-b register - to load the shift registers, drive the multiplication, and extract the result, we introduce a state machine =============================================== module mymult3(input wire clk, input wire reset, input wire [15:0] a, input wire [15:0] b, input wire start, output wire [31:0] result, output wire done); (* multstyle = "logic" *) reg [31:0] r, next_r; reg d, next_d; always @(posedge clk) begin r <= (reset) ? 32'h0 : next_r; d <= (reset) ? 1'b0 : next_d; end // IMPLEMENTATION 3 reg [31:0] tmp, next_tmp; reg [31:0] tmpa, next_tmpa; reg [15:0] shftb, next_shftb; reg [1:0] state, next_state; localparam S0 = 0, S1 = 1, S2 = 2; always @(posedge clk) begin tmp <= (reset) ? 32'b0 : next_tmp; tmpa <= (reset) ? 32'b0 : next_tmpa; shftb <= (reset) ? 16'b0 : next_shftb; state <= (reset) ? S0 : next_state; end always @(*) begin next_tmpa = tmpa; next_shftb = shftb; next_tmp = tmp; next_state = state; next_d = 1'b0; next_r = r; case(state) S0: if (start) begin next_tmpa = {16'b0, a}; next_shftb = b; next_tmp = 32'b0; next_state = S1; end S1: if (|shftb) begin next_tmpa = tmpa << 1; next_shftb = shftb >> 1; next_tmp = tmp + (shftb[0] ? tmpa : 32'b0); end else next_state = S2; S2: begin next_d = 1'b1; next_r = tmp; next_state = S0; end default: next_state = S0; endcase end assign result = r; assign done = d; endmodule =============================================== While this design looks more complicated, it is in fact smaller then the previous multiplier, because it needs only one 32-bit adder, while MULT2.v uses 16 32-bit adders. A caveat with this design is that is has variable latency. It will iterate in state S1 as long as there are non-zero bits in the b shift-register: S1: if (|shftb) ... else next_state = S2; In simulation, you can find that this design takes between 4 and 19 cycles depending of the position of the msbit in b. We will take the worst-case latency as the DELAY metric. You can see that this design can run at a much higher clock frequency then DESIGN 1 and DESIGN 2, because the critical path is much shorter. DESIGN 1 DESIGN 2 DESIGN 3 DESIGN 4 -------------------------------------------------------- ALM 130 159 67 Registers 64 64 147 Cycles 1 1 4 .. 19 Min CLKP 8.052 10.047 4.063 AREA 130 159 67 DELAY 8 10 77 6:10 Finally, let's look at MULT4.v, which is a trade-off design We can now consider designs that accumulate multiple partial products per clock cycle. In MULT4, we accumulate two partial products per clock cycle. This design is a simple transformation from MULT3: =============================================== module mymult4( input wire clk, input wire reset, input wire [15:0] a, input wire [15:0] b, input wire start, output wire [31:0] result, output wire done); (* multstyle = "logic" *) reg [31:0] r, next_r; reg d, next_d; always @(posedge clk) begin r <= (reset) ? 32'h0 : next_r; d <= (reset) ? 1'b0 : next_d; end // IMPLEMENTATION 4 reg [31:0] tmp, next_tmp; reg [31:0] tmpa, next_tmpa; reg [15:0] shftb, next_shftb; reg [1:0] state, next_state; localparam S0 = 0, S1 = 1, S2 = 2; always @(posedge clk) begin tmp <= (reset) ? 32'b0 : next_tmp; tmpa <= (reset) ? 32'b0 : next_tmpa; shftb <= (reset) ? 16'b0 : next_shftb; state <= (reset) ? S0 : next_state; end always @(*) begin next_tmpa = tmpa; next_shftb = shftb; next_tmp = tmp; next_state = state; next_d = 1'b0; next_r = r; case(state) S0: if (start) begin next_tmpa = {16'b0, a}; next_shftb = b; next_tmp = 32'b0; next_state = S1; end S1: if (|shftb) begin next_tmpa = tmpa << 2; next_shftb = shftb >> 2; next_tmp = tmp + (shftb[0] ? tmpa : 32'b0); next_tmp = next_tmp + (shftb[1] ? (tmpa << 1): 32'b0); end else next_state = S2; S2: begin next_d = 1'b1; next_r = tmp; next_state = S0; end default: next_state = S0; endcase end assign result = r; assign done = d; endmodule =============================================== So now, the shift registers move by two bits per clock cycle, there are two accumulations per clock cycle: next_tmpa = tmpa << 2; next_shftb = shftb >> 2; next_tmp = tmp + (shftb[0] ? tmpa : 32'b0); next_tmp = next_tmp + (shftb[1] ? (tmpa << 1): 32'b0); This design will be larger than DESIGN 4, and a little slower in terms of max clock frequency DESIGN 1 DESIGN 2 DESIGN 3 DESIGN 4 -------------------------------------------------------- ALM 130 159 67 81 Registers 64 64 147 147 Cycles 1 1 4 .. 19 4 .. 11 Min CLKP 8.052 10.047 4.063 5.961 AREA 130 159 67 81 DELAY 8 10 77 66 6:15 Now, let's compare these four designs: Area | | 2 | | 1 | | 4 | 3 | | +---------------------------------------- Delay (these positions are approximate) You can see that: - Design 2 never outperforms Design 1; compared to Design 1, Design 2 is suboptimal. Nevertheless, we did this transformation to enable the resource sharing that led to Design 3 and Design 4; So even though Design 2 is a 'throw-away' design, it was not useless. - Design 1, 3, and 4 are each optimal within the design space considered. Depending on how you select the area and delay constraint, you may choose Design 1, 3 or 4 as the most appropriate solution.