ECE 4514 - Performance - 2/21/19 Performance evaluation - understanding the speed of hardware 5:00 Performance factors in hardware Outline - 1. Performance Factors - Latency - Throughput - Delay = Clock Period * Cycle Count 2. The minimum clock period There are two common definitions used to describe the performance of a digital hardware design at the system level: 1/ Latency = the time it takes to compute an output starting from an input value The unit of latency is time (seconds) 2/ Throughput = the rate at which outputs are produced The unit of throughput is time^-1 (seconds^-1) ============================================= PERFORMANCE = THROUGHPUT or 1/LATENCY Depending on the application, you may need to optimize one or the other. In both cases, we will use the term DELAY ============================================= The delay of a design depends on the clock cycle count AND the clock period Latency = Cycles from input to output * Clock_period eg. 10 cycles at 50MHz clock Latency = 10 * 20ns = 200ns Throughput = 1 / (Cycles between outputs * clock period) eg. 5 cycles at 100MHz clock Throughput = 1/ (5 * 10ns) = 20MHz When you design an FSMD in Verilog: - Cycle count is defined by the Verilog code - Clock period is defined by the synthesis tools But the overal delay is the product of these two: ============================================= DELAY = CYCLE COUNT * CLOCK PERIOD ============================================= So we need to understand what determines the smallest CLOCK PERIOD that a given design can use. 5:15 Minimum clock period Assume a block of logic is connected between two registers R1 and R2. So we have: R2 = LogicFunction(R1) Then the fastest clock frequency at which R1 and R2 can be clocked is given by Tclk,min = Tclk-Q + Tlogic + Trouting + Tsetup + Tskew where Tclk-Q = time needed for the output of R1 to adopt a new value after a clock edge Tlogic = time needed for the logic to compute the output Trouting = time needed for the electrical signals to travel through the interconnect wires from R1 to Logic to R2 Tsetup = setup time for R2 Tskew = worst case variation between clock edges 5:20 Slack and Timing Violation The difference between the actual clock period, Tclk, and the minimum possible clock period, Tclk,min is called the SLACK. SLACK needs to be positive. When the SLACK would be negative, the system experiences a TIMING VIOLATION: it will try to update the value of R2 before the next result is available at the output of the logic function. For a given technology, Tclk-Q and Tsetup are a given. Tlogic and Trouting, on the other hand, are defined by the design. 5:25 Timing Analysis - In an overall design, there are multiple paths: - between any two registers - between any input and a register - between a register and any output - The Critical path of a design is the longest of these paths - Example I: Multiplexer module mux(input wire d0, input wire c0, input wire c1, output wire z); wire and1, and2, not1; assign z = and1 | and2; assign and1 = d0 & not1; assign and2 = d1 & c; assing not1 = ~c; endmodule delay components: tand, tor, tnot path 1: delay from d0 to z t1 = tand + tor path 2: delay from d1 to z t2 = tand + tor path 3: delay from c to z t3 = tnot + tand + tor overall delay: t = max(t1, t2, t3) Note that this is a pessimistic timing analysis, which ignores if a path would actually occur, ie. if a transition on an input would cause the output (5:35) Example II: Accumulator // note: we ignore reset in this example module accum(input wire i, input wire c, input wire clk, output wire q); reg r, rnext; always @(posedge clk) r <= rnext; always @(*) rnext = c ? i : r; endmodule Note that the ? : operator expands to a mux delay components: tmux, tcq, tsetup with tmux = propagation delay from mux tcq = clock to q delay tsetup = flip flop setup delay path 1: delay from i to input of register t1 = tmux + tsetup path 2: delay from c to input of register t2 = tmux + tsetup path 3: delay from clk to input of register t3 = tcq + tmux + tsetup overall delay: t = max(t1, t2, t3) (5:45) Example III: FIR module fir(input wire clk, input wire [3:0] i, output wire [3:0] y); reg [3:0] tap1, tap1next; reg [3:0] tap2, tap2next; reg [3:0] c1, c2, c3; reg [3:0] a1, a2; always @(posedge clk) begin tap1 <= tap1next; tap2 <= tap2next; end always @(*) begin c1 = i * 5; c2 = tap1 * 3; c3 = tap2 * 2; a1 = c1 + c2; a2 = a1 + c3; tap1next = i; tap2next = tap1; end assign y = a2; endmodule delay components: tmul, tadd, tcq, tsetup with tmul = propagation delay from mul tadd = propagation delay from add tcq = clock to q delay tsetup = flip flop setup delay path 1: delay from i to register t1 = tsetup path 2: delay i to y t2 = tmul + tadd + tadd path 3: delay from tap1 output to tap2 input t3 = tcq + tsetup path 4: delay from tap1 output to y t4 = tcq + tmul + tadd + tadd path 5: delay from tap2 output to y t5 = tcq + tmul + tadd overall delay: t = max(t1, t2, t3, t4, t5); 5:55 Timequest timing analyzer Study the tonegen design in Timequest analyzer - Compile design - Open Timequest Timing Analyzer - Go to Custom Reports .. Report Timing - Create a report on 10 paths from CLOCK_50 to CLOCK_50 The slowest reported path is the following (edited for clarity): ** This is a summary. It describes the from-to node critical path. ** The critical path travels from dut1, which is i2cmasterwrite ** to dut2, which is bitxmit ** The launch flop is an output of the shift register (shift[2]) ** The latch flop is an input of the dut2 state register +--------------------------------------------------------------------+ ; Path Summary ; +--------------------+-----------------------------------------------+ ; Property ; Value ; +--------------------+-----------------------------------------------+ ; From Node ; i2cmasterwrite:I2C|i2cgenerator:dut1|shift[2] ; ; To Node ; i2cmasterwrite:I2C|bitxmit:dut2|state.PRE ; ; Launch Clock ; CLOCK_50 ; ; Latch Clock ; CLOCK_50 ; ; Data Arrival Time ; 8.656 ; ; Data Required Time ; 23.562 ; ; Slack ; 14.906 ; +--------------------+-----------------------------------------------+ +--------------------------------------------------------------------------+ ; Data Arrival Path ; +---------+---------+------+-----------------------------------------------+ ; Total ; Incr ; Type ; Element ; +---------+---------+------+-----------------------------------------------+ ** data arrival path describes the path from the clock pin (CLOCK_50) to the 'launch flop' (R1 in the example above) ; 0.000 ; 0.000 ; ; launch edge time ; ; 3.799 ; 3.799 ; ; clock path ; ; 0.000 ; 0.000 ; ; source latency ; ; 0.000 ; 0.000 ; ; CLOCK_50 ; ; 0.000 ; 0.000 ; IC ; CLOCK_50~input|i ; ; 0.661 ; 0.661 ; CELL ; CLOCK_50~input|o ; ; 1.008 ; 0.347 ; IC ; CLOCK_50~inputCLKENA0|inclk ; ; 1.281 ; 0.273 ; CELL ; CLOCK_50~inputCLKENA0|outclk ; ; 3.346 ; 2.065 ; IC ; I2C|dut1|shift[2]|clk ; ; 3.799 ; 0.453 ; CELL ; i2cmasterwrite:I2C|i2cgenerator:dut1|shift[2] ; ** The above is interconnect delay from the CLOCK_50 pin to the launch flop ** Next follows the delay through the datapath ** Tcq for an output of the shift variable ; 8.656 ; 4.857 ; ; data path ; ; 3.799 ; 0.000 ; uTco ; i2cmasterwrite:I2C|i2cgenerator:dut1|shift[2] ; ; 3.799 ; 0.000 ; CELL ; I2C|dut1|shift[2]|q ; ** Used in a comparator ** if (shift == 8'b0) begin ** rcommand = `CMDWAIT; ; 5.313 ; 1.514 ; IC ; I2C|dut1|Equal0~0|datad ; ; 5.667 ; 0.354 ; CELL ; I2C|dut1|Equal0~0|combout ; ; 6.210 ; 0.543 ; IC ; I2C|dut1|Selector2~2|datae ; 6.471 ; 0.261 ; CELL ; I2C|dut1|Selector2~2|combout ; ** Assigned command, which is an input of DUT2 ; 6.958 ; 0.487 ; IC ; I2C|dut2|state~14|dataa ; ; 7.417 ; 0.459 ; CELL ; I2C|dut2|state~14|combout ; ; 7.946 ; 0.529 ; IC ; I2C|dut2|state~19|datab ; ; 8.437 ; 0.491 ; CELL ; I2C|dut2|state~19|combout ; ** determines a write into the state.PRE flip flop. ** note that the synthesis tool has selected a one-hot encoding ** for this state machine, which means that every state has its own flop ** if (command != `CMDWAIT) ** statenext = PRE; ; 8.437 ; 0.000 ; IC ; I2C|dut2|state.PRE|d ; ; 8.656 ; 0.219 ; CELL ; i2cmasterwrite:I2C|bitxmit:dut2|state.PRE ; +---------+---------+------+-----------------------------------------------+ ** data required path describes the path from the clock pin (CLOCK_50) to the 'latch flop' (R2 in the example above) +-----------------------------------------------------------------------+ ; Data Required Path ; +----------+---------+------+-------------------------------------------+ ; Total ; Incr ; Type ; Element ; +----------+---------+------+-------------------------------------------+ ; 20.000 ; 20.000 ; ; latch edge time ; ; 23.732 ; 3.732 ; ; clock path ; ** It's basically routing delay from the clock pin the the state.PRE flop ; 20.000 ; 0.000 ; ; source latency ; ; 20.000 ; 0.000 ; ; CLOCK_50 ; ; 20.000 ; 0.000 ; IC ; CLOCK_50~input|i ; ; 20.661 ; 0.661 ; CELL ; CLOCK_50~input|o ; ; 20.989 ; 0.328 ; IC ; CLOCK_50~inputCLKENA0|inclk ; ; 21.241 ; 0.252 ; CELL ; CLOCK_50~inputCLKENA0|outclk ; ; 22.933 ; 1.692 ; IC ; I2C|dut2|state.PRE|clk ; ; 23.340 ; 0.407 ; CELL ; i2cmasterwrite:I2C|bitxmit:dut2|state.PRE ; ; 23.732 ; 0.392 ; ; clock pessimism removed ; ; 23.562 ; -0.170 ; ; clock uncertainty ; ; 23.562 ; 0.000 ; uTsu ; i2cmasterwrite:I2C|bitxmit:dut2|state.PRE ; +----------+---------+------+-------------------------------------------+ ** The effective slack is the difference between the required time and the available time +--------------------------------------------------------------------+ ; Path Summary ; +--------------------+-----------------------------------------------+ ; Property ; Value ; +--------------------+-----------------------------------------------+ ; From Node ; i2cmasterwrite:I2C|i2cgenerator:dut1|shift[2] ; ; To Node ; i2cmasterwrite:I2C|bitxmit:dut2|state.PRE ; ; Launch Clock ; CLOCK_50 ; ; Latch Clock ; CLOCK_50 ; ; Data Arrival Time ; 8.656 ; ; Data Required Time ; 23.562 ; ; Slack ; 14.906 ; +--------------------+-----------------------------------------------+ 6:10 Summary - We discussed the concept of timing analysis - Main terms: latency throughput delay = cycles * clock period timing violation slack