Attention

This document was last updated Nov 25 24 at 21:59

Timing Analysis

Important

The purpose of this lecture is the following.

  • To define timing constraints of a synchronous digital circuit

  • To describe timing analysis of a synchronous digital circuit

  • To describe the delay modeling techniques used in hardware ASIC

  • To describe the impact of the environment on timing analysis

  • To describe the use of a static timing analysis tool

Attention

The following references are relevant background to this lecture.

Attention

The example discussed in this lecture is available on https://github.com/wpi-ece574-f24/ex-sta

Timing Requirements for Synchronous Circuit Operation

Correct timing in synchronous logic generally means that every flip-flop meets its basic timing requirements which include setup and hold timing (for synchronous inputs) and removal and recovery timing (for asynchronous inputs).

Setup, Hold, Recovery, Removal Time

The following figure shows an edge-triggered flip-flop with an asynchronous set input. The synchronous input D is read and stored on the synchronous output Q. Four clock cycles demonstrate the meaning of setup, hold, removal and recovery.

_images/violation.jpg
  • The setup time t_{setup} is the duration when a synchronous input must be stable before a clock edge. The hold time t_{hold} is the duration when a synchronous input must be stable after a clock edge. When a synchronous input changes during the setup/hold window, a timing violation occurs which may leave the flip-flop in an undefined state. For example, the input D changes just before the third clock edge, creating a timing violation. The flip-flop may remain at 0, or it may switch to 1, or it may switch to 1 after a long time (e.g. half a clock cycle). Physically, any of these cases are possible as the flip-flop is in a metastable state.

  • The recovery time t_{recovery} is the duration that an asynchronous that an asynchronous input must be de-asserted before a clock edge, in order to recover normal synchronous operation. The removal time t_
{removal} is the duration that an asynchronous input must remain asserted after a clock edge, in order to ensure the asynchronous command has effect. Just before the fourth clock edge, the asynchronous S input is asserted, which will force the Q output to take 1 (asserted). The S input remains asserted longer than t_{removal} after the clock edge, thus ensuring the flip-flop returns to a stable 1.

Stable Operation

Let’s first establish the conditions for a synchronous circuit has correct timing. We will assume only synchronous inputs in the following examples.

_images/clocktopo.jpg

A central clock is distributed to two flip-flops with a data path in between. The launch path is the route followed by the clock signal (potentially including clock buffers) up until the clock input of the launch flip-flop. The capture path is the route followed by the clock signal up until the clock input of the capture flip-flop. The flip-flops are characterized by setup time T_{setup}, hold time T_{hold} and clock-to-q time T_{cq}. The latter is the time needed for the synchronous output Q to be updated after a clock edge. The propagation delay for the datapath T_{logic} is the time needed for the data path output to be stable after its input is updated. Finally, the launch path and capture path may result in clock skew between the two flip-flops, meaning that the clock edge at each flip-flop does not arrive at exactly the same moment.

_images/clocktiming.jpg

We can now derive conditions towards the smallest possible clock period T that will ensure stable circuit operation.

Clock Period Lower Bound

The data arriving at the capture flip flop must be stable before the capture clock edge. The slack is the margin available between the data arrival and the setup boundary at the capture flip-flop.

T + T_{skew} = T_{clk-Q} + T_{dp} + T_{slack} + T_{setup}

We want T_{slack} to be positive, and therefore the following condition must hold.

T > T_{clk-Q} + T_{dp} + T_{setup} + T_{skew}

Therefore, the setup time, clk-Q time and skew all potentially reduce the time available for datapath computations. Note that in this formula, the sign of T_{skew} was not flipped when moving over from one side of the equation to the other. That is because skew relates to clock uncertainty. In the figure shown above, the skew from CLK1 to CLK2 actually helps with the minimum allowed clock period. CLK2 is delayed and therefore gives the datapath a little bit more time to settle. However, it’s a fallacy to think that this helps our cause; simply reversing the role of CLK2 and CLK1 in the launch path and the capture path will make the skew work against the datapath. So, in actual timing analysis, the skew is modeled as an uncertainty on the clock, and the uncertainty is counted as a retricting the result.

Important

The minimum clock period in a synchronous digital circuit is constrained by the setup time of the flip-flops, the clk-Q time of the flip-flops, the slowest datapath, and the clock skew between launch and capture flip-flop.

T_{CLK,min} > T_{skew} + T_{dp} + T_{clk-Q} + T_{setup}

Clock Skew Upper Bound

Clock skew has to be limited and ideally, for a properly implemented clock distribution circuit, it is as close as possible to zero. However, clock skew has an upperbound too. Data arriving at the capture flip-flop cannot change in the hold time window of the capture flop-flop.

T_{skew} + T_{hold} < T_{clk-Q} + T_{dp}

Therefore, positive skew cannot be larger than the time it takes for data to travel from the launch flip-flop through the datapath. If the skew condition is violated, than the capture flip-flop will grab data on the current clock edge rather than the next clock edge. In simulation, this appears as if the circuit skipping is a cycle, and the capture flip-flop has no effect.

T_{skew} < T_{clk-Q} + T_{dp} - T_{hold}

When skew is negative, of course, then this condition is easy to meet. It’s also possible to look at this relation in terms of the minimum propagation delay required on a datapath for stable operation.

Important

The minimum propagation delay of a datapath is constrained by the hold time of the flip-flops, the clock skew between launch and capture flip-flop, and the clock-Q time of the flip-flops.

T_{dp,min} > T_{hold} + T_{skew} - T_{clk-Q}

It’s easy to see why the ideal skew must by 0, and not negative. Assume that you have a circuit with feedback, with two flip-flops in the feedback loop. That is, register R2 is updated with a result computed from R1, and register R1 is updated with a result computed from R2. In that case, a positive skew from R1 to R2 will become a negative skew from R2 to R1, and vice versa. The ideal solution is therefore a zero skew.

Modern hardware design for ASIC takes the skew problem specifically into account, and will generate a clock tree for a specific circuit. The following figure shows the example of a so-called H-tree which distributes a clock physically to 8 flip-flops in such a way that the skew over all flip-flops becomes close to 0. We will discuss the problem of creating a circuit to distribute the clock signal with low skew in further detail once we discuss layout generation.

_images/htree.jpg

Delay Modeling

In order to verify the timing of a complex digital circuit, we must have a way to quickly compute T_{dp}. The combinational delay of a digital circuit depends not only on the (active) logic gates that evaluate the combinational function, but also on the (passive) interconnect between the gates. In an ASIC layout, the latter is defined by the placement of gates on the layout as well as the length of the interconnect running between individual gates. The fundamental concepts of Delay Modeling follow from the operation of CMOS logic, such as the CMOS inverter shown next.

_images/invertertiming.png

The CMOS inverter uses a p-channel and a n-channel transistor that will exclusively turned on, depending on the input voltage level at the common gate A. The equivalent circuit of such a transistor is a switch controlled by the voltage level on the common gate, as well as an internal resistor R_p or R_n. The inverter gate drives circuit interconnect and other gate inputs which are modeled - in first order - as a loading capacitor C_L. Hence, when a CMOS gate switches, it doesn’t make an immediate transition. Instead, it charges (or discharges) a loading capacitor using an exponential curve with a time constant propertional to \tau_{0 \rightarrow
1} = R_p C_L. Because a logic ‘1’ or a logic ‘0’ requires the voltage to be higher (resp. lower) than a given threshold, it will therefore take time before the output of a gate has turned from 0 to 1 or vice versa.

The time constant \tau can increase because of two factors.

  • When R_p or R_n increases, the gate is slower. The resistance will increase when a weaker gate with smaller transistors is used, i.e. when the drive strength of the gate decreases. In addition, when a gate needs to drive a long wire, there is a non-negligible series resistance that must be added to R_p or R_n to find the effective resistance.

  • When C_L increases, the gate is slower. The loading capacitance is determined by the type and number of gates that are driven with this gate. The fanout is the number of gates driven, and the higher the fanout, the slower the gate.

Circuit timing estimators do not directly compute exponential charging and discharging of wires. Instead, they approximate the exponential-shaped waveforms using linear or piecewise linear models. The following shows an example approximation into two popular circuit delay modeling techniques, NLDM (Non Linear Delay Modeling) and CCS (Composite Current Source).

_images/delaymodel.jpg

The key difference between an NLDM and CCS model is the level of sophistication by which the loading capacitance C_L of a gate is modeled. NLDM assumes a constant loading capacitance, and is adequate for gate technologies above 90 nm (including the SKY130 PDK that we will be using for experiments). CCS assumes a variable loading capacitance, and is used with gate technologies below 90 nm. In the following discussion, we focus exclusively on NLDM modeling.

NLDM Delay Modeling

_images/propagation.jpg

Assume an inverter circuit that drives several other gates. The NLDM model can capture delays in this circuit following the waveforms shown below the circuit.

Each transition is modeled with a given steepness or slew rate. The slew rate, which is typically measured as the time between the 10% and 90% of the signal’s transition voltage, is a first order approximation of the non-linear (exponential) charging/discharging of loading capacitance. Slew rate is modeled in terms of a transition delay. Rather than expressing the steepness of the transition, we capture the time of the transition. The propagation delay of a gate is measured as the response time between an input and a corresponding output, measured as the time between the halfway points of the transition at the input and at the output.

Each input on the network represents an equivalent loading capacitance. On a high-fanout net, many inputs are connected in parallel and thus the loading capacitance rises.

The delay computation tool now has to compute the propagation delay of every cell, as well as the transition delay for every output. The delay computation tool distinguishes rising transitions from falling transitions, and each of these transitions may take a different transition time (on a net) or a different propagation delay (on a cell).

The objective of the delay computation tool is to determine the propagation delay of every gate, as well as the transition time for every output. The propagation delay of a gate depends on the loading capacitance of the gate output, and the transition time at the input. Similarly, the transition time of an output depends on the loading capacitance of the gate output, and the transition time at the input.

When a gate has multiple inputs, there can be multiple factors for the propagation delay and the transition time of the output. The delay computation tool will consider all of them, and then select the worst possible one to reflect the propagation and transition delay of an output. The causal dependence between the input and the output of a gate in terms of delay is called a timing arc. Thus, you can imagine each gate pin as a node in a graph, and the timing arcs as the edges of that graph. Then, the delay computation of the overall circuit can be expressed in terms of finding the slowest path in that graph. Initially, most of the propagation delays and transition delays are unknown, except for those at the input. Then, the delay computation tool starts resolving all the unknowns using a per-gate lookup table, until all delays are known. Finally, the delay of the overall circuit can be determined.

Interconnect Delay Modeling

Delay is also affected by interconnect. The basic delay model of a wire accounts for its resistance and capacitance to ground, which are both proportional to the wire length.

The circuit delay through a wire modeled using an equivalent resistor R and a capacitor C is given by the Elmore delay, which sets the delay at R.C. To account for the distributed (rather than lumped) nature of wire resistance and wire capacitance, the elmore delay of a wire is rather R.C/2. Hence, the total delay experienced by a signal edge traveling through the circuit below is R_{wire} * (\frac{C_{wire}}{2} + C_{gate}).

_images/elmore.jpg

Interconnect delay is introduced in the delay estimation through two mechanisms: through a wire load model or through parasitics modeling.

Wire Load Models

In advanced technology nodes, wire delay is dominating the delay of the overall circuit. Delay computation tools deal with the delay of wires by estimating the length of the wire. The length of a wire implies an equivalent resistance and capacitance for the wire, and thus, a delay. The loading capacitance is also affecting the transition delay of the gate that drives the wire.

As long as the circuit is not physically constructed into a layout, the length of the wires and the metal layers used for each wire, is not yet known. This creates a chicken-and-egg problem, since without known the precise delays, it’s not possible to accurately size the circuit and select the proper drive strength for each gate. To get around this problem, delay computation tools (and synthesis tools) use a wire load model, an initial estimate of the resistance and capacitance of each wire based on high level circuit factors such as fanout and circuit area.

When the detailed layout of a circuit is known, the wire load model is replaced with more precise estimates for each wire in terms of resistance and capacitance. Thus, as we add more design details to the implementation, the timing estimates computed by the delay computation tool will be more precise. Typically, you will find that the timing of the circuit worsens after layout. However, it should not become so much worst that the post-layout design introduces a significant amount of timing violations. If that would occur, it could indicate that the wire load model that was chosen initially at synthesis, as not a good estimate of the actual circuit area and layout.

The Liberty Format

The delay information of a gate is stored in a specific(human-readable) format called the Liberty Format. Originally devised by Synopsys, the Liberty Format is now an open standard (http://www.opensourceliberty.org/) that is shared by other EDA vendors and cell library providers.

Attention

The Liberty files (or lib files for short) of the 130nm high-density standard cells are stored under /opt/skywater/libraries/sky130_fd_sc_hd/latest/timing/. There are multiple lib files in that directory, corresponding to different operating corners (temperature, voltage, process).

The Liberty files of CDS45 standard cells are stored under /opt/cadence/libraries/gsclib045_all_v4.7/gsclib045/timing/. Similarly, the different lib files in that directory correspond to different operating corners.

For every gate in the library, the .lib file will capture the load model (the input capacitance of each gate) as well as the drive model (the gate delay and output transition delay). As an example, let’s look at the drive-strength-1 AND2 gate of the CDS45 library. The model is slightly shortened by removing information not related to timing, such as power.

cell (AND2X1) {
  area : 1.368;
  pin (Y) {
    direction : "output";
    function : "(A B)";
    max_capacitance : 0.25;
    timing () {
      related_pin : "A";
      timing_sense : positive_unate;
      timing_type : combinational;
      cell_rise (delay_template_2x2) {
        index_1 ("0.008, 0.28");
        index_2 ("0.01, 0.25");
        values ( \
          "0.221347, 2.96281", \
          "0.349336, 3.09097" \
        );
      }
      rise_transition (delay_template_2x2) {
        index_1 ("0.008, 0.28");
        index_2 ("0.01, 0.25");
        values ( \
          "0.220913, 5.14177", \
          "0.22198, 5.14177" \
        );
      }
      cell_fall (delay_template_2x2) {
        index_1 ("0.008, 0.28");
        index_2 ("0.01, 0.25");
        values ( \
          "0.18609, 3.40759", \
          "0.309193, 3.53088" \
        );
      }
      fall_transition (delay_template_2x2) {
        index_1 ("0.008, 0.28");
        index_2 ("0.01, 0.25");
        values ( \
          "0.259895, 6.18195", \
          "0.260511, 6.18195" \
        );
      }
    }
    timing () {
      related_pin : "B";
      timing_sense : positive_unate;
      timing_type : combinational;
      cell_rise (delay_template_2x2) {
        index_1 ("0.008, 0.28");
        index_2 ("0.01, 0.25");
        values ( \
          "0.230092, 2.97154", \
          "0.366278, 3.10769" \
        );
      }
      rise_transition (delay_template_2x2) {
        index_1 ("0.008, 0.28");
        index_2 ("0.01, 0.25");
        values ( \
          "0.220922, 5.14177", \
          "0.221819, 5.14177" \
        );
      }
      cell_fall (delay_template_2x2) {
        index_1 ("0.008, 0.28");
        index_2 ("0.01, 0.25");
        values ( \
          "0.190622, 3.41312", \
          "0.316799, 3.53945" \
        );
      }
      fall_transition (delay_template_2x2) {
        index_1 ("0.008, 0.28");
        index_2 ("0.01, 0.25");
        values ( \
          "0.260612, 6.18249", \
          "0.261192, 6.18251" \
        );
      }
    }
  pin (A) {
    direction : "input";
    max_transition : 0.28;
    capacitance : 0.000205583;
    rise_capacitance : 0.000205583;
    rise_capacitance_range (0.000202183, 0.000210011);
    fall_capacitance : 0.000164011;
    fall_capacitance_range (0.000143976, 0.0001853);
  }
  pin (B) {
    direction : "input";
    max_transition : 0.28;
    capacitance : 0.000208516;
    rise_capacitance : 0.000208516;
    rise_capacitance_range (0.000205733, 0.000210818);
    fall_capacitance : 0.000186718;
    fall_capacitance_range (0.000182961, 0.000191844);
  }
}

This model captures information related to each pin of the gate: A, B, and Y. The input pin models are the shortest. The model encodes input capacitance for rising and falling transitions separately. It also capture the capacitance range which is needed for more advanced CCS based delay calculations. The max_transition attribute is a constraint and represents the slowest transition that is acceptable for the given cell characterization.

The output pin description is more sophisticated and states the function (A B) and several timing blocks. Each timing block encodes a timing arc, and there is a timing block for changes related to the A input as well as changes related to the B input. The propagation delay for the gate is encoded as the cell_fall/cell_rise to capture delay from changes at the corresponding input. The transition delay for the output is encoded as fall_transition/rise_transition to capture output transition time stemming from changes at the corresponding input. Each timing arc has a unateness, which indicate the direction of the output transition in function of the input transition. Positive unateness means the output rises when the input rises, and negative unateness means the output falls when the input rises. An output pin can also be non-unate, when there is no logic connection between input and output transition. For example, the Q output on a flip-flop is non-unate with respect to the clock input signal.

The delays are encoded in lookup tables. For example, the lookup table delay_template_2x2 is used to describe the propagation delay for a rising output triggered from input B:

timing () {
  related_pin : "B";
  timing_sense : positive_unate;
  timing_type : combinational;
  cell_rise (delay_template_2x2) {
    index_1 ("0.008, 0.28");
    index_2 ("0.01, 0.25");
    values ( \
      "0.230092, 2.97154", \
      "0.366278, 3.10769" \
    );
  }
  ...

These encoding used for these lookup tables is stored in the .lib file as well. For example, for delay_template_2x2 we find:

lu_table_template (delay_template_2x2) {
  variable_1 : input_net_transition;
  variable_2 : total_output_net_capacitance;
  index_1 ("0.008, 0.28");
  index_2 ("0.01, 0.3");
}

Thus, the first index of cell_rise represents the input transition delay, and the second index of cell_rise represents the output net capacitance. The transition delay table for cell_rise is thus encoded as follows. (index1, index2) map to (row, column).

cell_rise

(x1) 0.01

(x2) 0.3

(y1) 0.008

(t1) 0.230092

(t2) 2.97154

(y2) 0.28

(t3) 0.366278

(t4) 3.10769

Using this table, the timing simulator can determine the right gate propagation delay from a given capacitive load and a given input transition. For example, if the input transition (on B) is 0.28ns and the capacitive load at the output is 0.01 fF, then the gate propagation delay for a rising transition is 0.366278ns. Then the index value cannot be directly matched (to either input transition or output capacitance), the output propagation delay will be found using linear interpolation. For the following table:

t   &= x01 * y01 * t4 + x20 * y01 * t3 + x01 * y20 * t2 +  x20 * y20 * t1 \\
x01 &= (x0 - x1) / (x2 - x1) \\
x20 &= (x2 - x0) / (x2 - x1) \\
y01 &= (y0 - y1) / (y2 - y1) \\
y20 &= (y2 - y0) / (y2 - y1)

Standard Delay Format

When additional design information becomes available (such as the fanout and type of gates used after synthesis, or the exact topology of interconnect after layout), the delay estimates of gates and wires can be refined. The standard delay format (sdf) is a data format that is used by synthesis and layout tools to provide such updated delay information. The data provided through sdf can override the delays listed in the lib file. Standard delay format can be used, for example, to initialize a gate-level timing simulation. The library provides a functional view of the gates in Verilog, and the SDF captures the delay effects of the gate-level netlist. Keep in mind that typical 4-value HDL simulation (0,1,X,Z) cannot simulate slew or transition delay.

The following is an example of an entry in an sdf file, produced by the synthesis tool. It describes the delays of a gate g828__3680.

(CELL
   (CELLTYPE "ADDFX1")
   (INSTANCE g828__3680)
   (DELAY
      (ABSOLUTE
        (PORT A (::0.000))
        (PORT B (::0.000))
        (PORT CI (::0.000))
        (IOPATH CI CO (::0.186) (::0.172))
        (IOPATH A CO (::0.200) (::0.173))
        (IOPATH B CO (::0.198) (::0.173))
        (IOPATH CI S (::0.286) (::0.268))
        (IOPATH A S (::0.281) (::0.300))
        (IOPATH B S (::0.291) (::0.294))
      )
   )
)
  • (CELLTYPE "ADDFX1") specifies the type of the cell, which is “ADDFX1” in this case.

  • (INSTANCE g828__3680) specifies the instance name of the cell, which is “g828__3680.”

  • (DELAY ...) defines the delay characteristics of the cell instance.

    • (ABSOLUTE ...) indicates that the delay values provided are absolute delays.

    • (PORT A (::0.000)) specifies that the input port “A” has an absolute delay of 0.000 time units. Similar specifications hold for B and CI.

    • (IOPATH CI CO (::0.186) (::0.172)) defines an input-to-output path from “CI” to “CO” with an absolute delay of 0.186 time units and a slew (transition) delay of 0.172 time units. Similar specifications hold for other IO paths of the gate.

Static Timing Analysis

We are now ready to perform timing analysis of a complete circuit. Through delay modeling, we can annotate each gate (and possibly wire) with a given propagation delay. Now, given an arbitrarily complex circuit, we would like to verify if the setup delay constraints of flip-flops in the design are met.

First, we demonstrate how slack can be computed for any node in an arbtitrary combinational circuit. Next, we then expand this method to the static timing analysis algorithm to verify the setup/hold time of every flip-flop in the circuit, as well as the arrival time of every system output.

Actual Arrival Time in a combinational circuit

Using a straightforward graph algorithm, the propagation delay of a network of combinational gates is computed as follows. We use the following example as an illustration. A three-input network has a propagation delay of 1 ns for every gate. All inputs except C are available at the start of the clock cycle, while C is only ready after 0.5ns. Such an input delay is a typical constraint encountered in actual timing analysis calculations, as we will illustrate later.

_images/stacircuit.png

To compute the propagation delay of this circuit, we need to determine the actual arrival time (AAT), the time when output Q is available. All gates are mapped as nodes in a graph, and interconnections are mapped as edges. Inputs and outputs map to their own node. Each node is then annotated with a propagation delay, and the AAT is compute from inputs to outputs, by identifying the latest time each node output is available.

_images/staat.png

Next, we compute the required arrival time (RAT) for each node in the graph under the assumption of a given clock period. While AAT is computed from input to output, the RAT is computed from output to inputs, each time identifying the latest possible time that a node input must be ready in order to meet the overall RAT.

_images/starat.png

Finally, the slack for the circuit output, and every other node in the graph, is the difference between the RAT and the AAT. The slack must always be positive. A negative slack (for any node in the graph) indicates a timing violation and thus a circuit that does not meet its timing constraints. Also, the critical path in the graph becomes visible by highlighting all nodes with minimum slack. In the case of the example, we see that the critical path runs through input C (the slowest input) and the path that cross a maximumm overall gate delay.

_images/staslack.png

Applying static timing analysis in sequential logic

The perform static timing analysis on sequential logic, the slack computation algorithm is applied to every block of combinational logic in a given module. There are four possible paths in such a circuit. These four paths enumerate the four possible paths between two kinds of inputs (module input and register output) and two kinds of output (module output and register input). The four paths are:

  • Combinational logic from module input to module output

  • Combinational logic from module input to register input

  • Combinational logic from register output to module output

  • Combinational logic from register output to register input

We compute the slack for each of these paths, and verify that each slack is positive. Static analysis tools use the term path group to mark such similar paths. In designs with multiple clocks, there can be more than four possible path groups, and the static timing analysis tool will verify to slack in every group.

Input and Output Delay

In principle, the required arrival time is equal to the chosen clock period at which the tming analysis is performed. However, there are practical arguments to verify against a tighter constraint, in particular because the module inputs and module outputs can introduce their own timing constraint.

Each input can be associated with an input delay, the time after the clock edge at which the input is valid. Furthermore, each output can be associated with an output delay, the time before the clock edge at which the output must be available. The ideal input delay, as well as the ideal output delay, is zero, meaning that the entire clock period can be used for combinational computation. However, registers cannot meet that constraint. Every register has a non-zero clk-Q time, which determines the ‘input delay’ for the paths starting at a register output. Also, every register has a positive setup time, which determines the ‘output delay’ for paths ending at a register input.

In addition, external interfaces (such as the connections to memory modules) can introduce additional input/output constraints. The following example shows how one could select an input delay in an interconnect strategy that puts registers at the output. Below that, how one coudl select an output delay constraint in an interconnect strategy that puts registers at the input.

_images/iodelay.png

Timing Analysis with Tempus

Refer to the materials posted on Canvas (Timing Paths and SDC Constraints)

Example

We will discuss static timing analysis on the following multiply-accumulate example. Notice that this module has all four types of timing paths we discussed above: input to output, input to register, register to register and register to output.

module mac(
    input logic [7:0] x1,
    input logic [7:0] x2,
    output logic [9:0] y,
    output logic [9:0] m,
    input logic reset,
    input logic clk
);

   logic [9:0] y_next;
   logic [15:0] m16;

   always_ff @(posedge clk)
       if (reset)
         y <= 10'b0;
       else
         y <= y_next;

   always_comb
     begin
        m16 = x1 * x2;
        m = m16[15:6];
        y_next = y + m;
     end

endmodule

First, perform RTL synthesis in the syn directory.

cd syn
make

The resulting module is implemented using 163 standard cells for a 4 ns clock constraint. The synthesis timing report states a slack of 633ps, with the critical path running from the input (x2[0]) to the accumulator register msb (y[9]). The critical path of the design thus includes the multiply-accumulate operation, and within those operations the path includes the complete carry chain from LSB to MSB.

#---------------------------------------------------------------------------------------------------------
#          Timing Point           Flags    Arc   Edge   Cell     Fanout Load Trans Delay Arrival Instance
#                                                                       (fF)  (ps)  (ps)   (ps)  Location
#---------------------------------------------------------------------------------------------------------
  x2[0]                           -       -      R     (arrival)      7  2.6     0     0       0    (-,-)
  mul_21_11_g847__4319/Y          -       A->Y   F     NAND2X1        2  0.8    76    61      61    (-,-)
  mul_21_11_g817__5107/Y          -       A->Y   R     NOR2X1         2  1.2    84   100     160    (-,-)
  mul_21_11_g806__4733/Y          -       B0->Y  F     AOI21X1        1  0.5    77    69     230    (-,-)
  mul_21_11_cdnfadd_003_0__2883/S -       CI->S  R     ADDFX1         2  0.8    46   307     537    (-,-)
  mul_21_11_g802__1881/Y          -       A1->Y  F     OAI221X1       1  0.4   154   166     703    (-,-)
  mul_21_11_g798__7098/Y          -       A1->Y  R     AOI22X1        2  0.8    87   157     860    (-,-)
  mul_21_11_g796__8246/Y          -       C->Y   F     NOR3BX1        1  0.4    43    72     931    (-,-)
  mul_21_11_g794__1705/Y          -       A1->Y  R     OAI22X1        1  0.8   100    87    1018    (-,-)
  mul_21_11_g793__2802/CO         -       B->CO  R     ADDFX1         1  0.6    39   230    1249    (-,-)
  mul_21_11_g792__1617/CO         -       CI->CO R     ADDFX1         1  0.6    39   186    1435    (-,-)
  mul_21_11_g791__3680/CO         -       CI->CO R     ADDFX1         1  0.6    39   186    1621    (-,-)
  mul_21_11_g790__6783/CO         -       CI->CO R     ADDFX1         1  0.6    39   186    1807    (-,-)
  mul_21_11_g789__5526/CO         -       CI->CO R     ADDFX1         1  0.6    39   186    1993    (-,-)
  mul_21_11_g788__8428/CO         -       CI->CO R     ADDFX1         1  0.6    39   186    2179    (-,-)
  mul_21_11_g787__4319/CO         -       CI->CO R     ADDFX1         1  0.6    39   186    2365    (-,-)
  mul_21_11_g786__6260/CO         -       CI->CO R     ADDFX1         1  0.6    39   186    2551    (-,-)
  mul_21_11_g785__5107/S          -       CI->S  F     ADDFX1         2  0.7    45   271    2822    (-,-)
  g707__4319/CO                   -       B->CO  F     ADDFX1         1  0.4    37   172    2994    (-,-)
  g705__5107/Y                    -       B->Y   R     XNOR2X1        1  0.2    22   163    3157    (-,-)
  g703__2398/Y                    -       AN->Y  R     NOR2BX1        1  0.3    39    86    3243    (-,-)
  y_reg[9]/D                      <<<     -      R     DFFHQX1        1    -     -     0    3243    (-,-)
#---------------------------------------------------------------------------------------------------------

For a better insight into the timing behavior of the module, we run the timing analysis again under sta.

cd ta
make

The static timing analysis makes use of the following timing analysis script. The following indicate particular activities.

  • The script inputs are the gate-level netlist, the standard cell library, and the constraint file as well as the SDF generated by the synthesis. Thus, the STA results from this run may not be identical to those produced by the synthesis because the constraints/sdf is not identical.

  • The script produces three files. The first file, late.rpt shows the three longest paths in the system as they pertain to register setup tests. The second file, early.rpt shows the three longest paths in the system as they pertain to register hold checks. The third file, allpaths.rpt, shows a summary of the longest path starting at each input, output, register input and register output. This third file gives an understanding how the worst case path (from x2[0] to y[9]) relates to other slow paths in the module.

  • The script uses Tempus commands; consult the Tempus command reference manual (on Cadence support) for variations of these commands. Some of them, such as report_timing, support many variations and are worth exploring.

read_lib /opt/cadence/libraries/gsclib045_all_v4.7/gsclib045/timing/slow_vdd1v0_basicCells.lib

read_verilog ../syn/outputs/mac_netlist.v
set_top_module mac

read_sdc ../syn/outputs/mac_constraints.sdc
read_sdf ../syn/outputs/mac_delays.sdf

report_timing -late -max_paths 3 > late.rpt
report_timing -early -max_paths 3 > early.rpt

report_timing  -from [all_inputs] -to [all_outputs] -max_paths 12 -path_type summary  > allpaths.rpt
report_timing  -from [all_inputs] -to [all_registers] -max_paths 12 -path_type summary  >> allpaths.rpt
report_timing  -from [all_registers] -to [all_registers] -max_paths 12 -path_type summary >> allpaths.rpt
report_timing  -from [all_registers] -to [all_outputs] -max_paths 12 -path_type summary >> allpaths.rpt
exit

First, let’s look at the worst path from late.rpt. That path turns out to be worse than the path reported by the synthesis tool although it’s quite close. The differences are cause by the use of SDF.

Path 1: MET Setup Check with Pin y_reg[9]/CK
Endpoint:   y_reg[9]/D (^) checked with  leading edge of 'clk'
Beginpoint: x2[0]      (^) triggered by  leading edge of 'clk'
Path Groups: {clk}
Other End Arrival Time          0.000
- Setup                         0.124
+ Phase Shift                   4.000
= Required Time                 3.876
- Arrival Time                  3.291
= Slack Time                    0.585
     Clock Rise Edge                      0.000
     + Input Delay                        0.000
     = Beginpoint Arrival Time            0.000
      -------------------------------------------------------------------------------
      Instance                       Arc           Cell      Delay  Arrival  Required
                                                                    Time     Time
      -------------------------------------------------------------------------------
      -                              x2[0] ^       -         -      0.000    0.585
      mul_21_11_g847__4319           A ^ -> Y v    NAND2X1   0.061  0.061    0.646
      mul_21_11_g817__5107           A v -> Y ^    NOR2X1    0.100  0.161    0.746
      mul_21_11_g806__4733           B0 ^ -> Y v   AOI21X1   0.069  0.230    0.815
      mul_21_11_cdnfadd_003_0__2883  CI v -> S ^   ADDFX1    0.307  0.537    1.122
      mul_21_11_g802__1881           A1 ^ -> Y v   OAI221X1  0.166  0.703    1.288
      mul_21_11_g798__7098           A1 v -> Y ^   AOI22X1   0.157  0.860    1.445
      mul_21_11_g796__8246           C ^ -> Y v    NOR3BX1   0.072  0.932    1.517
      mul_21_11_g794__1705           A1 v -> Y ^   OAI22X1   0.087  1.019    1.604
      mul_21_11_g793__2802           B ^ -> CO ^   ADDFX1    0.230  1.249    1.834
      mul_21_11_g792__1617           CI ^ -> CO ^  ADDFX1    0.186  1.435    2.020
      mul_21_11_g791__3680           CI ^ -> S ^   ADDFX1    0.290  1.725    2.310
      g725__1881                     B ^ -> CO ^   ADDFX1    0.203  1.928    2.513
      g722__7098                     CI ^ -> CO ^  ADDFX1    0.186  2.114    2.699
      g719__5122                     CI ^ -> CO ^  ADDFX1    0.186  2.300    2.885
      g716__2802                     CI ^ -> CO ^  ADDFX1    0.186  2.486    3.071
      g713__3680                     CI ^ -> CO ^  ADDFX1    0.186  2.672    3.257
      g710__5526                     CI ^ -> CO ^  ADDFX1    0.186  2.858    3.443
      g707__4319                     CI ^ -> CO ^  ADDFX1    0.184  3.042    3.627
      g705__5107                     B ^ -> Y ^    XNOR2X1   0.163  3.205    3.790
      g703__2398                     AN ^ -> Y ^   NOR2BX1   0.086  3.291    3.876
      y_reg[9]                       D ^           DFFHQX1   0.000  3.291    3.876
      -------------------------------------------------------------------------------

Next, look at the fastest path with respect to hold time in early.rpt. This path runs from reset to a register input. The overall module uses syncrhonous reset, so this reset signal is a simple system input. Because the system constraint uses an input delay of 0, the STA assumes that reset can change directly after the upgoing clock edge, thereby potentially affecting y_reg[7]. However, as the report shows, there is no hold time violation.

Path 1: MET Hold Check with Pin y_reg[7]/CK
Endpoint:   y_reg[7]/D (v) checked with  leading edge of 'clk'
Beginpoint: reset      (^) triggered by  leading edge of 'clk'
Path Groups: {clk}
Other End Arrival Time          0.000
+ Hold                          0.006
+ Phase Shift                   0.000
= Required Time                 0.006
  Arrival Time                  0.020
  Slack Time                    0.014
     Clock Rise Edge                      0.000
     + Input Delay                        0.000
     = Beginpoint Arrival Time            0.000
      ---------------------------------------------------------
      Instance    Arc         Cell     Delay  Arrival  Required
                                              Time     Time
      ---------------------------------------------------------
      -           reset ^     -        -      0.000    -0.014
      g709__8428  B ^ -> Y v  NOR2BX1  0.020  0.020    0.006
      y_reg[7]    D v         DFFHQX1  0.000  0.020    0.006
      ---------------------------------------------------------

Finally, look at the path summary of the overall module in allpath.rpt. This results show that the slack is minimal indeed for the input-to-register path. Note also the significant variation of the slack within a path group (e.g. input-to-register) as well as between apth groups. In this case, the x2[0 to y_reg[9] path is so much longer than anything else, that it dominates the delay. Hence, any timing optimization on this module should start by considering that specific part of the design. However, in other case, other parts of the design may be equally slow so that the critical path tends to jump around as soon as we start making small changes/optimzations. The use of STA (and tempus, in particular) can help us understand how design changes can affect the critical path. In the quiz, you will explore this aspect further.

#  Command:           report_timing  -from [all_inputs] -to [all_outputs] -max_paths 12 -path_type summary  > allpaths.rpt
      --------------------------------------------------------------------------------------------------------------------------------------
       Path No.   Begin Point         End Point            Slack      Arrival        Required   Phase                  Other Phase
      --------------------------------------------------------------------------------------------------------------------------------------
      1          x2[0] ^             m[8] ^               1.159      2.841          4.000      clk(D)(P) *            clk(C)(P)
      2          x2[0] ^             m[9] ^               1.265      2.735          4.000      clk(D)(P) *            clk(C)(P)
      3          x2[0] ^             m[7] ^               1.345      2.655          4.000      clk(D)(P) *            clk(C)(P)
      4          x2[0] ^             m[6] ^               1.531      2.469          4.000      clk(D)(P) *            clk(C)(P)
      5          x2[0] ^             m[5] ^               1.717      2.283          4.000      clk(D)(P) *            clk(C)(P)
      6          x2[0] ^             m[4] ^               1.903      2.097          4.000      clk(D)(P) *            clk(C)(P)
      7          x2[0] ^             m[3] ^               2.089      1.911          4.000      clk(D)(P) *            clk(C)(P)
      8          x2[0] ^             m[2] ^               2.275      1.725          4.000      clk(D)(P) *            clk(C)(P)
      9          x2[0] ^             m[1] ^               2.463      1.537          4.000      clk(D)(P) *            clk(C)(P)
      10         x2[3] ^             m[0] v               2.647      1.353          4.000      clk(D)(P) *            clk(C)(P)
      --------------------------------------------------------------------------------------------------------------------------------------

#  Command:           report_timing  -from [all_inputs] -to [all_registers] -max_paths 12 -path_type summary  >> allpaths.rpt
      --------------------------------------------------------------------------------------------------------------------------------------
       Path No.   Begin Point         End Point            Slack      Arrival        Required   Phase                  Other Phase
      --------------------------------------------------------------------------------------------------------------------------------------
      1          x2[0] ^             y_reg[9]/D ^         0.585      3.291          3.876      clk(D)(P) *            clk(C)(P)
      2          x2[0] ^             y_reg[8]/D ^         0.643      3.233          3.876      clk(D)(P) *            clk(C)(P)
      3          x2[0] ^             y_reg[7]/D ^         0.829      3.047          3.876      clk(D)(P) *            clk(C)(P)
      4          x2[0] ^             y_reg[6]/D ^         1.015      2.861          3.876      clk(D)(P) *            clk(C)(P)
      5          x2[0] ^             y_reg[5]/D ^         1.201      2.675          3.876      clk(D)(P) *            clk(C)(P)
      6          x2[0] ^             y_reg[4]/D ^         1.387      2.489          3.876      clk(D)(P) *            clk(C)(P)
      7          x2[0] ^             y_reg[3]/D ^         1.573      2.303          3.876      clk(D)(P) *            clk(C)(P)
      8          x2[0] ^             y_reg[2]/D ^         1.767      2.109          3.876      clk(D)(P) *            clk(C)(P)
      9          x2[0] ^             y_reg[1]/D ^         1.964      1.912          3.876      clk(D)(P) *            clk(C)(P)
      10         x2[3] ^             y_reg[0]/D ^         2.348      1.511          3.859      clk(D)(P) *            clk(C)(P)
      --------------------------------------------------------------------------------------------------------------------------------------

#  Command:           report_timing  -from [all_registers] -to [all_registers] -max_paths 12 -path_type summary >> allpaths.rpt
      --------------------------------------------------------------------------------------------------------------------------------------
       Path No.   Begin Point         End Point            Slack      Arrival        Required   Phase                  Other Phase
      --------------------------------------------------------------------------------------------------------------------------------------
      1          y_reg[0]/Q ^        y_reg[9]/D ^         1.821      2.055          3.876      clk(D)(P) *            clk(C)(P)
      2          y_reg[0]/Q ^        y_reg[8]/D ^         1.879      1.997          3.876      clk(D)(P) *            clk(C)(P)
      3          y_reg[0]/Q ^        y_reg[7]/D ^         2.065      1.811          3.876      clk(D)(P) *            clk(C)(P)
      4          y_reg[0]/Q ^        y_reg[6]/D ^         2.251      1.625          3.876      clk(D)(P) *            clk(C)(P)
      5          y_reg[0]/Q ^        y_reg[5]/D ^         2.437      1.439          3.876      clk(D)(P) *            clk(C)(P)
      6          y_reg[0]/Q ^        y_reg[4]/D ^         2.623      1.253          3.876      clk(D)(P) *            clk(C)(P)
      7          y_reg[0]/Q ^        y_reg[3]/D ^         2.809      1.067          3.876      clk(D)(P) *            clk(C)(P)
      8          y_reg[0]/Q ^        y_reg[2]/D ^         2.995      0.881          3.876      clk(D)(P) *            clk(C)(P)
      9          y_reg[0]/Q ^        y_reg[1]/D ^         3.187      0.689          3.876      clk(D)(P) *            clk(C)(P)
      10         y_reg[0]/Q v        y_reg[0]/D ^         3.506      0.353          3.859      clk(D)(P) *            clk(C)(P)
      --------------------------------------------------------------------------------------------------------------------------------------

#  Command:           report_timing  -from [all_registers] -to [all_outputs] -max_paths 12 -path_type summary >> allpaths.rpt
      --------------------------------------------------------------------------------------------------------------------------------------
       Path No.   Begin Point         End Point            Slack      Arrival        Required   Phase                  Other Phase
      --------------------------------------------------------------------------------------------------------------------------------------
      1          y_reg[1]/Q v        y[1] v               3.798      0.202          4.000      clk(D)(P) *            clk(C)(P)
      2          y_reg[2]/Q v        y[2] v               3.798      0.202          4.000      clk(D)(P) *            clk(C)(P)
      3          y_reg[3]/Q v        y[3] v               3.798      0.202          4.000      clk(D)(P) *            clk(C)(P)
      4          y_reg[4]/Q v        y[4] v               3.798      0.202          4.000      clk(D)(P) *            clk(C)(P)
      5          y_reg[5]/Q v        y[5] v               3.798      0.202          4.000      clk(D)(P) *            clk(C)(P)
      6          y_reg[6]/Q v        y[6] v               3.798      0.202          4.000      clk(D)(P) *            clk(C)(P)
      7          y_reg[7]/Q v        y[7] v               3.798      0.202          4.000      clk(D)(P) *            clk(C)(P)
      8          y_reg[8]/Q v        y[8] v               3.798      0.202          4.000      clk(D)(P) *            clk(C)(P)
      9          y_reg[0]/Q v        y[0] v               3.804      0.196          4.000      clk(D)(P) *            clk(C)(P)
      10         y_reg[9]/Q v        y[9] v               3.806      0.194          4.000      clk(D)(P) *            clk(C)(P)
      --------------------------------------------------------------------------------------------------------------------------------------