Attention

This document was last updated Nov 25 24 at 21:59

RTL Synthesis

Important

The purpose of this lecture is as follows.

To describe the role of synthesis in the digital hardware design
To provide examples of hardware inference in SystemVerilog
To experiment with an RTL synthesis tool

Attention

The example discussed in this lecture is available on https://github.com/wpi-ece574-f24/ex-expressions

Synthesis and Optimization

In the introductory lecture, we spoke about the powerful concept of modeling to abstract design activities to a higher level. There are three abstraction levels that are important to us for this lecture.

System-level: Captures a behavioral description of a hardware design. Behavior, in this context, doesn’t have to be represented as a sequential C program. Think for example on Models of Compution, which allow to express abstract, complex events such as distributed, data-driven processing (dataflow), or interacting parallel processes (process networks).
Register-transfer level: Captures a mixed behavioral/structural description of a hardware design in terms of activities of a single clock cycle, a Register Transfer. RTL is the golden model in hardware design for the past 40 years: it expresses detailed, single-bit operations while at the same time it can also capture complex decision making logic such as with if-then-else statements.
Gate-level: Captures a structural description of a hardware design in terms of low-level primitives such as a gate or a standard cell. Most of our examples have been using SKY130 standard cells. The gate-level abstraction is technology specific and therefore very well suited for performance assessments such as area, critical path delay, power.

A fourth abstraction level, the transistor-level, is of great importance to analog and mixed-signal design, but it is less commonly used in digital hardware design. We will focus on the first three.

Defining Synthesis and Optimization

The design process of a digital hardware design requires a systematic mapping and refinement from higher abstraction levels to lower abstraction levels. Synthesis and Optimization playa critical role in this process.

Attention

Synthesis in digital hardware design involves methods that offer: a systematic (and often, automated) translation from a higher abstraction level to a lower abstraction level. The appeal of synthesis is that, once its automated, it can claim to create designs that are correct by construction.
Optimization in digital hardware design involves methods that: transform a given abstraction level into a more optimal representation at the same abstraction level. The optimality criterium depends on the objective of the optimization. An important observation is that the abstraction level does not change.

In digital hardware design, there is Synthesis and Optimization at any abstraction level, as the following examples illustrate.

System-level synthesis will generate RTL automatically from a system-level description. For example, we can expand a dataflow diagram automatically into an RTL description. System-level optimization will transform a system-level description into a different, more optimal one. For example, we could transform a process network by instantiating multiple parallel copies of a process to increase the processing parallellism.
RTL synthesis is used to create a gate-level netlist automatically from an RTL description. For example, we can convert a Finite State Machine Description into a gate-level netlist by selecting a state encoding, and performing logic synthesis for each state transition. RTL optimization is used to create better designs. For example, we can optimize an FSM at the RTL by trying to reduce the number of states it uses. The optimization knob of RTL lies in non-technological factors, such as reducing a clock cycle on a schedule, sharing an operator, rewriting a complex expression, and so on.
Gate-level synthesis is used to convert a gate-level netlist into a layout (ie. a transistor netlist with spatial dimensions). The term gate-level synthesis is not used; instead, terms like place-and-route and clock-tree-synthesis are used. Gate-level optimization is used to improve the performance of the design, and automatic retiming would serve as a good example. The optimization know of gate-level optimization is technological, and involves transformations that reduce area, critical path, power, and so forth.

The Hardware Design Flow

We thus can define a hardware design flow that combines optimization and synthesis as shown in the next figure. This is generic as it tries to cover multiple many cases of hardware design. At each level, a design goes through multiple iterations of optimization, and finally expands towards a lower level through synthesis. These steps are largely automated at the lower layers of abstraction, while they remain largely manual at the higher layers of abstraction. One exception would be high-level synthesis, which can map a subset of system-level behaviors into RTL.

The verification flow uses simulation or symbolic (formal) techniques to check the correctness of a design. Ideally, one would use a single verification technique throughout the design but in practice, only the RTL and gate-level verification are well integrated in simulation.

The first few steps in system level synthesis and optimization are, in general, extremely hard, and they require lots of design experience. Just consider the following problem as an illustration: given a set of design constraints (such as type of computations and real-time performance) how do you choose between an FPGA, a programmable processor, a customized processor (e.g DSP), a multi-core design, or a full dedicated ASIC? In a few extreme cases, the choice may be obvious, but in general, you will find that you can always make it work with any of these components, or a combination of them. Yet, that is the choice faced by a system designer. We will revisit this problem later in the course.

Register Transfer Level Design in Verilog

We discussed RTL modeling in our previous handson assignments. In the following, we will highlight some of the key ideas and point out various pitfalls of RTL coding. A first important observation is that hardware synthesis from RTL uses two distinct mechanisms.

Hardware instantiation is used when a lower level primitive is referenced structurally in the high level design. Hardware instantiation is seen, for example, in the gate-level netlists produced during RTL synthesis. The following example instantiates a module with name _2_ of type sky130_fd_sc_hd__dfxtp_1 – which is of course a flip-flop of the standard cell library. There is nothing special about instantiation, but it is often uses when complex primitives such as RAM modules must be integrated directly into the design.

sky130_fd_sc_hd__dfxtp_1 _2_ (
  .CLK(clk),
  .D(_0_),
  .Q(q)
);

Hardware inference is used when a lower level primitive will replace an RTL expression or construction. Hardware inference is a selective process, and not every Verilog construction has a corresponding hardware inference. This is an important observation, as it implies that not every Verilog construction in simulation will also lead to hardware. In fact, we will see examples where the hardware inference can lead to an implementation that differs from the simulation.

RTL coding Guidelines

RTL coding guidelines are vendor-specific; constructions that are supported by one vendor (or technology) may not be supported by another vendor. Also, the target technology may be slightly different from one vendor to the next (FPGA versus standard cells, for example). Therefore, it is crucial to rely on vendor documentation for details. Several links to such RTL Coding Guidelines are listed on top of this lecture among the Reading links.

Important

The fundamental insight of effective RTL coding is the following. It’s not helpful to think of RTL synthesis as a magical box that converts RTL code into a gate netlist. That strategy doesn’t work well on large systems. Sooner or later, you’ll end up with a buggy design (e.g., a design that mixes latches with flip-flops) and it’ll be hard to explain went wrong.

The right way of thinking about RTL inference is to think about your hardware implementation first, and then map that back to RTL code. Thus, you use RTL just as a shorthand for something that, at least in your mind, is already very structured. Hardware design is fundamentally a bottom-up process,

Visualizing the Synthesis Result

We will make use of the Cadence Genus tool for RTL synthesis. Refer to canvas for a set of slides that discuss the typical flow in cadence. Here, we will focus on the application of Genus on to implement small example.

Attention

You can download the code examples on the class design server using git.

git clone git@github.com:wpi-ece574-f24/ex-expressions.git

Once you download the code, inspect the following two files first: syn/genus_script.tcl and constraints/constraints_comb.sdc

The synthesize the example code, run genus with the graphical user interface option enabled (genus -gui). Then, follow along below as we synthesize each example.

There are three levels that are relevant for visualization.

RTL. After first parsing a Verilog file, the RTL is broken down into fundamental statements and register transfers. While this is not quite a netlist yet, we can represent the RTL operations as nodes in a graph, with edges carrying dependencies between variables.
Generic Gates. After synthesis, the RTL is converted into logic gates. Most RTL synthesis tools, including yosys, map the RTL to a generic technology, meaning a technology of gates that does not specify a specific feature size.
Technology Gates. After technology mapping, the netlist of generic logic gates is converted into concrete cells from a standard cell library.

Let’s look at the following example of a basic XOR gate.

module ex1(output logic q, input logic a, input logic b);
 assign q = (a & ~b) | (~a & b);
endmodule

To synthesize this code, run the following commands in the genus prompt. These commands can be found as well in the genus_script.tcl file.

# these are initialization commands that have to be run once.
# we will run synthesis against cadence 45nm standard cells

set_db init_lib_search_path /opt/cadence/libraries/gsclib045_all_v4.7/gsclib045/timing/
read_libs slow_vdd1v0_basicCells.lib


# this sets the 'effort' of the synthesis tool: how hard it tries to find a good solution
set_db syn_generic_effort medium
set_db syn_map_effort medium
set_db syn_opt_effort medium

Now we are ready to read in the code and synthesize it.

# read in RTL of the first example
read_hdl -language sv ../rtl/ex1.sv
elaborate

After these commands, the tool has constructed an RTL schematic of the code. You can visualize the RTL schematic in the GUI using ‘show schematic’.

_images/genus_ex1_rtl.png — RTL netlist of ex1

# map to generic gates
set_top_module ex1
read_sdc ../constraints/constraints_comb.sdc
syn_generic

Next, we map this code to generic gates. The first two commands set_top_module and read_sdc do housekeeping by selecting the top module in the synthesis as well as the synthesis constraints. syn_generic does the actual synthesis work. We will discuss the construction of synthesis constraints in more detail while we discuss timing. For now, it’s sufficient to understand that these constraints are used to express the desired speed of the design. In this case, we design the implementation to operate at 100 MHz.

# contents of constraint_comb.sdc
set non_clock_inputs [all_inputs]
set_input_delay  0 -clock clk $non_clock_inputs
set_output_delay 0 -clock clk [all_outputs]

After syn_generic completes, you’ll notice that the schematic looks different. The generic schematic uses less cells, presumably because these generic gates can implement a more sophisticated logic mapping.

The following steps map the generic netlist to 130nm standard cells, and finally to an optimized netlist of 130nm standard cells.

syn_map  # map to a concrete technology
syn_opt  # optimize the implementation

After syn_opt completes, you’ll notice the resulting design consists of a single xor cell. Because of the simplicity of this design, the syn_opt has not practical impact: the final netlist is the same as the mapped netlist.

_images/genus_ex1_45.png — CDS45 netlist of ex1

Genus allows you to various properties of the design. The report command, in particular, returns many useful statistics. For example, report_area shows the active area (standard cell area) of the final design. report_timing shows the speed (critical path) of the final design.

@genus:design:ex1 29> report_area
============================================================
Generated by:           Genus(TM) Synthesis Solution 21.19-s055_1
Generated on:           Sep 14 2024  10:54:49 am
Module:                 ex1
Technology library:     slow_vdd1v0 1.0
Operating conditions:   PVT_0P9V_125C (balanced_tree)
Wireload mode:          enclosed
Area mode:              timing library
============================================================

Instance Module  Cell Count  Cell Area  Net Area   Total Area   Wireload
--------------------------------------------------------------------------
ex1                       1      2.736     0.000        2.736 <none> (D)

@genus:design:ex1 30> report_timing
============================================================
Generated by:           Genus(TM) Synthesis Solution 21.19-s055_1
Generated on:           Sep 14 2024  10:56:02 am
Module:                 ex1
Operating conditions:   PVT_0P9V_125C (balanced_tree)
Wireload mode:          enclosed
Area mode:              timing library
============================================================

Path 1: MET (9844 ps) Late External Delay Assertion at pin q
Group: clk
Startpoint: (F) b
Clock: (R) clk
Endpoint: (R) q
Clock: (R) clk

Capture       Launch
Clock Edge:+   10000            0
Drv Adjust:+       0            0
Src Latency:+       0            0
Net Latency:+       0 (I)        0 (I)
Arrival:=   10000            0

Output Delay:-       0
Required Time:=   10000
Launch Clock:-       0
Input Delay:-       0
Data Path:-     156
Slack:=    9844

Exceptions/Constraints:
input_delay               0              constraints_comb.sdc_line_6_1_1
output_delay              0              constraints_comb.sdc_line_7

#---------------------------------------------------------------------------------------
# Timing Point   Flags   Arc   Edge   Cell     Fanout Load Trans Delay Arrival Instance
#                                                     (fF)  (ps)  (ps)   (ps)  Location
#---------------------------------------------------------------------------------------
b              -       -     F     (arrival)      1  0.2     0     0       0    (-,-)
g18__2398/Y    -       A->Y  R     XOR2XL         1  0.0    15   156     156    (-,-)
q              -       -     R     (port)         -    -     -     0     156    (-,-)
#---------------------------------------------------------------------------------------

We will discuss the meaning of these reports in detail in later lectures. For the timing being, we will focus on three important quality metrics.

area The active area of all standard cells in the design. In this case, the active area is 2.736 square micron
cell count The number of cells in the design. In this case, the cell count is 1.
slack The margin of the design compared to the clock period. In this case, the clock period is 10000 ps, and the computation is read at 156 ps. Therefore, the slack is 9844 ps, or 9.844 ns. Slack must always be positive, otherwise your design does not meet the timing constraints; i.e. when slack is negative, your design is too slow.

In the remainder of this demonstration, we focus only on the schematics and the link from SystemVerilog to gate-level netlist.

Complex Combinational Logic

A good way of creating complex combinational logic is to write expressions.

module ex2(output logic [3:0] q, input logic [1:0] a, b, c);
    assign q = a + b * c;
endmodule

This directly maps into a multiplier followed by an adder. Note how the expression b*c has a wordlength of twice the input, as we would expect of a multiplier. The wordlength is kept very short to keep the resulting gate-level netlist compact enough for display.

_images/genus_ex2_rtl.png — RTL netlist of ex2

The gate-level netlist graphics uses a black bar to ungroup wires from a bus. The following is the synthesis result: a 2x2 multiplier (with 4-bit output), to which a 2-bit number is added.

_images/genus_ex2_45.png — CDS45 netlist of ex2

Attention

In case you wonder about the functionality of a standard cell with an odd or unknown name, you’d typically look into (a) the standard cell library databook, or else the functional Verilog view of the standard cell. For example, for OAI2BB1X1, you can find a functional Verilog view in /opt/cadence/libraries/gsclib045_all_v4.7/gsclib045/verilog/slow_vdd1v0_basicCells.v on the class design server.

The RTL synthesis will use a host of tricks to make the resulting netlist as small as possible. One of these is constant propagation, where the constant inputs to a gate are used to simplify the netlist. For example, if we hardcode one input of the multiplier to the value 2, the resulting netlist can be drastically simplified. The gate level netlist implements the following design as two half-adder standard cells.

module ex3(output logic [3:0] q, input logic [1:0] a, b);
    assign q = a + b * 2;
endmodule

_images/genus_ex3_45.png — CDS45 netlist of ex3

Keep in mind that some verilog operators are very expensive to implement. For example, the inoccuous ‘modulo’ operator in Verilog becomes a large implementation, even at a short wordlength. Again, observe that for certain values of b (namely, at every power of 2 minus 1), the modulo operation becomes trivial to implement. The complexity of the design stems from the fact that we want to compute the modulo operation for any value that a two-bit number b can take. We will have a separate lecture where we discuss the implementation of arithmetic in detail.

module ex4(output logic [3:0] q, input logic [3:0] a, b);
    assign q = a % b;
endmodule

_images/genus_ex4_45.png — CDS45 netlist of ex4

To map combinational logic in a procedural manner, always use an always_comb block. The initial block is not supported, although some synthesis tools (such as for FPGA) may refer to the initial block to determine the initial value of flip-flops. But for combinational logic, initial remains unused. For example, the following piece of RTL will result in a pure inverter.

module ex5(output logic q, input logic a);
    initial q = 1;
    always_comb begin
        q = ~a;
    end
endmodule

Another important point of concern with combinational logic captured using always_comb blocks, is that the process sensitivity list must be complete.

module ex6(output logic q, input logic a, b);
    logic t;

    always @(a) begin
        t = a ^ 1;
        q = b ^ t;
    end
endmodule

This module maps into an XNOR gate. However, looking closely at the Verilog, this is not what was specified. If b changes, the always block will not evaluate, thereby leaving q unmodified. Hence, in this case, the resulting netlist is different from the Verilog specification.

_images/genus_ex6_45.png — CDS45 netlist of ex6

There is an easy workaround that will make sure that this problem is always avoided: use a wildcard for the sensitivity list of a combinational process (always) or just use always_comb with an empty sensitivity netlist. In other words, write the previous example as follows. This will not have impact on the synthesis outcome. However, it will make sure that the RTL behaves identical to the gate-level netlist during simulation.

module ex6(output logic q, input logic a, b);
    logic t;

    always_comb begin
        t = a ^ 1;
        q = b ^ t;
    end
endmodule

Priority Logic

Priority logic is used in expressing conditions. Verilog expresses procedural priority logic using if statements or else using case statements. Both of these require special attention. Consider the following design. A two-bit vector a selects either the value of b or its complement.

module ex7(output logic q, input logic [1:0] a, input logic b);
    always_comb begin
        if (a[1])
            q = b;
        else if (a[0])
            q = ~b;
    end
endmodule

The logic generated from this code contains a latch, which is driven by the bits of a.

_images/genus_ex7_45.png — CDS45 netlist of ex7

There is a problem with the Verilog description. The netlist includes a latch as the cell sky130_fd_sc_hd__dlxtn_1 (we intend to describe combinational logic, not stateful logic!). That problem is solved through a default assignment, a value that will be assigned to q when a combination of the input values does not evaluate to a concrete update for q. In this case, when a is 00, the value of q is left unspecified.

module ex8(output logic q, input logic [1:0] a, input logic b);
    always_comb begin
        q = 0;
        if (a[1])
            q = b;
        else if (a[0])
            q = ~b;
    end
endmodule

This generates the following logic. The latch has disappeared.

_images/genus_ex8_45.png — CDS45 netlist of ex8

Note also the impact of the priority logic. In a nested if-then case, the outer if receives priority. In the example above, bit a[1] has priority over a[0], meaning that if the value a is 1, then the function of this module is to copy b.

We can swap the nesting of the if-then-else statement, if we need a different priority. In the following example, bit a[0] has priority over a[1]. When a is 11, the function of this module is to copy the complement of b.

module ex9(output logic q, input logic [1:0] a, input logic b);
    always_comb begin
        q = 0;
        if (a[0])
            q = ~b;
        else if (a[1])
            q = b;
    end
endmodule

_images/genus_ex9_45.png — CDS45 netlist of ex9

Unlike if statements, expressing priorities with case statements is harder, because a case statement tests only a single condition for all the branches. However, a case statement is better at handling complex logic, such as for example with the next-state logic of finite state machines. A case statement has the same issue as if statements and can infer latches if an uncovered case condition is left. The use of a default branch ensures that the case logic is always complete.

module ex10(output logic q, input logic [1:0] a, input logic b);
    always_comb begin
        case (a)
            1: q = ~b;
            2: q = b;
            default: q = 0;
        endcase
    end
endmodule

_images/genus_ex10_45.png — CDS45 netlist of ex10

Loops

Verilog supports the use of loops in procedural blocks. These loops have no meaning for synthesis and will be expanded as the Verilog is translated to gates. If you want to implement a control contruction, you will have to build it using a finite state machine. The following example show how to compute the Hamming weight of a 5-bit number.

module ex11(output logic [2:0] q, input logic [4:0] a);
    logic [2:0] tmp;
    integer i;

    always_comb begin
        tmp = 0;
        for (i = 0; i < 5; i = i + 1)
            tmp = tmp + a[i];
        q = tmp;
    end
endmodule

This module results in pure combinational code. The compiler will unroll the loop before RTL synthesis. Note also that we can assign and reassign the value of tmp. However, for synthesis, all the iterations in a loop will instantly execute. tmp does not play the role of a hardware storage element.

_images/genus_ex11_45.png — CDS45 netlist of ex11

For example, the following modules, ex12 and ex13, will generate gates as ex11.

module ex12(output logic [2:0] q, input logic [4:0] a);
    logic [2:0] tmp;
    integer i;

    always_comb begin
        tmp = 0;
        tmp = tmp + a[0];
        tmp = tmp + a[1];
        tmp = tmp + a[2];
        tmp = tmp + a[3];
        tmp = tmp + a[4];
        q = tmp;
    end
endmodule

module ex13(output logic [2:0] q, input logic [4:0] a);
    always_comb begin
        q = 0 + a[0] + a[1] + a[2] + a[3] + a[4];
    end
endmodule

Flip-flops

Hardware registers are modeled in a separate always block which is driven by the clock signal. The assignment on flip-flops always proceeds using a non-blocking assignment. A non-blocking assignment is necessary to allow simultaneous assignments on all flip-flops. The following example shows two flop-flops that are connected back-to-back. The use an asynchronous reset.

module ex14(
    input wire d,
    input wire reset,
    input wire clk,
    output wire q1,
    output wire q2
);
    logic q1r, q2r;

    always_ff @(posedge clk or posedge reset) begin
        if (reset) begin
            q1r <= 1'b0;
            q2r <= 1'b0;
        end else begin
            q1r <= q2r;
            q2r <= q1r ^ d;
        end
    end

    assign q1 = q1r;
    assign q2 = q2r;
endmodule

_images/genus_ex14_45.png — CDS45 netlist of ex14

A flip-flop with synchronous reset is achieved by a simple change to the process sensitivity list.

module ex15(
    input logic d,
    input logic reset,
    input logic clk,
    output logic q1,
    output logic q2
);
    logic q1r, q2r;

    always_ff @(posedge clk) begin
        if (reset) begin
            q1r <= 1'b0;
            q2r <= 1'b0;
        end else begin
            q1r <= q2r;
            q2r <= q1r ^ d;
        end
    end

    assign q1 = q1r;
    assign q2 = q2r;
endmodule

_images/genus_ex15_45.png — CDS45 netlist of ex15

While asynchronous reset inputs are a common feature on flip-flop cells in standard cell libraries, synchronous reset are usually realized by multiplexing additional logic at the flip-flop inputs.

Gated clocks on flip-flops are to be done with caution. It’s not a good idea to directly manipulated the clock as follows. This will result in difficult skew problems.

module ex16(
    output logic q,
    input logic enable,
    input logic clk,
    input logic d
);
    logic qr;
    wire gatedclk;

    // An example of how NOT to gate a clock!
    assign gatedclk = clk & enable;

    always_ff @(posedge gatedclk) begin
        qr <= d;
    end

    assign q = qr;
endmodule

_images/genus_ex16_45.png — CDS45 netlist of ex16

A better way to implement a gate clock is to add an enable single inside of the synchronous update as follows. This design will ensure that each flip-flop receives a primitive clock signal.

module ex17(
    output logic q,
    input logic enable,
    input logic clk,
    input logic d
);
    logic qr;

    always_ff @(posedge clk) begin
        if (enable)
            qr <= d;
    end

    assign q = qr;
endmodule

_images/genus_ex17_45.png — CDS45 netlist of ex17

Finite State Machines

We already briefly discussed the design of finite state machines. We will discuss the implementation aspects by means of an example.

Let’s create an FSM that reencodes a bitstring as follows. The first 1 in a string of 1s is passed as a 1. Otherwise, the output is 0. The following shows an example input string with its corresponding output string.

input string:      1 1 1 0 0 1 0 0 1 1 1 0 1 1
output string:     1 0 0 0 0 1 0 0 1 0 0 0 1 0

The first step in such a design is to design a state transition diagram. The following is the design of a Moore FSM that implements this FSM.

We will discuss three different Verilog implementations for this design: a three-process style (recommended), a two-process style and a single-process style. The last two provide a more compact style, but may be a little harder to develop.

Important

Keep the following in mind when writing Verilog for FSM.

Always draw your state transition diagram before coding. The Verilog representation of a state transition diagram is messy and error prone.
Always use symbolic state encoding. The synthesis tools will choose a state encoding for you which will result in the fastest/most compact design.
Always model state transition logic as a case statement. Don’t forget a default assignment.
Always make sure that your state transition logic is complete. Add a default state transition that will bring back the finite state machine to its initial (or another known) state.

Here is a three-process model of the finite state machine. The symbolic states are s0, s1, s2. We have assigned them a specific value for RTL simulation purposes. However, the RTL synthesis tool will choose a state encoding that results in the smallest possible area. Both the next-state block and the output block have default assignments (for the next state and for the output, respectively).

module ex19(
    output logic q,
    input logic i,
    input logic clk,
    input logic reset
);
    logic rq;

    logic [1:0] state, state_next;

    localparam s0 = 2'b00, s1 = 2'b01, s2 = 2'b10;

    always_ff @(posedge clk or posedge reset) begin
        if (reset)
            state <= s0;
        else
            state <= state_next;
    end

    always_comb begin
        state_next = s0;
        case (state)
            s0:
                if (i == 1'b1)
                    state_next = s1;
                else
                    state_next = s0;
            s1:
                if (i == 1'b1)
                    state_next = s2;
                else
                    state_next = s0;
            s2:
                if (i == 1'b1)
                    state_next = s2;
                else
                    state_next = s0;
        endcase
    end

    always_comb begin
        rq = 1'b0;
        case (state)
            s0:  rq = 1'b0;
            s1:  rq = 1'b1;
            s2:  rq = 1'b0;
        endcase
    end

    assign q = rq;
endmodule

This implementation yields a design as shown.

To verify that this is a correct implementation of the spec, we recompute the state transition diagram from the schematic. Note that DFF0 is the rightmost flip-flop.

DFF1	DFF0	i	q	Next DFF1	Next DFF0
0	0	0	0	0	0
0	0	1	0	0	1
0	1	0	1	0	0
0	1	1	1	1	0
0	0	0	0	0	0
0	0	1	0	1	0
1	1	0	0	0	0
1	1	1	0	0	0

From the transition table, we recreate the state transition diagram which closely matches the original specification.

Note, however, that the state 11, which did not appear in the logical state machine, now does occur in the physical implementation. Indeed, there is nothing that prevents both flip-flops from storing 1 at the same time. This would occur, of course, as the result of a fault. The important element is that this illegal state is effectively avoided, because it will transition into state 00.

Finally, we illustrate two more compact notations for this FSM, which are created by collapsing several processes into a single process. All of the following Verilog descriptions map into the same identical netlist. The first transformation is to merge the output encoding with the next-state logic, which is implemented by merging two always blocks into a single one.

module ex20(
    output logic q,
    input logic i,
    input logic clk,
    input logic reset
);
    logic rq;

    logic [1:0] state, state_next;

    localparam s0 = 2'b00, s1 = 2'b01, s2 = 2'b10;

    always_ff @(posedge clk or posedge reset) begin
        if (reset)
            state <= s0;
        else
            state <= state_next;
    end

    always_comb begin
        state_next = s0;
        rq = 1'b0;
        case (state)
            s0: begin
                rq = 1'b0;
                if (i == 1'b1)
                    state_next = s1;
                else
                    state_next = s0;
            end
            s1: begin
                rq = 1'b1;
                if (i == 1'b1)
                    state_next = s2;
                else
                    state_next = s0;
            end
            s2: begin
                rq = 1'b0;
                if (i == 1'b1)
                    state_next = s2;
                else
                    state_next = s0;
            end
        endcase
    end

    assign q = rq;
endmodule

The second transformation is to make it even more compact, by collapsing the synchronous always process that takes care of register update, with the combinational always process. This merging is tricky, since a synchronous always process can only contain assignments to registers. Thus, we are forced to push the output encoding out of the always process into a dataflow statement.

module ex21(
    output logic q,
    input logic i,
    input logic clk,
    input logic reset
);
    logic [1:0] state, state_next;

    localparam s0 = 2'b00, s1 = 2'b01, s2 = 2'b10;

    always_ff @(posedge clk or posedge reset) begin
        if (reset)
            state <= s0;
        else begin
            state <= s0;
            case (state)
                s0: begin
                    if (i == 1'b1)
                        state <= s1;
                    else
                        state <= s0;
                end
                s1: begin
                    if (i == 1'b1)
                        state <= s2;
                    else
                        state <= s0;
                end
                s2: begin
                    if (i == 1'b1)
                        state <= s2;
                    else
                        state <= s0;
                end
            endcase
        end
    end

    assign q = (state == s1);
endmodule