.. ECE 574 .. attention:: This document was last updated |today| .. _04synthesis: RTL Synthesis ============= .. important:: The purpose of this lecture is as follows. * To describe the role of synthesis in the digital hardware design * To provide examples of hardware inference in SystemVerilog * To experiment with an RTL synthesis tool .. attention:: The example discussed in this lecture is available on https://github.com/wpi-ece574-f24/ex-expressions Synthesis and Optimization -------------------------- In the introductory lecture, we spoke about the powerful concept of modeling to abstract design activities to a higher level. There are three abstraction levels that are important to us for this lecture. 1. System-level: Captures a behavioral description of a hardware design. Behavior, in this context, doesn't have to be represented as a sequential C program. Think for example on Models of Compution, which allow to express abstract, complex events such as distributed, data-driven processing (dataflow), or interacting parallel processes (process networks). 2. Register-transfer level: Captures a mixed behavioral/structural description of a hardware design in terms of activities of a single clock cycle, a *Register Transfer*. RTL is the golden model in hardware design for the past 40 years: it expresses detailed, single-bit operations while at the same time it can also capture complex decision making logic such as with if-then-else statements. 3. Gate-level: Captures a structural description of a hardware design in terms of low-level primitives such as a gate or a standard cell. Most of our examples have been using SKY130 standard cells. The gate-level abstraction is technology specific and therefore very well suited for performance assessments such as area, critical path delay, power. A fourth abstraction level, the transistor-level, is of great importance to analog and mixed-signal design, but it is less commonly used in digital hardware design. We will focus on the first three. Defining Synthesis and Optimization ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The design process of a digital hardware design requires a systematic mapping and refinement from higher abstraction levels to lower abstraction levels. *Synthesis* and *Optimization* playa critical role in this process. .. attention:: *Synthesis* in digital hardware design involves methods that offer a systematic (and often, automated) translation from a higher abstraction level to a lower abstraction level. The appeal of synthesis is that, once its automated, it can claim to create designs that are *correct by construction*. *Optimization* in digital hardware design involves methods that transform a given abstraction level into a more optimal representation at the same abstraction level. The optimality criterium depends on the objective of the optimization. An important observation is that the abstraction level does not change. In digital hardware design, there is Synthesis *and* Optimization at any abstraction level, as the following examples illustrate. 1. System-level synthesis will generate RTL automatically from a system-level description. For example, we can expand a dataflow diagram automatically into an RTL description. System-level optimization will transform a system-level description into a different, more optimal one. For example, we could transform a process network by instantiating multiple parallel copies of a process to increase the processing parallellism. 2. RTL synthesis is used to create a gate-level netlist automatically from an RTL description. For example, we can convert a Finite State Machine Description into a gate-level netlist by selecting a state encoding, and performing logic synthesis for each state transition. RTL optimization is used to create better designs. For example, we can optimize an FSM at the RTL by trying to reduce the number of states it uses. The optimization knob of RTL lies in non-technological factors, such as reducing a clock cycle on a schedule, sharing an operator, rewriting a complex expression, and so on. 3. Gate-level synthesis is used to convert a gate-level netlist into a layout (ie. a transistor netlist with spatial dimensions). The term *gate-level synthesis* is not used; instead, terms like place-and-route and clock-tree-synthesis are used. Gate-level optimization is used to improve the performance of the design, and automatic retiming would serve as a good example. The optimization know of gate-level optimization is technological, and involves transformations that reduce area, critical path, power, and so forth. The Hardware Design Flow ^^^^^^^^^^^^^^^^^^^^^^^^ We thus can define a hardware design flow that combines optimization and synthesis as shown in the next figure. This is generic as it tries to cover multiple many cases of hardware design. At each level, a design goes through multiple iterations of optimization, and finally expands towards a lower level through synthesis. These steps are largely automated at the lower layers of abstraction, while they remain largely manual at the higher layers of abstraction. One exception would be *high-level synthesis*, which can map a subset of system-level behaviors into RTL. The verification flow uses simulation or symbolic (formal) techniques to check the correctness of a design. Ideally, one would use a single verification technique throughout the design but in practice, only the RTL and gate-level verification are well integrated in simulation. .. figure:: img/flow.jpg :figwidth: 400px :align: center The first few steps in system level synthesis and optimization are, in general, extremely hard, and they require lots of design experience. Just consider the following problem as an illustration: given a set of design constraints (such as type of computations and real-time performance) how do you choose between an FPGA, a programmable processor, a customized processor (e.g DSP), a multi-core design, or a full dedicated ASIC? In a few extreme cases, the choice may be obvious, but in general, you will find that you can always make it work with any of these components, or a combination of them. Yet, that is the choice faced by a system designer. We will revisit this problem later in the course. Register Transfer Level Design in Verilog ----------------------------------------- We discussed RTL modeling in our previous handson assignments. In the following, we will highlight some of the key ideas and point out various pitfalls of RTL coding. A first important observation is that hardware synthesis from RTL uses two distinct mechanisms. * Hardware *instantiation* is used when a lower level primitive is referenced structurally in the high level design. Hardware instantiation is seen, for example, in the gate-level netlists produced during RTL synthesis. The following example instantiates a module with name ``_2_`` of type ``sky130_fd_sc_hd__dfxtp_1`` -- which is of course a flip-flop of the standard cell library. There is nothing special about instantiation, but it is often uses when complex primitives such as RAM modules must be integrated directly into the design. .. code:: sky130_fd_sc_hd__dfxtp_1 _2_ ( .CLK(clk), .D(_0_), .Q(q) ); * Hardware *inference* is used when a lower level primitive will replace an RTL expression or construction. Hardware inference is a selective process, and not every Verilog construction has a corresponding hardware inference. This is an important observation, as it implies that *not every Verilog construction in simulation will also lead to hardware*. In fact, we will see examples where the hardware inference can lead to an implementation that differs from the simulation. RTL coding Guidelines ^^^^^^^^^^^^^^^^^^^^^ RTL coding guidelines are vendor-specific; constructions that are supported by one vendor (or technology) may not be supported by another vendor. Also, the target technology may be slightly different from one vendor to the next (FPGA versus standard cells, for example). Therefore, it is crucial to rely on vendor documentation for details. Several links to such *RTL Coding Guidelines* are listed on top of this lecture among the Reading links. .. important:: The fundamental insight of effective RTL coding is the following. It's not helpful to think of RTL synthesis as a magical box that converts RTL code into a gate netlist. That strategy doesn't work well on large systems. Sooner or later, you'll end up with a buggy design (e.g., a design that mixes latches with flip-flops) and it'll be hard to explain went wrong. The right way of thinking about RTL inference is to think about your hardware implementation *first*, and then map that back to RTL code. Thus, you use RTL just as a shorthand for something that, at least in your mind, is already very structured. Hardware design is fundamentally a bottom-up process, Visualizing the Synthesis Result ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We will make use of the Cadence Genus tool for RTL synthesis. Refer to canvas for a set of slides that discuss the typical flow in cadence. Here, we will focus on the application of Genus on to implement small example. .. attention:: You can download the code examples on the class design server using git. ``git clone git@github.com:wpi-ece574-f24/ex-expressions.git`` Once you download the code, inspect the following two files first: ``syn/genus_script.tcl`` and ``constraints/constraints_comb.sdc`` The synthesize the example code, run genus with the graphical user interface option enabled (``genus -gui``). Then, follow along below as we synthesize each example. There are three levels that are relevant for visualization. 1. *RTL.* After first parsing a Verilog file, the RTL is broken down into fundamental statements and register transfers. While this is not quite a netlist yet, we can represent the RTL operations as nodes in a graph, with edges carrying dependencies between variables. 2. *Generic Gates.* After synthesis, the RTL is converted into logic gates. Most RTL synthesis tools, including ``yosys``, map the RTL to a *generic* technology, meaning a technology of gates that does not specify a specific feature size. 3. *Technology Gates.* After technology mapping, the netlist of generic logic gates is converted into concrete cells from a standard cell library. Let's look at the following example of a basic XOR gate. .. code:: module ex1(output logic q, input logic a, input logic b); assign q = (a & ~b) | (~a & b); endmodule To synthesize this code, run the following commands in the genus prompt. These commands can be found as well in the genus_script.tcl file. .. code:: # these are initialization commands that have to be run once. # we will run synthesis against cadence 45nm standard cells set_db init_lib_search_path /opt/cadence/libraries/gsclib045_all_v4.7/gsclib045/timing/ read_libs slow_vdd1v0_basicCells.lib # this sets the 'effort' of the synthesis tool: how hard it tries to find a good solution set_db syn_generic_effort medium set_db syn_map_effort medium set_db syn_opt_effort medium Now we are ready to read in the code and synthesize it. .. code:: # read in RTL of the first example read_hdl -language sv ../rtl/ex1.sv elaborate After these commands, the tool has constructed an RTL schematic of the code. You can visualize the RTL schematic in the GUI using 'show schematic'. .. figure:: img/genus_ex1_rtl.png :align: center :scale: 100% RTL netlist of ex1 .. code:: # map to generic gates set_top_module ex1 read_sdc ../constraints/constraints_comb.sdc syn_generic Next, we map this code to generic gates. The first two commands ``set_top_module`` and ``read_sdc`` do housekeeping by selecting the top module in the synthesis as well as the synthesis constraints. ``syn_generic`` does the actual synthesis work. We will discuss the construction of synthesis constraints in more detail while we discuss timing. For now, it's sufficient to understand that these constraints are used to express the desired speed of the design. In this case, we design the implementation to operate at 100 MHz. .. code:: # contents of constraint_comb.sdc set non_clock_inputs [all_inputs] set_input_delay 0 -clock clk $non_clock_inputs set_output_delay 0 -clock clk [all_outputs] After ``syn_generic`` completes, you'll notice that the schematic looks different. The generic schematic uses less cells, presumably because these generic gates can implement a more sophisticated logic mapping. The following steps map the generic netlist to 130nm standard cells, and finally to an optimized netlist of 130nm standard cells. .. code:: syn_map # map to a concrete technology syn_opt # optimize the implementation After ``syn_opt`` completes, you'll notice the resulting design consists of a single xor cell. Because of the simplicity of this design, the ``syn_opt`` has not practical impact: the final netlist is the same as the mapped netlist. .. figure:: img/genus_ex1_45.png :align: center :scale: 100% CDS45 netlist of ex1 Genus allows you to various properties of the design. The ``report`` command, in particular, returns many useful statistics. For example, ``report_area`` shows the active area (standard cell area) of the final design. ``report_timing`` shows the speed (critical path) of the final design. .. code:: @genus:design:ex1 29> report_area ============================================================ Generated by: Genus(TM) Synthesis Solution 21.19-s055_1 Generated on: Sep 14 2024 10:54:49 am Module: ex1 Technology library: slow_vdd1v0 1.0 Operating conditions: PVT_0P9V_125C (balanced_tree) Wireload mode: enclosed Area mode: timing library ============================================================ Instance Module Cell Count Cell Area Net Area Total Area Wireload -------------------------------------------------------------------------- ex1 1 2.736 0.000 2.736 (D) .. code:: @genus:design:ex1 30> report_timing ============================================================ Generated by: Genus(TM) Synthesis Solution 21.19-s055_1 Generated on: Sep 14 2024 10:56:02 am Module: ex1 Operating conditions: PVT_0P9V_125C (balanced_tree) Wireload mode: enclosed Area mode: timing library ============================================================ Path 1: MET (9844 ps) Late External Delay Assertion at pin q Group: clk Startpoint: (F) b Clock: (R) clk Endpoint: (R) q Clock: (R) clk Capture Launch Clock Edge:+ 10000 0 Drv Adjust:+ 0 0 Src Latency:+ 0 0 Net Latency:+ 0 (I) 0 (I) Arrival:= 10000 0 Output Delay:- 0 Required Time:= 10000 Launch Clock:- 0 Input Delay:- 0 Data Path:- 156 Slack:= 9844 Exceptions/Constraints: input_delay 0 constraints_comb.sdc_line_6_1_1 output_delay 0 constraints_comb.sdc_line_7 #--------------------------------------------------------------------------------------- # Timing Point Flags Arc Edge Cell Fanout Load Trans Delay Arrival Instance # (fF) (ps) (ps) (ps) Location #--------------------------------------------------------------------------------------- b - - F (arrival) 1 0.2 0 0 0 (-,-) g18__2398/Y - A->Y R XOR2XL 1 0.0 15 156 156 (-,-) q - - R (port) - - - 0 156 (-,-) #--------------------------------------------------------------------------------------- We will discuss the meaning of these reports in detail in later lectures. For the timing being, we will focus on three important quality metrics. * *area* The active area of all standard cells in the design. In this case, the active area is 2.736 square micron * *cell count* The number of cells in the design. In this case, the cell count is 1. * *slack* The margin of the design compared to the clock period. In this case, the clock period is 10000 ps, and the computation is read at 156 ps. Therefore, the slack is 9844 ps, or 9.844 ns. Slack must always be positive, otherwise your design does not meet the timing constraints; i.e. when slack is negative, your design is too slow. In the remainder of this demonstration, we focus only on the schematics and the link from SystemVerilog to gate-level netlist. Complex Combinational Logic ^^^^^^^^^^^^^^^^^^^^^^^^^^^ A good way of creating complex combinational logic is to write expressions. .. code:: module ex2(output logic [3:0] q, input logic [1:0] a, b, c); assign q = a + b * c; endmodule This directly maps into a multiplier followed by an adder. Note how the expression ``b*c`` has a wordlength of twice the input, as we would expect of a multiplier. The wordlength is kept very short to keep the resulting gate-level netlist compact enough for display. .. figure:: img/genus_ex2_rtl.png :width: 75% RTL netlist of ex2 The gate-level netlist graphics uses a black bar to ungroup wires from a bus. The following is the synthesis result: a 2x2 multiplier (with 4-bit output), to which a 2-bit number is added. .. figure:: img/genus_ex2_45.png :width: 75% CDS45 netlist of ex2 .. attention:: In case you wonder about the functionality of a standard cell with an odd or unknown name, you'd typically look into (a) the standard cell library databook, or else the functional Verilog view of the standard cell. For example, for ``OAI2BB1X1``, you can find a functional Verilog view in /opt/cadence/libraries/gsclib045_all_v4.7/gsclib045/verilog/slow_vdd1v0_basicCells.v on the class design server. The RTL synthesis will use a host of tricks to make the resulting netlist as small as possible. One of these is *constant propagation*, where the constant inputs to a gate are used to simplify the netlist. For example, if we hardcode one input of the multiplier to the value 2, the resulting netlist can be drastically simplified. The gate level netlist implements the following design as two half-adder standard cells. .. code:: module ex3(output logic [3:0] q, input logic [1:0] a, b); assign q = a + b * 2; endmodule .. figure:: img/genus_ex3_45.png :width: 75% CDS45 netlist of ex3 Keep in mind that some verilog operators are very expensive to implement. For example, the inoccuous 'modulo' operator in Verilog becomes a large implementation, even at a short wordlength. Again, observe that for certain values of b (namely, at every power of 2 minus 1), the modulo operation becomes trivial to implement. The complexity of the design stems from the fact that we want to compute the modulo operation for *any* value that a two-bit number ``b`` can take. We will have a separate lecture where we discuss the implementation of arithmetic in detail. .. code:: module ex4(output logic [3:0] q, input logic [3:0] a, b); assign q = a % b; endmodule .. figure:: img/genus_ex4_45.png :width: 75% CDS45 netlist of ex4 To map combinational logic in a procedural manner, always use an ``always_comb`` block. The ``initial`` block is not supported, although some synthesis tools (such as for FPGA) may refer to the initial block to determine the initial value of flip-flops. But for combinational logic, ``initial`` remains unused. For example, the following piece of RTL will result in a pure inverter. .. code:: module ex5(output logic q, input logic a); initial q = 1; always_comb begin q = ~a; end endmodule Another important point of concern with combinational logic captured using ``always_comb`` blocks, is that the process sensitivity list must be complete. .. code:: module ex6(output logic q, input logic a, b); logic t; always @(a) begin t = a ^ 1; q = b ^ t; end endmodule This module maps into an XNOR gate. However, looking closely at the Verilog, this is not what was specified. If ``b`` changes, the ``always`` block will not evaluate, thereby leaving q unmodified. Hence, in this case, the resulting netlist is *different* from the Verilog specification. .. figure:: img/genus_ex6_45.png :width: 75% CDS45 netlist of ex6 There is an easy workaround that will make sure that this problem is always avoided: use a wildcard for the sensitivity list of a combinational process (``always``) or just use ``always_comb`` with an empty sensitivity netlist. In other words, write the previous example as follows. This will not have impact on the synthesis outcome. However, it will make sure that the RTL behaves identical to the gate-level netlist during simulation. .. code:: module ex6(output logic q, input logic a, b); logic t; always_comb begin t = a ^ 1; q = b ^ t; end endmodule Priority Logic ^^^^^^^^^^^^^^ Priority logic is used in expressing conditions. Verilog expresses procedural priority logic using ``if`` statements or else using ``case`` statements. Both of these require special attention. Consider the following design. A two-bit vector ``a`` selects either the value of ``b`` or its complement. .. code:: module ex7(output logic q, input logic [1:0] a, input logic b); always_comb begin if (a[1]) q = b; else if (a[0]) q = ~b; end endmodule The logic generated from this code contains a latch, which is driven by the bits of ``a``. .. figure:: img/genus_ex7_45.png :width: 75% CDS45 netlist of ex7 There is a problem with the Verilog description. The netlist includes a latch as the cell ``sky130_fd_sc_hd__dlxtn_1`` (we intend to describe combinational logic, not stateful logic!). That problem is solved through a *default assignment*, a value that will be assigned to ``q`` when a combination of the input values does not evaluate to a concrete update for ``q``. In this case, when ``a`` is ``00``, the value of ``q`` is left unspecified. .. code:: module ex8(output logic q, input logic [1:0] a, input logic b); always_comb begin q = 0; if (a[1]) q = b; else if (a[0]) q = ~b; end endmodule This generates the following logic. The latch has disappeared. .. figure:: img/genus_ex8_45.png :width: 75% CDS45 netlist of ex8 Note also the impact of the priority logic. In a nested ``if-then`` case, the outer ``if`` receives priority. In the example above, bit ``a[1]`` has priority over ``a[0]``, meaning that if the value ``a`` is ``1``, then the function of this module is to copy ``b``. We can swap the nesting of the if-then-else statement, if we need a different priority. In the following example, bit ``a[0]`` has priority over ``a[1]``. When ``a`` is ``11``, the function of this module is to copy the complement of ``b``. .. code:: module ex9(output logic q, input logic [1:0] a, input logic b); always_comb begin q = 0; if (a[0]) q = ~b; else if (a[1]) q = b; end endmodule .. figure:: img/genus_ex9_45.png :width: 75% CDS45 netlist of ex9 Unlike ``if`` statements, expressing priorities with ``case`` statements is harder, because a ``case`` statement tests only a single condition for all the branches. However, a ``case`` statement is better at handling complex logic, such as for example with the next-state logic of finite state machines. A ``case`` statement has the same issue as ``if`` statements and can infer latches if an uncovered case condition is left. The use of a ``default`` branch ensures that the case logic is always complete. .. code:: module ex10(output logic q, input logic [1:0] a, input logic b); always_comb begin case (a) 1: q = ~b; 2: q = b; default: q = 0; endcase end endmodule .. figure:: img/genus_ex10_45.png :width: 75% CDS45 netlist of ex10 Loops ^^^^^ Verilog supports the use of loops in procedural blocks. These loops have *no meaning* for synthesis and will be expanded as the Verilog is translated to gates. If you want to implement a control contruction, you will have to build it using a finite state machine. The following example show how to compute the Hamming weight of a 5-bit number. .. code:: module ex11(output logic [2:0] q, input logic [4:0] a); logic [2:0] tmp; integer i; always_comb begin tmp = 0; for (i = 0; i < 5; i = i + 1) tmp = tmp + a[i]; q = tmp; end endmodule This module results in pure combinational code. The compiler will *unroll* the loop before RTL synthesis. Note also that we can assign and reassign the value of ``tmp``. However, for synthesis, all the iterations in a loop will instantly execute. ``tmp`` does not play the role of a hardware storage element. .. figure:: img/genus_ex11_45.png :width: 75% CDS45 netlist of ex11 For example, the following modules, ``ex12`` and ``ex13``, will generate gates as ``ex11``. .. code:: module ex12(output logic [2:0] q, input logic [4:0] a); logic [2:0] tmp; integer i; always_comb begin tmp = 0; tmp = tmp + a[0]; tmp = tmp + a[1]; tmp = tmp + a[2]; tmp = tmp + a[3]; tmp = tmp + a[4]; q = tmp; end endmodule .. code:: module ex13(output logic [2:0] q, input logic [4:0] a); always_comb begin q = 0 + a[0] + a[1] + a[2] + a[3] + a[4]; end endmodule Flip-flops ^^^^^^^^^^ Hardware registers are modeled in a separate ``always`` block which is driven by the clock signal. The assignment on flip-flops always proceeds using a *non-blocking* assignment. A non-blocking assignment is necessary to allow simultaneous assignments on all flip-flops. The following example shows two flop-flops that are connected back-to-back. The use an *asynchronous* reset. .. code:: module ex14( input wire d, input wire reset, input wire clk, output wire q1, output wire q2 ); logic q1r, q2r; always_ff @(posedge clk or posedge reset) begin if (reset) begin q1r <= 1'b0; q2r <= 1'b0; end else begin q1r <= q2r; q2r <= q1r ^ d; end end assign q1 = q1r; assign q2 = q2r; endmodule .. figure:: img/genus_ex14_45.png :width: 75% CDS45 netlist of ex14 A flip-flop with synchronous reset is achieved by a simple change to the process sensitivity list. .. code:: module ex15( input logic d, input logic reset, input logic clk, output logic q1, output logic q2 ); logic q1r, q2r; always_ff @(posedge clk) begin if (reset) begin q1r <= 1'b0; q2r <= 1'b0; end else begin q1r <= q2r; q2r <= q1r ^ d; end end assign q1 = q1r; assign q2 = q2r; endmodule .. figure:: img/genus_ex15_45.png :width: 75% CDS45 netlist of ex15 While *asynchronous reset* inputs are a common feature on flip-flop cells in standard cell libraries, *synchronous reset* are usually realized by multiplexing additional logic at the flip-flop inputs. Gated clocks on flip-flops are to be done with caution. It's not a good idea to directly manipulated the clock as follows. This will result in difficult skew problems. .. code:: module ex16( output logic q, input logic enable, input logic clk, input logic d ); logic qr; wire gatedclk; // An example of how NOT to gate a clock! assign gatedclk = clk & enable; always_ff @(posedge gatedclk) begin qr <= d; end assign q = qr; endmodule .. figure:: img/genus_ex16_45.png :width: 75% CDS45 netlist of ex16 A better way to implement a gate clock is to add an enable single inside of the synchronous update as follows. This design will ensure that each flip-flop receives a primitive clock signal. .. code:: module ex17( output logic q, input logic enable, input logic clk, input logic d ); logic qr; always_ff @(posedge clk) begin if (enable) qr <= d; end assign q = qr; endmodule .. figure:: img/genus_ex17_45.png :width: 75% CDS45 netlist of ex17 Finite State Machines ^^^^^^^^^^^^^^^^^^^^^ We already briefly discussed the design of finite state machines. We will discuss the implementation aspects by means of an example. Let's create an FSM that reencodes a bitstring as follows. The first 1 in a string of 1s is passed as a 1. Otherwise, the output is 0. The following shows an example input string with its corresponding output string. .. code:: input string: 1 1 1 0 0 1 0 0 1 1 1 0 1 1 output string: 1 0 0 0 0 1 0 0 1 0 0 0 1 0 The first step in such a design is to design a state transition diagram. The following is the design of a Moore FSM that implements this FSM. .. figure:: img/fsm.jpg :width: 75% We will discuss three different Verilog implementations for this design: a three-process style (recommended), a two-process style and a single-process style. The last two provide a more compact style, but may be a little harder to develop. .. important:: Keep the following in mind when writing Verilog for FSM. * Always draw your state transition diagram before coding. The Verilog representation of a state transition diagram is messy and error prone. * Always use symbolic state encoding. The synthesis tools will choose a state encoding for you which will result in the fastest/most compact design. * Always model state transition logic as a case statement. Don't forget a default assignment. * Always make sure that your state transition logic is complete. Add a default state transition that will bring back the finite state machine to its initial (or another known) state. Here is a three-process model of the finite state machine. The symbolic states are ``s0``, ``s1``, ``s2``. We have assigned them a specific value for RTL simulation purposes. However, the RTL synthesis tool will choose a state encoding that results in the smallest possible area. Both the next-state block and the output block have default assignments (for the next state and for the output, respectively). .. code:: module ex19( output logic q, input logic i, input logic clk, input logic reset ); logic rq; logic [1:0] state, state_next; localparam s0 = 2'b00, s1 = 2'b01, s2 = 2'b10; always_ff @(posedge clk or posedge reset) begin if (reset) state <= s0; else state <= state_next; end always_comb begin state_next = s0; case (state) s0: if (i == 1'b1) state_next = s1; else state_next = s0; s1: if (i == 1'b1) state_next = s2; else state_next = s0; s2: if (i == 1'b1) state_next = s2; else state_next = s0; endcase end always_comb begin rq = 1'b0; case (state) s0: rq = 1'b0; s1: rq = 1'b1; s2: rq = 1'b0; endcase end assign q = rq; endmodule This implementation yields a design as shown. .. figure:: img/genus_ex19_45.png :width: 75% To verify that this is a correct implementation of the spec, we recompute the state transition diagram from the schematic. Note that DFF0 is the rightmost flip-flop. +-----------+------------+-----------+----------+-----------+------------+ | DFF1 | DFF0 | i | q | Next DFF1 | Next DFF0 | +===========+============+===========+==========+===========+============+ | 0 | 0 | 0 | 0 | 0 | 0 | +-----------+------------+-----------+----------+-----------+------------+ | 0 | 0 | 1 | 0 | 0 | 1 | +-----------+------------+-----------+----------+-----------+------------+ | 0 | 1 | 0 | 1 | 0 | 0 | +-----------+------------+-----------+----------+-----------+------------+ | 0 | 1 | 1 | 1 | 1 | 0 | +-----------+------------+-----------+----------+-----------+------------+ | 0 | 0 | 0 | 0 | 0 | 0 | +-----------+------------+-----------+----------+-----------+------------+ | 0 | 0 | 1 | 0 | 1 | 0 | +-----------+------------+-----------+----------+-----------+------------+ | 1 | 1 | 0 | 0 | 0 | 0 | +-----------+------------+-----------+----------+-----------+------------+ | 1 | 1 | 1 | 0 | 0 | 0 | +-----------+------------+-----------+----------+-----------+------------+ From the transition table, we recreate the state transition diagram which closely matches the original specification. Note, however, that the state ``11``, which did not appear in the logical state machine, now does occur in the physical implementation. Indeed, there is nothing that prevents both flip-flops from storing 1 at the same time. This would occur, of course, as the result of a fault. The important element is that this illegal state is effectively avoided, because it will transition into state ``00``. Finally, we illustrate two more compact notations for this FSM, which are created by collapsing several processes into a single process. All of the following Verilog descriptions map into the same identical netlist. The first transformation is to merge the output encoding with the next-state logic, which is implemented by merging two ``always`` blocks into a single one. .. code:: module ex20( output logic q, input logic i, input logic clk, input logic reset ); logic rq; logic [1:0] state, state_next; localparam s0 = 2'b00, s1 = 2'b01, s2 = 2'b10; always_ff @(posedge clk or posedge reset) begin if (reset) state <= s0; else state <= state_next; end always_comb begin state_next = s0; rq = 1'b0; case (state) s0: begin rq = 1'b0; if (i == 1'b1) state_next = s1; else state_next = s0; end s1: begin rq = 1'b1; if (i == 1'b1) state_next = s2; else state_next = s0; end s2: begin rq = 1'b0; if (i == 1'b1) state_next = s2; else state_next = s0; end endcase end assign q = rq; endmodule The second transformation is to make it even more compact, by collapsing the synchronous ``always`` process that takes care of register update, with the combinational ``always`` process. This merging is tricky, since a synchronous ``always`` process can only contain assignments to registers. Thus, we are forced to push the output encoding out of the ``always`` process into a dataflow statement. .. code:: module ex21( output logic q, input logic i, input logic clk, input logic reset ); logic [1:0] state, state_next; localparam s0 = 2'b00, s1 = 2'b01, s2 = 2'b10; always_ff @(posedge clk or posedge reset) begin if (reset) state <= s0; else begin state <= s0; case (state) s0: begin if (i == 1'b1) state <= s1; else state <= s0; end s1: begin if (i == 1'b1) state <= s2; else state <= s0; end s2: begin if (i == 1'b1) state <= s2; else state <= s0; end endcase end end assign q = (state == s1); endmodule