.. ECE 574

.. attention::

   This document was last updated |today|
   
.. _project:

Project: A hardware-accelerated MAC function
============================================

.. contents::

This course includes a design project that follows the ASIC design
flow we studied. You will complete the project in your
designated/chosen team. To find out which team you belong to, go to
Canvas, and check under 'People' and then 'Team'. Every team member
will share the same grade for the project. Team Size will not
influence the grade that a project can attain.

The project target is to build a hardware acceleration of a reference
implementation of Poly1305, described in `IETF RFC 7539
<https://datatracker.ietf.org/doc/html/rfc7539>`_. This algorithm
implements a *Message Authentication Code*. Given a message, such as
*Cryptographic Forum Research Group*, and a 128-bit secret key, such
as
85:d6:be:78:57:55:6d:33:7f:44:52:fe:42:d5:06:a8:01:03:80:8a:fb:0d:b2:fd:4a:bf:f6:af:41:49:f5:1b,
Poly1305 computes a 64-bit *tag*. For the example given the tag is
a8:06:1d:c1:30:51:36:c6:c2:2b:8b:af:0c:01:27:a9. The tag authenticates
the message, which means that the tag demonstrates that it must be
produced by someone who knows the message and the secret key. MAC
codes are used in a wide variety of internet security applications,
but also for example in electronic door lock applications. A door lock
emits a challenge (a random message), and the wireless key responds
with the MAC of the message. If both the door lock and the key are
programmed with a matching secret key, the door lock can now verify if
the received tag matches the tag that it computes locally.

The project's objective will be for you to implement a hardware
version of the Poly1305 algorithm that can be integrated in an
SoC. Your starting point will be a reference implementation of an SoC
that implements the Poly1305 algorithm in hardware. Hence, the focus
of the design will be on thinking about hardware strategies to map
(and accelerate) Poly1305 as a coprocessor.

The project spans four phases. In each phase, you have to prepare a
small report or presentation and meet 1-1 with the instructor. Each
phase will be graded separately, with the final phase culminating in a
class presentation of your work. As there are a large number of teams
(12), your time to report your results in 1-1 sessions is
limited. Please come prepared!


Project Timeline
----------------

The project runs over 5 weeks (including Thanksgiving week).

+--------------------+-------------------------+-----------------------------+-----------------+
| Week               | Topic                   |         Deliverable         | Location        |
+====================+=========================+=============================+=================+
| 13 November        | Project Definition      | Team Meeting Presentation   | AK 301 or Video |
+--------------------+-------------------------+-----------------------------+-----------------+
| 20 November        | Project Design          | Team Meeting Presentation   | AK 301 or Video |
+--------------------+-------------------------+-----------------------------+-----------------+
| 4 December         | Project Implementation  | Team Meeting Presentation   | AK 301 or Video |
+--------------------+-------------------------+-----------------------------+-----------------+
| 11 December        | Project Presentation    | Class Presentation          | AK 232 or Video |
+--------------------+-------------------------+-----------------------------+-----------------+

Project meetings will be between team members and the instructor in AK
301 during the initial stages of the project. Project meetings
allocate 15 minutes per team. The project members prepare for the
meeting by means a a short presentation to explain their plans and/or
progress. The instructor will ask for clarification or enhancements
where needed. The team members will have a chance to explain and
motivate their plans.

For students in the **synchronous** version of the course, the
meetings will be scheduled in AK 301 during class time on
Wednesdays. The meeting schedule will be posted on Canvas, and
students will be able to sign up for a slot according to their
preference.

Students in the **asynchronous** version of the course must prepare a
video presentation and share it with the instructor. Students will
receive feedback by email. Alternately, online students may sign up
for a synchronous zoom slot during class time on Wednesdays. In the
latter case, the meeting schedule will be posted on Canvas, and
students will be able to sign up for a slot according to their
preference.

The final lecture will consist of an in-class presentation by each
team in the **synchronous** version of the course. Project
presentations must include a formal presentation with slides, and will
be made in a conference style format with moderated Q&A. Students in
the **asynchronous** version of the course must prepare a video
presentation, but will have a chance to present their work in the
final synchronous lecture as well.


Reference Implementation
------------------------

.. attention::
   
   To understand the following, it may be useful to revisit the :ref:`02systemonchip` lecture.
   To start, you will need to accept the final assignment to receive the reference implementation
   at the `following Github Classroom link <https://classroom.github.com/a/B6SxDiIU>`_.
   This will give you a repository final-project-userid which you can clone to the
   class server.

The reference implementation is an IBEX based SoC with a 32-bit RISCV
core, 2MB of RAM, a UART, a timer, a GPIO port, and SPI. The following are
key design files that you should study:

* ``custom/rtl/top_rtl.sv`` Chip top-level of the SoC. The SoC only uses clock, reset, a GPIO input and a GPIO output port.

* ``custom/rtl/custom_ibex.sv`` SoC system architecture that instantiates the IBEX, a bus, and various peripherals.

* ``custom/rtl/myreg.sv`` Example of a custom user-defined peripheral, in this case a collection of two registers which can be read from/ written to from software running on IBEX.

* ``Makefile`` to drive the design process: compiling software, the simulation, the hardware, and so on. 

* ``sw/c/demo/poly1305`` Reference software implementation of the Poly1305 algorithm.

This initial repository will emphasize RTL design and simulation using
Verilator. In later stages of the project, we will introduce a
hardware synthesis flow to map the hardware to a standard cell
netlist. The following steps demonstrate a basic walkthrough of the
reference design.

Environment setup
^^^^^^^^^^^^^^^^^

You will need ``fusesoc`` and ``Verilator`` so make sure you setup the environment correctly with

.. code::

   pyenv activate fusesoc
   scl enable devtoolset-9 bash

If needed revisit the notes of :ref:`02systemonchip`

Inspecting and Compiling software
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let's start with the Poly1305 software implementation. This lab assignment does not explain you how poly1305 works. Instead, refer to `IETF RFC 7539
<https://datatracker.ietf.org/doc/html/rfc7539>`_. Writing cryptographic algorithms involves many bits, and it's important that you understand the notations used, such as *octet string*, *number* and *message*. All these conventions are explained clearly in the RFC.

First, inspect the files in ``sw/c/demo/poly1305``.

* ``poly1305.{c,h}`` contains the reference implementation in software.
* ``main.c`` is the main driver program that computes the tag for a reference message, while measuring clock cycles
* ``rfc7539.txt`` is FYI, the text of the RFC.
* ``polygen.c`` is a test program to compute MACs for test messages. This program will not run on the IBEX, you can simply compile with with gcc. It will not be further discussed below.

Now, let's take a look at the main function in ``main.c``. A MAC is a
function that can, in principle, process *arbitrary-length*
messages. Hence, for a given key, Poly1305 can compute a tag no matter
how long or how short the message really is. The software
implementation gets around this flexibility requirement by
partitioning the MAC computation in phases. Typically, there is an
initialization phase that sets up the internal MAC state, followed by
one or more update phases that feed in a block of message bytes while
updating the MAC state, and a terminating phase that extracts the tag
from the MAC state. This implemenentation of Poly1305 is no different, and
you can spot ``poly1305Init``, ``poly1305Update`` and ``poly1305Final``.

Besides the MAC processing, you can also spot the use of the timer,
which is called just before and after calling a Poly1305 phase. That
will give us an idea of the implementation cost (execution time).

.. code::

   void main() {

    Poly1305Context c;

    uint64_t init_cycles;
    uint64_t block_cycles;
    uint64_t final_cycles;
    uint64_t stamp;

    timer_init();

    puts("--key:\n");
    stamp = timer_read();
    poly1305Init(&c, key);
    init_cycles = timer_read() - stamp;
    showPoly(&c);

    puts("--block1:\n");
    stamp = timer_read();
    poly1305Update(&c, block1, block1len);
    block_cycles = timer_read() - stamp;
    showPoly(&c);

    puts("--block2:\n");
    poly1305Update(&c, block2, block2len);
    showPoly(&c);

    puts("--block3:\n");
    poly1305Update(&c, block3, block3len);
    showPoly(&c);

    puts("--final:\n");
    stamp = timer_read();
    poly1305Final(&c, tag);
    final_cycles = timer_read() - stamp;
    showPoly(&c);

    puts("--tag:\n");
    showBlock(tag, 16);

    puts("--poly1305 performance:\n");
    puts("Init:  ");
    puthex((uint32_t) (init_cycles >> 32));
    puthex((uint32_t) (init_cycles));
    putchar('\n');
    puts("Block: ");
    puthex((uint32_t) (block_cycles >> 32));
    puthex((uint32_t) (block_cycles));
    putchar('\n');
    puts("Final: ");
    puthex((uint32_t) (final_cycles >> 32));
    puthex((uint32_t) (final_cycles));
    putchar('\n');

    // delay the simulator before halting so all output is flushed
    stamp = timer_read();
    while ((timer_read() - stamp) < 700000) ;

    sim_halt();
  }


To compile the software, use the following command from the main entry point
of your final project repo.

.. code::

   make swprep

If you make a change to the software and you want to force recompilation, use the
-B command line parameter:

.. code::

   make -B swprep

   
Inspecting and Compiling hardware
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The reference hardware is based on the Custom IBEX example discussed in Lecture 2.
The toplevel of the design is defined in ``custom/rtl/top_rtl.sv``, and the SOC itself
is defined in ``custom/rtl/custom_ibex.sv``.

The following steps show you how to construct the simulator for this system.
The resulting simulation is put in ``build/custom_ibex_0/sim-verilator/``

.. code::

   make simprep

Running the simulation
^^^^^^^^^^^^^^^^^^^^^^
When we run the simulation, we will load a compiled binary of an IBEX firmware
application into the RAM, and then execute the Verilator executable.
This simulation also simulates the UART which is used for various output operations
(``putchar`` and ``puts`` in the C program). So, you will not see the UART output
appear on screen. Instead, the output is dumped in a file called ``uart0.log``.

The easiest is to open a *second* terminal, and monitor this file as follows.
The -f flag on tail ensures that any new updates to the file will be displayed on the
console of the second terminal.

.. code::

   tail -f uart0.log

In the original (first) terminal, you can now run the simulation with:

.. code::

   make run
   
In the second terminal, you will note the following output:

.. code::
   
   tail: uart0.log: file truncated					
   --key:								
   R  as a number 0806D540 0E52447C 036D5554 08BED685		
   S  as a number 1BF54941 AFF6BF4A FDB20DFB 8A800301		
   A* as a number 00000000 00000000 00000000 00000000 00000000	
   Buffer 0000000000000000000000000000000000 ()			
   --block1:							
   R  as a number 0806D540 0E52447C 036D5554 08BED685		
   S  as a number 1BF54941 AFF6BF4A FDB20DFB 8A800301		
   A* as a number 00000002 C88C7784 9D64AE91 47DDEB88 E69C83FC	
   Buffer 43727970746F6772617068696320466F01 (Cryptographic Fo)	
   --block2:							
   R  as a number 0806D540 0E52447C 036D5554 08BED685		
   S  as a number 1BF54941 AFF6BF4A FDB20DFB 8A800301		
   A* as a number 00000002 D8ADAF23 B0337FA7 CCCFB4EA 344B30DE	
   Buffer 72756D2052657365617263682047726F01 (rum Research Gro)	
   --block3:							
   R  as a number 0806D540 0E52447C 036D5554 08BED685		
   S  as a number 1BF54941 AFF6BF4A FDB20DFB 8A800301		
   A* as a number 00000002 D8ADAF23 B0337FA7 CCCFB4EA 344B30DE	
   Buffer 75706D2052657365617263682047726F01 (upm Research Gro)	
   --final:							
   R  as a number 00000000 00000000 00000000 00000000		
   S  as a number 00000000 00000000 00000000 00000000		
   A* as a number 00000000 00000000 00000000 00000000 00000000	
   Buffer 7570010000000000000000000000000000 (up)			
   --tag:								
   A8061DC1305136C6C22B8BAF0C0127A9				
   --poly1305 performance:						
   Init:  00000000000001BB						
   Block: 0000000000000D3A						
   Final: 000000000000111A

While in the first terminal, you will see:

.. code::

   [pschaumont@arc-schaumont-class-vm final-project-patrickschaumont]$ make run
   Simulation of Custom Ibex System						     
   ==============================							     
   										     
   Tracing can be toggled by sending SIGUSR1 to this process:			     
   $ kill -USR1 17838								     
   										     
   UART: Created /dev/pts/9 for uart0. Connect to it with any terminal program, e.g.    
   $ screen /dev/pts/9								     
   UART: Additionally writing all UART output to 'uart0.log'.			     
   										     
   Simulation running, end by pressing CTRL-c.					     
   Terminating simulation by software request.					     
   - ../src/lowrisc_ibex_sim_shared_0/./rtl/sim/simulator_ctrl.sv:93: Verilog $finish   
   Received $finish() from Verilog, shutting down simulation.			     
   										     
   Simulation statistics								     
   =====================								     
   Executed cycles:  5926399							     
   Wallclock time:   12.932 s							     
   Simulation speed: 458274 cycles/s (458.274 kHz)					     
   										     
   Performance Counters								     
   ====================								     
   Cycles:                     5926395						     
   Instructions Retired:       3135341						     
   LSU Busy:                   1208436						     
   Fetch Wait:                 1035040						     
   Loads:                      1146947						     
   Stores:                     61489						     
   Jumps:                      34858						     
   Conditional Branches:       552483						     
   Taken Conditional Branches: 512180
   

Take a moment to observe this data. The output of the C simulation should match the example
of Poly1305 in the RFC standard. In the main simulation window, you notice various performance
metrics. Note the very high simulation speed (>400KHz) thanks to cycle-accurate compiled simulation
*without any VCD/FST tracing*.

There is an alternate version of the run command called ``runfst`` which creates a signal
trace file of all signals in the simulation. This will obviously slow down the simulation
significantly (to 40KHz, so 10x slower). Also the fst file is very large (280MB).

.. code::

   make runfst

The fst file can be opened with gtkwave. Once we will design custom hardware for Poly1305, this
waveform feature will become a useful extension.

Project Definition Phase
------------------------

Your assignment for this week in the final project is as follows.

1. Set up the final project design environment as outlined above.
2. Study the Poly1305 algorithm, in particular by carefully reading the RFC document (Section 2.5)
3. Do research on the fundamental operation performed by Poly1305: (r * a) % p. In particular, p is a prime number with a special form: 2^130 - 5. Find out how this leads to an efficient implementation in hardware and software.
4. Make a short presentation (5 slides max) on your findings. As the final slide, present your ideas on how to accelerate Poly1305 in hardware.

Project Design Phase
--------------------

Your assignment for this week in the final project is as follows.

1. Make a detailed block diagram of a Poly1305 hardware accelerator.
   Your block diagram should show every register used in your hardware
   accelerator, as well as individual blocks for datapath logic. Your
   block diagram should enumerate all I/O ports used at the top-level.
   Finite state machines (with embedded state register) may be drawn
   as boxes as well. Your block diagram must reflect a potential RTL
   design with as much detail as needed, such that the RTL coding becomes
   easy.
   
2. Describe the hardware/software interface between IBEX and your
   hardware accelerator in detail. For example, in the *myreg.sv* example
   hardware module (found in your final project github under ``custom/rtl/myreg.sv``),
   two registers are defined: ``reg1`` and ``reg2``. For each such
   memory-mapped register in your hardware/software interface,
   define its length, direction (sw to hw, or hw to sw, or bidrectional),
   and functionality. Provide pseudocode that shows how your C application on the IBEX
   will communicate with your hardware accelerator.

4. Make a short presentation (5 slides max) with your design. As the final slide,
   make a back-of-the-envelope calculation that shows how much faster your hardware
   design will be over a pure software implementation.
   
The following slides (:download:`PDF <pdf/poly1305.pdf>`) capture
the main idea of Poly1305 hardware implementation.  However, while
the Verilog code as shown is correct, it is *inefficient* for
hardware implementation because of the massive hardware
multiplier. Your task is to come up with a hardware implementation
strategy that is much more compact than this design, while still
ensuring reasonable hardware acceleration performance.

.. attention::

   A few groups asked about the full testbench shown in the slides. The complete testbench
   is listed below. You can use ``polygen.c`` and ``poly1305.c`` to generate additional
   testvectors that can be read by this testbench.

.. code::

  module tb;
   reg [127:0]  r;
   reg [127:0]  s;
   reg [127:0]  m;
   reg          fb;
   reg          ld;
   reg          first;
   wire [127:0] p;
   wire         rdy;
   reg          reset;
   reg          clk;

   poly1305 dut(.reset(reset),
                .clk(clk),
                .r(r),
                .s(s),
                .m(m),
                .fb(fb),
                .ld(ld),
                .first(first),
                .p(p),
                .rdy(rdy));

   always
     begin
        clk = 1'b0;
        #5;
        clk = 1'b1;
        #5;
     end

   reg [514:0] inputvector;
   integer     inputfile;

   reg         tv_firstblock;
   reg         tv_lastblock;
   reg         tv_fullblock;
   reg [127:0] tv_data;
   reg [255:0] tv_key;
   reg [127:0] tv_tag;
   reg [127:0] tv_tag_swap;

   integer     i;

   initial
     begin
        $dumpfile("trace.vcd");
        $dumpvars(0, tb);
        inputfile = $fopen("vectors.txt","r");

        r = 128'h0;
        s = 128'h0;
        m = 128'h0;
        fb = 1'b0; // fulblock
        ld = 1'b0;
        first = 1'b0;

        reset = 1'b1;
        repeat(3)
          @(posedge clk);

        reset = 1'b0;
        @(posedge clk);

        while (!$feof(inputfile))
          begin

             $fscanf(inputfile, "%b", inputvector);

             {tv_firstblock,
              tv_lastblock,
              tv_fullblock,
              tv_data,
              tv_key,
              tv_tag} = inputvector;

             r = 128'b0;
             s = 128'b0;
             for (i = 0; i < 16; i = i + 1)
               begin
                  s = (s << 8) | tv_key[7:0];
                  tv_key = (tv_key >> 8);
               end
             for (i = 0; i < 16; i = i + 1)
               begin
                  r = (r << 8) | tv_key[7:0];
                  tv_key = (tv_key >> 8);
               end

             m = 128'b0;
             for (i = 0; i < 16; i = i + 1)
               begin
                  m = (m << 8) | tv_data[7:0];
                  tv_data = (tv_data >> 8);
               end

             first = tv_firstblock;
             fb = tv_fullblock;

             ld = 1'b1;
             @(posedge clk);

             while (rdy == 1'b0)
               begin
                  ld = 1'b0;
                  @(posedge clk);
               end

             tv_tag_swap = 128'b0;
             for (i = 0; i < 16; i = i + 1)
               begin
                  tv_tag_swap = (tv_tag_swap << 8) | tv_tag[7:0];
                  tv_tag = (tv_tag >> 8);
               end

             if ((tv_lastblock == 1'b1) && (p != tv_tag_swap))
               $display("Expected tag %x but computed %x", tv_tag_swap, p);
             else if (tv_lastblock == 1'b1)
               $display("Tag OK: %x", p);
          end


        $finish;
     end

  endmodule


.. code::

   module processblock(input          reset,
                    input          clk,
                    input [127:0]  r,
                    input [128:0]  m, // {1, 128 message bits}
                    input [129:0]  a_in, // {acc < P = 2^130 - 5}
                    output [129:0] a_out,
                    input          start,
                    output         done
                    );

   wire [130:0]                    m1;
   wire [258:0]                    m2; // 128 bits * 131 bits = 259 bits
   wire [131:0]                    m3; // first reduction leaves 2 extra bits
   wire [129:0]                    m4;
   wire [2:0]                      five;

   assign five = 5;
   assign m1 = m + a_in;
   assign m2 = m1 * r;
   assign m3 = m2[129:0] + m2[258:130] * five;  // first reduction
   assign a_out = m3[129:0] + m3[131:130] * five;  // second reduction
   assign done = start;

  endmodule

.. code::

   module poly1305(input          reset,
                input          clk,
                input [127:0]  r,
                input [127:0]  s,
                input [127:0]  m,
                input          fb,
                input          ld,
                input          first,
                output [127:0] p,
                output         rdy);
   // fb   ld    first
   // x    x     1      marks the first block of a sequence of blocks
   // 1    1     0      m contains 128 bits of message (and does not include separator byte)
   // 0    1     0      m contains <128 bits of message (and includes separator byte)
   // 0    0     x      processing cycle

   reg [129:0]                 acc;
   wire [129:0]                acc_out;
   wire [129:0]                acc_in;
   wire                        block_start;
   wire                        block_done;

   wire [128:0]                msep;
   assign msep = fb ? {1'b1, m} : m;

   assign acc_in = first ? 130'b0 : acc;

   wire [127:0]                rclamp;

   assign rclamp = r & 128'h0FFF_FFFC_0FFF_FFFC_0FFF_FFFC_0FFF_FFFF;

   processblock single(.reset(reset),
                       .clk  (clk),
                       .r    (rclamp),
                       .m    (msep),
                       .a_in (acc_in),
                       .a_out(acc_out),
                       .start(block_start),
                       .done (block_done)
                       );

   always @(posedge clk)
     if (reset)
       acc <= 130'h0;
     else
       acc <= block_done ? acc_out : acc;

   assign block_start = ld;

   assign p = acc_out + s;
   assign rdy = block_done;

  endmodule


Project Implementation Phase
----------------------------

The final phase of the project is the implementation into RTL. Your
objective is to build a systemverilog implementation of Poly1305 that
can be integrated as a (memory-mapped) coprocessor core for the IBEX
core.

.. attention::

   In the following, we provide guidelines on the implementation of the
   ``poly1305`` coprocessor. You are free to follow your own path.
   However, you must deliver at least the following final results.
   First, you must provide an SoC level testbench, part of
   your ``final-project-..`` repository.
   Next, you must provide a netlist (and possibly layout) of the
   coprocessor implemented by itself. This requires a separate repository,
   as explained below. Do not work on synthesis/layout until you have
   a completely correct and working simulation in ``final-project-..``.

Study how ``myreg`` works
^^^^^^^^^^^^^^^^^^^^^^^^^

Refer to the ``myreg`` demonstration coprocessor to see how
memory-mapped registers can be created. In the following, the most
important features of ``myreg.sv`` are reviewed.

First, the coprocessor needs to provide a standard bus interface.
This bus interface will be identical for your Poly1305 coprocessor.
It includes the following signals.

* ``clk_i``, the clock input
* ``rst_ni``, the negative-asserted reset input
* ``device_req_i``, the device request input, asserted at valid bus cycles
* ``device_addr_i``, the relative address input (i.e. starts at 0 for this module)
* ``device_we_i``, the write control input, asserted during write cycles
* ``device_be_i``, the byte-assert input, one bit per byte of a word
* ``device_wdata_i``, the write data input (32 bit)
* ``device_rvalid_o``, the read assertion output, asserted when ``device_rdata_o`` valid
* ``device_rdata_o``, the read data output (32 bit)

.. code::

   module myreg #(
      parameter int unsigned AddrWidth = 32,	  
      parameter int unsigned RegAddr   = 8		  
    ) (						  
      input logic                 clk_i,		  
      input logic                 rst_ni,		  
    						  
      input logic                 device_req_i,	  
      input logic [AddrWidth-1:0] device_addr_i,	  
      input logic                 device_we_i,	  
      input logic [3:0]           device_be_i,	  
      input logic [31:0]          device_wdata_i,	  
      output logic                device_rvalid_o,	  
      output logic [31:0]         device_rdata_o	  
    );                                                


In a coprocessor, you would first create a read and write strobe for
each memory mapped register. For each such register, you also choose a
relative memory address, starting at 0. It's a good practice to make
memory-mapped registers both readable and writable from software.
Even if you would use that register only to transfer data from
software to hardware, it's still useful for debugging if your system
software can access that register.

.. code::

   localparam int              unsigned MYREG_REG1 = 32'h0;
   localparam int              unsigned MYREG_REG2 = 32'h4;

   logic [RegAddr-1:0]         reg_addr;
   logic                       reg1_wr, reg1_rd, reg1_rd_d;
   logic                       reg2_wr, reg2_rd, reg2_rd_d;

   logic [31:0]                reg1_data;
   logic [31:0]                reg2_data;


   // Decode write and read requests.
   assign reg_addr          = device_addr_i[RegAddr-1:0];
   assign reg1_wr           = device_req_i &  device_we_i & (reg_addr == MYREG_REG1[RegAddr-1:0]);
   assign reg1_rd           = device_req_i & ~device_we_i & (reg_addr == MYREG_REG1[RegAddr-1:0]);
   assign reg2_wr           = device_req_i &  device_we_i & (reg_addr == MYREG_REG2[RegAddr-1:0]);
   assign reg2_rd           = device_req_i & ~device_we_i & (reg_addr == MYREG_REG2[RegAddr-1:0]);

Next, you have to develop two pieces of logic, that show (a) how to
write into the memory mapped registers, and (b) how to read from the
memory mapped registers. The challenge of writing into the
memory-mapped registers, is that the software-based access must be
combined with the access coming from the custom hardware. In this case,
the ``myreg`` example shows something straightforward: if you write
into ``reg1``, then ``reg2`` will be updated at the same time with the
sum of (the previous) ``reg1`` and ``reg2``. Conversely, if you write into
``reg2``, then ``reg1`` will be updated at the same time with the
sum of (the previous) ``reg1`` and ``reg2``. This results in the following code.

In the following code, also pay special attention to the creation of *delayed*
write strobes ``reg1_rd_d`` and ``reg2_rd_d``. These are needed to support reading
from the memory mapped registers from software.

.. code::

     always @(posedge clk_i or negedge rst_ni) begin
      if (!rst_ni) begin
       reg1_data <= 32'b0;
       reg2_data <= 32'b0;
       reg1_rd_d <= 1'b0;
       reg2_rd_d <= 1'b0;
      end else begin
       if (reg1_wr)
         begin
            reg1_data[7:0]   <= {device_be_i[0] ? device_wdata_i[7:0] : reg1_data[7:0]};
            reg1_data[15:8]  <= {device_be_i[1] ? device_wdata_i[15:8] : reg1_data[15:8]};
            reg1_data[23:16] <= {device_be_i[2] ? device_wdata_i[23:16] : reg1_data[23:16]};
            reg1_data[31:24] <= {device_be_i[3] ? device_wdata_i[31:24] : reg1_data[31:24]};

            // when writing into reg1, will add its (old) contents to reg2
            reg2_data <= reg2_data + reg1_data;
         end
        if (reg2_wr)
         begin
            reg2_data[7:0]   <= {device_be_i[0] ? device_wdata_i[7:0] : reg2_data[7:0]};
            reg2_data[15:8]  <= {device_be_i[1] ? device_wdata_i[15:8] : reg2_data[15:8]};
            reg2_data[23:16] <= {device_be_i[2] ? device_wdata_i[23:16] : reg2_data[23:16]};
            reg2_data[31:24] <= {device_be_i[3] ? device_wdata_i[31:24] : reg2_data[31:24]};

            // when writing into reg2, will add its (old) contents to reg1
            reg1_data <= reg1_data + reg2_data;
         end
       device_rvalid_o <= device_req_i;
       reg1_rd_d <= reg1_rd;
       reg2_rd_d <= reg2_rd;
    end
  end
   
How should you handle the case when both the hardware and the software
can potentially update the same memory mapped register in the same
clock cycle? In that case, you will need to decide how to resolve the
access during concurrent access. Typically, the hardware will get
priority over software -- but you can decide on you own solution.
In hardware, you would write something as follows. ``hardware_writes_regs``
is a hardware-level signal that asserts if the current clock cycle should
update ``reg1`` with a coprocessor-generated value.

.. code::

   if (hardware_writes_reg1)
     begin
       reg1_data <= coprocessor_generated_value;
     end
   else if (reg1_wr)
     begin
       reg1_data[7:0]   <= {device_be_i[0] ? device_wdata_i[7:0] : reg1_data[7:0]};
       reg1_data[15:8]  <= {device_be_i[1] ? device_wdata_i[15:8] : reg1_data[15:8]};
       reg1_data[23:16] <= {device_be_i[2] ? device_wdata_i[23:16] : reg1_data[23:16]};
       reg1_data[31:24] <= {device_be_i[3] ? device_wdata_i[31:24] : reg1_data[31:24]};
     end

Finally, you also have to ensure that the memory-mapped registers can
be read from software. In the ``myreg`` example, this looks as follows.

.. code::

   always_comb begin
      if (reg1_rd_d)
        device_rdata_o = reg1_data;
      else if (reg2_rd_d)
        device_rdata_o = reg2_data;
      else
        device_rdata_o = 32'b0;
    end 

Note that we drive the output data bus ``device_rdata_o`` on the *delayed* read
strobes (see above).

Now, where should you instantiate your own coprocessor, ``poly1305``? Simply, make
it a submodule of the memory-mapped interface, and hook up its inputs and outputs
to the memory-mapped coprocess.

Design Verification In Verilator
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following description depends on how your coprocessor is realized.
I will make a generic description based on the following set of
memory-mapped coprocessor registers. However, your specific
implementation can vary -- you are not required to define the same set
of memory-mapped registers! In the following, I made the assumption
that the top-level module name of the coprocessor is called
``poly1305top``. If you start to modify ``myreg``, of course, your
toplevel name may be different. In my case, I created a new coprocessor
next to ``myreg``.

The ``poly1305top`` design has 14 memory-mapped registers.

+-------+----------+---------------------------------+
| Name  | Address  | Function                        |
+=======+==========+=================================+
| R0    | 0        | lowest word of 128-bit R        |
+-------+----------+---------------------------------+
| R1    | 4        | next word of 128-bit R          |
+-------+----------+---------------------------------+
| R2    | 8        | next word of 128-bit R          |
+-------+----------+---------------------------------+
| R3    | 12       | highest word of 128-bit R       |
+-------+----------+---------------------------------+
| S0    | 16       | lowest word of 128-bit S        |
+-------+----------+---------------------------------+
| S1    | 20       | next word of 128-bit S          |
+-------+----------+---------------------------------+
| S2    | 24       | next word of 128-bit S          |
+-------+----------+---------------------------------+
| S3    | 28       | highest word of 128-bit S       |
+-------+----------+---------------------------------+
| M0    | 32       | lowest word of 128-bit M        |
+-------+----------+---------------------------------+
| M1    | 36       | next word of 128-bit M          |
+-------+----------+---------------------------------+
| M2    | 40       | next word of 128-bit M          |
+-------+----------+---------------------------------+
| M3    | 44       | highest word of 128-bit M       |
+-------+----------+---------------------------------+
| P0    | 48       | lowest word of 128-bit P        |
+-------+----------+---------------------------------+
| P1    | 52       | next word of 128-bit P          |
+-------+----------+---------------------------------+
| P2    | 56       | next word of 128-bit P          |
+-------+----------+---------------------------------+
| P3    | 60       | highest word of 128-bit P       |
+-------+----------+---------------------------------+
| CTL   | 64       | Control Word                    |
+-------+----------+---------------------------------+
| STAT  | 68       | Status Word                     |
+-------+----------+---------------------------------+

Here is an example software driver for such a coprocessor.
In essence, this driver compares a software implementation
of poly1305 with the hardware-accelerated version of it.
The hardware-accelerated version is integrated by creating
custom versions of the old software versions. For example,
``poly1305Init_HW`` is the hardware version of ``poly1305Init``.
To write to coprocessor registers or read from them,
the macros ``DEV_WRITE`` and ``DEV_READ`` are used.
Keep in mind that this driver is just an example -- you are
not required to use this specific implementation, and the code
below is only provided as a guideline of how to develop the
software driver that will test your coprocessor design.

.. code::

   #include "poly1305.h"
   #include "demo_system.h"
   #include "timer.h"
   #include "dev_access.h"
   
   uint8_t key[32] = {0x85,0xd6,0xbe,0x78,0x57,0x55,0x6d,0x33,0x7f,0x44,0x52,0xfe,0x42,0xd5,0x06,0xa8,
                      0x01,0x03,0x80,0x8a,0xfb,0x0d,0xb2,0xfd,0x4a,0xbf,0xf6,0xaf,0x41,0x49,0xf5,0x1b};
   
   uint8_t block1[16] = {0x43,0x72,0x79,0x70,0x74,0x6f,0x67,0x72,0x61,0x70,0x68,0x69,0x63,0x20,0x46,0x6f};
   size_t  block1len  = 16;
   
   uint8_t block2[16] = {0x72,0x75,0x6d,0x20,0x52,0x65,0x73,0x65,0x61,0x72,0x63,0x68,0x20,0x47,0x72,0x6f};
   size_t  block2len  = 16;
   
   uint8_t block3[16]  = {0x75,0x70};
   size_t  block3len  = 2;
   
   uint8_t tag[16];
   
   void showBlock(const uint8_t *data, size_t length) {
     uint8_t n;
     uint8_t d;
     for (n=0; n<length; n++) {
       d = (data[n] >> 4) & 0xf;
       putchar((d > 9) ? d - 10 + 'A' : d + '0');
       d = data[n] & 0xf;
       putchar((d > 9) ? d - 10 + 'A' : d + '0');
     }
     putchar('\n');
   }
   
   #define POLY1305_R0 0x80006000
   #define POLY1305_R1 0x80006004
   #define POLY1305_R2 0x80006008
   #define POLY1305_R3 0x8000600C
   
   #define POLY1305_S0 0x80006010
   #define POLY1305_S1 0x80006014
   #define POLY1305_S2 0x80006018
   #define POLY1305_S3 0x8000601C
   
   #define POLY1305_M0 0x80006020
   #define POLY1305_M1 0x80006024
   #define POLY1305_M2 0x80006028
   #define POLY1305_M3 0x8000602C
   
   #define POLY1305_P0 0x80006030
   #define POLY1305_P1 0x80006034
   #define POLY1305_P2 0x80006038
   #define POLY1305_P3 0x8000603C
   
   #define POLY1305_CTL  0x80006040
   #define POLY1305_STAT 0x80006044
   
   #include <string.h>
   
   #define osMemcpy(dest, src, length) (void) memcpy(dest, src, length)
   
   #define MIN(a, b) ((a) < (b) ? (a) : (b))
   
   
   #define LOAD32LE(p) (                                           \
                         ((uint32_t)(((uint8_t *)(p))[0]) << 0)  | \
                         ((uint32_t)(((uint8_t *)(p))[1]) << 8)  | \
                         ((uint32_t)(((uint8_t *)(p))[2]) << 16) | \
                         ((uint32_t)(((uint8_t *)(p))[3]) << 24))
   
   #define STORE32LE(a, p) \
       ((uint8_t *)(p))[0] = ((uint32_t)(a) >> 0) & 0xFFU, \
       ((uint8_t *)(p))[1] = ((uint32_t)(a) >> 8) & 0xFFU, \
       ((uint8_t *)(p))[2] = ((uint32_t)(a) >> 16) & 0xFFU, \
       ((uint8_t *)(p))[3] = ((uint32_t)(a) >> 24) & 0xFFU
   
   typedef unsigned int uint_t;
   
   static int firstblock = 0;
   
   void poly1305Init_HW(Poly1305Context *p, uint8_t *key) {
     //The 256-bit key is partitioned into two parts, called r and s
     DEV_WRITE(POLY1305_R0, LOAD32LE(key +  0) & 0x0FFFFFFF);
     DEV_WRITE(POLY1305_R1, LOAD32LE(key +  4) & 0x0FFFFFFC);
     DEV_WRITE(POLY1305_R2, LOAD32LE(key +  8) & 0x0FFFFFFC);
     DEV_WRITE(POLY1305_R3, LOAD32LE(key + 12) & 0x0FFFFFFC);
   
     DEV_WRITE(POLY1305_S0, LOAD32LE(key + 16));
     DEV_WRITE(POLY1305_S1, LOAD32LE(key + 20));
     DEV_WRITE(POLY1305_S2, LOAD32LE(key + 24));
     DEV_WRITE(POLY1305_S3, LOAD32LE(key + 28));
   
     firstblock = 1;
     p->size = 0;
   }
   
   void poly1305ProcessBlock_HW(Poly1305Context *context);
   
   void poly1305Update_HW(Poly1305Context *context, const void *data, size_t length) {
     size_t n;
     while(length > 0) {
       n = MIN(length, 16 - context->size);
       osMemcpy(context->buffer + context->size, data, n);
       context->size += n;
       data = (uint8_t *) data + n;
       length -= n;
       if(context->size == 16) {
         poly1305ProcessBlock_HW(context);
         firstblock = 0;
         context->size = 0;
       }
     }
   }
   
   void poly1305ProcessBlock_HW(Poly1305Context *context) {
     uint_t n, fullblock, ctl;
   
     n = context->size;
     fullblock = (n == 16);
     context->buffer[n++] = 0x01;
   
     //If the resulting block is not 17 bytes long (the last block),
     //pad it with zeros
     while(n < 17)  {
       context->buffer[n++] = 0x00;
     }
   
     DEV_WRITE(POLY1305_M0, LOAD32LE(context->buffer +  0));
     DEV_WRITE(POLY1305_M1, LOAD32LE(context->buffer +  4));
     DEV_WRITE(POLY1305_M2, LOAD32LE(context->buffer +  8));
     DEV_WRITE(POLY1305_M3, LOAD32LE(context->buffer + 12));
   
     ctl = 2;
     if (fullblock)
       ctl |= 1;
     if (firstblock)
       ctl |= 4;
   
     DEV_WRITE(POLY1305_CTL, ctl);
     DEV_WRITE(POLY1305_CTL, 0);
   
     while (DEV_READ(POLY1305_STAT) != 1) ;
     DEV_WRITE(POLY1305_STAT, 0);
   
   }
   
   void poly1305Final_HW(Poly1305Context *context, uint8_t *tag) {
   
     //Process the last block
     if(context->size != 0)
       poly1305ProcessBlock_HW(context);
   
     STORE32LE(DEV_READ(POLY1305_P0), tag);
     STORE32LE(DEV_READ(POLY1305_P1), tag + 4);
     STORE32LE(DEV_READ(POLY1305_P2), tag + 8);
     STORE32LE(DEV_READ(POLY1305_P3), tag + 12);
   
     firstblock = 1;
     context->size = 0;
   }
   
   void showPoly(const Poly1305Context *p) {
     uint8_t n;
   
     puts("R  as a number ");
     puthex(p->r[3]); putchar(' ');
     puthex(p->r[2]); putchar(' ');
     puthex(p->r[1]); putchar(' ');
     puthex(p->r[0]); putchar('\n');
   
     puts("S  as a number ");
     puthex(p->s[3]); putchar(' ');
     puthex(p->s[2]); putchar(' ');
     puthex(p->s[1]); putchar(' ');
     puthex(p->s[0]); putchar('\n');
   
     puts("A* as a number ");
     puthex((uint32_t) (p->a[4])); putchar(' ');
     puthex((uint32_t) (p->a[3])); putchar(' ');
     puthex((uint32_t) (p->a[2])); putchar(' ');
     puthex((uint32_t) (p->a[1])); putchar(' ');
     puthex((uint32_t) (p->a[0])); putchar('\n');
   
     puts("Buffer ");
     for (n=0; n<17; n++) {
       char d;
       d = (p->buffer[n] >> 4) & 0xf;
       putchar((d > 9) ? d - 10 + 'A' : d + '0');
       d = p->buffer[n] & 0xf;
       putchar((d > 9) ? d - 10 + 'A' : d + '0');
     }
     putchar(' ');
     putchar('(');
     for (n=0; n<17; n++)
       putchar(p->buffer[n]);
     putchar(')');
     putchar('\n');
   
   }
   
   void main() {
   
     Poly1305Context c;
     uint64_t stamp;
     uint64_t hwcycles;
     uint64_t swcycles;
   
     Poly1305Context chw;
   
     puts("Running Hardware\n");
   
     stamp = timer_read();
   
     poly1305Init_HW  (&chw, key);
     poly1305Update_HW(&chw, block1, block1len);
     poly1305Update_HW(&chw, block2, block2len);
     poly1305Update_HW(&chw, block3, block3len);
     poly1305Final_HW (&chw, tag);
     hwcycles = timer_read() - stamp;
   
     puts("Done. TAG: ");
     showBlock        (tag,  16);
   
     puts("HW Cycles: ");
     puthex((uint32_t) (hwcycles >> 32));
     puthex((uint32_t) (hwcycles));
     putchar('\n');
   
   
     puts("Running Software\n");
   
     stamp = timer_read();
   
     poly1305Init  (&c,  key);
     poly1305Update(&c,  block1, block1len);
     poly1305Update(&c,  block2, block2len);
     poly1305Update(&c,  block3, block3len);
     poly1305Final (&c,  tag);
     swcycles = timer_read() - stamp;
   
     puts("Done. TAG: ");
     showBlock     (tag, 16);
   
     puts("SW Cycles: ");
     puthex((uint32_t) (swcycles >> 32));
     puthex((uint32_t) (swcycles));
     putchar('\n');
   
     // delay the simulator before halting so all output is flushed
     puts("Waiting\n");
     stamp = timer_read();
     while ((timer_read() - stamp) < 700000) ;
   
     sim_halt();
     }


The following is a checklist to help you with the integration of your coprocessor
in your final-project repository.

1. Add the RTL of your coprocessor to ``custom/rtl``. This can be one or more files.
   For example, in my case, I added ``processblock.v``, ``poly1305.v``, and
   ``poly1305top.sv``.
   
2. Update the toplevel description, ``custom_ibex.sv``, to include your coprocessor
   (in case you did not modify ``myreg`` but added a new coprocessor). The updates
   are as follows. It's straightforward if you follow the example of other
   devices attached to the IBEX bus.

   a. Define a base address, size and mask
   b. Update the ``bus_device_e`` enum
   c. Adjust (increment) the total ``NrDevices`` definition
   d. Add a ``cfg_device_addr_base`` and ``cfg_device_addr_mask``
   e. Instantiate a new coprocessor.

3. Update ``custom_ibex_core.core`` with the new files you have added in ``custom/rtl``.

4. Add a new software application in ``c/sw/demo``. Update
   ``CMakeLists.txt`` in the same directory so that your new software
   application is compiled.

5. In the top ``Makefile``, update the name of the application macro ``APP`` to point to your
   new software application.

6. Build the application with ``make simprep`` followed by ``make
   swprep`` followed by ``make run`` or ``make runfst``. If the build
   process breaks, inspect the error messages closely. The first error
   message usually reveals what went wrong.

The following figure shows a simulation that captures a complete
poly1305 operation in hardware as three message blocks. The timing
diagram monitors the bus interface (``device_..``) and the
memory-mapped registers. Next, we outline the timing behavior of
an IBEX write bus cycle and an IBEX read bus cycle.

.. figure:: img/pj_coprocessor-full.png
   :figwidth: 600px
   :align: center
   
The IBEX bus cycle
""""""""""""""""""

Writing to the coprocess takes two clock cycles. In the first cycle,
the IBEX asserts the address, device request signal, write control
signal, byte-enable signal, and value to write. The coprocessor must
decode the bus in one cycle and respond with an ``rvalid`` in the next
cycle. At the same time, the memory-mapped register is updated.
Eventually, the IBEX removes the address and other bus controls.
   
.. figure:: img/pj_writebus.png
   :figwidth: 600px
   :align: center

Reading from the coprocessor takes two cycles. In the first cycle, the
IBEX asserts the address, device request signal, de-asserts write
control signal, asserts the byte-enable signal.  The coprocessor must
decode the bus in one cycle and respond with an ``rvalid`` in the next
cycle. At that time, the output data bus must be dirven.  Eventually,
the IBEX removes the address and other bus controls.
   
.. figure:: img/pj_readbus.png
   :figwidth: 600px
   :align: center

	   
Synthesis and Layout
^^^^^^^^^^^^^^^^^^^^

.. important::

   Do not start working on this section until you have a completely
   working Verilator simulation. This means that you must first obtain
   a correct simulation that demonstrates that your hardware poly1305
   computes the same tag value as the poly1305 reference software.

   Once you have achieved that milestone, you can then move to
   coprocessor synthesis. You will receive a repository that is fully
   prepared to synthesize and layout your design, and where you only
   have to add RTL.

   To clone that additional repository, accept the following
   Github classroom link: https://classroom.github.com/a/0BlhqXyC

   Clone the resulting repository, which is called
   final-project-synthesis-userid You need at least one repo per team,
   although both team members can accept and clone the repository.

   The following starts after you have cloned the synthesis
   repository, which follows the same structure as the standard ECE574
   flow.


Step 1: Copy and configure RTL
""""""""""""""""""""""""""""""

Your first step is the copy the RTL of the coprocessor from your final-project
repo to the ``rtl`` subdirectory. The synthesis repo is configured for
a toplevel module name ``poly1305top``. If you change this, you will need to
adjust the Makefiles as needed.

For example, in my case, this directory looks as follows.

.. code::

   [pschaumont@arc-schaumont-class-vm rtl]$ ls
   poly1305.sv  poly1305top.sv  processblock.sv

Step 2: Run an RTL simulation
"""""""""""""""""""""""""""""

As a sanity check, you will run and RTL simulation in ``sim``. Make sure
to update the ``Makefile`` so that all files are included in the simulation.

Next, inspect the testbench ``tb.sv``, which runs an access sequence through
the coprocessor. You can use ``writebus`` and ``readbus`` tasks to quickly
generate bus cycles. The testbench runs through three message blocks and then
checks the resulting tag. Depending on how you have defined the memory-mapped
registers, small changes to ``tb.sv`` may be needed.

Afterwards you can just type ``make``. If the testbench completes successfully,
you will see: ``Test Vector Passes``.

Step 3: RTL synthesis
"""""""""""""""""""""

Next, go to the ``syn`` directory, update the Makefile to include all files
of your coprocessor and run logic synthesis. Inspect the synthesis results
in ``reports`` and find the area of your design.

.. important::

   1. If your design exceeds 25K cells, the layout of your coprocessor
   will be lengthy, and I recommend that you perform these runs at
   night only.

   2. You can relax the cycle constraint of your implementation
   for example from ``CLOCKPERIOD=4`` to ``CLOCKPERIOD=10``. A relaxed
   timing constraint will make synthesis complete faster. If your
   coprocessor has negative slack (``reports/poly1305top_report_timing.rpt``),
   you will need to relax the clock period.

   3. Make sure to check your coprocessor for latches (``grep LAT
   outputs/poly1305top_netlist.v``). No latches should exist in the final
   design.

Step 4: Static Timing Analysis
""""""""""""""""""""""""""""""

Next, go to the ``sta`` directory, and run a static timing analysis.
Make sure that your design is correct for setup timing. Since we have
not defined timing constraints for asynchronous inputs (reset), you
can ignore timing problems from reset.

Step 5: Gate Level Simulation (optional)
""""""""""""""""""""""""""""""""""""""""

Next, go to the ``glsim`` directory and perform gate-level simulation
using post-synthesis timing. If you modified the RTL testbench for
your coprocessor design, you may also need similar changes to the
gate level testbench.

.. code::

   make sim-postsyn
   or
   make sim-postsyn-gui

Step 6: Layout
""""""""""""""

Next, go to the ``layout`` directory and create the layout.
The chip IO pinout is already defined in ``chip/chip.io`` (you
may change it, if needed or desired).

Make sure that the ``Makefile`` reflects the complete set
of RTL files for your coprocessor. Then, build the layout.

.. code::

   make syn
   make layout

If you succeed with the layout, you should be able to produce a
final layout of your complete coprocessor design.
   
.. figure:: img/pj_layout.png
   :figwidth: 600px
   :align: center

Step 7: Post Layout Simulation (optional)
"""""""""""""""""""""""""""""""""""""""""

Next, go back to the ``glsim`` directory and perform gate-level simulation
using post-layout timing.

.. code::

   make sim-postlayout
   or
   make sim-postlayout-gui


Step 8: Collect Design Data
"""""""""""""""""""""""""""

If you have completed all design steps described above, congratulations, you
are done! During the design steps, a great number of design reports and
diagnostic information was generated. Go back and collect that data for your
presentation.

1. The ``syn`` directory contain area, timing and power estimates from logic synthesis

2. The ``sta`` directory contains the critical path definition for the selected clock period

3. The ``layout/reports`` amd ``layout/reports/STA`` contains post-layout design data
   related to area, power, timing, and clock tree.
   
Project Presentation
--------------------

This section summarizes the requirements for the project presentation.

If you participated in the 'Tiny Tapeout' projects
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Your final presentation will consist of a 20 minute powerpoint
presentation on your design. Discuss the application in detail, since
the rest of the class did not hear about your application yet. Put
special emphasis on the following two aspects related to the design:

1. Dealing with area and performance constraints.
2. Dealing with design verification, either through simulation or else
   through prototyping.

Deliverables for your final project include the following.

* The github repository of your tiny tapeout submission
* Your slide deck must be submitted to the instructor before the presentation on 11 December.

If you participated in the 'poly1305' project in the in-person course
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The deliverables for your final project include the following. The
slide deck and excel sheet must be emailed to the instructor before
the start of your presentation. One email per team is sufficient. Make
sure to indicate the names of the students participating in the team.

* The gtihub repository for the IBEX Verilator based cosimulation

* The github repository for the synthesis of your coprocessor

* The slide deck of your final presentation

* The following excel sheet filled out with data statistics of your design: :download:`project-metrics.xlsx <pdf/project-metrics.xlsx>`
   
Your final presentation will consist of a 15-minute or 10-minute
powerpoint presentation on your design. The schedule will show if you
are going to talk for 15 or 10 minutes. In your talk, you have to
highlight the main design phases of your design, depending on how far
you were able to finish the design. There are four design phases:

1/ **RTL** coding corresponding to the 'RTL' section in the Projects Metrics.  You have made an initial architecture design for the accelerator, along with RTL coding. In your slides, demonstrate the design hierarchy and major design properties such as the number of data multipliers, and the clock cycle latency for a modular multiplication.

2/ **SIMULATION** which includes (a) simulation of the full coprocessor in the IBEX Verilator system simulation, and (b) stand-alone Verilog simulation of the coprocessor as it is moved to synthesis. Both simulations need to compute the correct Poly1305 Tag before moving futher into synthesis.

3/ **SYNTHESIS** which includes RTL synthesis of your design (the 'syn' step of the ECE574 flow). Report the major properties of your design including the achieved clock period, slack, leaf instance count (number of standard cells), the number of flip-flops, and the active cell area. You may report these numbers for multiple clock periods, if you want to showcase the area/delay tradeoff in your design. The design delay (critical path * design latency) and the active cell area will be used to plot your design in a global Pareto plot.

4/ **LAYOUT** which includes layout of your design (the 'layout' step of the ECE574 flow). Report the major properties of your design: the clock period, slack, leaf instance count including physical cells, total area of standard cells, and gate density. Consult the excel sheet for guidance on where to take all these measurements.

If you participated in the 'poly1305' project in the online class
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The deliverables for your final project include the following.  The
slide deck and excel sheet must be emailed to the instructor together
with the video. The submission deadline is Friday 12/13, 11:59PM. One
email per team is sufficient. Make sure to indicate the names of the
students participating in the team.

* The github repository for the IBEX Verilator based cosimulation
* The github repository for the synthesis of your coprocessor
* The slide deck of your final presentation
* The following excel sheet filled out with data statistics of your design: :download:`project-metrics.xlsx <pdf/project-metrics.xlsx>`

Your final presentation will consist of a 10-minute video presentation
on your design. In your talk, you have to highlight the main design
phases of your design, depending on how far you were able to finish
the design. Refer to the section above for a description of each
phase, including **RTL**, **SIMULATION**, **SYNTHESIS** and
**LAYOUT**.