Codesign Challenge - ML Inference

Introduction
Application
Running the reference implementation
Ranking Criteria
Acceleration Strategies
What to turn in

Introduction

The codesign challenge is the final assignment for ECE 4530. This project is an exercise in performance optimization. You will start from a reference application on a NIOS platform. You have to improve the performance of the reference application as much as possible, using the hardware/software codesign techniques covered in this course. Typically, you would design a hardware coprocessor. In addition, you would also optimize the driver software, and/or modify the system architecture to remove communication bottlenecks.

The major constraints in the design are as follows.

The final result has to run on a DE1-SoC board
The final result has to be turned in before the submission deadline, Sunday 8 December 11:59 PM
The testbench consists of the classification of 100 images from the MNIST database, and your margin of error needs to be less then 10 misclassifications (defined further).
The performance of the final solution is measured by a timer counting 50MHz clock cycles.
At the start of the testbench, the 100 images are stored in an off-chip SDRAM. At the start of the testbench, the training coefficients are stored in an off-chip SDRAM.

Application

The MNIST database is an image database with handwritten digits 0 to 9. The database is used to demonstrate the capabilities of neural networks and to benchmark their performance - even though there are many newer, more complex databases available now.

The reference design is a four-layer neural network with 784 nodes in the input layer, two hidden layers of 16 nodes each, and 10 output nodes. The network of the reference design has been trained on the 60,000 images of the MNIST database.

This assignment is to accelerate the reference implementation as much as possible. This writeup does not elaborate further on neural network design and implementation. There are plenty of good introductions out there. Two good ones are the introduction on neural networks by 3blue1brown part 1 and part 2, and the introduction on accelerated computing of neural networks by Xilinx. You don’t have to be an expert in neural networks to succeed in the codesign challenge. The acceleration techniques you have learned in this course are sufficient to obtain a good improvement over the reference implementation.

The reference architecture is the same as the one we’ve discussed in Lecture 19 (example-nios-sdram). It contains the following components.

Nios II/e at 50MHz
JTAG uart, Timer 0 (system timer), Timer 1 (timestamp timer), PIO ports for HEX, LEDs, Switches
64Kbyte on-chip RAM holding program instructions
100MHz SDRAM on-chip interface to 64MByte off-chip SDRAM DDR memory holding program data

The reference implementation is based a neural network library in C, which was ported to DE1-SoC. The reference implementation as provided is not very fast, due to the complexity of the implementation on the one hand, and the design of the platform on the other. For example, the Nios II/e does not contain a floating-point multiplier. The reference implementation takes about 6.5 billion cycles to classify 100 reference images (That’s almost one second per image).

The reference testbench performs the following task. There .rwdata segment contains 100 images selected from the MNIST database. Each image is 28 by 28 pixels. The images are fed into a {784, 16, 16, 10} neural network for classification. The neural network was trained beforehand, and the testbench includes the training coefficients. Each image is classified into 10 categories corresponding to the digits 0 to 9. Finally, the classification is verified for correctness. The reference implementation makes 9 mistakes in 100 classifications. That is, in 9 out of 100 images, the identified number was different from the actual handwritten number. Overall, your accelerated design must make less then 10 misclassifications over 100 test images.

Running the reference implementation

Start by accepting the Codesign Challenge assignment on Github Classroom. Download the resulting repository to your laptop.

(Cygwin)

$ git clone https://github.com/vt-ece4530-f19/codesign-challenge-patrickschaumont

Go to into the repository and compile the hardware. In order to modify the hardware, you will have to make use of Quartus and Platform Designer. The following command only handles the synthesis.

(Nios II Command Shell)

$ quartus_sh --flow compile exampleniossdram

When the compilation finishes, you have a bitstream exampleniossdram.sof. Download this bitstream to the DE1-SoC board.

(Nios II Command Shell)

$ nios2-configure-sof -d 2  exampleniossdram.sof

To compile the software, first prepare a board support package (BSP). Create a new BSP starting from platfomniossdram.sopcinfo. Select the following options:

Connect the 50MHz timer0 to the system timer and timer1 to the timestamp timer.
Disable the alt_load facilty (under Advanced/hal/linker)

(Nios II Command Shell)

nios2-bsp-editor

Compile the BSP. Note the genbsp.h that includes this command. So you can also run the script ./genbsh.sh.

(Nios II Command Shell)

cd software
nios2-bsp-generate-files  \
    --settings=hal_bsp/settings.bsp  \
    --bsp-dir=hal_bsp

Compile the reference testbench. Note the script genmake.sh

(Nios II Command Shell)

cd software
nios2-app-generate-makefile \
    --bsp-dir ../hal_bsp \
    --elf-name main.elf \
    --src-files dataset.c \
    --src-files functions.c \
    --src-files loss_functions.c \
    --src-files mnist_infertest.c \
    --src-files neuralnet.c \
    --src-files optimizer.c \
    --src-files tensor.c

Open a nios2-terminal and run the application

(Nios II Command Shell)

nios2-terminal

(Nios II Command Shell)

nios2-download main.elf --go

If everything runs as expected, you’ll see a series of digits with the predictions made by the network. For each digit, the testbench prints the number of elapsed cycles (lap cycles) so far. The HEX display on the board also reflects the progress. The first two digits count the test image (0 to 99), the next two digits show the network prediction (00 to 09), the next two digits show the image label (00 to 09). The red LEDs count the number of errors (only the first 8 errors are counted).

O~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O
|                              |
|                              |
|                              |
|                              |
|                              |
|                              |
|                              |
|       ,x=+..                 |
|       XMMMMMxxxxxxxx=.       |
|       .-.-=XMXMMMMXMM+       |
|            ........MM-       |
|                   ,MX.       |
|                  .MM,        |
|                  +MM.        |
|                 .MM.         |
|                 +Mx.         |
|                .XM.          |
|                -Mx           |
|               .MM.           |
|              .XM=            |
|             .xMX.            |
|             .MM,             |
|            .XM-              |
|            +MM.              |
|           .MMM.              |
|           -MMX.              |
|           -MX.               |
|                              |
O~~~~~~~~~ Number: 7 ~~~~~~~~~~O
Test Number 0: Network prediction: 7  Expected: 7  Confidence: 99.728165%
Lap cycles: 63288046

The last output of the testbench is the Total Cycle Count, and that is the cycle count against which your speedup will be measured.

Complete. 9 total errors
Total cycles: 6529544770

Your accelerated design is correct if it has less then 10 total errors over the 100 test images in the testbench. The acceleration of your testbench is given by the following expression, where C is a performance measurement made in 50MHz clock cycles.

              65529544770
  speedup = --------------
                   C

Ranking Criteria

All designs will be strictly ranked from best to worst. We will use the following metrics to evaluate the rank of your design.

Functional Correctness: This requirement is mandatory for all designs. Your design has to work (the testbench must pass), in order to be considered for ranking. If it does not work, you will be automatically ranked last.
Metric 1: The speedup of your design following the formula given above. Higher is better.
Metric 2: The area efficiency of the resulting design, expressed in ALMs. An ALM is an Adaptive Logic Module, a unit of hardware in a Cyclone V FPGA. A lower ALM cell count corresponds to a smaller design. You can find the ALM count in the ‘Flow Summary’ of Quartus.
Metric 3: The turn-in time of your design and report, as measured by the turn-in time on Scholar. Turning in the solution earlier is better. Note that, if you turn in the design multiple times, only the latest turn-in time will be used.

Two designs will be compared as follows, to determine their ranking order. If a design is functionally incorrect (i.e. fails the testbench), it will automatically be moved to the lowest rank. If multiple designs are not operational, they will all share the same lowest rank. Every functionally correct design will get a better and unique rank.

First, the speedup (Metric 1) will be compared. If there is a difference of more than 5% between them (with the fastest design considered 100%), the fastest design will get a better rank. If, on the other hand, the difference is smaller than 5%, Metric 2 will be used as tie-breaker.
Metric 2 will be used in a similar way to compare two designs. If the difference in area efficiency between two designs is larger than 5%, then the smallest design gets the better rank. Otherwise, if they are separated less than 5%, Metric 3 will be used as tie-breaker.
Metric 3 will be used as a final metric in case the ranking decision cannot be made using Metric 1 and Metric 2 alone. In this case, the design turned-in earlier wins. The turn-in time is the time of your last commit on the git repository.

The grade for your project is determined as follows.

40 points of the grade are determined by the ranking of your project as described above. The best design gets 40 (out of 40) points, the worst design gets 0 (out of 40) points, and all other designs are linearly distributed between 0 and 40 points.
60 points of the grade are determined by the quality of the written documentation you provide with the solution.

Acceleration Strategies

The main objective of this assignment is to make the testbench run as fast as possible, using hardware software codesign. There are several layers to this problem and it is worthwhile to carefully think them through.

The reference processor is a Nios II/e, which has a very low CPI. Moving computations away from this processor into hardware is a good idea. However, keep in mind that the overall neural network is a fairly complex computation, so you’ll have to do this cautiously. You cannot implement everything in hardware. Study the innermost loop of the neural network inference problem; you’ll find that efficient matrix multiplication is the key to acceleration.
The data is stored off-chip in SDRAM. The 100 images and the training coefficients all together use about 130,000 bytes, which is very manageable. In other words, the reference design is likely not limited by the off-chip memory bandwidth. The design is limited by the computational bottleneck more than anything else.
The design is written using floating-point precision, but is run on a processor without floating point hardware. Introducting floating point hardware may optimize performance considerably. This can be done using custom-instructions or using stand-alone coprocessors.
Always look for the bottleneck in the design, and look for opportunities to do things in parallel. Realize that the bottleneck shifts around as you start improving the design. For example, making a computation faster may result in a communication bottleneck.
Use the tools you’ve learned in this course! Use profiling, debugging, and clock cycle counting to diagnose your design. Use custom instructions, coprocessors, dedicated and customized memory organizations to improve the performance of the design.

There are several optimization mechanisms to consider. However, it’s vital to think and act strategically.

Avoid using a random ’let’s try this and see’ strategy. Very likely you will forget something, or it will lead you into a local optimum while loosing sight of the big picture. Always keep an objective in mind (eg. a desired acceleration factor), and measure your progress in reaching your objective. Such a final objective should be derived through a reasonable back-of-the-envelope calculation.
Make a list of possible optimizations and start with the one that you think will give the most improvement. For example, you can try to increase the clock frequency to 100MHz, which gives an improvement factor of roughly two. However, once the system runs at 100MHz, architectural changes become more harder and it’s more likely to violate the critical path.
Take only a small step at a time, and verify, verify, verify. Make sure that you have always something that works and that you can fall back on.
The key to succeed in this assignment is NOT in a single session of back-to-back all-nighters a few days before the deadline. The key to succeed is to work on this design in small steps over a longer period. Design is a creative process, and that takes time. Ideas take time to develop. Think about your strategy and discuss with your colleagues. This is an open-ended assignment, and you are competitively graded, so it’s not in your advantage to disclose your ideas. However, a brainstorm session or two with your peers may do wonders to sharpen your insight in this assignment.

What to turn in

You have to deliver the following items for the project result. Everything has to be provided through github.

A sof (bitstream) file of your final design, and a compiled executable (elf) of your final design.
A PDF document that describes your resulting design. Please explain your design strategy, the architecture of your hardware/software solution, and overall observations on the design. Note that the PDF document counts for 60 % of the grade, so it’s worth to do this carefully. Work on the PDF before you’re running out of time.
Push your repository back on github before the deadline. Make sure to add all new files.