Introduction

Homework 3 is a complete design of a hardware processor. It combines the ensemble of ideas discussed in Lectures 2 to 8. The assignment is more complex than a simple homework, and it will carry double weight (40 points). In this homework, you have to design a bit-transpose coprocessor. The coprocessor reads in a 8-by-8 bit-matrix, computes the matrix transpose, and returns the result.

You will receive a reference implementation of the algorithm in software, implemented on an MSP430 core. Your job is to accelerate a C function call of that reference implementation using a hardware accelerator. There are many possible solutions to this design problem. A correct solution is a solution that (a) is functionally correct and (b) runs in fewer clock cycles than the reference implementation. We will grade this homework by performance: the fastest design will receive the best score. Any design which is functionally correct will receive a passing score.

The Bit Transpose coprocessor

Figure: 8-by-8 bit transpose 8 by 8 bit transpose

The high-level functionality of the bit transpose coprocessor is illustrated in the figure above. The coprocessor accepts a single 64-bit argument which represents a compacted version of a 8-by-8 bit matrix. The coprocessor computes the transpose operation and generates a single 64-bit argument.

The reference platform comes with an 8KB program memory and an 8KB data memory. Note that the program memory size is larger then what we have used for previous examples discussed in class. You will not be able to run the reference software implementation of Homework 3 on one of these previous configurations because their program memory is too small.

Start by downloading the reference design from github by following the Homework 3 starter link on github classroom.

git clone https://github.com/vt-ece4530-f19/homework-3-patrickschaumont

Compile the reference hardware and reference software. Follow the same

A portion of the listing main.c in transpose is shown below.

// The reference implementation
unsigned long long transpose(unsigned long long a) {
  unsigned r1, r2, r3, r4, r5, r6, r7, r8;
  unsigned long long c, d;
  unsigned i;

  r1 = (a >> 56);
  r2 = (a >> 48) & 0xff;
  r3 = (a >> 40) & 0xff;
  r4 = (a >> 32) & 0xff;
  r5 = (a >> 24) & 0xff;
  r6 = (a >> 16) & 0xff;
  r7 = (a >>  8) & 0xff;
  r8 = (a      ) & 0xff;

  c = 0;
  for (i = 0; i<8; i++) {
    
    d = ((r1 & 1) << 7)
      | ((r2 & 1) << 6)
      | ((r3 & 1) << 5)
      | ((r4 & 1) << 4)
      | ((r5 & 1) << 3)
      | ((r6 & 1) << 2)
      | ((r7 & 1) << 1)
      | ((r8 & 1));
      
    c = (d << 56) + (c >> 8);

    r1 = r1 >> 1;
    r2 = r2 >> 1;
    r3 = r3 >> 1;
    r4 = r4 >> 1;
    r5 = r5 >> 1;
    r6 = r6 >> 1;
    r7 = r7 >> 1;
    r8 = r8 >> 1;

  }
  
  return c;
}

// The hardware-accelerated implementation
unsigned long long transpose_hw(unsigned long long a) {
  volatile unsigned long long k = transpose(a);
  return k;
}

The main program computes the bit transpose for every value between 0 and 32767, and it accumulates the resulting bit-transposed outputs as well as the cycle count needed to compute them. This testbench is used for the reference implementation as well as for the hardware-accelerated version.

int main() {

	...

    // performance comparison
    sw_tt = 0;
    sw_ct = 0;
    hw_tt = 0;
    hw_ct = 0;
    for (i=0; i<32768; i++) {
      de1soc_hexlo(i);
      
      TimerLap();
      t = transpose(i);
      c = TimerLap();
      sw_tt += t;
      sw_ct += c;
      
      TimerLap();
      t = transpose_hw(i);
      c = TimerLap();
      hw_tt += t;
      hw_ct += c;      
    }

    ...
}

The main program then sends the computed results and the cycle counts to the UART. Obviously, if your hardware-accelerated version of transpose is correct, then we expect that sw_tt = hw_tt. Furthermore, since the hardware coprocessor would be faster than the software reference implementation, we expect that sw_ct > hw_ct.

Your job is to design a hardware coprocessor that can compute the same result as the reference implementation transpose. The coprocessor needs to be called from within transpose_hw. The file main.c has a line that says NO CHANGES ARE ALLOWED BELOW THIS LINE. This means that you must stick to the pre-defined testbench. Of course, during debugging, you can make changes as needed.

Designing a hardware-accelerated version of transpose will involve the following:

  • Design a set of memory mapped registers that can hold the input argument a, and the result.
  • Design the computational core in hardware, ie. implement the equivalent functionality of the transpose function body in Verilog.
  • Integrate the memory mapped registers and the computational core into a single module.
  • Integrate and interconnect the resulting module into the toplevel of the design (toplevel.v).

Quickstart

Here are the steps to build and implement the reference implementation.

  • Download the repository
git clone https://github.com/vt-ece4530-f17/homework-3-patrickschaumont
  • Compile the software (in Cygwin). Note that the reference implementation is relatively big: 1958 bytes. The program memory can fit up to 8 Kbyte.
cd homework-3-patrickschaumont/software
make compile
cd ../../..
  • Compile the hardware (in Nios II Command Shell)
cd homework-3-patrickschaumont/msp430de/hardware/msp430de1soc
quartus_sh --flow compile msp430de1soc
cd ../../../..
  • Connect the board and download the application (in Nios II Command Shell). If you cut-and-paste this command line on one line, make sure to remove the backslash ().
cd homework-3-patrickschaumont/msp430de/loader
system-console --script connect.tcl \
               -sof ../hardware/msp430de1soc/msp430de1soc.sof \
               -bin ../software/transpose/transpose.bin
  • Run Mobaxterm and telnet to localhost port 19800. You should see a stream of lines such as the following.
V 80C0E0F0F8FCFEFF FF7F3F1F0F070301
SW C0C0C0C0C0C0C000 0E3C8000 HW C0C0C0C0C0C0C000 0E500000

The first line is a verification that the software reference is a correct transposition. Refer to the first figure in this homework. 80C0E0F0F8FCFEFF is the input (8 bytes each representing one row of bits), while FF7F3F1F0F070301 is the output (8 bytes each representing one row of bits).

The second line shows four numbers: a 64-bit checksum for the software transpose, a 32-bit cycle count for the software transpose, a 64-bit checksum for the hardware transpose, and a 32-bit cycle count for the hardware transpose. Initially the hardware transpose will be slower than the software transpose because we’re emulating the hardware transpose by calling the software transpose. The overhead of this function call will make hardware transpose slower.

The output shows the software testbench checksum, the software testbench cyclecount, the hardware testbench checksum, and the hardware testbench cyclecount.

Rules of the implementation

To be counted as a valid solution, your implementation must satisfy the following properties.

  • You cannot change the main function. Thus, the testbench of the C program needs to remain identical. The objective of the homework is to create a hardware equivalent of the transpose function, no more, no less.
  • You can change the Makefile, the hardware, the software above DO NOT CHANGE ...
  • You have to design both the hardware as well as the hardware/software interface for your solution. The hardware has to be implemented in a single verilog function transpose.v, which you will include as part of the project. The hardware/software interface has to be implemented in a single C function unsigned long long transpose_hw(unsigned long long a), which you will add in the file main.c.

What to turn in

  • A PDF which shows the listing of the following files: main.c, and transpose.v. In addition, add a screenshot of a MobaXterm terminal that shows the output of your design in operation. Call the listing hw3_username.pdf (with username replaced by your github username) and include it in the root of homework-3-username.

  • Make sure to add any files you created (including the PDF and transpose.v) to your repository with the git add file command.

  • Commit your changes to your repository and push it back onto github.

Good Luck!