Introduction

In this lecture we will design an MSP-432 coprocessor for 64-bit multiplications. We will investigate the speedup of the hardware implementation over software, and we will analyze the sources of overhead in the design of such a coprocessor.

The function that we’ll accelerate is the following. This is a 64-bit (unsigned) multiply. That means that the argument a and b have to be passed to 2 registers each, and the result will occupy 4 registers. The function itself uses 64-bit arguments, and uses a 64-bit times 64-bit multiply with a 64-bit result.

unsigned long long mymul(unsigned long a, unsigned long b) {
   unsigned long long r;
   r = (unsigned long long) a * b;
   return r;
}

Indeed, the assembly listing of this function looks as follows. The two arguments a and b are passed down as R13:R12 and R15:R14 respectively. The compiler turns these into two 64-bit arguments using zero extension as R11:R10:R9:R8 and R15:R14:R13:R12 respectively.

0000f828 <mymul>:
    f828:       0a 12           push    r10             ;
    f82a:       09 12           push    r9              ;
    f82c:       08 12           push    r8              ;
    f82e:       08 4c           mov     r12,    r8      ;
    f830:       0c 4e           mov     r14,    r12     ;
    f832:       09 4d           mov     r13,    r9      ;
    f834:       0d 4f           mov     r15,    r13     ;
    f836:       4e 43           clr.b   r14             ;
    f838:       0f 4e           mov     r14,    r15     ;
    f83a:       0a 4e           mov     r14,    r10     ;
    f83c:       0b 4e           mov     r14,    r11     ;
    f83e:       b0 12 ae fa     call    #-1362          ;#0xfaae
    f842:       30 40 fe fb     br      #0xfbfe         ;

0000faae <__mspabi_mpyll>:
    faae:       0a 12           push    r10             ;
...

Interestingly the compiler insers a call to __mspabi_mpyll which is, according the the MSP430 ABI, a signed 64-bit multiply (I suspect this is a compiler bug; the unsigned 64-bit multiply, a different function in the MSP430 ABI, is not generated from the compiler and appears to be missing).

In any case, our objective is to build a hardware version of a 32 bit by 32 bit unsigned multiply, with a 64 bit result. Recalling the previous lecture, there are two aspects to the design of such a hardware multiplier. First, the arguments have to be moved from the software to the hardware coprocessor, and the result has to be retrieved from the hardware coprocessor back to the software. Next, the coprocessor’s execution has to be synchronized to the execution of the software.

Designing the Hardware Function Call

Let’s start by considering the hardware interface to hold the arguments and the result.

Data Type Read Access Write Access # Address Loc
    (HW to SW) (SW to HW)  
a 32-bit n y 2
b 32-bit n y 2
retval 64-bit y n 4


The first design decision is to pick addresses for the above set of variables. The most straightforward way is to map everything in a different register address. This is especially useful for the 32-bit and 64-bit registers, since the compiler will automatically generate write- and read-operations to consecutive addresses for long variables. Furthermore, we will pick a different address value for each of a, b and the return value. This is wasteful of address space, but on the other hand makes the design easier to understand for a software programmer. We therefore use a memory map is shown below. The addresses are organized in little-endian order, so that for example the least significant byte of a will be in address 0x140 and the most significant byte of a will be in 0x143.

Data Byte Address Word Address
a 0x140 - 0x143 0xA0 - 0xA1
b 0x144 - 0x147 0xA2 - 0xA3
retval 0x148 - 0x14F 0xA4 - 0xA7
ctl 0x150 0xA8


In addition, we need to design the control interface for the coprocessor. The control interface uses a very simple command set that fits in a single control register. When the lsb from that control register changes from 0 to 1, exactly at that clock cycle the multiplication will be computed in the hardware. To support the control functionality, we add one single register ctl that will be used to run the coprocessor. That register is mapped at address 0x150. Furthermore, we won’d add any status signals in the coprocessor. So there is no need for a status register.

We are now ready to design the coprocessor. The following is an example design that uses a single-cycle multiplication in hardware.

module  mymul (
               output [15:0] per_dout,
               input         mclk,
               input [13:0]  per_addr,
               input [15:0]  per_din,
               input         per_en,
               input [1:0]   per_we,
               input         puc_rst
               );

   reg [31:0]                hw_a;
   reg [31:0]                hw_b;
   reg [63:0]                hw_retval;
   reg                       hw_ctl;
   reg                       hw_ctl_old;
   wire [63:0]               mulresult;

   wire                      write_alo, write_ahi;
   wire                      write_blo, write_bhi;
   wire                      write_retval;
   wire                      write_ctl;

   always @(posedge mclk or posedge puc_rst)
     if (puc_rst)
       begin
          hw_a        <= 32'h0;
          hw_b        <= 32'h0;
          hw_retval   <= 64'h0;
          hw_ctl      <= 1'h0;
          hw_ctl_old  <= 1'h0;
       end
     else
       begin
          hw_a[15: 0] <= write_alo    ? per_din[15:0]   : hw_a[15: 0];
          hw_a[31:16] <= write_ahi    ? per_din[15:0]   : hw_a[31:16];
          hw_b[15: 0] <= write_blo    ? per_din[15:0]   : hw_b[15: 0];
          hw_b[31:16] <= write_bhi    ? per_din[15:0]   : hw_b[31:16];
          hw_retval   <= write_retval ? mulresult       : hw_retval;
          hw_ctl      <= write_ctl    ? per_din[0]      : hw_ctl;
          hw_ctl_old  <= hw_ctl;
       end

   assign mulresult = hw_a * hw_b;

   assign write_alo    = (per_en & (per_addr == 14'hA0) & (per_we == 2'h3));
   assign write_ahi    = (per_en & (per_addr == 14'hA1) & (per_we == 2'h3));
   assign write_blo    = (per_en & (per_addr == 14'hA2) & (per_we == 2'h3));
   assign write_bhi    = (per_en & (per_addr == 14'hA3) & (per_we == 2'h3));

   assign write_ctl    = (per_en & (per_addr == 14'hA8) & per_we[0] & per_we[1]);
   assign write_retval = ((hw_ctl == 1'h1) & (hw_ctl ^ hw_ctl_old));

   assign per_dout = (per_en & (per_addr == 14'hA4) & (per_we == 2'h0)) ? hw_retval[15: 0] :
                     (per_en & (per_addr == 14'hA5) & (per_we == 2'h0)) ? hw_retval[31:16] :
                     (per_en & (per_addr == 14'hA6) & (per_we == 2'h0)) ? hw_retval[47:32] :
                     (per_en & (per_addr == 14'hA7) & (per_we == 2'h0)) ? hw_retval[63:47] : 16'h0;

endmodule

The integration of this hardware coprocessor in the system is easy; the coprocessor module maps as a peripheral on the peripheral bus. As a recap, the integration includes the following steps.

  1. Copy the verilog file of the coprocessor design as a new module in hardware/msp430de1soc/msp430.
  2. Modify the top-level design of the MSP430 system by adding the new coprocessor to hardware/msp430de1soc/msp430/toplevel.v
  3. Add the new coprocessor verilog file to the list of files in the Quartus project constraints file (.qsf file extension)

After the hardware is designed and integration, we can develop a software driver for the coprocessor. The following illustrates a sample driver.

#define HW_A      (*(volatile unsigned long *)      0x140)
#define HW_B      (*(volatile unsigned long *)      0x144)
#define HW_RETVAL (*(volatile unsigned long long *) 0x148)
#define HW_CTL    (*(volatile unsigned *)           0x150)

unsigned long long mymul_hw(unsigned long a, unsigned long b) {
  HW_A = a;
  HW_B = b;
  HW_CTL = 1;
  HW_CTL = 0;
  return HW_RETVAL;
}

To verify that the right addresses are used, it’s a good idea to consult the assembly listing of the software driver. This assembly listing also illustrates the ‘overhead’ of such a hardware driver. Whereas the hardware multiplication completes in single clock cycle, moving the data around involves 11 memory-move functions, a function call and return.

0000f846 <mymul_hw>:
    f846:       82 4c 40 01     mov     r12,    &0x0140 ;
    f84a:       82 4d 42 01     mov     r13,    &0x0142 ;
    f84e:       82 4e 44 01     mov     r14,    &0x0144 ;
    f852:       82 4f 46 01     mov     r15,    &0x0146 ;
    f856:       3c 40 50 01     mov     #336,   r12     ;#0x0150
    f85a:       9c 43 00 00     mov     #1,     0(r12)  ;r3 As==01
    f85e:       8c 43 00 00     mov     #0,     0(r12)  ;r3 As==00
    f862:       1c 42 48 01     mov     &0x0148,r12     ;0x0148
    f866:       1d 42 4a 01     mov     &0x014a,r13     ;0x014a
    f86a:       1e 42 4c 01     mov     &0x014c,r14     ;0x014c
    f86e:       1f 42 4e 01     mov     &0x014e,r15     ;0x014e
    f872:       30 41           ret

Running the example

To run the example, proceed as follows.

First, clone the design example.

git clone https://github.com/vt-ece4530-f19/example-mul64mps430

Then, in a Cygwin window, compile the software

cd software
make compile

And in a NiosII command shell, compile the hardware

cd hardware/msp430de1soc
quartus_sh --flow compile msp430de1soc

Download the application (using a NiosII command shell)

cd loader
system-console -script connect.tcl -sof ../hardware/msp430de1soc/msp430de1soc.sof -bin ../software/mul64/mul64.bin

Finally, in a MobaXterm telnet connected to localhost:19800, you should see the results appear. This line says that the software implementation computes the same result as the hardware multiplier (for one particular test vector). Furthermore, it says that the software requires rought 0x81D cycles (2,077 cycles) for a 64-bit multiply, while the hardware accelerated version needs 0x54 cycles (84 cycles). That is a speedup of almost 25 times! The downside of this number, however, is that this speedup comes at a considerable overhead. Our 64-bit multiplier requires only a single cycle to complete, but the software integration costs 84 cycles. Hence, we suffered an 84x integration overhead for only a 25x speedup on performance!

SW 00000C3789AB6618 081D HW 00000C3789AB6618 0054
           |         |            |            |
      SW result  SW cycles    HW result    HW cycles

This requires further investigation.

Overhead factors in hardware acceleration

The ‘hardware function call’ model of acceleration replaces a software function with a hardware version of that function.

The Ideal Speedup

Assume that the original (software-only) cycle count of the program is as follows. The software function executes M times, and requires N1 cycles per call. The rest of the software (the non-hardware-accelerated) takes N0 cycles. Then, the complete task in software takes NT cycles, given by

NT = N0 + M.N1

The maximum (ideal) speedup that can be obtained, is when we could reduce N1 to 0 cycles. In that case the speedup for the optimized solution is given by

SU = (N0 + M.N1) / N0

Some observations we can make are as follows.

  1. The speedup SU is determined by the portion of the program that can be accelerated by the function. Assume for example that, in the original program, M.N1 constitutes k=0.9 (90%) of the execution time, then the maximum speedup would be SU = 1 / (1-k) = 10. This insight is also known as Amdahl’s law.

  2. When M is large, then the proportion N1.M will quickly become the dominating portion of the program execution time. This means that the most intensively used software functions are the best candidates for speedup. The clock cycles of the software function effectively ‘weigh’ N times heavier in the overall cycle count. Every clock cycle you can shave off that software function, will gain you M cycles overall.

  3. To design the most effective accelerator, we should be looking for the most intensively used function.

The Real Speedup

The real speedup is less then ideal. In reality, we are replacing one function in software with a hybrid version that runs partly in software (for the interfacing) and partly in hardware. So we are substituting N1sw cycles for N1hw cycles. We expect reasonably N1hw to be much smaller than N1sw. So the speedup is given by

SU = (N0sw + M.N1sw) / (N0sw + M.N1hw)

So the speedup is limited by our ability of accelerate N1sw into N1hw. If M is very large, then the speedup will be dominated by it. Let’s break out each of N1sw and N1hw in their respective components.

The execution time N1sw contains two components.

  1. N1stackops, the time needed to manipulate the stack and implement to function call semantics. This includes for example
    • Caller Activity: prepare the arguments on the stack or in registers, preserve caller-saved storage, call the function.
    • Callee Activity: prepare the return arguments on the stack or in registers, preserve callee-saved storage, return from the function.
  2. N1swcompute, the time needed doing actual computations.

The execution time N1hw also contains two components.

  1. N1hwswcomms, the time needed to get the arguments to the hardware module and back. Note that some of this time is needed for very similar operations as N1stackops. Namely, the hardware function call needs to build a stack frame as well, and prepare all the arguments in registers (or stack) before they can be moved to memory-mapped registers.

  2. N1hwcompute, the time needed doing actual computations.

Hence we find as speedup

SU = (N0sw + M.(N1stackops + N1swcompute)) / (N0sw + M.(N1hwswcomms + N1hwcompute))

For large M, we therefore find the following

SU = (N1stackops + N1swcompute) / (N1hwswcomms + N1hwcompute)

And here is the potential catch. When there are many arguments to move between software and hardware, N1hwswcomms becomes large, and potentially bigger then N1stackops. If that happens, then the gain obtained from replacing N1swcompute with N1hwcompute can be lost!

Application

We’ll derive the factors N1stackops, N1swcompute, N1hwswcomms and N1hwcompute for the example of the hardware multiplier discussed above. We already know that (N1stackops + N1swcompute) equals 2,077 cycles, and (N1hwswcomms + N1hwcompute) equals 84 cycles.

Looking into assembly code, it’s not always straightforward to find a clear partitioning between N1stackops and N1swcompute, especially if the compiler has already optimized the function. In this case we find

;    TimerLap();
;    sw_tt = mymul(arga, argb);
;    sw_ct = TimerLap();
;
; is implemented as:

    f922: 84 12         call  r4        ;         call TimerLap()

    f924: 18 41 0e 00   mov 14(r1), r8  ;0x0000e  prepare arg for __mspabi_mpyll
    f928: 19 41 10 00   mov 16(r1), r9  ;0x00010  prepare arg for __mspabi_mpyll
    f92c: 1c 41 12 00   mov 18(r1), r12 ;0x00012  prepare arg for __mspabi_mpyll
    f930: 1d 41 14 00   mov 20(r1), r13 ;0x00014  prepare arg for __mspabi_mpyll
    f934: 4e 43         clr.b r14   ;             prepare arg for __mspabi_mpyll
    f936: 0f 4e         mov r14,  r15 ;           prepare arg for __mspabi_mpyll
    f938: 0a 4e         mov r14,  r10 ;           prepare arg for __mspabi_mpyll
    f93a: 0b 4e         mov r14,  r11 ;           prepare arg for __mspabi_mpyll
    f93c: b0 12 ae fa   call  #-1362    ;#0xfaae  call __mspabi_mpyll

    f940: 08 4c         mov r12,  r8  ;           prepare return
    f942: 09 4d         mov r13,  r9  ;           prepare return
    f944: 0a 4e         mov r14,  r10 ;           prepare return
    f946: 81 4f 02 00   mov r15,  2(r1) ;         prepare return

    f94a: 84 12         call  r4    ;             call TimerLap()

In this case, we conclude that the work done during two TimerLap() calls is almost exclusively related to the ‘compute’ part, which in this case consists of a call to __mspabi_mpyll. There are four instructions that prepare the return argument in R12:R15. So we conclude that N1stackops ~ (overhead of TimerLap() + 4 mov instructions). The 2,077 cycles for N1sw is dominated by N1swcompute, and we’ll approximate N1stackops by 20 clock cycles.

Next, we look at the hardware function call.

;    TimerLap();
;    hw_tt = mymul_hw(arga, argb);
;    hw_ct = TimerLap();
;
; is implemented as:

    f950: 84 12         call  r4    ;              call TimerLap()
    f952: 1c 41 0e 00   mov 14(r1), r12 ;0x0000e   prepare arg for mymul_hw
    f956: 1d 41 10 00   mov 16(r1), r13 ;0x00010   prepare arg for mymul_hw
    f95a: 1e 41 12 00   mov 18(r1), r14 ;0x00012   prepare arg for mymul_hw
    f95e: 1f 41 14 00   mov 20(r1), r15 ;0x00014   prepare arg for mymul_hw
    f962: b0 12 46 f8   call  #-1978    ;#0xf846   call mymul_hw
    f966: 81 4c 08 00   mov r12,  8(r1) ;          prepare return
    f96a: 81 4d 0a 00   mov r13,  10(r1)  ; 0x000a prepare return
    f96e: 81 4e 0c 00   mov r14,  12(r1)  ; 0x000c prepare return
    f972: 81 4f 00 00   mov r15,  0(r1) ;          prepare return
    f976: 84 12         call  r4    ;              call TimerLap()

0000f846 <mymul_hw>:
    f846: 82 4c 40 01   mov r12,  &0x0140 ;         write into memmap reg
    f84a: 82 4d 42 01   mov r13,  &0x0142 ;         write into memmap reg
    f84e: 82 4e 44 01   mov r14,  &0x0144 ;         write into memmap reg
    f852: 82 4f 46 01   mov r15,  &0x0146 ;         write into memmap reg
    f856: 3c 40 50 01   mov #336, r12 ;#0x0150
    f85a: 9c 43 00 00   mov #1, 0(r12)  ;r3 As==01  write into memmap reg
    f85e: 8c 43 00 00   mov #0, 0(r12)  ;r3 As==00  write into memmap reg
    f862: 1c 42 48 01   mov &0x0148,r12 ;0x0148     read from memmap reg
    f866: 1d 42 4a 01   mov &0x014a,r13 ;0x014a     read from memmap reg
    f86a: 1e 42 4c 01   mov &0x014c,r14 ;0x014c     read from memmap reg
    f86e: 1f 42 4e 01   mov &0x014e,r15 ;0x014e     read from memmap reg
    f872: 30 41         ret     

In implementing the hardware function call, the compiler decided not to inline the function but rather implement a separate function call. This has introduced a considerable amount of data movement. In between the two TimerLap() calls, there are 8 mov instructions and a function call. Inside of mymul_hw, there are 10 additional mov operations from memory mapped registers, a constant-move and a return function. All of these make up N1hwswcomms. The actual hardware execution, N1hwcompute, is in fact a clock cycle to runs in parallel to one of the software instructions. However, we can reasonably approximate it as 1 clock cycle.

Factor Software Hardware-accelerated Remark
Communications 20 83  
Computations 2057 1 Ideal Speedup: 2057/1 = 2057x !
Total 2077 84 Real Speedup: 2077/84 ~ 25x …


We summarize the conclusions in the following table. The ideal speedup of the hardware accelerated design is (2057/1) > 2000 times. However, the real speedup is only (2077/84) ~ 25 times! The integration of the hardware-accelerated function costs (83/20) ~ 4 times more than the integration of the original software functionality.

The start-done model of coprocessing

So far, we concentrated the discussion on the overhead of moving data between the software and the hardware. The hardware used a very simple control model, in which a bit-flip is sufficient to trigger a computation. The software driver for such a simple trigger-driven coprocessor is as follows.

  1. Copy coprocessor input arguments from software to hardware
  2. Write a 1 into a control register for the coprocessor
  3. Read the coprocessor output result back into software

When the execution time of the hardware increases, this model does not hold. When can make the software ‘wait’ for the hardware by adding a status flag - a memory-mapped register set by the hardware and read by the software to signal completion. This is the ‘start-done’ model of coprocessing. The software driver for such a coprocessor is as follows.

  1. Copy coprocessor input arguments from software to hardware
  2. Write a 1 into a control register for the coprocessor
  3. Wait for the status register of the coprocessor to turn 1
  4. Read the coprocessor output result back into software

The start-done model can be implemented in a variety of ways. The essence is that the software can create a pulse to signal the hardware to start processing, and conversely that the hardware can create a pulse to signal the software that the hardware processing is complete.

Some implementation alternatives for the start pulse and done pulse include the following.

Scheme Hardware Software
Dedicated Start Bit Detect a 0->1 transition in a memory-mapped register Write a 0 followed by a 1
Implicit Start Decode a write to a specific address Write to a specific address
Dedicated Done Bit Set a memory-mapped register bit to 1 Spin on a read operation waiting for a 1-bit
Done Interrupt Pull interrupt request Complete processing in an Interrupt Service Routine

Conclusions