Lecture 8 - MSP 430 Hardware Multiply Coprocessor
- Introduction
- Designing the Hardware Function Call
- Running the example
- Overhead factors in hardware acceleration
- The start-done model of coprocessing
- Conclusions
Introduction
In this lecture we will design an MSP-432 coprocessor for 64-bit multiplications. We will investigate the speedup of the hardware implementation over software, and we will analyze the sources of overhead in the design of such a coprocessor.
The function that we’ll accelerate is the following. This is a 64-bit
(unsigned) multiply. That means that the argument a
and b
have
to be passed to 2 registers each, and the result will occupy
4 registers. The function itself uses 64-bit arguments, and uses
a 64-bit times 64-bit multiply with a 64-bit result.
unsigned long long mymul(unsigned long a, unsigned long b) {
unsigned long long r;
r = (unsigned long long) a * b;
return r;
}
Indeed, the assembly listing of this function looks as follows. The two arguments a and b are passed down as R13:R12 and R15:R14 respectively. The compiler turns these into two 64-bit arguments using zero extension as R11:R10:R9:R8 and R15:R14:R13:R12 respectively.
0000f828 <mymul>:
f828: 0a 12 push r10 ;
f82a: 09 12 push r9 ;
f82c: 08 12 push r8 ;
f82e: 08 4c mov r12, r8 ;
f830: 0c 4e mov r14, r12 ;
f832: 09 4d mov r13, r9 ;
f834: 0d 4f mov r15, r13 ;
f836: 4e 43 clr.b r14 ;
f838: 0f 4e mov r14, r15 ;
f83a: 0a 4e mov r14, r10 ;
f83c: 0b 4e mov r14, r11 ;
f83e: b0 12 ae fa call #-1362 ;#0xfaae
f842: 30 40 fe fb br #0xfbfe ;
0000faae <__mspabi_mpyll>:
faae: 0a 12 push r10 ;
...
Interestingly the compiler insers a call to __mspabi_mpyll
which
is, according the the MSP430 ABI, a signed 64-bit multiply (I suspect
this is a compiler bug; the unsigned 64-bit multiply, a different
function in the MSP430 ABI, is not generated from the compiler and
appears to be missing).
In any case, our objective is to build a hardware version of a 32 bit by 32 bit unsigned multiply, with a 64 bit result. Recalling the previous lecture, there are two aspects to the design of such a hardware multiplier. First, the arguments have to be moved from the software to the hardware coprocessor, and the result has to be retrieved from the hardware coprocessor back to the software. Next, the coprocessor’s execution has to be synchronized to the execution of the software.
Designing the Hardware Function Call
Let’s start by considering the hardware interface to hold the arguments and the result.
Data | Type | Read Access | Write Access | # Address Loc |
---|---|---|---|---|
(HW to SW) | (SW to HW) | |||
a |
32-bit | n | y | 2 |
b |
32-bit | n | y | 2 |
retval |
64-bit | y | n | 4 |
The first design decision is to pick addresses for the above set
of variables. The most straightforward way is to map everything
in a different register address. This is especially useful
for the 32-bit and 64-bit registers, since the compiler will
automatically generate write- and read-operations to
consecutive addresses for long variables. Furthermore, we will
pick a different address value for each of a, b and the return
value. This is wasteful of address space, but on the other hand
makes the design easier to understand for a software
programmer. We therefore use a memory map is shown below.
The addresses are organized in little-endian order,
so that for example the least significant byte of a
will be in
address 0x140 and the most significant byte of a
will
be in 0x143.
Data | Byte Address | Word Address |
---|---|---|
a |
0x140 - 0x143 | 0xA0 - 0xA1 |
b |
0x144 - 0x147 | 0xA2 - 0xA3 |
retval |
0x148 - 0x14F | 0xA4 - 0xA7 |
ctl |
0x150 | 0xA8 |
In addition, we need to design the control interface for the coprocessor.
The control interface uses a very simple command set that fits in
a single control register. When the lsb from that control register
changes from 0 to 1, exactly at that clock cycle the multiplication
will be computed in the hardware. To support the control
functionality, we add one single register
ctl
that will be used to run the coprocessor.
That register is mapped at address 0x150.
Furthermore, we won’d add any status signals in the coprocessor. So
there is no need for a status register.
We are now ready to design the coprocessor. The following is an example design that uses a single-cycle multiplication in hardware.
module mymul (
output [15:0] per_dout,
input mclk,
input [13:0] per_addr,
input [15:0] per_din,
input per_en,
input [1:0] per_we,
input puc_rst
);
reg [31:0] hw_a;
reg [31:0] hw_b;
reg [63:0] hw_retval;
reg hw_ctl;
reg hw_ctl_old;
wire [63:0] mulresult;
wire write_alo, write_ahi;
wire write_blo, write_bhi;
wire write_retval;
wire write_ctl;
always @(posedge mclk or posedge puc_rst)
if (puc_rst)
begin
hw_a <= 32'h0;
hw_b <= 32'h0;
hw_retval <= 64'h0;
hw_ctl <= 1'h0;
hw_ctl_old <= 1'h0;
end
else
begin
hw_a[15: 0] <= write_alo ? per_din[15:0] : hw_a[15: 0];
hw_a[31:16] <= write_ahi ? per_din[15:0] : hw_a[31:16];
hw_b[15: 0] <= write_blo ? per_din[15:0] : hw_b[15: 0];
hw_b[31:16] <= write_bhi ? per_din[15:0] : hw_b[31:16];
hw_retval <= write_retval ? mulresult : hw_retval;
hw_ctl <= write_ctl ? per_din[0] : hw_ctl;
hw_ctl_old <= hw_ctl;
end
assign mulresult = hw_a * hw_b;
assign write_alo = (per_en & (per_addr == 14'hA0) & (per_we == 2'h3));
assign write_ahi = (per_en & (per_addr == 14'hA1) & (per_we == 2'h3));
assign write_blo = (per_en & (per_addr == 14'hA2) & (per_we == 2'h3));
assign write_bhi = (per_en & (per_addr == 14'hA3) & (per_we == 2'h3));
assign write_ctl = (per_en & (per_addr == 14'hA8) & per_we[0] & per_we[1]);
assign write_retval = ((hw_ctl == 1'h1) & (hw_ctl ^ hw_ctl_old));
assign per_dout = (per_en & (per_addr == 14'hA4) & (per_we == 2'h0)) ? hw_retval[15: 0] :
(per_en & (per_addr == 14'hA5) & (per_we == 2'h0)) ? hw_retval[31:16] :
(per_en & (per_addr == 14'hA6) & (per_we == 2'h0)) ? hw_retval[47:32] :
(per_en & (per_addr == 14'hA7) & (per_we == 2'h0)) ? hw_retval[63:47] : 16'h0;
endmodule
The integration of this hardware coprocessor in the system is easy; the coprocessor module maps as a peripheral on the peripheral bus. As a recap, the integration includes the following steps.
- Copy the verilog file of the coprocessor design as a new module
in
hardware/msp430de1soc/msp430
. - Modify the top-level design of the MSP430 system by adding the
new coprocessor to
hardware/msp430de1soc/msp430/toplevel.v
- Add the new coprocessor verilog file to the list of files in the Quartus project constraints file (.qsf file extension)
After the hardware is designed and integration, we can develop a software driver for the coprocessor. The following illustrates a sample driver.
#define HW_A (*(volatile unsigned long *) 0x140)
#define HW_B (*(volatile unsigned long *) 0x144)
#define HW_RETVAL (*(volatile unsigned long long *) 0x148)
#define HW_CTL (*(volatile unsigned *) 0x150)
unsigned long long mymul_hw(unsigned long a, unsigned long b) {
HW_A = a;
HW_B = b;
HW_CTL = 1;
HW_CTL = 0;
return HW_RETVAL;
}
To verify that the right addresses are used, it’s a good idea to consult the assembly listing of the software driver. This assembly listing also illustrates the ‘overhead’ of such a hardware driver. Whereas the hardware multiplication completes in single clock cycle, moving the data around involves 11 memory-move functions, a function call and return.
0000f846 <mymul_hw>:
f846: 82 4c 40 01 mov r12, &0x0140 ;
f84a: 82 4d 42 01 mov r13, &0x0142 ;
f84e: 82 4e 44 01 mov r14, &0x0144 ;
f852: 82 4f 46 01 mov r15, &0x0146 ;
f856: 3c 40 50 01 mov #336, r12 ;#0x0150
f85a: 9c 43 00 00 mov #1, 0(r12) ;r3 As==01
f85e: 8c 43 00 00 mov #0, 0(r12) ;r3 As==00
f862: 1c 42 48 01 mov &0x0148,r12 ;0x0148
f866: 1d 42 4a 01 mov &0x014a,r13 ;0x014a
f86a: 1e 42 4c 01 mov &0x014c,r14 ;0x014c
f86e: 1f 42 4e 01 mov &0x014e,r15 ;0x014e
f872: 30 41 ret
Running the example
To run the example, proceed as follows.
First, clone the design example.
git clone https://github.com/vt-ece4530-f19/example-mul64mps430
Then, in a Cygwin window, compile the software
cd software
make compile
And in a NiosII command shell, compile the hardware
cd hardware/msp430de1soc
quartus_sh --flow compile msp430de1soc
Download the application (using a NiosII command shell)
cd loader
system-console -script connect.tcl -sof ../hardware/msp430de1soc/msp430de1soc.sof -bin ../software/mul64/mul64.bin
Finally, in a MobaXterm telnet connected to localhost:19800, you should see the results appear. This line says that the software implementation computes the same result as the hardware multiplier (for one particular test vector). Furthermore, it says that the software requires rought 0x81D cycles (2,077 cycles) for a 64-bit multiply, while the hardware accelerated version needs 0x54 cycles (84 cycles). That is a speedup of almost 25 times! The downside of this number, however, is that this speedup comes at a considerable overhead. Our 64-bit multiplier requires only a single cycle to complete, but the software integration costs 84 cycles. Hence, we suffered an 84x integration overhead for only a 25x speedup on performance!
SW 00000C3789AB6618 081D HW 00000C3789AB6618 0054
| | | |
SW result SW cycles HW result HW cycles
This requires further investigation.
Overhead factors in hardware acceleration
The ‘hardware function call’ model of acceleration replaces a software function with a hardware version of that function.
The Ideal Speedup
Assume that the original (software-only) cycle count of the program is as follows. The software function executes M times, and requires N1 cycles per call. The rest of the software (the non-hardware-accelerated) takes N0 cycles. Then, the complete task in software takes NT cycles, given by
NT = N0 + M.N1
The maximum (ideal) speedup that can be obtained, is when we could reduce N1 to 0 cycles. In that case the speedup for the optimized solution is given by
SU = (N0 + M.N1) / N0
Some observations we can make are as follows.
-
The speedup SU is determined by the portion of the program that can be accelerated by the function. Assume for example that, in the original program, M.N1 constitutes k=0.9 (90%) of the execution time, then the maximum speedup would be
SU = 1 / (1-k) = 10
. This insight is also known as Amdahl’s law. -
When M is large, then the proportion N1.M will quickly become the dominating portion of the program execution time. This means that the most intensively used software functions are the best candidates for speedup. The clock cycles of the software function effectively ‘weigh’ N times heavier in the overall cycle count. Every clock cycle you can shave off that software function, will gain you M cycles overall.
-
To design the most effective accelerator, we should be looking for the most intensively used function.
The Real Speedup
The real speedup is less then ideal. In reality, we are replacing one function in software with a hybrid version that runs partly in software (for the interfacing) and partly in hardware. So we are substituting N1sw cycles for N1hw cycles. We expect reasonably N1hw to be much smaller than N1sw. So the speedup is given by
SU = (N0sw + M.N1sw) / (N0sw + M.N1hw)
So the speedup is limited by our ability of accelerate N1sw into N1hw. If M is very large, then the speedup will be dominated by it. Let’s break out each of N1sw and N1hw in their respective components.
The execution time N1sw contains two components.
- N1stackops, the time needed to manipulate the stack and implement
to function call semantics. This includes for example
- Caller Activity: prepare the arguments on the stack or in registers, preserve caller-saved storage, call the function.
- Callee Activity: prepare the return arguments on the stack or in registers, preserve callee-saved storage, return from the function.
- N1swcompute, the time needed doing actual computations.
The execution time N1hw also contains two components.
-
N1hwswcomms, the time needed to get the arguments to the hardware module and back. Note that some of this time is needed for very similar operations as N1stackops. Namely, the hardware function call needs to build a stack frame as well, and prepare all the arguments in registers (or stack) before they can be moved to memory-mapped registers.
-
N1hwcompute, the time needed doing actual computations.
Hence we find as speedup
SU = (N0sw + M.(N1stackops + N1swcompute)) / (N0sw + M.(N1hwswcomms + N1hwcompute))
For large M, we therefore find the following
SU = (N1stackops + N1swcompute) / (N1hwswcomms + N1hwcompute)
And here is the potential catch. When there are many arguments to move between software and hardware, N1hwswcomms becomes large, and potentially bigger then N1stackops. If that happens, then the gain obtained from replacing N1swcompute with N1hwcompute can be lost!
Application
We’ll derive the factors N1stackops, N1swcompute, N1hwswcomms and N1hwcompute for the example of the hardware multiplier discussed above. We already know that (N1stackops + N1swcompute) equals 2,077 cycles, and (N1hwswcomms + N1hwcompute) equals 84 cycles.
Looking into assembly code, it’s not always straightforward to find a clear partitioning between N1stackops and N1swcompute, especially if the compiler has already optimized the function. In this case we find
; TimerLap();
; sw_tt = mymul(arga, argb);
; sw_ct = TimerLap();
;
; is implemented as:
f922: 84 12 call r4 ; call TimerLap()
f924: 18 41 0e 00 mov 14(r1), r8 ;0x0000e prepare arg for __mspabi_mpyll
f928: 19 41 10 00 mov 16(r1), r9 ;0x00010 prepare arg for __mspabi_mpyll
f92c: 1c 41 12 00 mov 18(r1), r12 ;0x00012 prepare arg for __mspabi_mpyll
f930: 1d 41 14 00 mov 20(r1), r13 ;0x00014 prepare arg for __mspabi_mpyll
f934: 4e 43 clr.b r14 ; prepare arg for __mspabi_mpyll
f936: 0f 4e mov r14, r15 ; prepare arg for __mspabi_mpyll
f938: 0a 4e mov r14, r10 ; prepare arg for __mspabi_mpyll
f93a: 0b 4e mov r14, r11 ; prepare arg for __mspabi_mpyll
f93c: b0 12 ae fa call #-1362 ;#0xfaae call __mspabi_mpyll
f940: 08 4c mov r12, r8 ; prepare return
f942: 09 4d mov r13, r9 ; prepare return
f944: 0a 4e mov r14, r10 ; prepare return
f946: 81 4f 02 00 mov r15, 2(r1) ; prepare return
f94a: 84 12 call r4 ; call TimerLap()
In this case, we conclude that the work done during two TimerLap()
calls is almost exclusively related to the ‘compute’ part, which
in this case consists of a call to __mspabi_mpyll
. There are four
instructions that prepare the return argument in R12:R15. So we
conclude that N1stackops ~ (overhead of TimerLap() + 4 mov instructions).
The 2,077 cycles for N1sw is dominated by N1swcompute, and we’ll
approximate N1stackops by 20 clock cycles.
Next, we look at the hardware function call.
; TimerLap();
; hw_tt = mymul_hw(arga, argb);
; hw_ct = TimerLap();
;
; is implemented as:
f950: 84 12 call r4 ; call TimerLap()
f952: 1c 41 0e 00 mov 14(r1), r12 ;0x0000e prepare arg for mymul_hw
f956: 1d 41 10 00 mov 16(r1), r13 ;0x00010 prepare arg for mymul_hw
f95a: 1e 41 12 00 mov 18(r1), r14 ;0x00012 prepare arg for mymul_hw
f95e: 1f 41 14 00 mov 20(r1), r15 ;0x00014 prepare arg for mymul_hw
f962: b0 12 46 f8 call #-1978 ;#0xf846 call mymul_hw
f966: 81 4c 08 00 mov r12, 8(r1) ; prepare return
f96a: 81 4d 0a 00 mov r13, 10(r1) ; 0x000a prepare return
f96e: 81 4e 0c 00 mov r14, 12(r1) ; 0x000c prepare return
f972: 81 4f 00 00 mov r15, 0(r1) ; prepare return
f976: 84 12 call r4 ; call TimerLap()
0000f846 <mymul_hw>:
f846: 82 4c 40 01 mov r12, &0x0140 ; write into memmap reg
f84a: 82 4d 42 01 mov r13, &0x0142 ; write into memmap reg
f84e: 82 4e 44 01 mov r14, &0x0144 ; write into memmap reg
f852: 82 4f 46 01 mov r15, &0x0146 ; write into memmap reg
f856: 3c 40 50 01 mov #336, r12 ;#0x0150
f85a: 9c 43 00 00 mov #1, 0(r12) ;r3 As==01 write into memmap reg
f85e: 8c 43 00 00 mov #0, 0(r12) ;r3 As==00 write into memmap reg
f862: 1c 42 48 01 mov &0x0148,r12 ;0x0148 read from memmap reg
f866: 1d 42 4a 01 mov &0x014a,r13 ;0x014a read from memmap reg
f86a: 1e 42 4c 01 mov &0x014c,r14 ;0x014c read from memmap reg
f86e: 1f 42 4e 01 mov &0x014e,r15 ;0x014e read from memmap reg
f872: 30 41 ret
In implementing the hardware function call, the compiler decided not to
inline the function but rather implement a separate function call.
This has introduced a considerable amount of data movement.
In between the two TimerLap() calls, there are 8 mov instructions and a
function call. Inside of mymul_hw
, there are 10 additional mov
operations from memory mapped registers, a constant-move and a return
function. All of these make up N1hwswcomms. The actual hardware execution, N1hwcompute, is in fact a clock cycle to runs in parallel to one of the
software instructions. However, we can reasonably approximate it as 1 clock
cycle.
Factor | Software | Hardware-accelerated | Remark |
---|---|---|---|
Communications | 20 | 83 | |
Computations | 2057 | 1 | Ideal Speedup: 2057/1 = 2057x ! |
Total | 2077 | 84 | Real Speedup: 2077/84 ~ 25x … |
We summarize the conclusions in the following table.
The ideal speedup of the hardware accelerated design is (2057/1) > 2000 times.
However, the real speedup is only (2077/84) ~ 25 times! The integration
of the hardware-accelerated function costs (83/20) ~ 4 times more than
the integration of the original software functionality.
The start-done model of coprocessing
So far, we concentrated the discussion on the overhead of moving data between the software and the hardware. The hardware used a very simple control model, in which a bit-flip is sufficient to trigger a computation. The software driver for such a simple trigger-driven coprocessor is as follows.
- Copy coprocessor input arguments from software to hardware
- Write a 1 into a control register for the coprocessor
- Read the coprocessor output result back into software
When the execution time of the hardware increases, this model does not hold. When can make the software ‘wait’ for the hardware by adding a status flag - a memory-mapped register set by the hardware and read by the software to signal completion. This is the ‘start-done’ model of coprocessing. The software driver for such a coprocessor is as follows.
- Copy coprocessor input arguments from software to hardware
- Write a 1 into a control register for the coprocessor
- Wait for the status register of the coprocessor to turn 1
- Read the coprocessor output result back into software
The start-done model can be implemented in a variety of ways. The essence is that the software can create a pulse to signal the hardware to start processing, and conversely that the hardware can create a pulse to signal the software that the hardware processing is complete.
Some implementation alternatives for the start pulse and done pulse include the following.
Scheme | Hardware | Software |
---|---|---|
Dedicated Start Bit | Detect a 0->1 transition in a memory-mapped register | Write a 0 followed by a 1 |
Implicit Start | Decode a write to a specific address | Write to a specific address |
Dedicated Done Bit | Set a memory-mapped register bit to 1 | Spin on a read operation waiting for a 1-bit |
Done Interrupt | Pull interrupt request | Complete processing in an Interrupt Service Routine |