Introduction

To a programmer, the easiest manner to work with a coprocessor is a synchronous model, which makes the coprocessor look like a function call. Let’s say we have a coprocessor which takes two input arguments and produces one result, all of them integers. You could model the execution of the coprocessor as a function call my_func. The function call takes two arguments, hides how the arguments are routed from the software to the hardware, how the coprocessor is run, and how the result is retrieved.

      main() {
         ...
         my_func(&out, in1, in2);   => build my_func as a hardware module
         ...
      }

If we break down such a function call into smaller steps, it’s clear that there are two aspects to the design of such a function call: a data aspect and a control aspect.

  • The data aspect is concerned with moving data. The arguments of the hardware function call have to be moved from software to hardware, typically by writing into memory-mapped registers. Furthermore, the results from the hardware function have to be retrieved back from hardware into software, by reading from memory-mapped registers.

  • The control aspect is concerned with ensuring that the sequence of operations between software and hardware is maintained. Indeed, from a system perspective, the function my_func is called at a specific point in the execution of main. On the other hand, a hardware module that implements this function call is fully parallel with the processor (that runs the main function). There has to be a control signal from the processor and the hardware coprocessor such that the coprocessor starts to work at the right time. Furthermore, there has to be a status signal from the coprocessor back to the processor to indicate that the coprocessor has completed its operation.

In this lecture, we will study the semantics of a software function call. That is, we will study how software programmers solve the data aspect and the control aspect of a function call when software is calling software. This experiment will give us the following insight.

  1. First, we will learn how a compiler implements function calls. As a result, we also understand better the overhead associated with a function call. The overhead is the additional effort that a processor must do in order to provide the comfort of a function call to the programmer.

  2. Next, when we understand how a software function call works, we will have a basis to build a hardware function call using memory-mapped registers. Next lecture, we will build a data semantics and a control semantics for hardware coprocessors, such that we can call them from within a C program.

Now, let’s take a closer look at the implementation of software function calls on the MSP-430.

The data semantics of function calls on MSP430

We’ll start by studying a (non-optimized) function call on MSP430. The MSP430 Application Binary Interface contains important details regarding the calling conventions. The default register organization is as shown in the following table. Register R4-R10 are callee saved, which means that functions should preserve them across a function call. Register R12-R15 are used to pass arguments into the function, and to retrieve return values. As each register is 16 bit, this means that up to 64 bits can be passed in one function call. This could be four 16-bit variables (unsigned), two 32-bit variables (unsigned long), one 64-bit variable (unsigned long long).

Register Alias Callee Saved Role
R0 PC   Program Counter
R1 SP yes Stack Pointer
R2 SR   Status Register
R3 CG   Constant
R4   yes  
R5   yes  
R6   yes  
R7   yes  
R8   yes  
R9   yes  
R10   yes  
R11   no  
R12   no arg1 & return1
R13   no arg2 & return2
R14   no arg3 & return3
R15   no arg4 & return4


The registers R12 to R15 are allocated in sequence. For example, a function that uses two integer arguments will use R12 for the first argument, and R13 for the second. If the function would take two long arguments, then R13:R12 is used for the first argument, and R14:R15 is used for the second argument. The lowest-numbered register will hold the least significant word. When more then four registers are needed for arguments, the compiler will build a data structure called a stack frame.

Consider the following function which implements a 32-bit multiply from two 16-bit arguments. We’ll compile this program in two different ways - with and without optimization - and we’ll compare the resulting assembly code.

unsigned long mymul(unsigned a, unsigned b) {
  unsigned long r;
  r = (unsigned long) a * b;
  return r;
}

unsigned a, b;
unsigned long r;

int main(void) {
  
  a = 35;
  b = 60;

  r = mymul(a, b);
  
  return 0;
}

Here is the compilation result with the -Os optimization flag (optimization for size). All examples so far have been using this flag:

/cygdrive/c/ti/msp430-gcc/bin/msp430-elf-gcc  \
      -I../hal                                \
      -Wall                                   \
      -Os                                     \ # Optimization for size
      -mmcu=msp430c1111                       \
      -Ic:/ti/msp430-gcc/include              \
      -c main.c                               \
      -o main.o

The input arguments of this function are stored in R12 (a) and R13 (b). Since the MSP430 does not have a multiply instruction, the compiler substitutes the multiply instruction by a call to __mspabi_mpyl a function that computes a 32-bit multiplication (see Help Function API, Chapter 6 in MSP430 Application Binary Interface).

    int32 __mspabi_mpyl(int32 x, int32 y);

This internal helper function __mspabi_mpyl takes two 32-bit integers as argument. Therefore, we have to organize the two 16-bit arguments in R12 and R13 as two 32-bit arguments in R13:R12 and R15:R14. This explains the format of the assembly code. Finally, the return argument from __mspabi_mpyl is a 32-bit integer, which is the same as the return argument from mymul. Therefore, the function can return directly after the call to __mspabi_mpyl, since the return argument is already stored in R13:R12 as expected.

0000fc3a <mymul>:
    fc3a:       0e 4d           mov     r13,    r14     ;
    fc3c:       0f 43           clr     r15             ;
    fc3e:       0d 43           clr     r13             ;
    fc40:       b0 12 60 fc     call    #-928           ;#0xfc60
    fc44:       30 41           ret

0000fc60 <__mspabi_mpyl>:
    fc60:       0a 12           push    r10             ;
    ...

If we compile the program without the -Os flag, the assembly is more complicated. Without optimization, the compiler goes into a not-so-smart-but-always-correct strategy of code generation that will always build a stack frame. Hence, studying the assembly code for mymul compiled without optimization helps us to understand how a stack frame works.

/cygdrive/c/ti/msp430-gcc/bin/msp430-elf-gcc  \
      -I../hal                                \
      -Wall                                   \
      -mmcu=msp430c1111                       \
      -Ic:/ti/msp430-gcc/include              \
      -c main.c                               \
      -o main.o

Take a moment to look over this code. While the code is not optimal, every instruction, every line can be logically explained. The stack frame is a data structure that organizes every local variable, including the arguments, on the stack. The push r10 at the beginning and the pop r10 at the end is an example of a callee-saved register. The sub #8, r1 and add #8, r1 at the beginning and the end make room on the stack (and remove room from the stack) by moving the stack pointer. 8 bytes is enough room to store two 16-bit arguments (a and b) and a 32-bit return value (r). The use of the stack frame does not change how the compiler passes arguments. Hence, the function mymul still has to be called with R12 holding a, R13 holding b, and R13:R12 holding the return value. That creates additional move instructions between the stack frame and the registers.

0000fc3a <mymul>:
    fc3a:       0a 12           push    r10             ;
    fc3c:       31 82           sub     #8,     r1      ;r2 As==11
    fc3e:       81 4c 02 00     mov     r12,    2(r1)   ;
    fc42:       81 4d 00 00     mov     r13,    0(r1)   ;
    fc46:       1a 41 02 00     mov     2(r1),  r10     ;
    fc4a:       0c 4a           mov     r10,    r12     ;
    fc4c:       0d 43           clr     r13             ;
    fc4e:       2a 41           mov     @r1,    r10     ;
    fc50:       0e 4a           mov     r10,    r14     ;
    fc52:       0f 43           clr     r15             ;
    fc54:       b0 12 92 fc     call    #-878           ;#0xfc92
    fc58:       81 4c 04 00     mov     r12,    4(r1)   ;
    fc5c:       81 4d 06 00     mov     r13,    6(r1)   ;
    fc60:       1c 41 04 00     mov     4(r1),  r12     ;
    fc64:       1d 41 06 00     mov     6(r1),  r13     ;
    fc68:       31 52           add     #8,     r1      ;r2 As==11
    fc6a:       3a 41           pop     r10             ;
    fc6c:       30 41           ret

0000fc92 <__mspabi_mpyl>:
    fc92:       0a 12           push    r10             ;
...

Here is the stack frame for the function mymul.

Address Content Offset
A Return Address  
A-2 r10 (callee saved)  
A-4 r, high word 6(r1)
A-6 r, low word 4(r1)
A-8 argument a 2(r1)
A-10 argument b 0(r1), @r1


In general, when a function is called with a large number of arguments, larger then the number of arguments that would fit in registers, then the stack will also be used to pass those arguments.

The stack frame offers two advantages compared to using registers. First, it enables a function to use an arbitrary number of arguments. Second, it supports recursion.

The control semantics of function calls on MSP430

Consider next how the main function will call mymul. First, it prepares the function arguments. Next, it calls the function (pushes the PC on the stack, and jumps to the function). When mymul executed the return instruction, the PC is retrieved from the stack, and control returns to the instruction after the function call. Hence, the control semantics of a function call is organized using the stack as well.

unsigned a, b;
unsigned long r;

int main(void) {
  a = 35;
  b = 60;
  r = mymul(a, b);
  return 0;
}

/* assembly generated without -Os flag:
 *
 * 0000fc6e <main>:
 *   fc6e:       b2 40 23 00     mov     #35,    &0x0206 ;#0x0023
 *   fc72:       06 02
 *   fc74:       b2 40 3c 00     mov     #60,    &0x0200 ;#0x003c
 *   fc78:       00 02
 *   fc7a:       1c 42 06 02     mov     &0x0206,r12     ;0x0206
 *   fc7e:       1d 42 00 02     mov     &0x0200,r13     ;0x0200
 *   fc82:       b0 12 3a fc     call    #-966           ;#0xfc3a
 *   fc86:       82 4c 02 02     mov     r12,    &0x0202 ;
 *   fc8a:       82 4d 04 02     mov     r13,    &0x0204 ;
 *   fc8e:       4c 43           clr.b   r12             ;
 *   fc90:       30 41           ret
 */

The semantics of a hardware function call

Does the stack frame make sense when we call a coprocessor? Not at all! A coprocessor does not use the stack to pass arguments from the main function to the ‘hardware function’. Instead, the coprocessor uses memory-mapped registers. So we will have to revise how to pass arguments to the coprocessor.

Furthermore, we will have to devise our own ‘control semantics’, since the call instruction only works when we call a software function.

/* original design 
 *
 *  unsigned long mymul(unsigned a, unsigned b) {
 *    unsigned long r;
 *    r = (unsigned long) a * b;
 *    return r;
 *  }
 *
 */

unsigned long mymul_in_hw(unsigned a, unsigned b) {
   MEMORY_MAPPED_A = a;
   MEMORY_MAPPED_B = b;
   /* do the hardware magic here .. */
   return MEMORY_MAPPED_RLONG;
}

In the next lecture, we will look at a detailed implementation of a hardware multiplication coprocessor using this concept, and we will compare its performance to the performance of the software implementation.

Conclusions

We studied the function call semantics for the MSP-430 ABI. There is overhead associated with each function call. That overhead is associated with moving data arguments, and passing control from one function to the next. Without optimization, function calls involve quite a bit of preparation on behalf of the compiler. Luckily, the compiler is able to remove most of the overhead when you use the appropriate optimization flag.