Lecture 5 - Performance Evaluation using Timers

Introduction
Making exact measurements
The TimerA module in openMSP430
Using TimerA to build a TimeStamp counter
Time stamping long intervals
Dealing with uncertainty in the target
Running the Examples
Conclusions

Introduction

So far, we discussed the hardware and software flow of the MSP-430 microcontroller. By now you configure the microcontroller on a DE1-SoC kit, and run applications in C.

In this lecture we discuss how to evaluate the execution time of a software application. Accurate performance evaluation is crucial when hardware acceleration is our objective. We will discuss performance evaluation using hardware timers. Measuring elapsed time is a well known method for establishing performance in computer systems. Here are several examples.

Intel processors offer the rtdsc instruction, a 64-bit cycle monotonically increasing counter. See the Time Stamp Counter Section in Chapter 17 of the Software Developer Manual.
Texas Instruments adds Time Stamp Counters as peripherals in its microcontrollers. See, for example, the Timer32 peripheral in the MSP432P401 microncontroller. This is a 32-bit timer monotonically decreasing timer.
Another example is the 24-bit down-counting SysTick counter integrated in many ARM Cortex-M4F based microcontrollers, including the TIVA microcontroller series from Texas Instruments.

Let’s assume the case of a hardware counter that is used to determine the execution time of a section of software. The canonical measurement of the execution time goes as follows.

Optionally, reset the counter
Start the counter
Execute the software to be measured
Stop the counter
Read the counter value, determine the elapsed cycle count or execution time

We will discuss how to implement this scheme on an MSP-430 using a Timer peripheral, and we will talk about the typical quirks of counter-based performance measurement.

Making exact measurements

Before starting the make measurements with software, let’s recall the following the basic definitions.

The resolution of the counter defines the smallest time step that can be measured. It equals one tick of the hardware counter, and corresponds to one clock period of that hardware in absolute time. Resolution should be as high as needed, but there is no point in making the resolution much smaller than the measured target. When measuring execution time, counting processor clock cycles is a good pick, since each instruction takes at least one processor clock cycle to execute.
The range of the counter defines the longest time interval that can be measured. It equals the full range of the hardware counter, i.e. 2^n-1 in the case of an n bit counter. If the range of the hardware counter is insufficiently wide to cover the period of interest, then it must be extended using software by detecting hardware counter overflows.
The accuracy refers to the largest possible error that can occur while making the measurement. When using hardware counters to measure software execution time, we are interested in determining the execution time of the software as accurate as possible. There are several sources of overhead at the level of software, such as the overhead of reading the hardware counter or starting and stopping the hardware counter. We have to determine and eliminate this overhead.
The precision refers to the ability of repeatedly making the same measurement with the same accuracy, the same range and the same resolution. When measuring software, the execution time varies from one run to the other because of input data dependencies, and because of architectural effects. In many cases, it may be possible to control the input data of the software, thereby ensuring that the software performs in exactly the same manner under every run. However, dealing with architectural effects is harder, because these effects are generally outside of the control of the software. For example, the load/store performance in the memory hierarchy is highy dependent on the context of the software execution. In microcontrollers, shared resources such as busses can cause unexpected execution delays that affect the precision. Therefore, we will make every measurement several times, to verify the precision of every measurement.

An important caveat of the counter-based performance measurement method is that it is a dynamic method. It only measures one specific execution instance of a software program at a time. In cases where the execution time is highly dependent on the input data, one may need to recur to average execution time, or worst-case execution time. Another alternative is to determine the execution time statically, by analyzing the instructions of each program. For example, we can determine the execution time of each instruction and compute the worst-case and best-case execution time for programs. This only works for very simple applications.

The TimerA module in openMSP430

The openMSP430 platform has a 16-bit timer module that sits on the peripheral bus. The block diagram of the timer is shown in the Figure below.

Figure: TimerA 16-bit timer module

The timer is configured using memory-mapped registers accessible through software. The most important registers for our purpose, are the counter TAR and the timer control register TACTL.

The bits of the control register TACTL, shown below, enable the selection of a prescaler, the selection of the count-mode, and the selection of the interrupt capability. For performance measurement, we are interested in operating the timer in a continuous up-counting mode. We may also be interested in capturing timer overflows, so that the fast-paced hardware counter can be extended in software with a slow-paced interrupt counter. We will run the timer at the system clock, so that one tick of TimerA corresponds to one CPU cycle.

Table:TACTL relevant bits

Bit	MACRO	Meaning
9-8	TASSEL1,TASSEL0	`00` = TACLK, `01` = ACLK, `10` = SMCLK, `11` = INCLK
7-6	ID1,0	`00` = /1, `01` = /2, `10` = /4, `11` = /8
5-4	MC1,0	`00` = stop, `01` = single, `10` = continuous, `11` = up/down
2	TACLR	Resets TAR, clock divider and count direction
1	TAIE	TimerA interrupt enable
0	TAIFG	TImerA interrupt flag

Using TimerA to build a TimeStamp counter

We start with a simple test program to illustrate the operation of the timer. This code measures the execution time of a single addition of a volatile variable. The timer configuration is set to continous counting, using SMCLK (system clock). Each time TimerLap is called, the clock is stopped, and the elapsed clock is obtained as the difference between the current clock (TAR) and the previous clock (count). Before leaving TimerLap, the global variable count is updated to the current timestamp, and the counter is restarted.

#include "omsp_de1soc.h"

unsigned count = 0;

unsigned TimerLap() {
  unsigned lap;
  TACTL &= ~(MC1 | MC0);
  lap = TAR - count;
  count = TAR;
  TACTL |= MC1;
  return lap;
}

int main(void) {
  int k;
  volatile int c;

  TACTL  |= (TASSEL1 | MC1 | TACLR);
  de1soc_init();

  while (1) {
    TimerLap();
    c = c + 1;
    k = TimerLap();
  }
  LPM0;

  return 0;
}

The use of unsigned arithmetic in TimerLap guarantees that wrap-around corrections are automatically handled. Consider a sequence of calls to TimerLap which return the following values.

TAR	Count	TimerLap return
0	0	0
5850	0	5680
48300	5680	42620
62150	48300	13860
1450	62150	4836 = (65536 - 60700)
12000	1450	10550

As long as two calls to TimerLap are no more than 64K cycles apart from one another, we are guaranteed of a correct time measurement regardless of the initial value of TAR. We will extend TimerLap later with support for wrap-around.

Figure: TimerLap output

When we run this on openmsp430de and print the value of the variable k, we find that the execution of a single addition c = c + 1 takes 15 cycles, for every while loop. Hence, the measurement technique as shown has a resolution of 1 clock cycle, and a precision of 1 clock cycle. The accuracy, on the other hand, seems off, as 15 clock cycles seems too much for a simple addition.

This suspicion is confirmed by measuring slightly longer sections of code. For exampple, by modifying the program to do 2 additions, then 3 additions, we find that the execution takes 19 cycles and 23 cycles respectively. Looking at this program in a differential manner, we then conclude that a memory to memory increment takes 4 clock cycles, which is what we expect for such an increment.

To clarify the 11 cycles of overhead, we have to analyze the assembly dump.

0000fc78 <TimerLap>:
    fc78:       b2 f0 cf ff     and     #-49,   &0x0160 ;#0xffcf
    fc7c:       60 01
    fc7e:       1c 42 70 01     mov     &0x0170,r12     ;0x0170
    fc82:       1c 82 00 02     sub     &0x0200,r12     ;0x0200
    fc86:       92 42 70 01     mov     &0x0170,&0x0200 ;0x0170
    fc8a:       00 02
    fc8c:       b2 d0 20 00     bis     #32,    &0x0160 ;#0x0020
    fc90:       60 01
    fc92:       30 41           ret

0000fc94 <main>:
    fc94:       0a 12           push    r10             ;
    fc96:       21 83           decd    r1              ;
    fc98:       b2 d0 24 02     bis     #548,   &0x0160 ;#0x0224
    fc9c:       60 01
    fc9e:       b0 12 b2 fc     call    #-846           ;#0xfcb2
    fca2:       3a 40 78 fc     mov     #-904,  r10     ;#0xfc78

0000fca6 <.L4>:
    fca6:       8a 12           call    r10             ;
    fca8:       91 53 00 00     inc     0(r1)           ;
    fcac:       8a 12           call    r10             ;
    fcae:       30 40 a6 fc     br      #0xfca6         ;

Hence, the true measurement of a single increment inc 0(r1) includes the following instructions. Based on our previous analysis, the accuracy of the measurement has an error of +11 cycles.

# ..re-enable counter
    fc92:       30 41           ret

# .. actual code to measure
    fca8:       91 53 00 00     inc     0(r1)           ;

# .. call timerlap    
    fcac:       8a 12           call    r10             ;

# .. stop the counter
    fc78:       b2 f0 cf ff     and     #-49,   &0x0160 ;#0xffcf
    fc7c:       60 01

Time stamping long intervals

When we have to measure long time intervals, the 16-bit range of the timestamp counter will run out, and we have to keep track of the number of wrap-around events.

Detecting timer wrap-around

To detect the wrap-around of the counter, we use the interrupt system of TimerA. There are two interrupts associated with the timer, which are wired to different events within the timer hardware. These two interrupts are associated with two different interrupt vectors.

Vector	Vector Name	Event
0xFFF2	TACCR0	Capture/Compare Channel 0
0xFFF0	TAIV	Overflow and all other Capture/Compare Channels

One timer register, TAIV is used to handle the interrupt flags. Reading from that register clears a pending interrupt on the timer, where different return values in that register indicate different interrupt sources (Refer to TAIV documentation on page 11-23 in the MSP430x1x Family Users Guide).

The interrupt vector is implementation dependent. For the MSP430-C1111, after which openmsp430 is modeled, vectors 0xFFF0 and 0xFFF2 are used. A quick verification is found in the compiler include files for this device:

$ grep VECTOR /cygdrive/c/ti/msp430-gcc/include/msp430c1111.h
#define TIMER0_A1_VECTOR       TIMERA1_VECTOR /* Int. Vector: Timer A CC1-2, TA */
#define TIMER0_A0_VECTOR       TIMERA0_VECTOR /* Int. Vector: Timer A CC0 */
#define PORT1_VECTOR            ( 3)                     /* 0xFFE4 Port 1 */
#define PORT2_VECTOR            ( 4)                     /* 0xFFE6 Port 2 */
#define TIMERA1_VECTOR          ( 9)                     /* 0xFFF0 Timer A CC1-2, TA */
#define TIMERA0_VECTOR          (10)                     /* 0xFFF2 Timer A CC0 */
#define WDT_VECTOR              (11)                     /* 0xFFF4 Watchdog Timer */
#define COMPARATORA_VECTOR      (12)                     /* 0xFFF6 Comparator A */
#define NMI_VECTOR              (15)                     /* 0xFFFC Non-maskable */
#define RESET_VECTOR            ("reset")                /* 0xFFFE Reset [Highest Priority] */

We can also double check the hardware interconnect in openmsp430, which ties the timer interrupt lines to the interrupt vector controller. Consulting the toplevel interconnect in toplevel.v, we find the following definitions. While the numbering between the include file and the verilog interconnect is slightly different, the address vectors

   assign nmi        =  1'b0;
   assign irq_bus    = {1'b0,         // Vector 13  (0xFFFA)
                        1'b0,         // Vector 12  (0xFFF8)
                        1'b0,         // Vector 11  (0xFFF6)
                        1'b0,         // Vector 10  (0xFFF4) - Watchdog -
                        irq_ta0,      // Vector  9  (0xFFF2)
                        irq_ta1,      // Vector  8  (0xFFF0)
                        irq_uart_rx,  // Vector  7  (0xFFEE)
                        irq_uart_tx,  // Vector  6  (0xFFEC)
                        1'b0,         // Vector  5  (0xFFEA)
                        1'b0,         // Vector  4  (0xFFE8)
                        irq_port2,    // Vector  3  (0xFFE6)
                        irq_port1,    // Vector  2  (0xFFE4)
                        1'b0,         // Vector  1  (0xFFE2)
                        1'b0};        // Vector  0  (0xFFE0)

An interrupt service routine for the timer is written as follows. An ISR tied to interrupt vector 0xFFF0 is created as follows:

void __attribute__ ((interrupt(TIMERA1_VECTOR))) timerisr (void) {
   if (TAIV == 0xA0) {
     // ...
   }
}

To configure the interrupt system, we also have to enable interrupts from the timer, and we have to enable interrupts globally for the MSP-430.

void main() {

  // enable timer interrupts
  TACTL  |= (TAIE);

  // disable timer interrupts
  TACTL  &= ~(TAIE);

  // globally enable interrupts
  _enable_interrupts();

  // globally disable interrupts
  _disable_interrupts();
}

Counting timer overflows

We can now rewrite the TimerLap routine to take into account the number of timer overflows. The first timer overflow may have to be ignored, when the new value (TAR) is below the previous value (counter). This is illustrated in the following figure.

Figure: Counting Timer Wrapping

An adjusted timer measurement program now looks as follows. The program also shows an example measurement of a ‘long’ program. In this case, we are able to measure up 2^32 clock cycles, hence, the range of the counter has increased by 16 bit. The resolution is still at a single clock cycle. However, both the accuracy and the precision are off.

#include "omsp_de1soc.h"

unsigned long wrap = 0;
void __attribute__ ((interrupt(TIMERA1_VECTOR))) timerisr (void) {
  if (TAIV == 0xA)
    wrap++;
}

unsigned count   = 0;
unsigned long TimerLap() {
  unsigned long lap;
  TACTL &= ~(MC1 | MC0); // stop timer
  TACTL &= ~(TAIE);      // disable further IRQ
  if (TAR < count)
    lap = (unsigned) (TAR - count) + ((wrap - 1) << 16);
  else
    lap = (unsigned) (TAR - count) + (wrap << 16);
  wrap = 0;
  count = TAR;
  TACTL |= TAIE; // reenable IRQ
  TACTL |= MC1;  // reenable timer
  return lap;
}

int main(void) {
  unsigned long k;
  volatile int c;
  volatile long i;

  TACTL  |= (TASSEL1 | MC1 | TACLR | TAIE);
  de1soc_init();

  _enable_interrupts();
  while (1) {
    TimerLap();
    for (i=0; i<1860; i++) c = c + 1;
    k = TimerLap();
    de1soc_hexhi(k >> 16); // hi word
    de1soc_hexlo(k );      // lo word
    long_delay(500);
  }
  _disable_interrupts();
  LPM0;

  return 0;
}

Accuracy and Precision on interrupt-driven Timestamp counter

The measurements made by the previous program oscillate between 80,066 and 80,097 cycles (138c2 and 138e1 on the HEX displays). Hence, there is a precision loss of 31 cycles.

There are two sources of inaccuracy. First, the interrupt service routine causes overhead, which may potentially affect the precision (since we cannot predict when the ISR runs). Second, the TimerLap routine itself creates overhead, which may potentially affect the accuracy.

Since 80,066 cycles is 1.22 times the range of a 16-bit counter, we conclude that each measurement sequence will experience between 1 and 2 interrupts from the timer. Hence, the overhead of the timer ISR, including interrupt latency and ISR execution latency, is 31 cycles. This leads to precision loss. It’s easy to see that the uncertainty on any measurement within the 32-bit range will be one ISR call. Therefore, we conclude that extending the timer range using interrupts introduces a precision loss of 31 cycles.

To determine the overhead of TimerLap, we measure back-to-back calls of TimerLap to find an overhead of 29 cycles (1d on the HEX display). Hence, we conclude that the accuracy loss is 29 cycles (up from 11 cycles without interrupts!).

We confirm both of these numbers by inspecting the assembly listing. First, the timerisr introduces the interrupt latency and an additional 8 instructions. Second, TimerLap incurs the overhead of a call, three push instructions, a branch, three pop instructions, and a return instruction (9 instructions in total). This explains the precision loss and accuracy loss.

0000fc28 <timerisr>:
    fc28:       0c 12           push    r12             ;
    fc2a:       1c 42 2e 01     mov     &0x012e,r12     ;0x012e
    fc2e:       3c 90 0a 00     cmp     #10,    r12     ;#0x000a
    fc32:       04 20           jnz     $+10            ;abs 0xfc3c
    fc34:       92 53 02 02     inc     &0x0202         ;
    fc38:       82 63 04 02     adc     &0x0204         ;

0000fc3c <.L1>:
    fc3c:       3c 41           pop     r12             ;
    fc3e:       00 13           reti

0000fc40 <TimerLap>:
    fc40:       0a 12           push    r10             ;
    fc42:       09 12           push    r9              ;
    fc44:       08 12           push    r8              ;
    fc46:       b2 f0 cf ff     and     #-49,   &0x0160 ;#0xffcf
    // ... counter is halted after this instruction


    // ... counter is reenabled here
    fc90:       b2 d0 20 00     bis     #32,    &0x0160 ;#0x0020
    fc94:       60 01
    // ... function epilog
    fc96:       30 40 86 fd     br      #0xfd86         ;

0000fd86 <__mspabi_func_epilog_3>:
    fd86:       38 41           pop     r8              ;

0000fd88 <__mspabi_func_epilog_2>:
    fd88:       39 41           pop     r9              ;

0000fd8a <__mspabi_func_epilog_1>:
    fd8a:       3a 41           pop     r10             ;
    fd8c:       30 41           ret

So we summarize the results of our measurement technique in the table below. Despite these imprecisions, this is not too bad. The execution of the MSP-430 software is deterministic, and therefore we know that any imprecision is caused by the measurement, and not by the target.

Table: Accuracy, precision and resolution of measurement techniques

Range	Software IRQ	Accuracy	Precision	Resolution
		(cycles)	(cycles)	(cycles)
16	no	+11	1	1
32	yes	+29	+31	1

Dealing with uncertainty in the target

Often, it is not known if the measurement target will always show the same amount of clock cycles. That is, the measurement body indicated below has an unknown, and possibly variable execution time.

   TimerLap();
   // measurement body
   ...
   k = TimerLap();

There are many possible causes.

There may be control flow dependencies in the input data of the software. That may be handled by controlling the input data. For example, it may be possible to provide constant data, or ‘typical’ data. For such cases, we may seek an average execution time.
There may be architectural dependencies. In particular, caches, branch predictors and dynamic instruction schedulers may cause the execution time of the program to vary even under constant input data. For such cases, we may seek to perform each measurement several times (say 10 times) and select the median execution time as the expected execution time. Selecting the median gets rid of start-up effects such as initial cache misses when you start executing the code.
There may be external, uncontrolled factors. For example, the bus of a system-on-chip is a shared resource with a varying load. Peripherals that are unrelated to the program may cause unexpected traffic on the bus, thereby affecting system performance. Such uncontrolled factors show up as measurement ‘noise’, and are often hard to explain. Such cases should be handled by ensuring that the system is under a typical load, that sufficient measurements are taken, and that these measurements can be combined to an average execution time.

Running the Examples

Here’s a summmary on running each of the two timestamp counter examples we discussed in this lecture.

Start by cloning the example repository.

git clone https://github.com/vt-ece4530-f19/example-perf430

Open a Nios-II Command Shell and compile the hardware

cd example-perf430/hardware/msp430de1soc
quartus_sh --flow compile msp430de1soc

Open a Cygwin Shell and compile the software

cd example-perf430/software
make compile

Connect a DE1-SoC board to your laptop over USB (USB-Blaster connection). Power up the DE1-SoC board. Go to the Nios-II Command Shell and run each example. For example 1 (TimerLap with no interrupts), use the following command.

cd example-perf430/loader
system-console --script connect.tcl -bin ../software/timerperf1/timer.bin -sof ../hardware/msp430de1soc/msp430de1soc.sof

For example 2 (TimerLap with interrupts), use the following command.

cd example-perf430/loader
system-console --script connect.tcl -bin ../software/timerperf2/timer.bin -sof ../hardware/msp430de1soc/msp430de1soc.sof

To change the software (experimenting is encouraged!), go to the software source directly, make the change and recompile in a Cygwin shell.

cd example-perf430/software/timerperf1
# ... (make changes to main.c)
make compile

Then, re-download the changed binary. Once the FPGA bitstream is downloaded, you do not have to download it again as long as the board remains powered up. For example, after making a change in the software, you can recompile the software and re-run the connect script in the Nios-II Command Shell without the bitstream downloading.

cd example-perf430/loader
system-console --script connect.tcl -bin ../software/timerperf1/timer.bin

Conclusions

We discussed the four main characteristics of a measurement (range, resolution, accuracy and precision) and applied those to the problem of software performance evaluation using Timers.
We used TimerA as a 16-bit hardware timer and investigated the sources of accuracy and precision.
We used TimerA and a Timer overflow interrupts as a 32-bit hybrid hardware/software timer, and investigated the sources of accuracy and precision.

Our experiments illustrated that there exists sources of error in timer-based performance measurements, but that these sources can be determined and kept into account.