DSP Libraries

The purpose of this lecture is as follows.

  • To discuss the purpose and use of DSP libraries

  • To describe the contents of the ARM CMSIS library

  • To illustrate filter primitives for ARM CMSIS DSP

  • To describe how DMA I/O is implemented in our DSP applications

  • To demonstrate the performance gains available from a using DSP library

DSP Libraries: ARM CMSIS

Despite the need for optimize DSP code to achieve real-time performance, there is also a relentless pressure to develop code faster. A solution for this conundrum is to make use of a library with optimized primitives.

The advantage of such a library is threefold. First, a library improves the speed of software development. Second, the library functions can be highly optimized for the underlying architecture. Third, the library’s application programmer’s interface (API) offers a portable design that enables the same application code to execute across multiple platforms.

To support the developer, ARM has published the Common Microcontroller Interface Standard. This is a collection of software functions that aim at supporting the application developer for ARM, regardless of the specific core used. Thus, ARM CMSIS aims at providing a consistent API (Application Programmer’s Interface) for applications that run on the Cortex M-series or A-series.

An important component of the ARM CMSIS linrary is the Digital Signal Processing Lirary. This library provides over 60 DSP functions for various data types: fixed-point (fractional q7, q15, q31) and single precision floating-point (32-bit). Optimized implementations for the SIMD instruction set of Cortex-M4/M7/M33/M35P are included as well.

The ARM CMSIS DSP library is included in the MSP432 SimpleLink software library as source code. You can find it in source\thirdparty\CMSIS of the SimpleLink Installation Directory (typically c:\ti\simplelink_msp432p4_sdk_3_40_01_02). There are a large number of functions included in the library which are documented in the ARM CMSIS DSP Library.

In this lecture, we will focus on the CMSIS DSP Library support for the DSP functions that we have discussed so far: FIR and IIR filters.

Data Types

The most common data types used by ARM DSP CMSIS are equivalent to the data types that we have used so far.

Type

Precision

Min Value

Max Value

float32_t

Single-precision Floating Point

F32_MAX

F32_MIN

q31_t

Fixed Point <32,31>

Q31_MAX

Q31_MIN

q15_t

Fixed Point <16,15>

Q15_MAX

Q15_MIN

In addition, ARM DSP CMSIS makes extensive use of records (structs) to set up the parameters related to a filter. For example, the following shows the definition of a data structure used for FIR filtering.

typedef struct {
  uint16_t numTaps;
  q15_t *pState;
  q15_t *pCoeffs;
} arm_fir_instance_q15;

In a arm_fir_instance_q15 record, we will store the number of taps in the filter (numTaps), an array with filter tap values (pState), and an array with filter coefficients (pCoefs). This record is a simple data structure with pointers which does not allocate actual storage. Memory allocation is application dependent, and is therefore left up to the user.

There are similar records for every major function type in ARM DSP CMSIS, and the online documentation can be used to explain each of the fields.

Filtering Operations

All of the filtering functions provided in ARM CMSIS DSP are block-based and work on a block of samples rather than a single sample at a time. For example, a FIR filter function may work on blocks of 8 samples at a time. This means that the ARM CMSIS DSP code for that FIR filter will start from a block of 8 inputs which represents a stream of 8 consecutive inputs samples, and that the FIR filter will create a block of 8 outputs which represents a stream of 8 consecutive output samples.

The following shows a minimal example.

#define BLOCKSIZE 8
#define NUMTAPS 32
int16_t taps[NUMTAPS + BLOCKSIZE - 1];
int coefficients[NUMTAPS] = { ... };
arm_fir_instance_q15 F;

// initialize the FIR
arm_fir_init_q15(F, NUMTAPS, coefficients, taps, BLOCKSIZE);

// execute the FIR
arm_fir_q15(&F, x, y, BUFLEN_SZ);

In this example, a 32-tap FIR is created. The FIR will filter blocks of 8 samples at a time. An array taps of 32 + 8 - 1 taps is required to hold all the data, as well as an array of 32 coefficients. The function arm_fir_init_q15 initializes the FIR data structure. The function arm_fir_q15 runs the actual filter on a block of x[BLOCKSIZE] samples.

The reason why ARM CMSIS DSP uses block-based processing instead of sample-based processing is rooted in performance considerations. In many cases, a computer architecture becomes more efficient when operations are formulated in bulk. Think for example of the data elements in a cache line which are all read out with a burst memory-read, or of a vector instruction which computes on a vector of data values at a time. Hence, many of the ARM CMSIS DSP library functions are internally optimized to make use of internal computer architecture parallellism.

The consequence of block-based operation, of course, is that one can no longer call a filter one sample at a time. Hence, interrupt driven operation, where samples are produced at an interval of the sample rate period, cannot be used for block-based operation. We can reasonably expect that the processing of N samples may take up to N times the processing of 1 sample.

We will introduce a new real-time input-output primitive to deal with blocks of samples. This new real-time input/output mechanism is called block-based input/output. We will use the term ‘DMA-driven input/output’ to describe this block-based I/O mechanism, mainly because there is a Direct Memory Access peripheral involved.

DMA-driven Input/Output

In a block-based processing system, a conversion from sample-rate based processing to block-based, and vice versa, is needed. We make use of a special peripheral called a Direct Memory Access unit. A DMA module performs data movement on behalf of the processor. We can program it with a source address and destination address, and a number of words to copy. When started, the DMA will then automatically copy the block of data. The DMA operation is fairly complex, and in this lecture we will only make a once-over-lightly discussion of it. We will discuss some of the internal details later when we care about detailed performance optimization.

_images/dmaoperation.jpg

DMA-driven I/O’s idea is to fill up a complete buffer with ADC samples before turing over the buffer to the processor. The processor can then compute on an entire buffer of samples. The following block diagram shows the sequence of operations happening under DMA driven I/O. A periodic timer starts A/D conversions at a specified interval. When an A/D conversion finishes, the DMA module is triggered and the sample is forwarded through a DMA channel to a buffer in main memory. The DMA trigger is a hardware signal; no software is involved to store a sample value in memory. The DMA makes use of a ping and a pong buffer, with the idea that the ARM is only allowed read access to the buffer which is currently not being filled. Thus, when DMA Channel 1 fills the Ping buffer, the ARM reads the Pong buffer, and when DMA Channel 2 fills the Pong buffer, the ARM reads the Ping buffer. The switching between buffers is controlled through a DMA Interrupt Service Routine, and is done behind the scenes. Finally, because the access to the DAC8311 is more complex than the ADC14 (i.e. it requires a combination of GPIO and SPI), no DMA transfers are used to produce output samples.

_images/dmaio.jpg

DMA Example

Here is a complete example of a DMA driven input/output program. The initialization function call includes an additional parameter: the size of the Ping/Pong buffer.

 1#include "xlaudio.h"
 2
 3void processBuffer(uint16_t x[32], uint16_t y[32]) {
 4 uint16_t i;
 5 for (i=0; i<16; i++) {
 6     y[i] = x[i];
 7 }
 8}
 9
10#include <stdio.h>
11
12int main(void) {
13   WDT_A_hold(WDT_A_BASE);
14
15   xlaudio_init_dma(FS_32000_HZ, XLAUDIO_J1_2_IN, BUFLEN_32, processBuffer);
16   xlaudio_run();
17
18   return 1;
19}

FIR Designs using ARM CMSIS DSP

There are several variants of FIR filter implementations in ARM CMSIS DSP. We illustrate a few examples that use single-precision floating point data types. There, of course, variants that use fixed-point precision data types as well. The full list can be found online. We will only discuss the standard block-based FIR.

arm_fir_f32

Standard block-based FIR

arm_fir_decimate_f32

Decimating block-based FIR

arm_fir_interpolate_f32

Interpolating block-based FIR

arm_fir_sparse_f32

Block-based FIR with sparse coefficients

arm_fir_lattice_f32

Lattice block-based FIR

The following is an example of a FIR with Q15 coefficients.

Important

void arm_fir_q15 ( const arm_fir_instance_q15 *  S,
                   const q15_t *   pSrc,
                   q15_t *   pDst,
                   uint32_t  blockSize
                 )

Parameters

  • [in] S points to an instance of the Q15 FIR filter structure

  • [in] pSrc points to the block of input data

  • [out] pDst points to the block of output data

  • [in] blockSize number of samples to process

Returns

none

Scaling and Overflow Behavior

The function is implemented using a 64-bit internal accumulator. Both coefficients and state variables are represented in 1.15 format and multiplications yield a 2.30 result. The 2.30 intermediate results are accumulated in a 64-bit accumulator in 34.30 format. There is no risk of internal overflow with this approach and the full precision of intermediate multiplications is preserved. After all additions have been performed, the accumulator is truncated to 34.15 format by discarding low 15 bits. Lastly, the accumulator is saturated to yield a result in 1.15 format.

The following application illustrates the use of arm_fir_q15.

#include "xlaudio.h"

#define BLOCKSIZE 8
#define NUMTAPS 32
q15_t taps[NUMTAPS + BLOCKSIZE - 1];
q15_t coefficients[NUMTAPS] = { (q15_t) (1 * (1 << 15)) };
arm_fir_instance_q15 F;

void processBuffer(uint16_t x[BLOCKSIZE], uint16_t y[BLOCKSIZE]) {
    q15_t xq[BLOCKSIZE], yq[BLOCKSIZE];
    xlaudio_adc14_to_q15_vec(x, xq, BLOCKSIZE);
    arm_fir_q15(&F, xq, yq, BLOCKSIZE);
    xlaudio_q15_to_dac14_vec(yq, y, BLOCKSIZE);
}

#include <stdio.h>

int main(void) {
    WDT_A_hold(WDT_A_BASE);

    arm_fir_init_q15(&F, NUMTAPS, coefficients, taps, BLOCKSIZE);

    xlaudio_init_dma(FS_8000_HZ, XLAUDIO_J1_2_IN, BUFLEN_8, processBuffer);

    uint32_t c = xlaudio_measurePerfBuffer(processBuffer);
    printf("Cycles: %d\n", c);

    xlaudio_run();

    return 1;
}
  • The processSample is now replaced with a processBuffer function, which filters a block of samples. Note that XLAUDIO_LIB has functions to convert a vector of samples from the ADC/to the DAC to internal q15, f32 or q31 datatype.

  • The main function creates a DMA block-based setup rather than an interrupt-driven sample-based setup. The key parameter is BLOCKSIZE, which describes the blocksize used by the DMA. We will discuss the detailed internal operation of the DMA mechanism in our next lecture. For now, the key point is that the DMA mechanism will collect a block of BLOCKSIZE samples (each one sample period apart) from the ADC, and then calls processBuffer. After this function returns, the resulting BLOCKSIZE output samples are submitted to the DAC, one at a time and spaced one sample period apart.

This function completes in 4999 cycles, which is about 625 cycles per sample. In contrast, a straightforward FIR filter using an interrrupt-driven I/O routine needs 1556 cycles per sample. Hence, arm_fir_q15 provides considerable performance improvement.

IIR Designs using ARM CMSIS DSP

Just as with FIR, there are several IIR variants in ARM CMSIS DSP. We illustrate a few examples that use single-precision floating point data types. There, of course, variants that use fixed-point precision data types as well. The full list can be found online. We will discuss Direct Form I Cascade and Direct Form II Transpose Cascade.

arm_biquad_cascade_df1_f32

Direct Form I Cascade block-based IIR

arm_biquad_cascade_df2T_f32

Direct Form II Transpose Casc block-based IIR

arm_iir_lattice_f32

IIR Lattice Filter

The following is an example of a IIR Direct Form I with float32_t coefficients.

Important

void arm_biquad_cascade_df1_f32   (   const arm_biquad_casd_df1_inst_f32 *    S,
                                      const float32_t *   pSrc,
                                      float32_t *     pDst,
                                      uint32_t    blockSize
                                  )

Parameters

  • [in] S points to an instance of the float32_t biquad cascade structure

  • [in] pSrc points to the block of input data

  • [out] pDst points to the block of output data

  • [in] blockSize number of samples to process

Returns

none

The data structure that holds the coefficients, arm_biquad_casd_df1_inst_f32, is organized in cascade sections as follows (online reference):

typedef struct
{
  uint32_t numStages;
  float32_t *pState;
  const float32_t *pCoeffs;
} arm_biquad_casd_df1_inst_f32;

numStages holds the number of second-order-sections. pState is an array with state elements, and there are 4 times numStages state elements. The order of state elements is x[n-1], x[n-2], y[n-1], y[n-2]. pCoeffs is an array of 5 times numStages coefficients, each holding the coefficients of a single cascade stage. The order of coefficients are b0, b1, b2, a1, a2, each time repeated per cascade stage.

The following application illustrates the use of arm_biquad_cascade_df1_f32. We assume a filter with two cascade stages, similar to the example discussed in the lecture on IIR design.

#include "xlaudio.h"

#define BLOCKSIZE 8
#define NUMSECTIONS 2
float32_t taps[4 * NUMSECTIONS];
float32_t coefficients[5 * NUMSECTIONS] = {1.0f, 0.0f, 1.0f, -0.707f, 0.25f,
                                           1.0f, 1.0f, 0.0f,  0.707f, 0.25f};
arm_biquad_casd_df1_inst_f32 F;

void processBuffer(uint16_t x[BLOCKSIZE], uint16_t y[BLOCKSIZE]) {
    float32_t xf[BLOCKSIZE], yf[BLOCKSIZE];
    xlaudio_adc14_to_f32_vec(x, xf, BLOCKSIZE);
    arm_biquad_cascade_df1_f32(&F, xf, yf, BLOCKSIZE);
    xlaudio_f32_to_dac14_vec(yf, y, BLOCKSIZE);
}

#include <stdio.h>

int main(void) {
    WDT_A_hold(WDT_A_BASE);

    arm_biquad_cascade_df1_init_f32(&F, NUMSECTIONS, coefficients, taps);
    xlaudio_init_dma(FS_8000_HZ, XLAUDIO_J1_2_IN, BUFLEN_8, processBuffer);

    xlaudio_run();

    return 1;
}