.. ECE 4703 

.. _lecture 6:

DSP Libraries
=============

.. contents:: 

The purpose of this lecture is as follows.

* To discuss the purpose and use of DSP libraries
* To describe the contents of the ARM CMSIS library
* To illustrate filter primitives for ARM CMSIS DSP
* To demonstrate the performance gains available from a using DSP library

.. attention::

   The examples for this lectures can be downloaded from https://github.com/wpi-ece4703-b24/blockbased
   
DSP Libraries: ARM CMSIS
------------------------

Despite the need for optimize DSP code to achieve real-time performance, there is also a relentless pressure to develop code faster. A solution for this conundrum is to make use of a library with optimized primitives.

The advantage of such a library is threefold. First, a library improves the speed of software development. Second, the library functions can be highly optimized for the underlying architecture. Third, the library's application programmer's interface (API) offers a portable design that enables the same application code to execute across multiple platforms.

To support the developer, `ARM has published the Common Microcontroller Interface Standard <https://developer.arm.com/tools-and-software/embedded/cmsis>`_. This is a collection of software functions that aim at supporting the application developer for ARM, regardless of the specific core used. Thus, ARM CMSIS aims at providing a consistent API (Application Programmer's Interface) for applications that run on the Cortex M-series or A-series.

An important component of the ARM CMSIS linrary is the Digital Signal Processing Lirary. This library provides over 60 DSP functions for various data types: fixed-point (fractional q7, q15, q31) and single precision floating-point (32-bit). Optimized implementations for the SIMD instruction set of Cortex-M4/M7/M33/M35P are included as well.

The ARM CMSIS DSP library is included in the `MSP432 SimpleLink software library <https://www.ti.com/tool/download/SIMPLELINK-MSP432-SDK/3.40.01.02>`_ as source code. You can find it in ``source\thirdparty\CMSIS`` of the SimpleLink Installation Directory (typically ``c:\ti\simplelink_msp432p4_sdk_3_40_01_02``). There are a large number of functions included in the library which are documented in the `ARM CMSIS DSP Library <https://arm-software.github.io/CMSIS_5/DSP/html/index.html>`_.

In this lecture, we will focus on the CMSIS DSP Library support for the DSP functions that we have discussed so far: FIR and IIR filters.

Data Types
^^^^^^^^^^

The most common data types used by ARM DSP CMSIS are equivalent to the data types
that we have used so far.

   +---------------+---------------------------------------+--------------+-------------+
   |  Type         |  Precision                            | Min Value    | Max Value   |
   +===============+=======================================+==============+=============+
   | ``float32_t`` | Single-precision Floating Point       | ``F32_MAX``  | ``F32_MIN`` |
   +---------------+---------------------------------------+--------------+-------------+
   | ``q31_t``     | Fixed Point <32,31>                   | ``Q31_MAX``  | ``Q31_MIN`` |
   +---------------+---------------------------------------+--------------+-------------+
   | ``q15_t``     | Fixed Point <16,15>                   | ``Q15_MAX``  | ``Q15_MIN`` |
   +---------------+---------------------------------------+--------------+-------------+

In addition, ARM DSP CMSIS makes extensive use of records (``structs``) to set up the parameters related to a filter. For example, the following shows the definition of a data structure used for FIR filtering.

.. code::

   typedef struct
   {
     uint16_t numTaps;         /**< number of filter coefficients in the filter. */
     q15_t *pState;            /**< points to the state variable array. The array is of length numTaps+blockSize-1. */
     q15_t *pCoeffs;           /**< points to the coefficient array. The array is of length numTaps.*/
   } arm_fir_instance_q15;


In a ``arm_fir_instance_q15`` record, we will store the number of taps in the filter (``numTaps``), an array with filter tap values (``pState``), and an array with filter coefficients (``pCoefs``). This record is a simple data structure with pointers which does not allocate actual storage. Memory allocation is application dependent, and is therefore left up to the user.

There are similar records for every major function type in ARM DSP CMSIS, and the `online documentation <https://arm-software.github.io/CMSIS_5/DSP/html/annotated.html>`_ can be used to explain each of the fields.

Filtering Operations
^^^^^^^^^^^^^^^^^^^^

All of the filtering functions provided in ARM CMSIS DSP are *block-based* and work on a block of samples rather than a single sample at a time. For example, a FIR filter function may work on blocks of 8 samples at a time. This means that the ARM CMSIS DSP code for that FIR filter will start from a block of 8 inputs which represents a stream of 8 consecutive inputs samples, and that the FIR filter will create a block of 8 outputs which represents a stream of 8 consecutive output samples.

The following shows a minimal example.

.. code::

   #define BLOCKSIZE 8
   #define NUMTAPS 32
   int16_t taps[NUMTAPS + BLOCKSIZE - 1];
   int coefficients[NUMTAPS] = { ... };
   arm_fir_instance_q15 F;

   // initialize the FIR   
   arm_fir_init_q15(F, NUMTAPS, coefficients, taps, BLOCKSIZE);

   // execute the FIR   
   arm_fir_q15(&F, x, y, BUFLEN_SZ);
  
In this example, a 32-tap FIR is created. The FIR will filter blocks of 8 samples at a time. An array ``taps`` of 32 + 8 - 1 taps is required to hold all the data, as well as an array of 32 coefficients. The function ``arm_fir_init_q15`` initializes the FIR data structure. The function ``arm_fir_q15`` runs the actual filter on a block of ``x[BLOCKSIZE]`` samples.

The reason why ARM CMSIS DSP uses block-based processing instead of sample-based processing is rooted in performance considerations. In many cases, a computer architecture becomes more efficient when operations are formulated in bulk. Think for example of the data elements in a cache line which are all read out with a burst memory-read, or of a vector instruction which computes on a vector of data values at a time. Hence, many of the ARM CMSIS DSP library functions are internally optimized to make use of internal computer architecture parallellism. 

The consequence of block-based operation, of course, is that one can no longer call a filter one sample at a time. Hence, the ARM CMSIS DSP function will need to be used in combination with a block-based real-time I/O scheme. We will use the DMA-based I/O scheme in XLAUDIO for thus purpose.


FIR Designs using ARM CMSIS DSP
-------------------------------

There are several variants of FIR filter implementations in ARM CMSIS DSP. We illustrate a few examples that use single-precision floating point data types. There, of course, variants that use fixed-point precision data types as well. The full list `can be found online <https://arm-software.github.io/CMSIS_5/DSP/html/group__groupFilters.html>`_. We will only discuss the standard block-based FIR.

+-----------------------------+-----------------------------------------------+
| ``arm_fir_f32``             | Standard block-based FIR                      |
+-----------------------------+-----------------------------------------------+
| ``arm_fir_decimate_f32``    | Decimating block-based FIR                    |
+-----------------------------+-----------------------------------------------+
| ``arm_fir_interpolate_f32`` | Interpolating block-based FIR                 |
+-----------------------------+-----------------------------------------------+
| ``arm_fir_sparse_f32``      | Block-based FIR with sparse coefficients      |
+-----------------------------+-----------------------------------------------+
| ``arm_fir_lattice_f32``     | Lattice block-based FIR                       |
+-----------------------------+-----------------------------------------------+

The following is an example of a FIR with Q15 coefficients.

.. important::

   .. code::

      void arm_fir_q15 ( const arm_fir_instance_q15 *  S,
                         const q15_t *   pSrc,
                         q15_t *   pDst,
                         uint32_t  blockSize 
                       )

   **Parameters**

     * [in]  ``S`` points to an instance of the Q15 FIR filter structure
     * [in]  ``pSrc``  points to the block of input data
     * [out] ``pDst``  points to the block of output data
     * [in]  ``blockSize`` number of samples to process

   **Returns**

   none

   **Scaling and Overflow Behavior**

   The function is implemented using a 64-bit internal accumulator. Both coefficients and state variables are represented in 1.15 format and multiplications yield a 2.30 result. The 2.30 intermediate results are accumulated in a 64-bit accumulator in 34.30 format. There is no risk of internal overflow with this approach and the full precision of intermediate multiplications is preserved. After all additions have been performed, the accumulator is truncated to 34.15 format by discarding low 15 bits. Lastly, the accumulator is saturated to yield a result in 1.15 format.

Let's go through an example by designing a FIR filter with a specification as follows: Low-pass, sample rate 8 KHz, passband cuttoff at 1KHz, stopband at 1.2KHz, passband ripple under 1dB and stopband suppression larger than 40dB. Using Matlab ``filterDesigner`` we obtain a 57th order FIR filter for an equiripple design. This filter thus creates a set of 58 filter tap coefficients.

.. figure:: images/firdmaspec.png
   :figwidth: 600px
   :align: center

The coefficients can be exported from Matlab to C. As a reminder, the ``q15`` data type has 15 fractional bits. During the generation of these C coefficients, the proper data type must be chosen to select the proper precision. The coefficient generation process results in the following output.

.. code::

   const int BL = 58;
   const int16_T B[58] = {
        -232,    -47,    109,    319,    462,    430,    205,   -104,   -314,
        -278,     -1,    343,    504,    323,   -133,   -580,   -681,   -285,
         426,    990,    938,    132,  -1056,  -1847,  -1447,    478,   3523,
        6642,   8610,   8610,   6642,   3523,    478,  -1447,  -1847,  -1056,
         132,    938,    990,    426,   -285,   -681,   -580,   -133,    323,
         504,    343,     -1,   -278,   -314,   -104,    205,    430,    462,
         319,    109,    -47,   -232
   };


We will now use these coefficients in a DMA-driven FIR filter. For comparison, we will use a direct C code implementation (interrupt driven), as well as an ARM CMSIS based implementation (dma driven) and compare the performance of both.

.. code::

   #include "xlaudio.h"
   #include "xlaudio_armdsp.h"
   
   #define BLOCKSIZE 8
   #define NUMTAPS 58
   
   q15_t taps[NUMTAPS + BLOCKSIZE - 1];
   arm_fir_instance_q15 F;
   
   typedef q15_t int16_T;
   
   #include "fdacoefs.h"
   
   void processBuffer(uint16_t x[BLOCKSIZE], uint16_t y[BLOCKSIZE]) {
       q15_t xq[BLOCKSIZE], yq[BLOCKSIZE];
       xlaudio_adc14_to_q15_vec(x, xq, BLOCKSIZE);
       arm_fir_q15(&F, xq, yq, BLOCKSIZE);
       xlaudio_q15_to_dac14_vec(yq, y, BLOCKSIZE);
   }
   
   // This is the interrupt-based (per-sample) filter operation for comparison
   // Note that you cannot simultaneously run the DMA-based scheme and interrupt-based scheme
   uint16_t processSample(uint16_t x) {
         q15_t v = xlaudio_adc14_to_q15(x);
         taps[0] = v;
         int i;
         q15_t q = 0;
         for (i=0; i<NUMTAPS; i++)
             q = q + (taps[i] * B[i] >> 15);
         for (i=NUMTAPS-1; i>0; i--)
             taps[i] = taps[i-1];
         return xlaudio_q15_to_dac14(q);
   }
   
   
   #include <stdio.h>
   
   int main(void) {
       WDT_A_hold(WDT_A_BASE);
   
       arm_fir_init_q15(&F, NUMTAPS, (int16_t *) B, taps, BLOCKSIZE);
   
       xlaudio_init_dma(FS_8000_HZ, XLAUDIO_J1_2_IN, BUFLEN_8, processBuffer);
       uint32_t c = xlaudio_measurePerfBuffer(processBuffer);
       printf("DMA Cycles: %d\n", c);
   
       // This is what you'd write in an interrupt-driven IO scheme:
       //
       //        xlaudio_init_intr(FS_8000_HZ, XLAUDIO_J1_2_IN, processSample);
       //        uint32_t c = xlaudio_measurePerfSample(processSample);
       //        printf("INT Cycles: %d\n", c);
   
       xlaudio_run();
   
       return 1;
   }

* The ``int16_T`` data type created by matlab can be equated with a ``q15_t``.

* The traditional implementation is shown in ``processSample``. Note the scaling of the multiplication by 15 bits, when we multiply q15*q15 and accumulate in a q15. The run the ``processSample`` version, uncomment the ``xlaudio_init_intr`` call in the main function (and comment out the ``xlaudio_init_dma`` call).

* In the DMA version, the ``processSample`` is replaced with a ``processBuffer`` function, which filters a block of samples. The :ref:`xlaudio_lib` has functions to convert a vector of samples from the ADC/to the DAC to internal q15, f32 or q31 datatype.

* The ``main`` function creates a DMA block-based setup rather than an interrupt-driven sample-based setup. The key parameter is ``BLOCKSIZE``, which describes the blocksize used by the DMA.

When this function is compiled with Global Optimization (level 2) at setting 3 for the speed-vs-size trade-off, we find that 7625 cycles are needed per DMA block (8 samples) while 1850 cycles are needed per interrupt (sample). Hence, the DMA version uses about 953 cycles per sample, which is a considerable improvement. The bulk of this improvement comes from the internal design of ``arm_fir_q15``. You can find the source code of ``arm_fir_q15`` on the source tree of the MSP432 software support package (simplelink_msp432p4_sdk_3_40_01_01/source/third_party/CMSIS/DSP_Lib/Source/FilteringFunctions).


IIR Designs using ARM CMSIS DSP
-------------------------------

Just as with FIR, there are several IIR variants in ARM CMSIS DSP. We illustrate a few examples that use single-precision floating point data types. There, of course, variants that use fixed-point precision data types as well. The full list `can be found online <https://arm-software.github.io/CMSIS_5/DSP/html/group__groupFilters.html>`_. We will discuss Direct Form  II Transpose Cascade. 

The following is an example of a IIR Direct Form II Transpose Cascade with float32_t coefficients.

.. important::

   .. code::

      void arm_biquad_cascade_df2T_f32   (   const arm_biquad_casd_df1_inst_f32 *    S,
                                            const float32_t *   pSrc,
                                            float32_t *     pDst,
                                            uint32_t    blockSize 
                                        )   

   **Parameters**

     * [in]  ``S`` points to an instance of the float32_t biquad cascade structure
     * [in]  ``pSrc``  points to the block of input data
     * [out] ``pDst``  points to the block of output data
     * [in]  ``blockSize`` number of samples to process

   **Returns**

   none

The data structure that holds the coefficients, ``arm_biquad_casd_df2T_inst_f32``, is organized in cascade sections as follows (`online reference <https://arm-software.github.io/CMSIS_5/DSP/html/structarm__biquad__cascade__df2T__instance__f32.html>`_):

.. code::

   typedef struct
   {
     uint32_t numStages;        
     float32_t *pState;           
     const float32_t *pCoeffs;          
   } arm_biquad_cascade_df2T_instance_f32;

``numStages`` holds the number of second-order-sections.
``pState`` is an array with state elements, and there are 2 times ``numStages`` state elements. The order of state elements is v[n-1], v[n-2].
``pCoeffs`` is an array of 5 times ``numStages`` coefficients, each holding the coefficients of a single cascade stage. The order of coefficients are b0, b1, b2, a1, a2, each time repeated per cascade stage.

Let's go through the example of an IIR design with the following specifications. The sample rate is 8KHz, the passband is 950Hz to 1050Hz with less than 1dB ripple. The lower stopband ends at 800Hz, and the higher stopband starts at 1100Hz. The stopband suppression must be better than 40dB. Matlab filterDesigner can translate these specs into an 8th-order IIR filter which can be decomposed into four second-order sections.

.. figure:: images/iirdmaspec.png
   :figwidth: 600px
   :align: center

As with the FIR filter design, we extract the filter coefficient from Matlab by generating a C header. In this case, the precision of the generated coefficients can be single-precision floating-point, as we target an ``arm_biquad_casd_df1_inst_f32`` filter which uses single-precision floating-point coefficients. This data structure has be reformatted somewhat to match the ARM CMSIS expectations for the ``arm_biquad_casd_df1_inst_f32`` data structure; we will include the translation as part of the program.

.. code::

   #include "xlaudio.h"
   #include "xlaudio_armdsp.h"
   
   // coefficients produced using Matlab filterDesigner
   
   typedef float32_t real32_T;
   
   #include "fdacoefs.h"
   
   // number of cascade sections
   #define NUMSECTIONS 4
   float32_t taps[2 * NUMSECTIONS];
   float32_t coefficients[5 * NUMSECTIONS];
   
   // shown as a function to illustrate how matlab format
   // maps to ARM CMSIS format. For the most compact
   // implementation, you would precompute this mapping
   // and initialize coefficients as a constant array
   
   void mat2armdsp() {
       coefficients[0]  =  NUM[0][0] * NUM[1][0];
       coefficients[1]  =  NUM[0][0] * NUM[1][1];
       coefficients[2]  =  NUM[0][0] * NUM[1][2];
       coefficients[3]  = -DEN[0][0] * DEN[1][1];
       coefficients[4]  = -DEN[0][0] * DEN[1][2];
       coefficients[5]  =  NUM[2][0] * NUM[3][0];
       coefficients[6]  =  NUM[2][0] * NUM[3][1];
       coefficients[7]  =  NUM[2][0] * NUM[3][2];
       coefficients[8]  = -DEN[2][0] * DEN[3][1];
       coefficients[9]  = -DEN[2][0] * DEN[3][2];
       coefficients[10] =  NUM[4][0] * NUM[5][0];
       coefficients[11] =  NUM[4][0] * NUM[5][1];
       coefficients[12] =  NUM[4][0] * NUM[5][2];
       coefficients[13] = -DEN[4][0] * DEN[5][1];
       coefficients[14] = -DEN[4][0] * DEN[5][2];
       coefficients[15] =  NUM[6][0] * NUM[7][0];
       coefficients[16] =  NUM[6][0] * NUM[7][1];
       coefficients[17] =  NUM[6][0] * NUM[7][2];
       coefficients[18] = -DEN[6][0] * DEN[7][1];
       coefficients[19] = -DEN[6][0] * DEN[7][2];
   }
   
   arm_biquad_cascade_df2T_instance_f32 F;
   
   #define BLOCKSIZE 8
   
   void processBuffer(uint16_t x[BLOCKSIZE], uint16_t y[BLOCKSIZE]) {
       float32_t xf[BLOCKSIZE], yf[BLOCKSIZE];
       xlaudio_adc14_to_f32_vec(x, xf, BLOCKSIZE);
       arm_biquad_cascade_df2T_f32(&F, xf, yf, BLOCKSIZE);
       xlaudio_f32_to_dac14_vec(yf, y, BLOCKSIZE);
   }
   
   #include <stdio.h>
   
   int main(void) {
       WDT_A_hold(WDT_A_BASE);
   
       mat2armdsp();
       arm_biquad_cascade_df2T_init_f32(&F, NUMSECTIONS, coefficients, taps);
   
       xlaudio_init_dma(FS_8000_HZ, XLAUDIO_J1_2_IN, BUFLEN_8, processBuffer);
   
       uint32_t c = xlaudio_measurePerfBuffer(processBuffer);
       printf("DMA Cycles: %d\n", c);
   
       xlaudio_run();
   
       return 1;
   }

* The matlab data type ``real32_T`` maps directly into ``float32_t`` as defined in ``xlaudio_armdsp.h``.

* The ``taps`` and ``coefficients`` data structures hold the taps and coefficients of each section. The ``mat2armdsp`` function illustrates how the matlab coefficients map into ARMCMSIS coefficients. Note the sign change for the feedback coefficients.

* The ``processBuffer`` function shows the processing of a DMA block. Similar as with the FIR design, conversion functions are used to convert a vector of ADC samples and DAC samples to ``float32_t`` and back.

* The ``main`` function includes an additional initialization call to ``arm_biquad_cascade_df1_init_f32`` to initialize the filter data stucture.

* At optimization level 2, setting '3', this filter requires 6350 cycles per block of 8 samples or around 793 cycles per sample. 

However, when the same IIR functionality is processed using an interrupt-driven (per-sample) scheme, the resulting filter only uses 402 cycles per sample at the same optimization level.  This relative inefficiency of the ARM CMSIS code is interesting but would require further study of the source code (simplelink_msp432p4_sdk_3_40_01_01/source/third_party/CMSIS/DSP_Lib/Source/FilteringFunctions). At least, this demonstrates that performance improvement by using external libraries is not unconditional. Performance evaluation is always needed!

Conclusions
-----------

We discussed the use of a DSP library with a standard API called ARM CMSIS DSP. Combined with a DMA-based I/O scheme, this library provides an optimized implementation of many filter constructions. We will use ARM CMSIS DSP regularly in the coming labs.