Fixed Point Arithmetic in DSP
The purpose of this lecture is as follows.
To describe fixed-point data representation of signals
To describe arithmetic using fixed-point data representation
To discuss the effects of fixed-point arithmetic on DSP
To discuss the influence of coefficient quantization on FIR and IIR
To demonstrate the use of filter coefficient quantization in Matlab
Fixed Point Data Representation
Floating point representation is the default representation adopted in many scientific computations, as well as in the world of signal processing. Matlab, for example, will compute the filter coefficients by default in a double-precision (64-bit) precision.
On the other hand, compared to integer arithmetic, floating point arithmetic is complex. Since a floating point number is represented using a mantissa and an exponent, every arithmetic operation involving a floating point number implies operations on both the mantissa as well as the exponent. Furthermore, floating point numbers have to be aligned before every operation, and they have to be normalized after every operation. Floating point arithmetic has a high implementation cost, much larger than that of typical integer arithmetic.
In DSP processing applications, the increased complexity of floating point arithmetic will manifest itself in two areas.
If the target processor includes floating-point hardware (such as the Cortex-M4F that we’re using on our MSP432 experimentation board), then the use of floating point arithmetic - as opposed to integer arithmetic - will increase the power consumption of the processor. Further, given the same amount of operations between a floating-point precision and an integer precision program, then the floating-point precision program will require more energy.
If the target processor does not include floating-point hardware (for example, because it’s a smaller microcontroller), then floating-point operations will have to be emulated in software, which will lead to a significant performance hit.
Therefore, to optimize the performance or the power/energy footprint of a program, DSP programmers are often required to optimize the cost of the arithmetic, by converting floating point arithmetic to integer arithmetic. We will discuss a method to achieve this optimization systematically, by converting the data types in a DSP program from floating-point data representation to fixed-point data representation. The conversion process from a DSP program using floating-point data types into a DSP program using fixed-point data types, is called fixed-point refinement.
Unsigned Fixed-point representation
Assume an unsigned integer datatype with bits . The value of the unsigned integer number is defined as follows.
In a fixed-point representation, the binary point shifts from the right-most position to a different position . Hence, a fixed-point representation is defined by the number of bits
as well as the position of the binary point
. For an unsigned number we adopt the notation
ufix<N,k>
, where ufix
means ‘unsigned fixed-point’. An unsigned integer as defined above would be ufix<N,0>
.
The value of a fixed point number ufix<N,k>
is computed as follows:
An ufix<N,k>
fits in the same number of bits as an N-bit unsigned integer. The only difference lies in our interpretation of what these bits mean. The following is an example for a ufix<6,4>
. A given bit pattern
can be evaluated to the formulas above to compute the integer value as well as the fixed-point value.

The value of an unsigned integer of N bits, and the value of an unsigned fixed point number ufix<N,k>
are related through the following expressions.
Signed Fixed-point representation
Signed values in integer arithmetic are captured in one of three different ways. The most common method is two’s complement, and we’ll focus on this method. The two other methods are one’s complement and sign-magnitude representation.
In a two’s complement representation, the value of a signed integer int<N>
is defined as follows.
Thus, the weight of the most significant bit is negative. Similarly, the value of a signed fixed-point number
fix<N,k>
is defined as follows.

Thus the only difference between signed and unsigned representation is the weight of the most significant bit (positive for unsigned, negative for signed two’s complement.) The relationship between the value of the integer and fixed-point representation is the same as above:
Conversion from Floating Point to Fixed Point
N-bit fixed-point numbers can be conveniently represented as N-bit integers,
and in a C program we will use integer data types (int
, unsigned
) to
store fixed-point numbers. A floating-point number is converted to a
fixed-point number by proper scaling and conversion to an integer data type.
To convert a floating point number to a fixed point one, we first scale it
by up so that the least significant fractional bit gets unit
weight. This conversion can lead to precision loss (at the LSB side) and
overflow (at the MSB side). The following program illustrates the conversion
to a
fix<8,6>
datatype.
#include <stdio.h>
void main() {
float taps[5] = {0.1, 1.2, -0.3, 3.4, -1.5};
int taps_8_6[5]; // a fix<8,6> type
unsigned i;
for (i=0; i<5; i++)
taps_8_6[i] = (int) (taps[i] * (1 << 6));
for (i=0; i<5; i++)
printf("%+4.2f %10d\n", taps[i], taps_8_6[i]);
}
The program generates the following output:
+0.10 6
+1.20 76
-0.30 -19
+3.40 217
-1.50 -96
The value of a fix<8,6>
lies between -2 (for pattern 10000000)
and 1.984375 (for pattern 01111111), which corresponds to integer values -128 to 127. Hence, the floating value +3.40 has suffered overflow during the conversion; its integer value is 217. The C program does not suffer from this overflow effect because we’re emulating the fix<8,6>
in an int
.
In addition, some of the floating point numbers (like 0.1) cannot be expressed exactly. Indeed, the value 6 maps to the bit pattern 00000110, or 0.09375. Hence, that conversion suffered precision loss.
Note that negative numbers are printed negative because of sign extension. For example, -19 corresponds to the bit pattern equals 1111…11101101. To lower 8 bits are 11101101, while the upper 24 bits are all 1, extending the sign bit of fix<8,6>.
Overflow is a highly non-linear effect with dramatic impact on the results of DSP computation. It must be detected (and prevented).
Conversion from Fixed Point to Floating Point
After computations in fixed point precision are complete, we can convert
the result back to a floating point representation, in order to obtain the numerical result. The conversion will scale down the integer value of a fixed-point number by a factor , so that the least significant bit gets its actual weight.
Special care has to be taken if the MSB of the fixed point representation is not located at the MSB of the integer representation. Indeed, to obtain the correct sign in the converted result, we have to replicate the MSB of the fixed point representation to get sign extension. The following program illustrates the conversion from a fix<8,6>
datatype.
#include <stdio.h>
void main() {
int taps_8_6[5] = {6, 76, -19, 127, -96}; // a fix<8,6> type
float taps[5]; // a fix<8,6> type
unsigned i;
for (i=0; i<5; i++)
taps[i] = taps_8_6[i] * 1.0 / (1 << 6);
for (i=0; i<5; i++)
printf("%10d %+4.6f\n", taps_8_6[i], taps[i]);
}
The program generates the following output:
6 +0.093750
76 +1.187500
-19 -0.296875
127 +1.984375
-96 -1.500000
Note that the fixed-point conversion introduces quantization which prevents perfect reconstruction of the floating-point value. For example, float(0.1) can be converted to fix<8,6>(6), but the opposite conversion yields float(0.09375).
Fixed Point Arithmetic
Because fixed-point representation is strongly related to integer representation, we are able to express arithmetic operations on fixed-point numbers in terms of integer operations. Let’s first derive a few basic rules. We’ll make the derivation for unsigned numbers.
Addition
When we add two ufix<N,k>
numbers, then the result is an ufix<N+1,k>
number. The extra bit at the MSB side is there to capture the carry bit, in case one is generated. The subtraction of two ufix<N,k>
numbers uses the same rule, since the subtraction can be defined as the addition of a ufix<N,k>
with the two’s complement version of the other ufix<N,k>
.
When we add a ufix<N,k1>
number to a ufix<N,k2>
number, with k1 > k2, then the two numbers have to be aligned first. This will increase the wordlength of the sum with k1 - k2
bits. However, if we desire to capture the sum as an N+1
bit number, then there are two possible alignments.
For a
ufix<N+1,k1>
sum, we will have to increase the number ofufix<N,k2>
fractional bits with a left-shift before addition.For a
ufix<N+1,k2>
sum, we will have to decrease the number ofufix<N,k1>
fractional bits with a right-shift before addition.
Since the total sum has only N+1
bits, there is potential precision loss, as illustrated in the following figure. Left-shifting V2 may cause an overflow when MSB-side bits are lost, while right-shifting V1 may cause precision loss when LSB-side bits are lost. The bottom line is that, when combining numbers with a different number of fractional bits, an alignment must be done which may cause either precision loss or else overflow.

Multiplication
When we multiply two ufix<N,k>
numbers, then the result is a ufix<2N,2k>
number. If the result has to be captured in an ufix<N,k>
, then the result of the multiplication has to be right-shifted over k bits.
In this case, precision loss may occur both at the MSB side as well as the LSB side.

Example
The standard C operators use the following precision rules.
Adding two 32-bit integers will yield a 32-bit integer. Hence, overflow is possible.
Multiplying two 32-bit integers will yield a 32-bit product, which may cause overflow.
When we write a C program using integers to emulate fixed-point data types, the same precision rules will still apply. The following example illustrates how multiplication and addition in fixed-point representation works.
We assume a vector with fix<8,7>
coefficients which is multiplied with fix<8,7>
tap values. We will implement these data types in C int
datatype.
#include <stdio.h>
void main() {
float c[5] = {0.2, -0.4, 0.8, -0.4, 0.2};
float taps[5] = {0.1, 0.2, 0.3, 0.4, 0.5};
float result;
int c_8_7[5];
int taps_8_7[5];
int result_16_14;
int i;
// convert c (float) to c_8_7 (fix<8,7>)
for (i=0; i<5; i++)
c_8_7[i] = (int) (c[i] * 128);
// convert taps (float) to tapsint (fix<8,7>)
for (i=0; i<5; i++)
taps_8_7[i] = (int) (taps[i] * 128);
// perform multiplication on a <32,14> data type
// <8,7> * <8,7> -> <16, 14>
result_16_14 = 0;
for (i=0; i<5; i++)
result_16_14 += (c_8_7[i] * taps_8_7[i]);
// perform multiplication on a float data type
result = 0.0f;
for (i=0; i<5; i++)
result += (c[i] * taps[i]);
printf("flp: %f fixp: %f\n", result, result_16_14 * 1.0f/(1<<14));
}
Since the result of the calculation is a <16,14>
data type, we have to divide the result by 1 << 14
to find the equivalent real value. The output of the program is shown next.
flp: 0.120000 fixp: 0.115967
DSP with fixed-point arithmetic
When computing a digital filter using fixed-point arithmetic, systematic errors occur during the computation of a digital filter, called ** quantization noise**. We’ll discuss the nature of quantization noise by
means of an example using computations on a fix<3,2>
datatype.

The blue staircase on the curve represents the conversion from floating point to fixed-point representation. For a fix<3,2>
, the smallest quantization step is 0.25 so that the fixed point value increments for every 0.25 step. The highest positive output value, ‘011’, corresponds to floating point value 0.75. If we increase one more quantization step, overflow will occur.
At the negative side, the staircase decreases a step for every 0.25 decrease.
An alternate of overflow is saturation, a technique that caps the most positive or most negative value that can be held in a fixed-point representation. This decreases the highly non-linear overflow effect in DSP. The implementation of saturating arithmetic, however, is more complicated than integer arithmetic. In our implementations, we will rely on plain integer arithmetic, which has overflow.
The red curve indicates the quantization error, the difference between the quantized value and the real (floating point) value. The quantization error for two’s complement fixed-point representation is always negative, which means that the value in the fixed-point representation always under-estimates the true value. The particular sawtooth shape of the error curve, however, demonstrates an important property of quantization noise: it is uniformly distributed over the range of one quantization step. In this case, the quantization error is uniformly distributed over the range [-0.25, 0].
Quantization noise will degrade signal quality. Since it can be treated as a uniformly distributed random variable, quantization noise therefore appears a as wideband noise in the signal output.
Fixed-point quantization on the MSP-432 kit
So far, we have converted the 14-bit ADC values into floating point values before starting the filtering, and from floating point values back into 14-bit DAC values after filtering.
uint16_t processSample(uint16_t x) {
float32_t input = adc14_to_f32(x);
// ... processing
return f32_to_dac14(input);
}
For fixed-point computations, we make use of one of two possible conversions to fixedpoint. adc14_to_q15()
converts ADC samples into a fix<16,15> data type, while adc14_to_q31()
converts ADC samples into a fix<32,31> data type.
The following figure shows the correspondence between analog values, DAC and ADC codes, floating point values, and fixed-point values.

When writing fixed-point implementations of a filter, it’s crucial to remember the data type of the samples: fix<16,15>
for a Q15, and fix<32,31>
for a Q31. For example, if you multiply these samples with coefficients of type Q15 (fix<16,15>
), then the result will be fix<32,30>
for Q15 * Q15, and fix<48,46>
for Q15 * Q31.
The former case, Q15 * Q15, requires downshifting before it can be send to the DAC output. The latter case, Q15 * Q31, will require a 64-bit integer, since 48 bits do not fit into a standard 32-bit integer.
Let’s look at the fixed-point implementation of the following first-order low-pass filter:
A floating-point version of this filter is straightforward to capture:
float32_t lpfloat(float32_t x) {
static float32_t state;
float32_t y = x + state * 0.5;
state = y;
return y;
}
To quantize this design, we adopt a Q15 data type for the input x. In addition, the state variable and the coefficient 0.5 are represented as Q15 values as well. The complete quantized filter is given by the following code:
q15_t lpq15(q15_t x) {
static int state; // accumulate as fix<16,15>
const int coeff = (int) (0.5 * (1 << 15)); // coeff as <16,15>
int mul = state * coeff; // mul is <32,30>
state = x + (mul >> 15); // add <16,15> and <17,15>
int y = state; // return output
return y;
}
Fixed-point quantization in Matlab Filter Designer
In Matlab, filter coefficients can be generated directly in fixed-point
precision in filterDesigner
. Often, a filter is first designed
using floating-point precision in order to decide on the proper filter order and filter type. Next, the filter coefficients are quantized such that the floating point filter characteristic shows limited degradation. This entire process is highly automated by filterDesigner
, as the following example shows.
Let’s start from a FIR design for an lowpass filter with a passband ripple of 1dB and a rejection of 80dB. We choose a sample frequency of
and obtain the following characteristics for an equipripple FIR. As expected the filter is symmetric and has 20 coefficients (taps). The overall filter is 19th order.
/*
* Discrete-Time FIR Filter (real)
* -------------------------------
* Filter Structure : Direct-Form FIR
* Filter Length : 20
* Stable : Yes
* Linear Phase : Yes (Type 2)
*/
/* General type conversion for MATLAB generated C-code */
#include "tmwtypes.h"
/*
* Expected path to tmwtypes.h
* C:\Program Files\MATLAB\R2020a\extern\include\tmwtypes.h
*/
const int BL = 20;
const real64_T B[20] = {
-0.003619634086643, -0.01225863817276, -0.01011137054999, 0.02046024442863,
0.048353670252, 0.009533024940125, -0.0706222147752, -0.04053698350706,
0.1791779657185, 0.4155023294505, 0.4155023294505, 0.1791779657185,
-0.04053698350706, -0.0706222147752, 0.009533024940125, 0.048353670252,
0.02046024442863, -0.01011137054999, -0.01225863817276,-0.003619634086643
};
To create a fixed-point precision version, we use the quantization parameter tab, which allows you to set the quantization of the inputs and the coefficients. We select the quantization to a Q15 data type (fix<16,15>), as shown in the following screen shot.

The resulting coefficients are shown below. Note that the filter coefficient header lists additional details on the coefficient quantization. It specifies the input precision and coefficient (numerator) precision. The filter uses a Q15 input and Q15 coefficients, and produces a fix<32,30> output. The coefficients are quantized out of the previous floating-point precision coefficients. For example, the -119 constant value is the quantized version of -0.003619634086643, since .
/*
* Filter Coefficients (C Source) generated by the Filter Design and Analysis Tool
* Generated by MATLAB(R) 9.8 and DSP System Toolbox 9.10.
* Generated on: 14-Nov-2021 11:29:10
*/
/*
* Discrete-Time FIR Filter (real)
* -------------------------------
* Filter Structure : Direct-Form FIR
* Filter Length : 20
* Stable : Yes
* Linear Phase : Yes (Type 2)
* Arithmetic : fixed
* Numerator : s16,15 -> [-1 1)
* Input : s16,15 -> [-1 1)
* Filter Internals : Full Precision
* Output : s32,30 -> [-2 2) (auto determined)
* Product : s30,30 -> [-5.000000e-01 5.000000e-01) (auto determined)
* Accumulator : s32,30 -> [-2 2) (auto determined)
* Round Mode : No rounding
* Overflow Mode : No overflow
*/
/* General type conversion for MATLAB generated C-code */
#include "tmwtypes.h"
/*
* Expected path to tmwtypes.h
* C:\Program Files\MATLAB\R2020a\extern\include\tmwtypes.h
*/
const int BL = 20;
const int16_T B[20] = {
-119, -402, -331, 670, 1584, 312, -2314, -1328, 5871,
13615, 13615, 5871, -1328, -2314, 312, 1584, 670, -331,
-402, -119
};
filterDesigner
also allows you to investigate the difference between the quantized filter characteristic and the floating-point filter characteristic. The following plots show that the quantized version of the filter is less good at suppression in the stop band, and that the zeroes of the quantized filter have moved a little bit from the position defined by the floating-point precision version.


Conclusions
We introduced fixed-point data representation as a technique to implement DSP programs using integer arithmetic. In applications where floating point hardware is unavailable, fixed-point implementations are crucial.
We discussed the representation of a fixed-point data type, as well as the rules for addition and multiplication using fixed-point data types. Crucially, in fixed-point arithmetic the programmer is responsible for data alignment of the expression operands. That alignment, in addition to ensuring that sufficient wordlengths are available (so as to prevent overflow), is the key challenge in using fixed-point arithmetic.
We discussed the impact of fixed-point arithmetic on DSP, and in particular the additional quantization noise that gets generated because of fixed-point. Finally, we illustrated fixed-point arithmetic in a simple lowpass filter. A tool such as Matlab filterDesigner
has built-in logic to quantize filter coefficients to a desired length, while analyzing the impact of quantization on the filter characteristic.