#### Introduction

Most of the real time signal processing algorithms have to operate on floating point numbers. Algorithms that are implemented in high-level languages have native support of floating-point math operations. When such algorithms are used in systems with processors that have support for only integer math, significant overhead occurs both in terms of memory and performance. For such processors, fixed point numbers are used.

Q-format (short form to represent fixed-point) is representation of decimal data into a specific format which accommodates sign bit, integer bits and fractional bits. There are many ways to represent this

- Qm→ where m represents number of fractional bits.
- Qx.y → where x is number of integer bits + 1 (sign bit) and y is number of fractional bits

Sign bit

(31-m) bit integer

m bits fractional number

**A generalized representation for 32 bit**

Q format can be of any bits like (8,16,24,32), remember that x+y in Qx.y notation give the total number of bits. For Qm representation, it is mandatory to mention the total number of bits.

Using this translation of decimal data, mathematical operations can be performed efficiently with only a little loss of accuracy. Fixed point data type is used widely in digital signal processing applications where performance is more important than precision. For processors that have no floating-point unit, fixed point implementation of algorithms will provide improved performance. TI’s C5000 DSP family is an example of fixed-point processors.

#### Representation in fixed point:

The value of a Q-number is an integer which is a scaled version of decimal data by a factor. The scaling factor is chosen based on the range of values to be represented and the precision required.

Maximum and minimum values that can be represented by a particular Q-format is determined by number of bits used for integer and fractional parts of number. Maximum positive number that can be represented in for a Q31 format (32-bit variable) is 0x7FFFFFFF which corresponds to 0.9999999. Minimum positive number in this format is 0x00000001 which corresponds to 1/231 *(For understanding different ranges and Q-format please refer glossary)*

*Example of converting float to fixed:*

Consider floating point number 0.5 . To represent this number in Q31 format (31 bits for fractional part), the number has to be multiplied by 231.

Hence, 0.5 can be represented in Q31 format as 0.5* 231=1073741824.

#### Basic Operations

Unlike floating implementations, fixed-point implementation of algorithms has to be implemented specific to the Q-format. Based on the range & precision, one has to customize their modules. In this section we will discuss about the basic math operations. One can understand better by working on these basic operations.

**Addition**

Suppose you have two float numbers A and B - when it is represented in Q-format; it will be A*K and B*K respectively. Thus the resultant will be (A+B)*K which is nothing but converting the resultant of sum of float values into Q-format. Thus, it is mandatory that the two operands has to be of same Q-format. But there is a possibility that the result might overflow the number of bits that the data-type can accommodate. Hence, before selecting the Q-format, one has to consider both data range of inputs and outputs as well as the precision.

Let’s understand the overflow from the following example:

Consider A = 64 and B = 64 of signed **Q8.24** format (32-bit datatype) (**MAX_POSITIVE** value is 127) then the A+B will result in 128 which is greater than the **MAX_POSITIVE**. Hence the output of every addition has to be checked for saturation. Here the resultant sum is 128 which is not possible to represent in **Q8.24** format. This is called saturation which has to be handled either by forcing the result as 127 or by changing the Q-format which will accommodate 128. This can be done by converting **Qx.y to Q(x+i).(y-i)** for each input before the addition. Hence, at the cost of precision loss, the dynamic range is increased.

In the example below, when we add **0.7 & 0.3** in **Q1.31** the result will be 1.0 which can’t be represented in **Q1.31** (because max is 0.99999…). Thus, the result will be:

A = 0.7f → 1503238553 (0x59999999) Q1.31

B = 0.3f → 644245095 (0x26666667) Q1.31

A + B → 0x80000000 (without Q-format change)

So instead of using **Q1.31,** we should use **Q2.30** which prevents the above mentioned overflow condition.

A → 0x2ccccccc (Q 2.30)

B → 0x13333333 (Q 2.30)

A + B → 0x3fffffff (Q 2.30)

**Subtraction**

Subtraction is similar to addition. Instead of subtracting the operands, processors complement the operand and add. That’s the reason why usually everyone talks about MAC (multiplier–accumulator). The same is understood as addition. Q-format operands of data A & B are A/K & B/K. Thus (A-B)/K is the resultant which has an assumption that both numbers are of same format.

Let’s look at the cases of overflow in subtraction:

**Case 1:** operand_1 is positive & operand_2 is negative in **A-B**

If argument one is greater than zero, argument two is less than zero and the subtracted result is less than zero then we can be sure that result overflows. ((arg1 > 0)&&(arg2<0)&&(res <0))

0.7f - (-0.3f) This case results in overflow

0.7f ⇒ 0x59999999 ( Q1.31)

-0.4f ⇒ 0xCCCCCCCD ( Q1.31)

-(-0.4) ⇒ 0x33333333 (in 2’s complement)

0x59999999

0x33333333

-------------------

0x8CCCCCCC Result is less than zero, so this indicates an overflow

-------------------

**Case 2:** operand_1 is negative & operand_2 is positive in **A-B**

If argument one is less than zero, argument two is greater than zero and the subtracted result is greater than zero, then we can be sure that the result overflows. ((arg1 < 0)&&(arg2>0)&&(res > 0))

-0.7f ⇒ 0xA6666667 (Q1.31)

0.4f ⇒ 0x33333333 (Q1.31)

-(0.4f) ⇒ 0xCCCCCCCD (in 2’s complement)

0xA6666667

0xCCCCCCCD

--------------------

0x73333334 Result is greater than zero so this indicates overflow

-------------------

In such cases, either a change of format or data saturation is required. So, before implementing the module in Q-format, programmer has to be aware of maxing extreme flows & has to accommodate sufficient room for integer against the precision loss.

**Multiplication**

Procedure for multiplication of fixed numbers is same as the multiplication of two binary numbers, except keeping track of the dot notation. This basic operation has its own set of advantages and disadvantages. The advantage of multiplication is that there is no restriction that the two operands has to be of the same format and the disadvantage is the result will be twice the number of bits of operands.

Q6.10 * Q6.10 ⇒ Q12.20 (32-bit)

Qx1.y1 * Qx2.y2 ⇒ Qx1+x2.y1+y2

Let’s understand the movement of dot (which separates integer & fraction)

A = 25 ⇒ 011001.0000000000

B = 25 ⇒ 011001.0000000000

A*B = 625 ⇒ 001001110001.00000000000000000000

As 25 can be representable in **Q6.10** format but 625 can’t be represent in **Q6.10** format, hence one has to know about the range of input & output before setting the Q-format. If required, in between one can opt for a change in the Q-format in the middle of the calculation but he will have to keep a track of later operations.

In the above example, if the output has to be truncated from 32 to 16 bits; then one has to either saturate the data or change the Q-format. Saturating is like dead-end which is not potentially good in some applications so it is better to change the Q-format or scale the data to the same input format level and multiply back at the end of the module.

For high range inputs - where precision is not desperately needed - there is one method that can be used. Where after multiplication, the data is downscaled by( 2^6) because of extra added 6 MSB bits hence result of 625 turns to 9.765 which can be represented in Q6.10 format and then feed it to the next process (applicable to the linear functions or linear systems only). And after processing, scale the output back by same amount at the end of your entire program. In this way, one can prevent the overhead of Q format changes.

**Identity multiplication:**

Likewise, multiplication identity in math (A * 1 = A) in Q-format, there is an identity multiplication where the result Q-format will be the same as the input Q-format.

**Q6.10 * Q1.15 ⇒ Q7.25** but the MSB upper most two bits will be same so one bit can be ignored and the lower 15 bits can be rounded.

Thus, results in **Q6.10** format again. Hence, using the **Q1.15** format will make programmer’s life easy as any **Q1.15 * Q1.15** will remain in the **Q1.15** format.

**Multiplication followed by right-shift:**

Right-shift means division i.e. increasing the integer portion in the Q-format. This indicates that after multiplication, if one wants to increase the integer portion in order to accommodate higher range values, then the data has to be right -shift. It is the same as the above-mentioned procedure.

**Multiplication followed by left-shift:**

Left shifting leads to multiplication which means increasing the integer portion in the Q-format. This indicates that after multiplication, if one feels that it’s not necessary to have so many bits in integer portion because of low potential data in it, then left shift can be used which will increase the precision.

**Glossary**

**Glossary**

- Qm format represents x*2^m. dynamic range
- 1.15 16-bit fixed point data type with 1 sign bit and 15 bits to the right of the decimal. The largest positive value 0x7fff is interpreted as 1.0 – 2-15. The smallest negative value 0x8000 is interpreted as -1.0. The value 0 is interpreted as 0.0.
- 9.23 32-bit fixed point data type with a 9-bit integer and 23 bits to the right of the decimal. The largest positive value 0x7fffffff is interpreted as 256.0 – 2-23. The smallest negative value 0x80000000 is interpreted as -256.0. The value 0 is interpreted as 0.0.
- 1.23 24-bit fixed point data type with 1 sign bit and 23 bits to the right of the decimal. The largest positive value 0x7fffff is interpreted as 1.0 – 2-23. The smallest negative value 0x800000 is interpreted as -1.0. The value 0 is interpreted as 0.0. Since register halves hold 32 bits, not 24 bits, typical 24-bit fractional variables are 9.23. However, 24-bit fixed-point multiply instructions ignore the upper 8-bits, thereby treating them as 1.23.
- 1.31 32-bit fixed point data type with 1 sign bit and 31 bits to the right of the decimal. The largest positive value 0x7fffffff is interpreted as 1.0 – 2-31. The smallest negative value 0x80000000 is interpreted as -1.0. The value 0 is interpreted as 0.0.
- 17.47 64-bit fixed point data type with a 17-bit integer and 47 bits to the right of the decimal. The largest positive value 0x7fff ffff ffff ffff is interpreted as 65536.0 – 2-47. The smallest negative value 0x8000 0000 0000 0000 is interpreted as -65536.0. The value 0 is interpreted as 0.0.
- 1.63 64-bit fixed point data type with 1 sign bit and 63 bits to the right of the decimal. The largest positive value 0x7fff ffff ffff ffff is interpreted as 1.0 – 2-63. The smallest negative value 0x8000 0000 0000 0000 is interpreted as -1.0. The value 0 is interpreted as 0.0.

#### Further Reading:

- Approaches to optimizing Boot time for Linux and Android based Embedded Systems
- Embedded Applications Code Coverage