#### BY: LIC. EZEQUIEL MORFI | TITANIO

## FLOATING-POINT CALCULATIONS

Today’s digital processing is entirely accomplished using FLOATING-POINT mathematical calculations in a single-precision format consisting of 32 bits (informally called “floats”), a double-precision format of 64 bits (“double- floats”) or even in some cases the extended-precision format of 80 bits (“long- floats”).

Unlike the fixed-point format, floating-point numbers DON NOT have a set and defined dynamic range. In all cases their dynamic range is NOT UNLIMITED but instead it is rather UNDEFINED. They can go above and below the fixed-point theoretical maximum of 0 dBFS to any value whatsoever without having the signal suffer from any form overloading, clipping or degradation at all. However, a specific dynamic range CAN be defined in advanced by setting a relation between the +/- 1 and the -128/127 values of the exponent.

There is no predefined dynamic range in a floating-point environment other than the boundaries that each programmer chooses to have per instance of processing, should he/she choose so in the first place, for example in order to set a determined, finite dynamic range inside a dynamic processor type plug-in such a compressor or an expander.

The SIGNAL-TO-NOISE RATIO for a floating-point number, however, is a different story. A 32-bit floating-point word comprises an 8-bit exponent plus a 1-bit sign bit and a 24-bit mantissa (23 explicitly stored). A 64-bit floating- point word comprises an 11-bit exponent, a 1-bit sign bit and a 53-bit mantissa (52 explicitly stored).

Since the mantissa consists of fixed-point math, a theoretical signal-to-noise of 144 dB for 32 bits/318 dB for 64 bits floating-point is defined on a per-sample basis. However, this does not set a defined, constant signal-to-noise ratio for the entire system, as the mantissa is permanently scaled by the exponent of 256 possible values for 32 bits/1024 possible values for 64 bits, giving the floating-point architecture its undefined dynamic range. Instead, what this means is that for any given sample value, however loud or soft it can be (as scaled by the exponent), there is a defined signal-to-noise ratio.

For example, a sample value of 0 dBFS in the fixed-point environment would imply a noise floor from quantization error of -144 dBFS in a 32-bit floating-point architecture and -318 dBFS in a 64-bit floating-point architecture. However, since the mantissa and its fixed-point’s inherent noise floor are constantly scaled by the exponent, a sample value of -6 dBFS in the fixed-point environment would have a noise floor from quantization error of -150 dBFS and -324 dBFS in the 32-bit and 64-bit floating-point environments respectively.

In other words, as the noise floor from the mantissa is scaled along with it by the exponent, so is the defined architecture’s signal-to-noise. In conclusion, the noise floor derived from the quantization error is completely irrelevant even in the worst-case scenario of the highest possible sample value of 0 dBFS in the fixed-point environment. Some claim that this “ever-evolving, modulated noise floor” inside a 32-bit floating-point architecture is audible and undesirable and therefore express an open preference for the 64-bit floating-point instead, where the maximum possible level of noise can only be “as high as” -318 dBFS. This statement is naturally ridiculous given the fact that the apparent perceivable and “audibly louder” noise floor of a 32-bit floating-point signal (-144 dBFS at its worst) is well below the actual noise floor that is inherent to even the best analogue-to- digital converter (Johnson-Nyquist noise) and will therefore be masked by it.

But today we’re not just digitizing analogue sources of sound, but we’re working entirely in a digital system from beginning to end. Recording, mixing and mastering all can be undertaken with any Digital Audio Workstation by means of internal digital processing (plug-ins) as well as internal digital summing inside the DAW. The undercover risk here is in the accumulation of rounding errors, inherent to any discrete system, as accurate as it can be, and most especially in the 32-bit floating-point environment.

In a floating-point environment manipulation of the signal results in a completely lossless process since sample values are being scaled within the boundaries of the exponent but not definitely set. For example, if a signal of any given amplitude level is reduced by a number of decibels, e.g. 60 dB, and then raised another 60 dB back again, the result is a perfect reconstruction of the original signal whereas, in a fixed-point architecture, this same procedure would have implied that the original signal will suffer from a great loss of accuracy while being reduced by 60 dB and, furthermore, this now-degraded signal would have later on been simply raised in level by 60 dB at its same small accuracy and along with its acquired noise floor; no exact reconstruction is possible and the manipulation has turned to be destructive. This can be considered to be the principal advantage of working inside a floating-point architecture: in a floating-point system the signal’s accuracy and signal-to-noise ratio is maintained throughout the entire dynamic range and preserved across all operations.

A bigger 64-bit floating-point architecture is yet more desirable though solely for the purposes of internal processing inside audio applications or plug-ins in order to further reduce the cumulative error in commonly-used recursive algorithms, such as IIR filters, where the small quantization errors can eventually add up and lead to a greater noise floor after the many iterations inside the code.

While this phenomenon indeed also happens in a 64-bit or 80-bit floating-point recursive algorithm, the error is so small that it no longer represents a problem. Therefore, employing software applications and plug- ins that process at 64 bits or higher accuracy (as considered to be suitable by the developer) is, ultimately, the only vital aspect of working with “double- floats” as opposed to working with just “floats”.

Manipulating audio files and clips (freezing, gluing, exporting, rendering, mixing down to, etc.) at 64 bits, while doing no harm to the sound quality, yields no conceivably useful benefits to the recording, mixing or mastering process but does indeed imply the unnecessary extra expense of CPU usage for doing bigger calculations and for dealing with bigger audio files, as well as extra expense of the storage media, all for no reason.

It must therefore be avoided by the operator, who would do better to let his/her DAW and plug-ins do their internal processing at the resolution hopefully wisely decided by the developer (most likely 64 bits or even 80 bits) and work his audio clips into 32-bit floating-point files. As the floating-point architecture implies a lossless environment for calculations, going back and forth between 32, 64 and 80 bits, as in going from a 64-bit processing audio streaming from the DAW into a 32-bit floating-point plug-in and back into the 64-bit DAW buss, results in no degradation of the program material whatsoever and needs no further action from the operator (i.e. no dithering).

So why would a DAW offer the user the ability to handle and produce 64 bits files? Maybe just because of malign marketing purposes, as the accuracy of 32 bits floating-point and of 64 bits floating-point cannot be distinguished by the human ear as far as audio summing goes.

“The only thing to remember is that floating-point can present any and all values at the same accuracy whereas fixed-point loses accuracy as the values become smaller and eventually end up with no information at all when values become too small”

(Paul Frindle)

Part3 (Last Part), Coming Soon.

EZEQUIEL MORFI | TITANIO – morfi@titanioisart.com