32-bit Single-Precision Floating Point in Details

In modern days, programming languages tend to be as high-level as possible to make the programmer’s life a little bit easier. However, no matter how advanced programming language is, the code still has to be converted down to the machine code, via compilation, interpretation, or even virtual machine such as JVM. Of course, at this stage, rules are different: CPU works with addresses and registers without any classes, even «if» branches look like conditional jumps. One of the most important aspects of this execution is the arithmetic operation, and today we will be talking about one of these «cornerstones»: floating-point numbers and how they may affect your code.

A brief introduction to the history

The need for processing large or small values was present since the very first days of computers: even first designs of Charles Babbage’s Analytical Engine sometimes included floating-point arithmetic along with usual integer arithmetic. For a long time, the floating-point format was used primarily for scientific research, especially in physics, due to the large variety of data. It is extremely convenient that distance between Earth and Sun can be expressed in the same amount of bits as the distance between hydrogen and oxygen atoms in water molecules with the same relative precision and, even better, values of different magnitudes may be freely multiplied without large losses in precision.

Almost all the early implementations of floating-point numbers were software due to the complexity of the hardware implementations. Without this common standard, everybody had to come up with their own formats: this is how Microsoft Binary Format and IBM Floating Point Architecture were born; the latter is still used in some fields such as weather forecasting, although it is extremely rare by now.

Intel 8087 coprocessor, announced in 1980, also used its own format called «x87». It was the first coprocessor specifically dedicated to floating-point arithmetic with aims to replace slow library calls with the machine code. Then, based on x87 format, IEEE 754 was born as the first and successful attempt to create a universal standard for floating-point calculations. Soon, Intel started to integrate IEEE 754 into their CPUs, and nowadays almost every system except some embedded ones supports the floating-point format.

Theory and experiments

In IEEE 754 single-precision binary floating-point format, 32 bits are split into 1-bit sign flag, 8-bit exponent flag, and 23-bit fraction part, in that order (bit sign is the leftmost bit). This information should be enough for us to start some experiments! Let us see how number 1.0 looks like in this format using this simple C code:

union { float in; unsigned out;} converter; converter.in = float_number; unsigned bits = converter.out;

Of course, after getting the bits variable, we only need to print it. For instance, this way:

1.0 | 1 | S: 0 E: 01111111 M: 00000000000000000000000

Common sense tells that 1 can be expressed in binary fluting-point form as 1.0 * 20, so exponent is 0 and significand is 1, while in IEEE 754 exponent is 1111111 (127 in decimal) and significand is 0.

The mystery behind exponent is simple: the exponent is actually shifted. A zero exponent is represented as 127; exponent of 1 is represented as 128 and so on. Maximum value of exponent should be 255 – 127 = 128, and minimum value should be 0 – 127 = -127. However, values 255 and 0 are reserved, so the actual range is -126…127. We will talk about those reserved values later.

The significand is even simpler to explain. Binary significand has one unique property: every significand in normalized form, except for zero, starts with 1 (this is only true for binary numbers). Next, if a number starts with zero, then it is not normalized. For instance, non-normalized 0.000101 * 10101 is the same as normalized 1.01 * 101. Due to that, there is no need to write an initial 1 for normalized numbers: we can just keep it in mind, saving space for one more significant bit. In our case, the actual significand is 1 and 23 zeroes, but because 1 is skipped, it is only 23 zeroes.

Let us try some different numbers in comparison with 1.

1.0 | 1 | S: 0 E: 01111111 M: 00000000000000000000000

-1.0 | -1 | S: 1 E: 01111111 M: 00000000000000000000000

2.0 | 2 | S: 0 E: 10000000 M: 00000000000000000000000

4.0 | 4 | S: 0 E: 10000001 M: 00000000000000000000000

1 / 8 | 0.125 | S: 0 E: 01111100 M: 00000000000000000000000

As we can see, a negative sign just inverts sign flag without touching the rest (this seems obvious, but it is not always the case in computer science: for integers, a negative sign is much more complex than just flipping one bit!). Changing the exponent by trying different powers of two works as expected.

1.0 | 1 | S: 0 E: 01111111 M: 00000000000000000000000

3.0 | 3 | S: 0 E: 10000000 M: 10000000000000000000000

Special numbers

Remember about the fact that zero can never be written in the normalized form because it does not contain any 1s in its binary representation? Zero is a special number.

0 | 0 | S: 0 E: 00000000 M: 00000000000000000000000

-0 | -0 | S: 1 E: 00000000 M: 00000000000000000000000

For zero, IEEE 754 uses an exponent value of 0 and a significand value of 0. In addition, as you can see, there are actually two zero values: +0 and -0. In terms of comparison, (0.0f == -0.0f), is actually true, sign just does not count. +0 and -0 loosely correspond to the mathematical concept of the infinitesimal, positive and negative.

Are there any special numbers with an exponent value of 0? Yes. They are called «denormalized numbers». Those numbers can represent extremely small values, lesser than the minimum normalized number (which should be a little larger than 1 * 2-127). Examples:

2^-126 | 1.17549e-38 | S: 0 E: 00000001 M: 00000000000000000000000

2^-127 | 5.87747e-39 | S: 0 E: 00000000 M: 10000000000000000000000

2^-128 | 2.93874e-39 | S: 0 E: 00000000 M: 01000000000000000000000

2^-149 | 1.4013e-45 | S: 0 E: 00000000 M: 00000000000000000000001

2^-150 | 0 | S: 0 E: 00000000 M: 00000000000000000000000

A denormalized number has the virtual exponent value of 1, but, at the same time, they do not have omitted 1 as their first omitted digit. The only consequence is that denormalized numbers quickly lose precision: to store numbers between 2-128 and 2-127, we are only using 21 digits of information instead of 23.

Conclusions

What can we learn from all the facts and experiments above? In any language operating with the floating-point data type, beware of the following:

– You should almost never directly compare two floating-point numbers unless you know what you are doing! A better way to do it is to compare it with some precision.

if (a == b) – wrong!

if (fabsf(a – b) < epsilon) – correct!

– Floating-point numbers lose precision even when you are just working with such seemingly harmless numbers as 0.2 or 71.3. You should be extra careful when working with a large amount of floating-point operations over the same data: errors may build up rather quickly. If you are getting unexpected results and you suspect rounding errors, try to use a different approach, and minimize errors.

– In the world of floating-point arithmetic, multiplication is not associative: a * (b * c) is not always equal to (a * b) * c.

– Additional measures should be taken if you are working with either extremely large values, extremely small numbers, and/or numbers close to zero: in case of overflow or underflow those values will be transformed into +Infinty, -Infinity or 0. Numeric limits for single-precision floating-point numbers are approximately 1.175494e-38 to 3.402823e+38 (1.4013e-45 to 3.402823e+38 if we also count denormalized numbers)а.

– Beware if your system generates «quiet NaN». Sometimes, it may help you to not crash the application. Sometimes, it may spoil program execution beyond recognition.

Nowadays, floating-point numbers operations are extremely fast, with speed comparable to the usual integer arithmetic: a number of floating-point operations per second, or FLOPS, is perhaps the most well-known measure of computer performance. The only downside is that the programmer should be aware of all the pitfalls regarding the precision and special floating-point values.

About the Author

ByteScout Team of WritersByteScout has a team of professional writers proficient in different technical topics. We select the best writers to cover interesting and trending topics for our readers. We love developers and we hope our articles help you learn about programming and programmers.

AUTOMATE WITH PDF.CO API PLATFORM

32-bit Single-Precision Floating Point in Details - ByteScout (2024)

FAQs

What is 32-bit single precision floating point? ›

Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

Explore More ›

What is the precision of float 32? ›

Numeric limits and precision

Floating Point Bitdepth	Largest value	Decimal digits of precision²
32-bit Float	3.4028237 × 10³⁸	7.22
16-bit Float	6.55 × 10⁴	3.31
14-bit Float	6.55 × 10⁴	3.01
11-bit Float	6.50 × 10⁴	2.1

1 more row

Aug 5, 2023

Get More Info ›

What is the floating-point range of 32-bit IEEE 754? ›

IEEE 754-1985

Level	Width	Range at full precision
Single precision	32 bits	±1.18×10⁻³⁸ to ±3.4×10³⁸
Double precision	64 bits	±2.23×10⁻³⁰⁸ to ±1.80×10³⁰⁸

Show Me More ›

How many bytes is a single precision floating point number? ›

Floating-point numbers use the IEEE (Institute of Electrical and Electronics Engineers) format. Single-precision values with float type have 4 bytes, consisting of a sign bit, an 8-bit excess-127 binary exponent, and a 23-bit mantissa.

View Details ›

Is 32-bit float worth it? ›

For ultra-high-dynamic-range recording, 32-bit float is an ideal recording format. The primary benefit of these files is their ability to record signals exceeding 0 dBFS. There is in fact so much headroom that from a fidelity standpoint, it doesn't matter where gains are set while recording.

Read On ›

How many digits is a 32-bit float? ›

32-bit single precision, with an approximate absolute normalized range of 0 and 10 ^-³⁸ to 10 ³⁸ and with a precision of about 7 decimal digits.

Get More Info ›

How accurate is 32-bit floating-point? ›

The binary format of a 32-bit single-precision float variable is s-eeeeeeee-fffffffffffffffffffffff, where s=sign, e=exponent, and f=fractional part (mantissa). A single-precision float only has about 7 decimal digits of precision (actually the log base 10 of 2²³, or about 6.92 digits of precision).

How do you calculate single precision floating point? ›

In the single precision floating point representation of numbers according to IEEE 754 standard, we use 24 bits for mantissa part (23 bits + 1 implied bit). So the precision can be calculated as 2^24 = 10^x where x can be calculated by taking log on both sides as 24log 2 = xlog 10 => x= 7.2 ~ 7.

How many bytes is a float? ›

The length of a float is 32 bits, or 4 bytes. Floats are encoded using the IEEE standard for normalized single-precision floating-point numbers.

Keep Reading ›

What does the IEEE 754 stand for? ›

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE).

Tell Me More ›

How accurate is float32? ›

Looks float32 has a resolution of 1e-6 and the abs value is valid down to as small as 1.2e-38 . The relative error is at the order of 1e-8 for values above 1e-38, lower than 1e-6 proposed by np. finfo and the error is still acceptable even if the value if lower than the tiny value of np.

Discover More Details ›

What is 32-bit binary representation? ›

Using 32 bits, we can represent positive integers from 0 up to 2³² minus 1. In terms of base-10 numbers, that means from 0 to 4,294,967,295. The "Unsigned Decimal" input box shows the base-10 equivalent of the 32-bit binary number. You can enter the digits 0 through 9 in this box (but no commas).

What is the range of f32? ›

Primitive Type f32. A 32-bit floating point type (specifically, the “binary32” type defined in IEEE 754-2008). This type can represent a wide range of decimal numbers, like 3.5 , 27 , -113.75 , 0.0078125 , 34359738368 , 0 , -1 .

What is a single precision floating point? ›

What Is Single-Precision Floating-Point Format? Single-precision floating-point format uses 32 bits of computer memory and can represent a wide range of numerical values. Often referred to as FP32, this format is best used for calculations that won't suffer from a bit of approximation.

Keep Reading ›

What is 32-bit per channel floating point? ›

A 32-bit floating point image can represent 4.3 billion values per channel, and requires roughly twice the disk space as a 16-bit image. Few programs support 32-bit images.

Tell Me More ›

What is the difference between 24-bit and 32-bit WAV? ›

The main difference between 24-bit and 32-bit digital audio is the level of precision or resolution in the audio data. A 24-bit audio sample can represent up to 16.7 million levels of amplitude, while a 32-bit audio sample can represent over 4.2 billion levels of amplitude.

Read On ›

32-bit Single-Precision Floating Point in Details - ByteScout (2024)

A brief introduction to the history

Theory and experiments

Special numbers

Conclusions

FAQs

What is 32-bit single precision floating point? ›

How many bytes is a float? ›