Floating Point Types in C

After integer numeric types, let's now see how the C language handles real numbers.

In particular, we will see how the C language represents real numbers and which types it provides to manipulate them: float, double and long double.

Floating Point Types

Integer types are not suitable for all applications. Often, the need arises to use variables capable of representing and manipulating values that have a fractional part and, in general, real numbers.

In particular, the need might arise to represent very small or very large real numbers, such as the distances between stars or atomic dimensions.

For this reason, the C language, like the vast majority of languages, allows representing such values using the floating point format.

This type of representation takes its name from the fact that the decimal point can move within the number, allowing the representation of very large or very small values.

The in-depth study of how floating point numbers work is a complex topic and is called numerical analysis. We refer to the Numerical Analysis course for further study on this topic.

Here we limit ourselves to providing a general overview of how floating point numbers work in C language.

The C language provides three floating point types:

  • float: represents a single precision floating point number;
  • double: represents a double precision floating point number;
  • long double: represents an extended precision floating point number.

Typically, the float type is suitable for cases where the required precision is not critical. For example, when one or two decimal digits are more than sufficient in calculations.

The double type, on the other hand, is the most used as it offers greater precision compared to the float type. In general, the double type is the one used by default when dealing with floating point numbers. However, it has the disadvantage that calculations can be slower compared to the float type.

Finally, the long double type is the one that offers the greatest precision but is rarely used for reasons we will see shortly.

The C standard does not specify the precision of each floating point type. This is because different processors might store floating point numbers differently.

Nowadays, however, most processors follow the IEEE 754 standard which was introduced to address this problem. It has now become very rare to find processors that do not follow this standard. Processors with x86 or ARM architecture, for example, all follow this standard.

IEEE 754 Standard

The IEEE 754 standard was developed by IEEE (Institute of Electrical and Electronics Engineers) to define a standard format for the representation of floating point numbers.

This standard defines two primary formats for number representation: single precision which uses 32 bits and double precision which uses 64 bits.

Numbers that comply with the IEEE 754 standard are represented in normalized scientific notation. This means that the number is divided into three parts:

  • Sign s: one bit that represents the sign of the number. 0 for positive numbers, 1 for negative numbers.
  • Exponent e: a series of bits that represents the exponent of the number.
  • Mantissa M: a series of bits that represents the fractional part of the number.

Therefore, informally, a floating point number has a form like this:

s \cdot M \cdot 2^{e}

The number of bits reserved for the exponent determines how large or small the number can be, while the mantissa bits determine, approximately, the precision.

In the case of single precision numbers, float, 8 bits are used for the exponent while 23 bits are used for the mantissa. The remaining bit is used for the sign.

For this reason, a single precision number can assume a maximum value of approximately:

2^{128} \approx 3.4 \times 10^{38}

and has a precision of approximately 6 decimal digits.

The standard also describes two other formats: single extended precision and double extended precision. However, it does not specify their size in bits and their implementation is left to the discretion of the processor manufacturer.

For this reason, in practice the float and double types are the most used as they guarantee a standardized and portable representation across various processors.

double and float in C

As stated above, therefore, almost all C compilers and processors follow the IEEE 754 standard for floating point. Therefore, the float type is equivalent to a 32-bit single precision number while the double type is equivalent to a 64-bit double precision number.

The following table shows the minimum, maximum values and characteristic precision of these two types:

Type Minimum Positive Value Maximum Positive Value Precision
float 1.17549 \times 10^{-38} 3.40282 \times 10^{38} 6 decimal digits
double 2.22507 \times 10^{-308} 1.79769 \times 10^{308} 15 decimal digits
Table 1: Characteristics of C floating point types

We have omitted the long double type from the table as its size can vary from processor to processor.

Obviously, the table does not apply to processors that do not follow the IEEE 754 standard. In that case, the values might be different.

To recap:

Definition

Floating Point Types in C

The C language provides three floating point types:

  • float: represents a single precision floating point number;
  • double: represents a double precision floating point number;
  • long double: represents an extended precision floating point number.

If the compiler (and the processor) follow the IEEE 754 standard then the sizes of float and double are known and correspond to 32 and 64 bits respectively.

This does not apply to long double.

In Summary

In this lesson we studied that:

  • To represent real numbers in C, the floating point format is used;
  • The C language provides three floating point types: float, double and long double;
  • The float and double types are the most used as they follow the IEEE 754 standard now adopted by the vast majority of processors;
  • The float type is suitable for cases where precision is not critical, while the double type is the most used;
  • The long double type offers the greatest precision but is rarely used as its size can vary from processor to processor.

In the next lesson we will study how to write floating point literal constants in our programs.