Text Files and Binary Files in C

Key Takeaways

The C standard library supports two types of files: text files and binary files.
Text files are formed by sequences of readable characters, while binary files contain data in raw format.
A text file is divided by lines and might contain an end-of-line character.
Binary files are not divided into lines nor do they have a character that indicates the end of file.
In general, if no assumptions can be made about the type of file that a program will have to manipulate, it is always better to treat files as binary files.

Text Files and Binary Files

The standard library of the C language supports two types of files: text files and binary files.

In a text file, the bytes that make up the file represent single characters and, in fact, a text file can be read by a human being through a text editor. The source code of a C program, for example, is contained in a file with extension .c and is to all effects a text file.

In a binary file, on the other hand, the bytes can represent any data, from simple types like integers or floating-point numbers, or complex data like structures and arrays. Binary files cannot be read directly from a text editor, as their content is not printable. For example, an executable file is a binary file as it contains the machine instructions necessary to be executed by a processor.

Text files possess two fundamental characteristics:

Text files are divided by lines or lines of text.

Each line of a text file normally ends with a sequence of characters that indicate the carriage return. This sequence of characters can vary depending on the operating system: for example, in Unix/Linux systems the newline character (\n) is used, while in Windows systems the sequence of characters \r\n is used (that is, carriage return plus newline).

Therefore, when writing or reading a text file from the C language, one must keep this difference in mind and correctly handle the end-of-line characters.
Text files can contain a special end of file or EOF (End Of File) character.

Some operating systems require or allow a particular ASCII character to be used as an end-of-file indicator. For example, in Windows systems the character 0x1A is used (which corresponds to the sequence CTRL+Z followed by pressing the Enter key). In reality, under Windows there is no obligation to insert the end-of-file character at the end of the file but, if it is present, all bytes that are found after it are ignored. However, they will still be considered in the calculation of the size of the file itself.

CTRL+Z is a legacy of the old operating system MS-DOS which in turn borrowed this convention from another even older operating system: the CP/M system.

UNIX systems, on the other hand, as well as Linux, do not have a true end-of-file character. There exists the EOF (End Of File) character, which is a logical concept rather than a physical character present in the file. It is used when reading from a device rather than from a file, therefore it is not necessary to include an end-of-file character in text files.

Binary files are not divided into lines nor do they have a character that indicates the end of file. In them, all bytes are treated as raw data.

When writing data to a file, one must choose whether to save such data in text format or in binary format. To understand the difference, let us suppose we want to save a number: 8192.

If we save it as textual data, a plausible choice might be to use ASCII encoding. In this case, the number 8192 would be converted into its textual representation, which is the string "8192". This string would be stored as:

+----------+----------+----------+----------+
| 00111000 | 00110001 | 00111001 | 00110010 |
+----------+----------+----------+----------+
    '8'         '1'        '9'        '2'

Our number would therefore occupy four bytes.

Conversely, if we choose to save it in binary, the number 8192 would first be converted into a sequence of 16 bits:

0010000000000000

After that, assuming we store the number in little-endian format, the bytes would be stored in the following order:

00000000 | 00100000

Therefore, with the binary format, the number 8192 would occupy two bytes, half of the textual case.

When writing a C program that writes and reads files, one must therefore take into account whether one is working with text files or binary files, since the methods of access and data management are different. A program that displays the content of a file on screen, for example, will probably deal with text files. A program that deals with copying files between each other cannot, however, assume that they are text files but will have to consider the files in question as any binary files. After all, a text file is still a binary file. If this program made the assumption that they are text files, it would probably discard the possible end-of-file character and any characters that are found after it would be ignored.

In general, if no assumptions can be made about the type of file that a program will have to manipulate, it is always better to treat files as binary files.