Files and Buffering in C

Key Takeaways

Buffering is a technique used to improve the performance of input/output (I/O) operations in C language.
The fflush() function allows you to force the flush of the output buffer, writing all data stored in the buffer to the associated file.
The setvbuf() function allows you to specify the type of buffering (full, line or no buffering) and the size of the buffer used for I/O operations on a file.
The setbuf() function is a simpler and obsolete version of setvbuf(), which allows you to set a custom buffer for a specific file.

Input/Output and the concept of Buffering

In previous lessons we have seen that at the basis of the concept of Input/Output (I/O) in C language there are streams, which represent data flows in input or output. Streams can be associated with I/O devices such as the keyboard, the screen, disk files or sockets, that is network connections.

One of the main problems in I/O is the difference in speed between the processor, RAM and I/O devices. For example, reading data from a file on disk is much slower compared to the speed with which the processor can access data in RAM memory. Even worse, writing data to disk is an operation that can require even longer times.

The consequence is that it is not efficient for a program to access a file whenever it needs to read or write data.

To overcome this problem and obtain acceptable performance, all operating systems implement a mechanism called buffering. Buffering consists of reserving a portion of RAM memory, the buffer precisely, which is used to temporarily store data read from or written to a file.

When the file is closed or when the buffer is full, the data stored in the buffer is actually written to the file on disk. This operation is called flush of the buffer (flush in English literally means to pull the chain 😄).

Similarly, the read buffer (the input buffer) also works in the same way. When a program reads data from a file, the operating system reads a larger amount of data from the file and stores it in the input buffer. Subsequently, when the program requests the data, it is read directly from the buffer, which is much faster than direct access to the file on disk. When the input buffer empties, the operating system again reads a larger amount of data from the file and stores it in the buffer.

The buffering technique significantly improves the performance of I/O operations, since reading or writing single bytes in memory requires a time of two or three orders of magnitude less compared to direct access to the file on disk. Obviously, transferring large blocks of data from disk to memory or vice versa is still a slower operation compared to simple reading or writing in memory, but reading or writing large blocks of data is still much more efficient than reading or writing single bytes.

The functions of the standard library of the C language defined in the header <stdio.h> perform buffering in a transparent way for the programmer. In most cases, the programmer does not have to worry about the internal functioning of buffering.

There are, however, situations in which the programmer might want to explicitly control its functioning. For this reason, the standard library of the C language provides three functions to manage buffering:

fflush(): this function forces the flush of the output buffer, writing all data stored in the buffer to the associated file.
setbuf(): this function allows you to set a custom buffer for a specific file, allowing the programmer to control the size and position of the buffer used for I/O operations on that file.
setvbuf(): this function offers more detailed control over buffering, allowing you to specify the type of buffering (full, line or no buffering) and the buffer size.

Let's now see in detail each of these functions.

The `fflush` function

To understand the behavior of the fflush() function, we must first introduce an important concept regarding buffering and input/output in general:

Definition

In C language (as well as in UNIX) Input/Output is synchronous but not synchronized

In C language (as well as in UNIX) Input/Output is synchronous. This means that when a program performs an I/O operation, such as reading or writing data to a file, the program waits for the operation to be completed before proceeding with the execution of subsequent instructions.

We have not yet studied the file reading and writing functions, such as fread() and fwrite(), but we can already state that when a program calls one of these functions, the program waits until the read or write operation is completed. The control flow does not resume until the I/O operation is finished.

However, I/O in C language is not synchronized. This means that I/O operations are not necessarily performed in the order in which they are called in the program code and, above all, write operations might not be immediately visible to other processes or threads accessing the same file. This is because the operating system can use buffering to optimize I/O operations, delaying the actual writing of data to disk until the buffer is full or until the file is closed.

In other words, when a program performs a write operation on a file, the data might be temporarily stored in the output buffer and not immediately written to the file on disk. The operating system can decide to delay the actual writing of data to disk to optimize I/O operations since the latter are particularly slow and expensive in terms of execution time.

To clarify better, we can think of an analogy. Suppose we want to ship a package from Rome to Milan. Instead of transporting the package directly from Rome to Milan, we could decide to entrust it to a courier. The courier might decide not to deliver the package immediately to Milan, but to wait to have more packages to deliver in the same area to optimize the journey and reduce transportation costs. In this way, the courier could delay the delivery of the package, but eventually the package will still arrive at its destination (hopefully). Invoking a write operation on a file in C language is similar to handing the package to the courier: the data might not be immediately written to the file on disk, but it is entrusted to the operating system (the courier) which will take care of writing it to the file at a later time.

As we have seen, only when we close a file with the fclose() function, we have the certainty that all data stored in the output buffer is actually written to the file on disk. But using fclose() every time we want to make sure that the data is written to the file is not always practical, especially if we need to continue writing other data to the file.

For this reason, the standard library of the C language provides the fflush() function, which allows you to force the flush of the output buffer, writing all data stored in the buffer to the associated file, without closing the file.

The syntax of the fflush() function is as follows:

int fflush(FILE *stream);

This function requires as a parameter a pointer to an object of type FILE, which represents the file on which we want to perform the flush of the output buffer. The function returns an integer value: 0 if the flush was successful, or EOF (End Of File) in case of error.

It can be called in two ways:

Passing the pointer to an object of type FILE:
```
fflush(file);
```
In this case, the function performs the flush of the output buffer associated with the specified file.
Passing the value NULL as a parameter:
```
fflush(NULL);
```
In this case, the function performs the flush of all output buffers associated with all files opened by the program.

The advantage of using fflush() is that we have the guarantee that all changes made to the file up to that moment are actually written to disk and will be visible to other processes or threads accessing the same file, without having to close the file.

The `setvbuf` function

The fflush function is useful for forcing the flush of the output buffer, but in some situations we might want to have more detailed control over the functioning of buffering. For this reason, the standard library of the C language provides two additional functions: setbuf() and setvbuf().

The setvbuf() function allows you to specify how a stream should be buffered and the size and position of the buffer used for I/O operations on that file. Its syntax is as follows:

int setvbuf(FILE *stream, char *buffer, int mode, size_t size);

One of the most important parameters is mode, which specifies the type of buffering to use. There are three buffering modes:

_IOFBF: Full Buffering, full buffering.

Using this mode, in reading data is read from the file in blocks of size equal to size and stored in the buffer. In writing, data is written to the stream only when the buffer is full or when an explicit flush is performed (for example with fflush() or fclose()).

This mode is the default for all streams or files that are not associated with interactive devices (such as the keyboard or screen). Therefore, when we open a file on disk, full buffering is the default mode.
_IOLBF: Line Buffering, line buffering.

In this mode, suitable especially for character streams, data is read or written one line at a time. In writing, the buffer is flushed every time a newline character ('\n') is encountered. In reading, data is read from the file until a newline character is encountered or until the buffer is full.
_IONBF: No Buffering, no buffering.

In this mode, data is read or written directly to the file without using any buffer. Each read or write operation is performed immediately on the file.

The second parameter buffer allows you to specify the address of a memory area that will be used as a buffer for I/O operations. This buffer can be an array of characters allocated automatically, statically or dynamically.

If the buffer is allocated automatically (for example, as a local variable within a function), it must remain valid for the entire duration of stream use, but the advantage is that it is automatically deallocated when the function ends.

If, instead, the buffer is allocated statically (for example, as a global or static variable), it remains valid for the entire duration of the program. Dynamic allocation (for example, using malloc()) allows you to specify the buffer size more flexibly, but requires the programmer to take care of memory deallocation when it is no longer needed, which can also be an advantage since the buffer can be resized or freed when no longer needed.

In general, a large buffer can improve the performance of I/O operations, but requires more memory. The choice of buffer size depends on the specific needs of the application and the available system resources.

Let's see an example of using the setvbuf() function to set a custom buffer for a file opened in write mode:

#include <stdio.h>
#include <stdlib.h>

int main() {
    FILE *file;
    char buffer[1024]; // Buffer of 1 KB

    // Open the file in write mode
    file = fopen("example.dat", "w");
    if (file == NULL) {
        printf("Error opening file.\n");
        return EXIT_FAILURE;
    }

    // Set the custom buffer with full buffering
    if (setvbuf(file, buffer, _IOFBF, sizeof(buffer)) != 0) {
        printf("Error setting buffer.\n");
        fclose(file);
        return EXIT_FAILURE;
    }

    // Write data to file

    // ...

    // Close the file
    fclose(file);
    return EXIT_SUCCESS;
}

In this example, we opened a file called example.dat in write mode and we created an automatic buffer, that is allocated on the stack, of 1 KB. Subsequently, we called the setvbuf() function to associate this buffer with the opened file, specifying the full buffering mode (_IOFBF). In this way, all write operations on the file will use the custom buffer we created. Since the buffer is allocated automatically, we don't have to worry about deallocating it manually; it will be automatically deallocated when the main() function ends.

When using the setvbuf() function, two important rules must be kept in mind:

Note

setvbuf must be called after opening the file but before any I/O operation

The setvbuf() function must be called immediately after opening the file with fopen(), but before performing any read or write operation on the file. If it is called after having already performed I/O operations, the behavior is undefined and could lead to unexpected results.

Note

The buffer must be deallocated only after closing the file

If using a dynamically allocated buffer (for example, with malloc()), it is important to ensure that the buffer remains valid for the entire duration of stream use. Therefore, the buffer must not be deallocated (for example, with free()) until the file is closed with fclose(). Deallocating the buffer before closing the file can lead to undefined behavior.

The setvbuf() function can also be invoked with the buffer parameter set to NULL. In this case, the operating system automatically allocates a buffer of the size specified by the size parameter. For example, to set line buffering on a file opened in write mode, with a buffer of 1024 bytes, we can write:

setvbuf(file, NULL, _IOLBF, 1024); // Set line buffering

Furthermore, the value of the size parameter can also be zero in this case. By doing so, the operating system will automatically choose an appropriate buffer size for the opened file:

// Set full buffering with automatic buffer size
setvbuf(file, NULL, _IOFBF, 0);

In general, the setvbuf() function returns 0 if the operation was successful, or a value other than 0 in case of error:

If the mode parameter is not one of the valid values (_IOFBF, _IOLBF, _IONBF);
If the request cannot be satisfied, for example due to insufficient system resources or invalid buffer size.

The `setbuf` function

The setbuf() function is an old and simpler version of setvbuf(). It allows you to set a custom buffer for a specific file, but does not offer the same flexibility as setvbuf() in terms of buffering mode and buffer size. In the standard library of the C language, setbuf() is considered obsolete and deprecated in favor of setvbuf().

The syntax of the setbuf() function is as follows:

void setbuf(FILE *stream, char *buffer);

Also in this case, the stream parameter is a pointer to an object of type FILE, which represents the file on which we want to set the custom buffer. The buffer parameter is a pointer to a memory area that will be used as a buffer for I/O operations on that file.

If the buffer parameter is NULL, the function disables buffering for the specified stream, equivalent to using the _IONBF mode with setvbuf(). In this case, all read or write operations on the file will be performed directly on the file without using any buffer.

If the buffer parameter is not NULL, the function sets the custom buffer for the specified stream, equivalent to using the _IOFBF mode with setvbuf(). In this case, all read or write operations on the file will use the specified buffer. Since setbuf() does not allow you to specify the buffer size, it assumes that the buffer has a size equal to the macro BUFSIZ, defined in the header <stdio.h>. BUFSIZ represents the default buffer size used by the standard library of the C language for I/O operations.

Therefore, calls to setbuf() are equivalent to the following calls to setvbuf():

If buffer is NULL:

setvbuf(stream, NULL, _IONBF, 0); // Disable buffering

If buffer is not NULL:

// Set full buffering with buffer of size BUFSIZ
setvbuf(stream, buffer, _IOFBF, BUFSIZ);

Hint

For new programs, prefer setvbuf() to setbuf()

Although setbuf() is still supported for backward compatibility reasons, it is considered obsolete and deprecated in the standard library of the C language. For new programs, it is advisable to use setvbuf() instead of setbuf(), as it offers more detailed control over the functioning of buffering.

Files and Buffering in C

Input/Output and the concept of Buffering

The fflush function

The setvbuf function

The setbuf function

The `fflush` function

The `setvbuf` function

The `setbuf` function