Source Files and Compilation Process in C

A program written in C language can be composed of hundreds or thousands of lines of code.

In the case of simple programs, writing all the code in a single source file may be fine. But as complexity increases, it is convenient to divide the program into multiple files.

Compilers for the C language allow dividing a program into multiple source files. But to fully understand how to exploit the division, it is necessary to deepen the compilation process.

In this lesson we will start by seeing how the compilation process works for a program composed of a single file. Then we will deepen the process for programs composed of multiple source files.

C Language and Source Files

Until now, when we have created our programs in C language during this guide, we have always written all the code inside a single source file.

We have, in fact, always written source files that contained both the main code of the program, inside the main function, and all the other auxiliary functions that the program used.

This approach may be fine for simple programs, but becomes rapidly unsustainable when the program becomes more complex.

It is not rare, in fact, for complex programs to contain thousands of lines of code, divided into dozens or hundreds of functions. Creating such programs in a single source file becomes impractical.

For this reason, the C language allows dividing a program into multiple source files, each of which contains a part of the program's code.

The approach of dividing a program into multiple source files has a series of advantages:

Modularity: the program is divided into independent modules, each of which performs a specific task. This makes the program easier to understand and maintain.

For example, suppose we want to create a program that reads an input file containing data that must be processed and then save the result in an output file.

We can divide the program into three modules:
- A first module that contains the code to read input data and write output results;
- A second module that contains the code to process the data;
- A third module that contains the main program that puts together the two previous modules.
Reusability: modules can be reused in other programs.

Returning to the example above, the module that contains the code to process the input data to the program can be reused in other programs that need to process data in a similar way.
Parallelization: modules can be developed in parallel by multiple programmers.

Definition

A C program can be divided into multiple source files

Any C program can be decomposed into multiple source files, files with extension .c, each of which contains a part of the program's code.

The only limitation is that the code, therefore the definition, of a function must be contained in only one source file.

To be able to divide a program into multiple source files, it is necessary to solve a series of fundamental problems. Before delving into these problems, however, it is necessary to review the compilation process of a program in C language.

Compilation Process Revisited

In previous lessons we mentioned the compilation process of a program in C language. With a view to understanding how the division of a program into multiple source files works, it is necessary to deepen this process.

Let's start first from the case where our program is composed of a single .c file.

In this case the compilation process consists of three fundamental steps:

Precompilation or Preprocessing:

The preprocessor takes the source file as input and processes the precompilation directives, such as the #include and #define directives.

The output of this phase is a modified source file in which the directives are absent.
Compilation:

The compiler takes the precompiled source file as input and generates an object file, containing the machine code related to the source file.

A fundamental detail is that the result of the compilation step is not, yet, an executable file. The object file contains only the machine code related to the source file. From it are missing all the information and code related to the standard functions of the C standard library, as well as the code related to functions defined in other source files.

For example, if in our source file we use the printf function, the object file will not contain the machine code related to the printf function. It will contain exclusively a reference to it.
Linking:

The object file resulting from the previous step is given as input to another component called linker.

The purpose of this program is to take the object file and resolve the references to functions not present in it. In practice, the linker searches for all those references present in the object file to functions, and other entities, not defined in the object file itself.

Once these calls are found, the linker searches for the definitions of these entities in a series of predefined libraries and links them to the object file.

A typical case is that of the C standard library. If our source file uses the printf function, the linker searches for the definition of this function in the C standard library and links it to the object file.

The result is, finally, an executable file that contains all the code necessary to execute the program.

Definition

Compilation Process in C Language

The compilation process of a program in C language consists of three fundamental steps:

Precompilation or Preprocessing: the preprocessor processes the precompilation directives present in the source file and generates a new file.
Compilation: the compiler takes the precompiled source file as input and generates an object file containing the machine code related to the source file.
Linking: the linker takes the object file as input and resolves references to functions and variables not defined in the object file, linking these entities to the object file.

Compilation Example with `gcc`

To fully understand the entire process, let's try, using the open source compiler gcc and the Linux operating system, to compile a program in C language composed of a single source file.

Suppose we have the following source file, called hello.c:

/* hello.c */
#include <stdio.h>

int main() {
    printf("Hello World!\n");
    return 0;
}

Normally, with gcc, we compile the program with a single command:

$ gcc hello.c -o hello

The gcc compiler is very advanced and automatically executes all three steps of the compilation process, invoking the preprocessor, the compiler and the linker.

However we can explicitly invoke each one of the three steps of the compilation process, to better understand what happens.

First, let's execute the preprocessor on the source file hello.c. The actual preprocessor is called cpp, from c pre-processor. However, gcc allows invoking the preprocessor directly with the -E option:

$ gcc -E hello.c -o hello.i

As you can see, the -E option tells the compiler to execute only the preprocessor. The output of this phase is a modified source file, without precompilation directives.

The generated hello.i file will be much larger than the original source file, since it will contain all the code contained in the header files included with the #include directive.

At this point, we can compile the precompiled source file hello.i:

$ gcc -c hello.i -o hello.o

The -c option tells the compiler to execute only the compilation phase. The output of this phase is an object file, containing the machine code related to the source file.

The hello.o file is no longer a text file, as in the case of the source file, but a binary file that contains the partial machine code of our program.

We can examine the object file hello.o with the nm utility that allows viewing the symbols present in the object file. A symbol is, in essence, a label assigned to a program entity, such as a variable or a function.

$ nm hello.o

The result of the command will be something similar to:

0000000000000000 T main
                 U printf

The output tells us two fundamental things:

In our object file there is a symbol main, which corresponds to the main function of our program. This symbol corresponds to actual machine code as indicated by the letter T (Text) that precedes the symbol.
In the file there is also a symbol printf, which corresponds to the printf function used in our program. However, this symbol is preceded by the letter U (Undefined), which indicates that the symbol is undefined in the object file. This means that the machine code related to the printf function is not present in the object file.

Therefore, the object file hello.o does not contain all the information necessary to execute the program. A fundamental piece is missing: the machine code related to the printf function.

To solve this problem, we must link the object file hello.o with the C standard library, which contains the definition of the printf function.

Although the actual linker is ld, directly invoking ld is very complex. For this reason, we still use gcc as an interface to invoke the linker.

$ gcc hello.o -o hello

In this case, gcc realizes that the file hello.o is an object file and invokes the linker to link the object file with the C standard library.

The result is an executable file hello that we can execute with the command:

$ ./hello

If everything went well, we will see the message Hello World! printed on the screen.

Recapping:

Definition

Program Compilation Process using gcc

The compilation process of a program in C language using the gcc compiler can be executed using a single command:

$ gcc file_name.c -o file_name

Or, decomposing the process into three steps:

Precompilation:
```
$ gcc -E file_name.c -o file_name.i
```
Compilation:
```
$ gcc -c file_name.i -o file_name.o
```
Linking:
```
$ gcc file_name.o -o file_name
```

The precompilation and compilation operations can be executed in a single step with the command:

$ gcc -c file_name.c -o file_name.o

Compilation of Multiple Source Files

The compilation process of a program composed of a single source file is quite straightforward, as we have seen.

When the program is composed of multiple source files, the compilation process becomes more complex.

Let's study another example using, as before, the gcc compiler and the Linux operating system.

Suppose we want to write a simple program that calculates the area of a circle. Suppose, moreover, to divide the program into two source files:

main.c which will contain the main function, that is the entry point of the program, and which will contain the code to read the circle's radius from the keyboard and to print the resulting circle's area;
circle.c which will contain the definition of the circle_area function that calculates the circle's area.

We can write the two source files like this:

/* main.c */
#include <stdio.h>

int main() {
    double radius;
    double area;

    printf("Insert the radius of the circle: ");
    scanf("%lf", &radius);

    area = circle_area(radius);

    printf("The area of the circle with radius %.2f is %.2f\n", radius, area);

    return 0;
}

/* circle.c */

double circle_area(double radius) {
    return 3.14159 * radius * radius;
}

Now, we must compile the two source files.

Let's start with the file circle.c. Let's precompile and compile the file circle.c in a single step:

$ gcc -c circle.c -o circle.o

The result will be an object file circle.o that contains the machine code related to the circle_area function. In fact, if we analyze the object file with nm, we will see the symbol circle_area defined in the object file:

$ nm circle.o
0000000000000000 T circle_area

As can be observed, the object file circle.o contains exclusively the machine code related to the circle_area function.

At this point, we must compile the file main.c. However, the file main.c contains a call to the circle_area function defined in the file circle.c.

If we try to compile main.c as is, we get an error:

$ gcc -c main.c -o main.o
main.c: In function 'main':
main.c:11:12: error: implicit declaration of function 'circle_area' [-Wimplicit-function-declaration]
   11 |     area = circle_area(radius);
      |            ^~~~~~~~~~~~

In fact, we are in the presence of an implicit function in the file main.c. The compiler is telling us that main.c calls a function circle_area about which it knows nothing. In particular, the compiler does not need to know how the function is made internally, but it needs to know which and how many parameters it accepts and what type of value it returns.

After all, to generate an object file, the compiler does not need to know the source code of the circle_area function, but only its declaration. This concept is very important:

Definition

Object File and Function Declarations

During compilation, to be able to generate an object file, the compiler does not need to know the source code of a function, but only its signature and the type of the returned value.

The machine code of the function will be needed only during the linking phase, when the linker will search for the function's definition in a library or in another object file.

Only the linker needs to know the source code of the circle_area function, to be able to link it to the object file main.o.

How can we, then, solve this problem?

A first naïve solution is to add the prototype of the circle_area function at the beginning of the file main.c:

/* main.c */
#include <stdio.h>

double circle_area(double radius);

int main() {
    double radius;
    double area;

    printf("Insert the radius of the circle: ");
    scanf("%lf", &radius);

    area = circle_area(radius);

    printf("The area of the circle with radius %.2f is %.2f\n", radius, area);
    return 0;
}

Modifying the file main.c in this way and trying to recompile the program, we get:

$ gcc -c main.c -o main.o

In this case the compiler no longer complains about the lack of a declaration of the circle_area function and generates the object file main.o.

Let's try to analyze the object file main.o with nm:

$ nm main.o
0000000000000000 T main
                 U circle_area
                 U printf
                 U scanf

We can observe that in the object file there are:

The symbol main, which corresponds to the main function of our program; This symbol corresponds to actual machine code as indicated by the letter T (Text) that precedes the symbol.
Three references to symbols not present in the object file, circle_area, printf and scanf. These symbols are preceded by the letter U (Undefined), which indicates that the symbols are not defined in the object file.

At this point, we can link the two object files main.o and circle.o to obtain the executable file:

$ gcc main.o circle.o -o circle_area

The result is an executable file circle_area that we can execute with the command:

$ ./circle_area

In practice, when dealing with programs composed of multiple source files, only the precompilation and compilation steps are executed for each source file. The linking step is executed only once, at the end, to link all object files together.

Recapping:

Definition

Compilation of Programs with Multiple Source Files using gcc

To compile a program composed of multiple source files, we must follow these steps:

For each source file .c:

The file is compiled obtaining the corresponding object file .o:
```
$ gcc -c file_name.c -o file_name.o
```
Linking of object files:

Once all object files are obtained, they are linked together to obtain the executable file:
```
$ gcc file_name1.o file_name2.o ... -o file_name
```

These two steps can be executed in a single command:

$ gcc file_name1.c file_name2.c ... -o file_name

The way we compiled the program in the example above, circle_area, worked but we had to make changes to the file main.c.

In particular, we had to add the prototype of the circle_area function at the beginning of the file main.c.

This approach presents problems.

Suppose we want to change the circle_area function so that, instead of working with floats, it works with doubles:

/* circle.c */

double circle_area(double radius) {
    return 3.14159 * radius * radius;
}

By doing this, we must then modify the prototype of the circle_area function at the beginning of the file main.c, plus the rest of the code that uses the circle_area function.

/* main.c */
#include <stdio.h>

double circle_area(double radius);

int main() {
    double radius;
    double area;

    printf("Insert the circle radius: ");
    scanf("%lf", &radius);

    area = circle_area(radius);

    printf("The area of the circle with radius %.2f is %.2f\n", radius, area);
    return 0;
}

The problem with this approach is that it is not scalable.

In fact, in this case there is only one source file that calls the circle_area function. If there were more source files that call the circle_area function, we would have to modify the function's prototype in all source files that call it.

The second issue concerns the fact that the file circle.c contains only one function, circle_area. If the program becomes more complex and the file circle.c contains more functions, we must modify all the prototypes of the functions in the source files that use them.

A cascade procedure is triggered that can cause errors and involve considerable time expenditure.

For this reason, the C language provides a mechanism that allows solving these problems: header files.

Through header files or header files, the problem of sharing function and variable declarations among multiple source files is solved. We will study header files in the next lesson.

In Summary

In this lesson we learned that:

A program written in C language can be composed of multiple source files. This allows dividing the program into independent modules, making it easier to understand and maintain.
The compilation process of a program in C language consists of three fundamental steps: precompilation, compilation and linking.
The precompilation step is executed by the preprocessor, the compilation step by the compiler and the linking step by the linker.
The compilation process of a program composed of multiple source files requires that for each source file the precompilation and compilation steps are executed. The linking step is executed only once, at the end, to link all object files together.
The compiler needs to know only the function signatures, not their source code, to generate an object file.
Header files allow sharing function and variable declarations among multiple source files, solving scalability and maintainability problems of programs composed of multiple source files.

In the next lesson we will begin the in-depth study of header files that simplify the writing of multi-source programs.

Source Files and Compilation Process in C

C Language and Source Files

Compilation Process Revisited

Compilation Example with gcc

Compilation of Multiple Source Files

Issues Related to Compilation of Multiple Source Files

In Summary

Compilation Example with `gcc`