Introduction to Strings in C

In this introductory lesson, we will begin studying the concept of strings in the C language. Strings are a fundamental part of programming and are used to represent sequences of characters.

We will focus on string literals, and how C internally represents a string.

Starting from the next lesson, we will see how to define and manipulate string variables.

String Literals

A string literal, in the C language, is a sequence of characters enclosed in double quotes. For example:

"Hello, how are you?"

We have already frequently encountered string literals throughout the various lessons on the C language. For instance, when we used the printf function to print a message to the screen, we passed a string literal as an argument:

printf("Hello, how are you?");

In this first lesson on strings, we will focus on string literals, meaning those defined directly in our code. Starting from the next lesson, we will see how to define and manipulate strings using variables.

Definition

String Literal

A String Literal is a sequence of characters enclosed in double quotes.

A string literal can be defined directly in the source code. The syntax to define a string literal is as follows:

"string"

Escape Characters

Just like with char, strings can also contain escape characters. Escape characters are character sequences that begin with the \ (backslash) character. For example, the \n character represents a newline. Here are some examples of using escape characters inside a string:

"This is a string with a newline character\n"
"This\nis\na\nmulti-line\nstring\n"
"This is a string with a tab character\t"

Escape characters are useful for formatting strings to make them more readable or to insert special characters that cannot be typed directly from the keyboard.

Definition

String literals can contain escape characters

A string literal can contain escape characters. Escape characters are character sequences that begin with the \ (backslash) character.

Split String Literals

When inserting string literals into your code, you may need to split them for readability purposes. For example, you might need to write a string literal across multiple lines.

One method, in C, for splitting a string literal across multiple lines is to use the escape character \ at the end of each line. For example:

printf("Midway upon the journey of our life\n\
I found myself within a forest dark\n\
For the straight way had been lost.");

The rule is that no character must follow the escape character \. This way, the compiler knows the string continues on the next line. Not even a space or a tab character can follow the backslash.

The drawback of this rule is that the following line must begin immediately with the continuation of the string. This can pose a problem. For instance, we cannot write the following code:

/* ERROR */
printf("Midway upon the journey of our life\n\ 
        I found myself within a forest dark\n\ 
        For the straight way had been lost.");

In this case, we indented the code for readability, but the compiler will not recognize the string as properly continued.

However, there is a second technique for splitting string literals. The compiler will treat multiple string literals as a single string if they are separated only by spaces or tabs. For example, the following two strings are considered one:

"Hello, "     "how are you?"

In this case, the compiler treats it as a single string:

"Hello, how are you?"

Therefore, we can rewrite the printf above using this technique and indent the code:

printf("Midway upon the journey of our life "
       "I found myself within a forest dark "
       "For the straight way had been lost.");

This way, the string is split across multiple lines, but the compiler still treats it as a single string.

Definition

Split String Literals

In the C language, it is allowed to split a string literal into multiple strings in two ways:

  1. You can split a string literal across multiple lines by using the escape character \ at the end of each line.

    The constraint is that no character must follow the escape character \, and the next line must immediately continue the string.

    Syntax:

    "Part 1 of the string\
    Part 2 of the string\
    Part 3 of the string"
    
  2. You can split a string literal into multiple literals if they are separated solely by spaces or tabs.

    Syntax:

    "Part 1 of the string"     "Part 2 of the string"     "Part 3 of the string"
    

    In this case, the compiler treats the strings as a single string.

Internal Representation of Strings

So far, we’ve been using strings—particularly string literals—without giving much thought to what happens behind the scenes when we pass them to functions like printf and scanf. What we now want to understand is: what exactly does it mean to pass a string as an argument to a function?

To understand that, we first need to understand how the C language represents strings internally.

At its core, a string in C is an array of char characters. When the compiler encounters a string literal of length n, it allocates a memory area of length n + 1 bytes. This memory area will contain the characters of the string and a special character called the null terminator.

The purpose of the null terminator is to indicate the end of the string. In C, the null terminator is a binary zero, which can be represented with the escape sequence \0. This character is always present at the end of a string and is not considered part of the actual text of the string.

Definition

Internal Representation of Strings

In the C language, a string is internally represented as an array of char characters.

The array contains the characters of the string and a special character called the null terminator. The terminator is a binary zero, represented by the escape sequence \0.

Let’s clarify with an example. Suppose we define the string literal "Hello" composed of 5 characters. Internally, it will be stored as an array of 6 characters, like this:

+-----+-----+-----+-----+-----+-----+
| 'H' | 'e' | 'l' | 'l' | 'o' | '\0'|
+-----+-----+-----+-----+-----+-----+

From this, it follows that the empty string "" is composed of just one character—the null terminator \0:

+-----+
| '\0'|
+-----+

Since string literals are stored as character arrays, the compiler treats them as pointers of type char *. Therefore, functions like printf and scanf expect arguments of type char *, not actual strings.

So, when we invoke printf like this:

printf("Hello, how are you?");

we’re actually passing to the function a pointer to the first character of the string literal "Hello, how are you?". The printf function reads characters starting from that pointer until it encounters the null terminator \0.

Definition

String literals are treated as pointers

String literals are stored as character arrays and treated as pointers of type char *. When we pass a string literal as an argument to a function, we are passing a pointer to the first character of the string.

Note

Do not confuse the null terminator '\0' with the character '0'

The null terminator '\0' is a binary zero, meaning it corresponds to ASCII code 0.

Do not confuse it with the character '0', which corresponds to ASCII code 48 and is a completely different character.

Operations on String Literals

In general, we can use string literals anywhere the C language expects a char * pointer.

In addition to passing string literals as arguments to functions, we can assign them to variables of type char *:

char *greeting = "Hello, how are you?";

In this case, the assignment does not copy the string literal into the variable greeting, but rather assigns to greeting the address of the first character of the string literal.

Since string literals are arrays of characters, we can also use indexing to access individual characters:

char c;
c = "Hello, how are you?"[3];  // c contains 'l'

Here, we’re assigning to the variable c the value of the fourth character of the string literal "Hello, how are you?", which is the character 'l'.

This property of string literals is not often used, but there are situations where it can be helpful. For example, suppose we want to write a function that, given an integer representing the day of the week, returns a character corresponding to the first letter of that day.

We could write such a function using a switch statement:

char weekday_initial(int day) {
    switch (day) {
        case 1: return 'M';
        case 2: return 'T';
        case 3: return 'W';
        case 4: return 'T';
        case 5: return 'F';
        case 6: return 'S';
        case 7: return 'S';
        default: return '?';
    }
}

Or, more compactly, we could use a string literal:

char weekday_initial(int day) {
    if (day < 1 || day > 7) return '?';
    return "MTWTFSS"[day - 1];
}
Note

A string literal is read-only

Although string literals are stored as character arrays, attempting to modify them can result in undefined behavior.

For example, the following code is incorrect:

char *greeting = "Hello, how are you?";
// ERROR: attempt to modify a string literal
greeting[0] = 'C';

A program that tries to modify a string literal may crash or behave unpredictably.

In the next lesson, we will see how to define and manipulate strings using variables.

String Literals vs. Character Constants

Let’s end this lesson with one final important note.

Often, inexperienced developers confuse string literals made up of a single character with character constants. For example, the string literal "A" is different from the character constant 'A'.

The fact is that the string literal "A" is an array of two characters: 'A' and '\0'. It is effectively a pointer of type char * pointing to a memory location.

The character 'A', on the other hand, is an integer representing the ASCII code of the character A.

So, we must pay attention to whether a function expects a string (i.e., char *) or a single character (i.e., char).

For example, the following printf call is correct:

printf("A");

After all, printf expects a char * as its first parameter.

In contrast, the following code is incorrect:

/* ERROR */
printf('A');
Note

Do not confuse single-character strings with character constants

Single-character strings are arrays of two characters: the character itself and the null terminator '\0'.

Character constants are integers representing the ASCII code of the character.

In Conclusion

This lesson serves as the starting point for learning about strings in the C language.

We learned that a string literal is a sequence of characters enclosed in double quotes. We saw that string literals can contain escape characters and can be split across multiple lines.

Most importantly, we learned that string literals are treated as pointers of type char * and are stored as arrays of characters with a null terminator '\0'.

However, we focused on string literals that we define directly in the code. In the next lesson, we’ll see how to define and manipulate string variables.