The purpose of this article is to make everyone (especially C programmers) say: “I do not know C”.
I want to show that the dark corners of C are much closer than it seems, and even trivial code lines may contain undefined behavior.
The article is organized as a set of questions and answers. All the examples are separate files of the source code.
1.
int i;
int i = 10;
Q: Is this code correct? (Will there occur an error related to the fact that the variable is defined twice? Reminding you that it’s a separate source file and not a part of the function body or compound statement)
A: Yes, this code is correct. The first line is the tentative definition that becomes the «definition» after the compiler processes the definition (the second line).
2.
extern void bar(void);
void foo(int *x)
{
int y = *x; /* (1) */
if(!x) /* (2) */
{
return; /* (3) */
}
bar();
return;
}
Q: Turns out, bar() is invoked even when x is the null pointer (and the program does not crash). Is it the optimizer’s error, or is everything correct?
A: Everything’s correct. If x is the null pointer, undefined behavior occurs in line (1), and no one owes anything the programmer: the program does not have to crash in line (1), or make a return in line (2) in case it has managed to execute line (1). If we talk about the rules the compiler has been guided by, it all happens the following way. After the analysis of line (1), the compiler thinks that x cannot be a null pointer, and eliminates (2) and (3) as the dead code. Variable y is removed as unused. Reading from memory is also eliminated, since the *x type is not qualified as volatile.
That’s how the unused variable has removed the check for the null pointer.
3.
There was a function:
#define ZP_COUNT 10
void func_original(int *xp, int *yp, int *zp)
{
int i;
for(i = 0; i < ZP_COUNT; i++)
{
*zp++ = *xp + *yp;
}
}
They wanted to optimize it this way:
void func_optimized(int *xp, int *yp, int *zp)
{
int tmp = *xp + *yp;
int i;
for(i = 0; i < ZP_COUNT; i++)
{
*zp++ = tmp;
}
}
Q: Is it possible to call the original function and the optimized one, so as to obtain different results in zp?
A: It is possible, let yp == zp.
4.
double f(double x)
{
assert(x != 0.);
return 1. / x;
}
Q: Can this function return inf? Assume that floating-point numbers are implemented according to IEEE 754 (most machines), and assert is enabled (NDEBUG is not defined).
A: Yes, it can. It’s enough to pass a denormalized x, like 1e-309.
5.
int my_strlen(const char *x)
{
int res = 0;
while(*x)
{
res++;
x++;
}
return res;
}
Q: The provided above function should return the length of the null-terminated line. Find a bug.
A: The use of the int type for storing sizes of objects is wrong, as it is not guaranteed that int will be able to store the size of any object. We should use size_t
6.
#include
#include
int main()
{
const char *str = "hello";
size_t length = strlen(str);
size_t i;
for(i = length - 1; i >= 0; i--)
{
putchar(str[i]);
}
putchar('\n');
return 0;
}
Q: The loop is infinite. How come?
A: size_t is the unsigned type. If i is unsigned, then i >= 0 is always true.
7.
#include
void f(int *i, long *l)
{
printf("1. v=%ld\n", *l); /* (1) */
*i = 11; /* (2) */
printf("2. v=%ld\n", *l); /* (3) */
}
int main()
{
long a = 10;
f((int *) &a, &a);
printf("3. v=%ld\n", a);
return 0;
}
This program is compiled by two different compilers and run on a little-endian machine. Two different results were obtained:
1. v=10 2. v=11 3. v=11
1. v=10 2. v=10 3. v=11
Q: How can you explain the second result?
A: The given program has undefined behavior. Namely, strict aliasing rules are violated. int is being changed in line (2). Therefore, we can assume that any long has not changed. (We cannot dereference the pointer that aliases another pointer of an incompatible type). That’s why the compiler can pass the same long (line (3)) that has been read during the execution of line (1).
8.
#include
int main()
{
int array[] = { 0, 1, 2 };
printf("%d %d %d\n", 10, (5, array[1, 2]), 10);
}
Q: Is this code correct? If there is no undefined behavior, what does it print then?
A: Yes, the comma operator is used here. First, the left argument of the comma is calculated and discarded. Then, the right argument is calculated and used as the value of the entire operator. The output is 10 2 10.
Note that the comma symbol in the function call (for example, f(a(), b())) is not the comma operator, and, therefore, it does not guarantee the order of calculations: a(), b() can be called in any order.
9.
unsigned int add(unsigned int a, unsigned int b)
{
return a + b;
}
Q: What is the result of add(UINT_MAX, 1)?
A: The overflow of unsigned numbers is defined, it is calculated by 2^(CHAR_BIT * sizeof(unsigned int)). The result is 0.
10.
int add(int a, int b)
{
return a + b;
}
Q: What is the result of add(INT_MAX, 1)?
A: The overflow of signed numbers – the undefined behavior.
11.
int neg(int a)
{
return -a;
}
Q: Is undefined behavior possible here? If so, under what arguments?
A: neg(INT_MIN). If the ECM represents negative numbers in the additional code (two's complement), the absolute value of INT_MIN is more by one than the absolute values of INT_MAX. In this case, -INT_MIN invokes the signed overflow, which is the undefined behavior.
12.
int div(int a, int b)
{
assert(b != 0);
return a / b;
}
Q: Is undefined behavior possible here? If so, under what arguments?
A: If the ECM represents negative numbers in the additional code, then div(INT_MIN, -1) – refer to the previous question.
— Dmitri Gribenko
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.