I Do Not Know C

Programming

The purpose of this article is to make everyone (especially C programmers) say: “I do not know C”.

I want to show that the dark corners of C are much closer than it seems, and even trivial code lines may contain undefined behavior.

The article is organized as a set of questions and answers. All the examples are separate files of the source code.

1.

int i;
int i = 10;

Q: Is this code correct? (Will there occur an error related to the fact that the variable is defined twice? Reminding you that it’s a separate source file and not a part of the function body or compound statement)

Answer

A: Yes, this code is correct. The first line is the tentative definition that becomes the «definition» after the compiler processes the definition (the second line).

2.

extern void bar(void);
void foo(int *x)
{
  int y = *x;  /* (1) */
  if(!x)       /* (2) */
  {
    return;    /* (3) */
  }
  bar();
  return;
}

Q: Turns out, bar() is invoked even when x is the null pointer (and the program does not crash). Is it the optimizer’s error, or is everything correct?

Answer

A: Everything’s correct. If x is the null pointer, undefined behavior occurs in line (1), and no one owes anything the programmer: the program does not have to crash in line (1), or make a return in line (2) in case it has managed to execute line (1). If we talk about the rules the compiler has been guided by, it all happens the following way. After the analysis of line (1), the compiler thinks that x cannot be a null pointer, and eliminates (2) and (3) as the dead code. Variable y is removed as unused. Reading from memory is also eliminated, since the *x type is not qualified as volatile.

That’s how the unused variable has removed the check for the null pointer.

3.

There was a function:

#define ZP_COUNT 10
void func_original(int *xp, int *yp, int *zp)
{
  int i;
  for(i = 0; i < ZP_COUNT; i++)
  {
    *zp++ = *xp + *yp;
  }
}

They wanted to optimize it this way:

void func_optimized(int *xp, int *yp, int *zp)
{
  int tmp = *xp + *yp;
  int i;
  for(i = 0; i < ZP_COUNT; i++)
  {
    *zp++ = tmp;
  }
}

Q: Is it possible to call the original function and the optimized one, so as to obtain different results in zp?

Answer

A: It is possible, let yp == zp.

4.

double f(double x)
{
  assert(x != 0.);
  return 1. / x;
}

Q: Can this function return inf? Assume that floating-point numbers are implemented according to IEEE 754 (most machines), and assert is enabled (NDEBUG is not defined).

Answer

A: Yes, it can. It’s enough to pass a denormalized x, like 1e-309.

5.

int my_strlen(const char *x)
{
  int res = 0;
  while(*x)
  {
    res++;
    x++;
  }
  return res;
}

Q: The provided above function should return the length of the null-terminated line. Find a bug.

Answer

A: The use of the int type for storing sizes of objects is wrong, as it is not guaranteed that int will be able to store the size of any object. We should use size_t

6.

#include 
#include 
int main()
{
  const char *str = "hello";
  size_t length = strlen(str);
  size_t i;
  for(i = length - 1; i >= 0; i--)
  {
    putchar(str[i]);
  }
  putchar('\n');
  return 0;
}

Q: The loop is infinite. How come?

Answer

A: size_t is the unsigned type. If i is unsigned, then i >= 0 is always true.

7.

#include 
void f(int *i, long *l)
{
  printf("1. v=%ld\n", *l); /* (1) */
  *i = 11;                  /* (2) */
  printf("2. v=%ld\n", *l); /* (3) */
}
int main()
{
  long a = 10;
  f((int *) &a, &a);
  printf("3. v=%ld\n", a);
  return 0;
}

This program is compiled by two different compilers and run on a little-endian machine. Two different results were obtained:

1. v=10    2. v=11    3. v=11
1. v=10    2. v=10    3. v=11

Q: How can you explain the second result?

Answer

A: The given program has undefined behavior. Namely, strict aliasing rules are violated. int is being changed in line (2). Therefore, we can assume that any long has not changed. (We cannot dereference the pointer that aliases another pointer of an incompatible type). That’s why the compiler can pass the same long (line (3)) that has been read during the execution of line (1).

8.

#include 
int main()
{
  int array[] = { 0, 1, 2 };
  printf("%d %d %d\n", 10, (5, array[1, 2]), 10);
}

Q: Is this code correct? If there is no undefined behavior, what does it print then?

Answer

A: Yes, the comma operator is used here. First, the left argument of the comma is calculated and discarded. Then, the right argument is calculated and used as the value of the entire operator. The output is 10 2 10.

Note that the comma symbol in the function call (for example, f(a(), b())) is not the comma operator, and, therefore, it does not guarantee the order of calculations: a(), b() can be called in any order.

9.

unsigned int add(unsigned int a, unsigned int b)
{
  return a + b;
}

Q: What is the result of add(UINT_MAX, 1)?

Answer

A: The overflow of unsigned numbers is defined, it is calculated by 2^(CHAR_BIT * sizeof(unsigned int)). The result is 0.

10.

int add(int a, int b)
{
  return a + b;
}

Q: What is the result of add(INT_MAX, 1)?

Answer

A: The overflow of signed numbers – the undefined behavior.

11.

int neg(int a)
{
  return -a;
}

Q: Is undefined behavior possible here? If so, under what arguments?

Answer

A: neg(INT_MIN). If the ECM represents negative numbers in the additional code (two’s complement), the absolute value of INT_MIN is more by one than the absolute values of INT_MAX. In this case, -INT_MIN invokes the signed overflow, which is the undefined behavior.

12.

int div(int a, int b)
{
  assert(b != 0);
  return a / b;
}

Q: Is undefined behavior possible here? If so, under what arguments?

Answer

A: If the ECM represents negative numbers in the additional code, then div(INT_MIN, -1) – refer to the previous question.

— Dmitri Gribenko

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Comments

  1. I know C. These really aren’t that hard. I would argue that a modern programming language shouldn’t require you to memorize such trivia (and I avoid C wherever possible), but it’s very much possible.
  2. I do know C.
  3. (Sorry if this comment appears twice.)

    I suggest using «int main(void)» rather than the old-style «int main()». Every C compiler I’ve seen will accept both, but I’ve argued that a strict reading of the C standard indicates that «int main()» has undefined behavior.

  4. I know C. 912 is good enough i think! :D
  5. uh, gave a few wrongs answers…
  6. wrongs
  7. Could you show the result of those question? I couldn’t believe that. so crazy.!!!
  8. Well, just as i suspected. I do not know c.
  9. I know C :P
  10. I know 1/4th C? Only 3 out of 12.
  11. I know C.
  12. Regarding Q9, the answer is indeed 0, but note the words of the C99 standard: “For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter).” I am not sure of the situation in C90.
  13. I don’t mean to complain that ‘Someone on the Internet is Wrong’ (http://xkcd.com/386/) but hopefully this is informative to someone. I do also mean to make the point that, yes, I agree: I do not know C, you do not know C, and we do not know C, as it would seem C is unknowable… but then, the same could be said of JavaScript and of many other languages.
  14. Reasonable and funky title, me either! I don’t know C.
  15. I don’t understand the answer for Q1.

    The two lines are an int definitions. One also has an initialize ruin, but they are both definitions all the same. A good compiler would yield a warning but it doesn’t have to. At any rate this will not link, because of the one definition rule.

  16. Well, these where some interesting facts, and yes… I don’t know C!
  17. Is Q3 defined in C++17? The new sequence before relationship ensures that
3,751

Ropes — Fast Strings

Most of us work with strings one way or another. There’s no way to avoid them — when writing code, you’re doomed to concatinate strings every day, split them into parts and access certain characters by index. We are used to the fact that strings are fixed-length arrays of characters, which leads to certain limitations when working with them. For instance, we cannot quickly concatenate two strings. To do this, we will at first need to allocate the required amount of memory, and then copy there the data from the concatenated strings.