How I Wrote a C++ Compiler 15 Years Ago

Programming C++

15 years ago, there was no Facebook. There was no C++ compiler with the diagnostic messages in Russian (my native language). Several C++ standards have been released since then, and technologies have made a huge leap forward in development. Today, having so many various frameworks, it doesn’t take too long to write a code analyzer or a programming language of your own.

This post is about how I started my career and reached an expert level by means of self-development and writing a C++ compiler. You’ll find general implementation details and what came out of it below.

file

How It All Started

Back in 2001, when my parents bought me my first computer Duron 800mhz/128mb ram/40gb hdd, I began studying programming. Actually, at first I was wondering what to install: Red Hat Linux, FreeBSD or Windows 98/Me? A Russian programming magazine “Hacker” was a benchmark for me at the time. Their mocking and style of telling attracted me a lot.

I really wanted to learn the entire stack they published and hack the Pentagon (without the Internet). My internal struggle on whether to become a Linux fan or play games in Widows lasted till I got the Internet at home. A modem, slow, 56kb/s Internet that needed half an hour to download a song. At a price of 0.1$/ MB, one song used to cost around 40 to 50 cents, during the daytime.

But the rates were completely different at night. I could surf all the websites from 11 p.m. to 6 a.m. without disabling images in the browser! So, I downloaded everything I could at night, and then read it during the daytime.

On the first day of the Internet at home, the technician, who did the network setup, opened IE 5 and Yandex browser, and then left. When thinking about what to search for on the web first, I typed something like “website for programmers”. I found rsdn.org, one of the new websites at the time, and spent lots of time there. However, I was dissatisfied by the fact that I didn’t understand much. At that time, the flagship and the most popular language was С++. It was a challenge, and I had nothing to do but reach the level of guys experienced in C++.

There was another interesting website — firststeps.ru. I still think that their method of knowledge delivery is the best one. I moved step by step, with small results. But I did it!

Buying all used books on a flea market, I wanted to grasp all the basics of programming. One of the first books I bough was “The Art of Computer Programming” by Donald Knuth. I don’t really remember why I had decided to buy this book. Perhaps, the seller suggested buying it, but I began studying the first volume with the eagerness of a student and performed all tasks at the end of each chapter. I wasn’t good at math at school. However, I had some progress with the mathematical analysis of Knuth as I had a great desire and motivation to write programs and do it the right way. Having mastered algorithms and data structures, I bought the third volume of The Art of Computer Programming: Sorting and Searching. It was the bomb! Heapsort, quicksort, binary search, trees and lists, stacks and queues. I wrote all this on a piece of paper, interpreting the result in my head. I read at home, read at the beach, I read everywhere. Solid theory without practice. I didn’t know what a huge benefit this basic knowledge will give me in the future.

When interviewing developers today, I haven’t met a person who can write an implementation of a binary search or a quicksort on a piece of paper. It’s frustrating.

But let’s get back to the subject of this post. Having mastered Knuth, I had to move on. I also attended Turbo Pascal classes, read books by Kernighan and Ritchie, and then “Teach Yourself С++ in 21 days”. I didn’t understand everything from С and С++, so I copied some texts from books. I couldn’t google it or ask anyone but I had lots of time as I forgot the school and began attending night school, which did not require much time.

As a result, I developed myself from morning till night, learning new things every day. I could write a calculator, or a simple application in WinApi. I could also write something in Delphi 6. So, when I got my high school diploma, I was already prepared for the third or fourth year at university and I definitely knew what higher education to go for.

When entering the Department of Computer Systems and Networks, I could perform tasks in С and С++ of any level of complexity in. However, when I visited, say, rsdn.org, I realized how much I still had to learn. It was a challenge for me. Lack of understanding and a burning desire to know everything has led me to the book “Compilers: Principles, Techniques, and Tools” written by Alfred V. Aho and Ravi Sethi. It is also known as the Dragon Book. That’s when the most interesting things happened. Before reading this book, I had read “Schildt’s Expert C++”, in which Herbert Schildt covered such advanced things as encryption, data compression and, most interestingly, writing a parser.

Studying the Dragon Book, I moved from the lexical analysis to parsing, and finally to testing semantics and code generation. During this time, I made a landmark decision to write a C++ compiler of my own.

‘Why not’, I asked myself. ‘Let’s do it’, replied the part of my brain that becomes more skeptical with age. So, I began the development of the compiler.

Preparation

I lost my modem Internet at the time as telephone lines were being changed to digital ones. That’s why I downloaded the ISO C++ Standard dated 1998. I used the tool I liked: Visual C++ 6.0.

Thus, the work was reduced to implementing what was written in the C++ Standard. I also used the Dragon Book for the development of the compiler. As for the starting point, it was the parser-calculator from the Schildt’s book. The puzzle came together and development began.

Preprocessor

nrcpp/KPP_1.1

The second chapter of the ISO C++ Standard dated 1998 provides requirements to the preprocessor, as well as lexical conventions. ‘Great!’ I thought, as it was the simplest part that could be implemented separately from the compiler itself. In other words, we first run preprocessing of the file, at the input of which a C++ file is received in the form you’re used to see it. After preprocessing, we have a converted C++ file without comments, substituted files from #include, substituted macros from #define, preserved #pragma, and processed conditional compilation #if/#ifdef/#endif.

Before Preprocessing:

#define MAX(a, b) \
    ((a) > (b) ? a : b)
#define STR(s)  #s

/*
 This is the entry point of program
 */
int main()
{
    printf("%s: %d", STR(This is a string), MAX(4, 5));
}

After preprocessing:

int main()
{
    printf("%s: %d", "This is a string", ((4) > (5) ? 4 : 5));
}

In addition, the preprocessor has performed lots of useful work like computing constant expressions, concatenating string literals, and the output of #warning and #error. By the way, have you even seen digraphs and trigraphs in C code? If not, you should know that they do exist!

**The Example of Trigraphs and Digraphs **

int a<:10:>; // the equivalent of int a[10];
if (x != 0) <% %> // the equivalent of if (x != 0) { }

// Exmaple of a Trigraph 
??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
// is converted to
#define arraycheck(a,b) a[b] || b[a]

You’ll find more details here.

The main benefit of the C++ preprocessor is obviously the substitution of macros and insertion of files denoted in #include.

What did I learn in the process of writing a C++ preprocessor?

  • How the language vocabulary and syntax are built
  • Priorities of C++ operators and ways to calculate expressions in general
  • Strings, characters, and constant suffixes
  • Code structure

All in all, it took me about a month to write the preprocessor. It wasn’t too difficult but it’s not a trivial task anyway. At this time, my fellow students tried to write their first “Hello, world!” and not all of them could do it right. As for me, the next chapters of the C++ Standard with the implementation of the language compiler were waiting for me.

Lexical Analyzer

nrcpp/LexicalAnalyzer.cpp

It’s black and white here. I have already written the main part of the lexical analysis in the preprocessor. The task of the lexical analyzer is to parse code into lexemes or tokens that will be analyzed by the syntax analyzer. What was written at this stage?

  • A finite state machine for analyzing integer, real and character constants. Think it’s all simple? Actually, it is when you’ve finished it.
  • The analysis of variable names and C++ keywords
  • There’s something else, but I can’ remember now.

Parser

nrcpp/Parser.cpp

The task of the parser is to check the correctness of the arrangement of lexemes that were received at stages of the lexical analysis.
My parser was based on a simple parser from Schildt’s book that was leveled up to the C++ syntax, with the check for stack overflow. If we write this:

(((((((((((((((((((((((((((((0))))))))))))))))))))))))))))))))); // there can be even more parentheses

My recursive analyzer will eat the stack and output that the expression is too complex. A careful reader might have the following question. Why recreate the wheel if we had yacc and lex? Yes, we did. But I wanted the wheel with full control over the code. As for performance, it was definitely inferior to the code generated by these utilities. But technical excellence wasn’t my goal. My goal was to understand everything.

Semantics

nrcpp/Checker.cpp nrcpp/Coordinator.cpp nrcpp/Overload.cpp

It’s covered in from chapter 3rd to 14th of the ISO C++ Standard dated 1998. It’s the most difficult part, and I’m sure that >90% С++ developers do not know the rules described in these chapters. For instance, Did you know that it’s possible to declare the function twice, like this:

void f(int x, int y = 7);
void f(int x = 5, int y);

There’re also the following structures for pointers:

const volatile int *const volatile *const p;

Here’s a pointer to a member function of class X:

void (X::*mf)(int &)

These are the first things that came to my mind. I guess there’s no need to say that when testing code from the Standard to Visual C++ 6, I often got an Internal Compiler Error Development of the semantics analyzer took one and a half years. During this time, I was nearly kicked out from the university as I wasn’t good at any other subject than programming. Meanwhile, I was developing the compiler and adding functionality to it.

Code Generator

nrcpp/Translator.cpp At this stage, when the enthusiasm was fading away, quite a runtime version of the front-end compiler was ready. Developer decides what to do with this front-end in the future. It can be distributed in the current form or used for writing a code analyzer. We can also use it to create a converter of our own, something like С++ -> C#, or C++ -> C. At this stage, we’ve got a syntactically and semantically validated AST (abstract syntax tree).

That’s when the developer realizes that he has reached Zen, has reached enlightenment, and can understand why code executes exactly this way without even looking at it. To achieve my goal of creating a C++ compiler, I decided to end up generating a C code that could be converted to any existing assembly language or given at the input of the current C compilers (like Stroustrup did it in the first versions of “C with classes”).

What nrcpp Lacks

  • Templates. С++ templates is such a complicated thing in terms of implementation that I had to admit that templates will not work properly without interfering with the parser and combining it with semantics.
  • namespace std. You cannot write a standard library without templates. Besides, it would take months as it takes the lion’s share of the standard.
  • Internal compiler errors. When playing with code, you can see messages like: Internal compiler error: in.txt(20, 14): “theApp.IsDiagnostic()” –> (Translator.h, 484) The reason is either unimplemented functionality or semantic rules not taken into account.

Why Invent the Wheel?

To sum it up, I would like to say why I wrote this post. The invention a wheel of my own, even if it has taken me two years, is still my income today. This knowledge is beyond price. It’s the basis that will be with you throughout your developer career. Techniques, frameworks will change, and new languages will come to the scene, but all of this will have the same basis from the past. It won’t take you long to understand and master them.

github.com/nrcpp/nrcpp — the compiler source code. You can play changing file in.txt and looking at the output in out.txt. github.com/nrcpp/nrcpp/tree/master/KPP_1.1 — the preprocessor source code. It’s built with the help of Visual C++ 6.

Comments

    3,751

    Ropes — Fast Strings

    Most of us work with strings one way or another. There’s no way to avoid them — when writing code, you’re doomed to concatinate strings every day, split them into parts and access certain characters by index. We are used to the fact that strings are fixed-length arrays of characters, which leads to certain limitations when working with them. For instance, we cannot quickly concatenate two strings. To do this, we will at first need to allocate the required amount of memory, and then copy there the data from the concatenated strings.