Slightly Improved C

2021-04-08

I’ve been programming with C language for a long time. And the feeling is usually love-hate. It’s simple and powerful enough, but still it can get complex and hard to understand. You can easily shoot your own foot. In so many ways.

As a machine to learn something I usually implement it myself. And no, I don’t mean I need to replicate everything, but to get grip of the main idea. I think I have hundreds of projects which implements something already existing. Usually poorly, and in barely working manner.

Thus I thought writing own C compiler would be fun.And more fun would be extending it to support certain features.

Introducing Slightly Improved C - or SIC

First it was just a sick joke. Implement your own C compiler! Various people have worked for years on similar things. I knew it’s possible but hard work. Still wanted to try how far would I get.

Idea was first to get working C compiler, but improve it to prevent certain pitfalls in standard C. Thus this dialect would not be compatible with the standard. That would give me some freedom. Before I started I wrote down some principles:

  • Limit undefined behavior
    • This is the Pandora’s box of weird errors, try to avoid it even it means bit slower performance
  • Initialize variables
    • No need to remember to set everything to zero, since it’s done automatically!
  • Integer overflow cases defined
    • Always safe integer overflow and no signed int surprises
  • Integer sizes
    • Integer sizes are always the same, no matter on which target you’re compiling to
  • Native string support
    • Built-in safe string support
  • Fixed point math
    • Built-in fixed point support
  • Improvements on fallthrough in switch-case
  • Rotate operator
    • Some architectures supports rotate, but C has only shifts
  • Built-in swap command
    • For example x86 has xchg but no (easy) way to express it in C

There were many others, but these were the main features I wanted to implement. Everyone who has been programming with C can most probably understand why these are on my list.

Since this was going to be incompatible dialect of C, it needed a new name. Somehow Slightly Improved C - or SIC - sounded just right.

Design principles

This was not the first compiler project I dived into. Thus I had some gut feeling how to proceed. However I checked and compared some existing small and hobby C implementations as references.

I quickly noticed that some of them works, some of them simple does not. Most of them are limited in many ways. Many of my example apps just does not compile. There’s list of features that is not implemented. Or they have implemented something in one way, not supporting many other possible ways. Some of the code was unhackable and just touching it would make me horrified…

My principles were:

  1. KISS - Keep It Simple Stupid
  2. Most of the tutorials focus on scanning, lexing and basic parsing - that is not the hard part
  3. Code generator and complex parsing is the hard part
  4. Follow standard C parsing, unless it causes conflict with reality
  5. Don’t try to generate native assembly, but target some intermediate representation

I quickly got first example working. That was just numbers and arithmetic:

$ cat tests/test_0001.sic
1 + 5 + 10 - 4;

My problem was, what to generate out of the parse tree? Thought multiple different possibilities and finally decided to go with LLVM. However I didn’t take any of the libraries or fancy helpers. I thought to output LLVM IR in text format myself. Have no idea if that was a good or bad decision, but at least it saved me from some external dependencies. Might be that taking a library would have save me from many miseries, but on the other hand not taking one might have also save me from many.

Can it do something?

I have parser, LLVM code generator, and scripts to call llvm-as, llc and finally linking phase. I can get native app compiled and run it.

The first example:

$ build/sic --dump-tree tests/test_0001.sic -o build/test_0001.sic.ir
> LIST
L   -
L     +
L       +
L         INT: 1, 0 bits unsigned
R         INT: 5, 0 bits unsigned
R       INT: 10, 0 bits unsigned
R     INT: 4, 0 bits unsigned

$ cat build/test_0001.sic.ir
; Init - __global_context
; Int literal: 1
@G1048576 = global i32 1, align 4
; Int literal: 5
@G1048577 = global i32 5, align 4
; Int literal: 10
@G1048578 = global i32 10, align 4
; Int literal: 4
@G1048579 = global i32 4, align 4

; Pre - __global_context
define dso_local i32 @__global_context() #0 {

    ; Data - __global_context
    store i32 1, i32* @G1048576, align 4 ; store_int 1
    store i32 5, i32* @G1048577, align 4 ; store_int 5
    %1 = load i32, i32* @G1048576, align 4; gen_load_int
    %2 = load i32, i32* @G1048577, align 4; gen_load_int
    %3 = add i32 %1, %2
    store i32 10, i32* @G1048578, align 4 ; store_int 10
    %4 = load i32, i32* @G1048578, align 4; gen_load_int
    %5 = add i32 %3, %4
    store i32 4, i32* @G1048579, align 4 ; store_int 4
    %6 = load i32, i32* @G1048579, align 4; gen_load_int
    %7 = sub i32 %5, %6

    ; Post - __global_context
    ; F2 int, 32 bits, ptr 0, signed, 2
    ret i32 %7 ; RET1
}

; Pre - main
define dso_local i32 @main() #0 {

    ; Init - main

    ; Data - main
    %1 = call i32 @__global_context()
    ret i32 %1 ; faked

    ; Post - main
}

$ llvm-as build/test_0001.sic.ir
$ llc -O0 -relocation-model=pic -filetype=obj build/test_0001.sic.ir.bc -o build/test_0001.ir.o
$ cc build/test_0001.ir.o -o build/test_0001.ir.bin -lm
$ build/test_0001.ir.bin ; echo $? || true
12

So it works at least on some level. But is not optimal or fancy.

The struggle

I was stubborn and decided to go forward with it. I implemented support for different C features one after another, rewrote many parts, needed to fix parser, get more complex features working. It grew and grew.

I created a monster.

For now I have 72 test apps of which all pass. I felt that the further I go, more complex things there is to support. I’m not kidding, but one of the most complex test is this one:

#include <stdio.h>

int main()
{
    printf("Hello world\n");
    return 0;
}

Most of the people would just say, but isn’t that the first thing you do? Yes it is. And that’s one of the problems. Most of the toy compilers just implement their own simple libc as well, or at least provide their own headers. This simplifies the work a lot, since headers provided by full blown libc are complex.

But no, I wanted to use whatever is available in the system. For me it was glibc. And that stdio.h is coming from the system. I call the cpp from system (gcc) to do preprocessing, and then compile the result with sic. It works with couple of tricks. But to get there required me to implement many obscure features I would not have implemented otherwise yet.

Of course I have some more complex and long tests cases, and support for more than just “hello world”. On those I’m using arrays, type conversions, structs, unions, accessing values of multiple level struct/union, sizeof, function calls with different parameter types, for-loops, whiles, do-while, conditions, ternary operation, typedef, etc.

I have almost created a working C compiler.

I would claim it’s better than most of the tiny simple example compilers. It can compile real example apps, and thanks to LLVM the binary can be quite nicely optimized. It even has some of the non-standard features I planned, like initializing all the values.

But it’s still not complete. There’s basic things that are NOT working. Things that are hard to get working. One thing that’s difficult is function typedefs and passing them as parameters. I have urge to rewrite the type system, or at least improve it a lot. Some things are just a big mess. Like pointers and handling them. There’s lot of hardcoded things and exceptions.

As said I created a monster. It has it’s own will. By touching one corner you might break another.

Post mortem

I haven’t been doing anything for this project for about 5 months now. To get me here took about a year. And I had idea that I’ll publish it in Github once I get it “good enough”. I never got there.

Now my nice and fancy compiler is gathering dust on my hard drive. I haven’t had motivation to continue. Same time I feel it would not be that much work to do. But would that be lying to myself? What if it never ends?

This is not a fun project anymore.

I never got to the most fun - or interesting - part: improving the C language for real.

  • Did I learn a lot? Yes.
  • Would I do something differently? Yes.
  • Do I regret doing this? No.

Maybe some day I’ll finish it. For now I just publish it as-is without thinking too much: SIC