Writing a C Compiler

Programming

I wrote a small C compiler called wcc!

It's not much at the moment, but it's capable of compiling a small hello world program and I think it's already super neat :) below is the generated assembly (which can be compiled with FASM and linked with the c standard library) because I think the generated code is very interesting. The IR is stack-based which makes the output have a lot of pops and pushes.

format ELF64

public main

extrn printf
main:
  push rbp ; STACK_FRAME
  mov rbp, rsp
  push printf ; PUSH_GLOBAL_LABEL
  push _str_0 ; PUSH_STRING "Hello, World!\n"
  pop rdi ; CALL 1
  pop r11
  mov eax, 0
  call r11
  push rax
  add rsp, 8 ; POP
  push 0 ; PUSH_INT
  pop rax ; RET
  pop rbp
  ret

In the future I plan on making it fully self-hosting which will be super exciting! Then I want to add a new frontend for my own syntax and a slightly different language with slightly different features (I have plans but nothing concrete now, except that I like the name ozone).

I also want to write multiple backends and specifically intend on writing a backend for the uxn system, which will hopefully be self-hosting on there too :) If I can successfully compile the compiler to uxn, it will be the first C compiler to run on the uxn/varvara virtual machine! I also intend on adding some more normal backends, with the most interesting to me being the 6502 CPU, WASM, and LLVM.

Project inspiration and history

When I first started writing this compiler, I just wanted a fun self-hosting c compiler, but then I found the uxn virtual machine as featured on tsoding's livestreams. After discovering this, I started reading about it myself, and ended up coming across the hundred rabbit collective's website, which linked to one member's site, both sites have a ton of very cool information! They seem like really neat people with really good ideals, I'd definitally recomend poking around those sites for a while, especially if you're a programmer nerd like me :) (but even otherwise, my partner who isn't a programmer also found it a fascinating blog). After coming across all this, it inspired me to write some code for it, but I honestly didn't feel like learning concatinative programming. So I searched for a C compiler that targeted uxn, and the closest I found was this project which wasn't self-hosting. I figured it'd be a fun project to try hack that compiler into a shape where it could be self-hosting. The original project was, at the time of the fork, already self-hosting capable, and since then many more features had been added. After much tinkering, I got the whole thing to compile itself! Of course, I wasn't sure whether or not it would work once ran but at least I had assembly. But when I assembled it, I got a paculiar error: the assembly generated was too large to fit in memory! The uxn-based varvara system has 64K of generic memory, like the 6502's address range, but somehow the compiler didn't fit into that space. I'm not exactly sure if the assembler just choked on a label or it actually had 65,536 instructions, but either way no compiler would accept the input. It was at that point I realized I would likely need to write my own C compiler to get it to work.

Luckily, I already had one! Well, all I had was a lexer and an expression parser, but that was a good enough start to where I could start hacking together some other pieces. Of course, with this new motivation I had to change up my approach slightly. While before I was allocating memory for expressions and sub-expressions, my new code allocates memory very sparingly, with the goal of storing as little as possible during compilation. This also means that my compiler compiles code as it reads it! I think this is a super cool property that contemporary C compilers haven't had for a while now. One interesting challenge that came with this approach was parsing. To parse a C file, the compiler generally converts the linear source file into a tree structure, which is then stored and walked to generate the final code. I wanted to be able to have multiple backends for the compiler, so I didn't want to generate code as I parsed, but I also wanted to avoid storign this large tree. The approach I settled on uses a lil structure with "visitor" functions that are called when the parser enters and exits different types of AST node. It's honestly remarkibly simple :) The one issue I encountered is that due to the way that expressions are parsed, it would be effectively impossible to call expression_enter for each expression before parsing the left half of a binary expression. This is because when parsing the left half, we do not yet no how many different operators will follow. Consider the snipets 1 + 2 and 1 + 2 + 3, from the perspective of the parser, they are identical up until the moment the second plus is encountered. Thankfully, it's not neccesary to generate function calls when entering expressions, only when exiting them, so I added a new visitor function and all was well. I also chose to implement the compiler with an IR, even though the IR is never stored, I found it useful for both debugging and adding multiple backends to the compiler.