Numbers, Unary Operations, Variables

Lets Write a Compiler!

Our goal is to write a compiler which is a function:

In 131 TargetProgram is going to be a binary executable.

Lets write our first Compilers

SourceProgram will be a sequence of four tiny “languages”

  1. Numbers
  • e.g. 7, 12, 42
  1. Numbers + Increment
  • e.g. add1(7), add1(add1(12)), …
  1. Numbers + Increment + Decrement
  • e.g. add1(7), add1(add1(12)), sub1(add1(42))
  1. Numbers + Increment + Decrement + Local Variables
  • e.g. let x = add1(7), y = add1(x) in add1(y)

Recall: What does a Compiler look like?

Compiler Pipeline
Compiler Pipeline

An input source program is converted to an executable binary in many stages:

  • Parsed into a data structure called an Abstract Syntax Tree
  • Checked to make sure code is well-formed (and well-typed)
  • Simplified into some convenient Intermediate Representation
  • Optimized into (equivalent) but faster program
  • Generated into assembly x86
  • Linked against a run-time (usually written in C)

Simplified Pipeline

Goal: Compile source into executable that, when run, prints the result of evaluating the source.

Approach: Lets figure out how to write

  1. A compiler from the input string into assembly,
  2. A run-time that will let us do the printing.
Simplified Compiler Pipeline with Runtime
Simplified Compiler Pipeline with Runtime

Next, lets see how to do (1) and (2) using our sequence of adder languages.

Adder-1

  1. Numbers
  • e.g. 7, 12, 42

The “Run-time”

Lets work backwards and start with the run-time.

Here’s what it looks like as a C program main.c

  • main just calls our_code and prints its return value,
  • our_code is (to be) implemented in assembly,
    • Starting at label our_code_label,
    • With the desired return value stored in register EAX
    • per, the C calling convention

Test Systems in Isolation

Key idea in SW-Engg:

Decouple systems so you can test one component without (even implementing) another.

Lets test our “run-time” without even building the compiler.

Testing the Runtime: A Really Simple Example

Given a SourceProgram

42

We want to compile the above into an assembly file forty_two.s that looks like:

For now, lets just

  • write that file by hand, and test to ensure
  • object-generation and then
  • linking works

On a Mac use -f macho instead of -f aout

We can now run it:

Hooray!

The “Compiler”

Recall, that compilers were invented to avoid writing assembly by hand

First Step: Types

To go from source to assembly, we must do:

Simplified Pipeline
Simplified Pipeline

Our first step will be to model the problem domain using types.

Simplified Pipeline with Types
Simplified Pipeline with Types

Lets create types that represent each intermediate value:

  • Text for the raw input source
  • Expr for the AST
  • Asm for the output x86 assembly

Defining the Types: Text

Text is raw strings, i.e. sequences of characters

Defining the Types: Expr

We convert the Text into a tree-structure defined by the datatype

Note: As we add features to our language, we will keep adding cases to Expr.

Defining the Types: Asm

Lets also do this gradually as the x86 instruction set is HUGE!

Recall, we need to represent

An Asm program is a list of instructions each of which can:

  • Create a Label, or
  • Move a Arg into a Register
  • Return back to the run-time.

Where we have

Second Step: Transforms

Ok, now we just need to write the functions:

Pretty straightforward:

Where instr is a Text representation of each Instruction

Brief digression: Typeclasses

Note that above we have four separate functions that crunch different types to the Text representation of x86 assembly:

Remembering names is hard.

We can write an overloaded function, and let the compiler figure out the correct implementation from the type, using Typeclasses.

The following defines an interface for all those types a that can be converted to x86 assembly:

class ToX86 a where
  asm :: a -> Text

Now, to overload, we say that each of the types Asm, Instruction, Arg and Register implements or has an instance of ToX86

Note in each case above, the compiler figures out the correct implementation, from the types…

Adder-2

Well that was easy! Lets beef up the language!

  1. Numbers + Increment
  • e.g. add1(7), add1(add1(12)), …

Repeat our Recipe

  1. Build intuition with examples,
  2. Model problem with types,
  3. Implement compiler via type-transforming-functions,
  4. Validate compiler via tests.

1. Examples

First, lets look at some examples.

Example 1

How should we compile?

add1(7)

In English

  1. Move 7 into the eax register
  2. Add 1 to the contents of eax

In ASM

Aha, note that add is a new kind of Instruction

Example 2

How should we compile

add1(add1(12))

In English

  1. Move 12 into the eax register
  2. Add 1 to the contents of eax
  3. Add 1 to the contents of eax

In ASM

Compositional Code Generation

Note correspondence between sub-expressions of source and assembly

Compositional Compilation
Compositional Compilation

We will write compiler in compositional manner

  • Generating Asm for each sub-expression (AST subtree) independently,
  • Generating Asm for super-expression, assuming the value of sub-expression is in EAX

2. Types

Next, lets extend the types to incorporate new language features

Extend Type for Source and Assembly

Source Expressions

Assembly Instructions

Examples Revisited

3. Transforms

Now lets go back and suitably extend the transforms:

Lets do the easy bits first, namely parse and asm

Parse

Asm

To update asm just need to handle case for IAdd

Note

  1. GHC will tell you exactly which functions need to be extended (Types, FTW!)
  2. We will not discuss parse and asm any more…

Compile

Finally, the key step is

Examples Revisited

Lets check that compile behaves as desired:

Adder-3

You do it!

  1. Numbers + Increment + Double
  • e.g. add1(7), twice(add1(12)), twice(twice(add1(42)))

Adder-4

  1. Numbers + Increment + Decrement + Local Variables
  • e.g. let x = add1(7), y = add1(x) in add1(y)

Local Variables

Local variables make things more interesting

Repeat our Recipe

  1. Build intuition with examples,
  2. Model problem with types,
  3. Implement compiler via type-transforming-functions,
  4. Validate compiler via tests.

Step 1: Examples

Lets look at some examples

Example: let1

Need to store 1 variable – x

Example: let2

Need to store 3 variable – x, y, z

Example: let3

Need to store 3 variables – a, b, c – but at most 2 at a time

  • First a, b, then a, c
  • Don’t need b and c simultaneously

Registers are Not Enough

A single register eax is useless:

  • May need 2 or 3 or 4 or 5 … values.

There is only a fixed number (say, N) of registers:

  • And our programs may need to store more than N values, so
  • Need to dig for more storage space!

Memory: Code, Globals, Heap and Stack

Here’s what the memory – i.e. storage – looks like:

Memory Layout
Memory Layout

Focusing on “The Stack”

Lets zoom into the stack region, which when we start looks like this:

Stack Layout
Stack Layout

The stack grows downward (i.e. to smaller addresses)

We have lots of 4-byte slots on the stack at offsets from the “stack pointer” at addresses:

  • [EBP - 4 * 1], [EBP - 4 * 2], …,

How to compute mapping from variables to slots ?

The i-th stack-variable lives at address [EBP - 4 * i]

Required A mapping

  • From source variables (x, y, z …)
  • To stack positions (1, 2, 3 …)

Solution The structure of the lets is stack-like too…

  • Maintain an Env that maps Id |-> StackPosition
  • let x = e1 in e2 adds x |-> i to Env
    • where i is current height of stack.

Example: Let-bindings and Stacks

QUIZ

At what position on the stack do we store variable c ?

A. 1 B. 2 C. 3 D. 4 E. not on stack!

haskell "ENVIRONMENT" -- [] let a = 1 -- [a |-> 1] , b = 2 -- [b |-> 2, a |-> 1] , c = -- let b = add1(a) -- [b |-> 3, b -> 2, a |-> 1] in add1(b) -- -- [b |-> 2, a |-> 1] in add1(b)

Strategy

At each point, we have env that maps (previously defined) Id to StackPosition

Variable Use

To compile x given env

  1. Move [ebp - 4 * i] into eax

(where env maps x |-> i)

Variable Definition

To compile let x = e1 in e2 we

  1. Compile e1 using env (i.e. resulting value will be stored in eax)
  2. Move eax into [ebp - 4 * i]
  3. Compile e2 using env'

(where env' be env with x |-> i i.e. push x onto env at position i)

Example: Let-bindings to Asm

Lets see how our strategy works by example:

Example: let1

Convert let1 to Assembly
Convert let1 to Assembly

QUIZ: let2

When we compile

The assembly looks like

What .asm instructions shall we fill in for ???

Example: let3

Lets compile

Lets figure out what the assembly looks like!

Step 2: Types

Now, we’re ready to move to the implementation!

Lets extend the types for Source Expressions

Lets enrich the Instruction to include the register-offset [ebp - 4*i]

Environments

Lets create a new Env type to track stack-positions of variables

API:

  • Push variable onto Env (returning its position),
  • Lookup variable’s position in Env

Step 3: Transforms

Ok, now we’re almost done. Just add the code formalizing the above strategy

Code

Variable Use

Variable Definition

Step 4: Tests

Lets take our adder compiler out for a spin!

Recap: We just wrote our first Compilers

SourceProgram will be a sequence of four tiny “languages”

  1. Numbers
  • e.g. 7, 12, 42
  1. Numbers + Increment
  • e.g. add1(7), add1(add1(12)), …
  1. Numbers + Increment + Decrement
  • e.g. add1(7), add1(add1(12)), sub1(add1(42))
  1. Numbers + Increment + Decrement + Local Variables
  • e.g. let x = add1(7), y = add1(x) in add1(y)

Using a Recipe

  1. Build intuition with examples,
  2. Model problem with types,
  3. Implement compiler via type-transforming-functions,
  4. Validate compiler via tests.

Will iterate on this till we have a pretty kick-ass language.