This post is about compiler design, it’s phases and it’s working :

Let’s have a look on simple C program of adding two numbers.

int main() {
  int a = 5;
  int b = 3;
  int c = a + b;
  printf("Sum = %d\n",c);
  return 0;

Now this C program what we call as “High level language program“. It can be C. java. python and so on. Why they called as high level language because they are written in more abstract manner and we can directly put our thoughts directly into them. Say if you want to add two number one is in float type and another is in int type, so what you do, you just write your thoughts like

float a = 3.25;int b = 10;

A normal computer cannot understand that high level language. I it can understand is called as “Machine Code“. It is damm difficult to write our thought in Machine Code which is usually in 0’s and 1’s. There is an intermediate which is one level up to machine code which we call it as “Assembly level language”.

Compiler :

Compiler is nothing, it is a software or a program whose input is a high level program and output can be an Assembly Level Program then assembler convert it into Machine Code or it can directly give us what we call as Machine Code. In short we can say that “Compiler is a program that can convert hight level language into machine code.

We can now say that machine code is much faster than assembly level program and assembly level program is much faster than high level program.

Machine Code > Assembly level program > High level program

What functionality should present in compiler :

  1. Correctness :
    • Correct output on execution.
    • Report error correctly if the programmer is not following the language syntax.
  2. Efficiency :
    • How efficiently the compiler is generating the code.
    • How efficient the generated code is.
  3. Debugging/Usability
    if you write say : floatt a = 5;
    undefined variable ” floatt “. It should tell that, that thing is undefined and line number where possible error should be and file name.

Stages of compiler

Front-end of compiler :

  1. Lexical Analysis :
    It act like a scanner, which means when we write any high level code and when it move through this phase it read the file a text stream and tries to convert it into tokens. Rules for defining these tokens are predefined for a given language. If a say
    float a = 5;
    For this statement

    " Float = Keyword "
    " a = indetifier / Variable "
    " = = operator / Assignment operator "
    " 5 = Decimal Number "
    " ; = Delimiter. It can be new line, can be comma, semi-colon and so "
  2. Syntax Analysis :
    After lexical analyser next stage comes syntax analysis. We get list of tokens from lexical analyser. So syntax analyser (also called parser) accept that tokens.  What it does it tires to follow the rules of given language / grammar and tries to create abstract syntax tree or parse tree.What is Grammar 
    Grammar is the set of non-terminals, terminals, rules or productions and starting of non-terminal .
    G = (V, T, P,S)
    Graph start from S i.e starting of non-terminal then comes set of non-terminals then the leaf nodes are set of terminals.
    Syntax checker tries to read the rules of grammar and tries to create parse tree. It also removes ambiguity by following the association and presidency rule.
  3. Semantic Analysis :
    Also called semantic checker. This phase of the compiler tries to follows the semantic rules of given language. If I say that  :
    var a;
    If a variable is used but it is not defined then it is called as semantic error. Synatically it is correct like x = 5; but if x is not defined then it is semantic error.
    or if I create

    int function(int);
    int function (int) {
    char c;
    return c;

    In this i’m creating a function which has to return a int value but i’m returning char value, which is semantically wrong. Whether I have to change char to int or another way is to type case it.

Back-end of compiler :

  1. Intermediate code generator :
    Intermediate code is a machine independent code which is more close to machine instruction. Different kind of compiler use different type of ICG.
    Some people use :

    • AST
    • Control and data flow graph
    • Three address codeMost widely used is three address code. How three address code looks like say if I have
      c = a +b
      then three address code should be : ” add a b c “. It has three variable a and b, three variable where it stores the result. And add is the operator. So it also know as quadruple.Some languages like java has its own intermediate code generator capability. Like java has specific kind of code called Byte code. What it do exactly, we install JVM or java virtual machines which take byte code as input and runs the program.
  2. Register Allocation :
    Suppose I have intermediate representation like this

    add a   b   c
    sub c   b   d
    add b   a   c

    Now problem comes out how to store value in registers. As there are limited number of register in our computer. So we use specific algorithm, one of the algorithm called as Graph colouring algorithm.

    Graph colouring algorithm :-

    If I have graph with three vertices say V1, V2, V3 and arranged them in the triangle shape such that V1 is adjacent to V2 and V2 is adjacent to V3 and V3 to V1. So what is states that colour the vertices of a graph such that no two adjacent vertices are of same colour.

  3. Machine code generation :
    Given the intermediate code we use the translator that takes input as intermediate code and output is assembly level code. This assembly level code is machine dependent. Job of the compiler is to decide the architecture  ( whether unix, windows or else ) then generates the assembly level code.
  4. Assembly and Linking :
    Finally, after creating assembly level code. Assembler and linker takes input as assembly code and generates the binary representation for execution. That is why we called final out as binary.