Week-5: Compilers: Lab

FYI: The Environmental Impact Per Page ≈ 10.2 L water, 2 g CO2 and 2 g wood.

The environmental effects of paper production include deforestation, the use of enormous amounts of energy and water as well as air pollution and waste problems. Paper accounts for around 26% of total waste at landfills.

Therefore, please print only if this is really necessary.

Introduction

The Compilers lab is designed for to experience different programming languages (C and Python), how they compile and subsequent inputs and outputs. You will develop basic understanding and appreciation of how high level code compiles down to machine code.

INFORMATION

Throughout this lab you will can refer to documentation for the source code and the compiled code:

RECAP

Registers

A CPU register is a small ( a word-sized storage unit), fast storage location within the central processing unit (CPU) of a computer. These registers are used to store temporary data, intermediate results, and memory addresses during program execution. CPU registers play a crucial role in the processing of instructions and data by providing quick and direct access to the CPU. For example, the Intel CPU has a total of 16 registers for storing 64-bit data: %rsp, %rbp, %rax, %rbx, %rcx, %rdx, %rdi, %rsi, %rsp, %rbp ..........

  • stack: stores local variables
  • heap: dynamic memory for programmer to allocate
  • data: stores global variables, separated into initialized and uninitialised
  • text: stores the code being executed

Task-1: Produce Assembely(ASM) code (30 mins)

There are two ways to produce ASM code as follows:

A. Using GCC Compiler:

  1. Open this https://classroom.github.com/a/yb9UqZlV, accept the task, and then open the accepted repo using Codespaces. (More info on how to do that in Weeks 1,3, & 4 labs (revisit your previous labs))

  2. Create a file named lab.c and open it in a text editor. Add the following content:

    #include <stdio.h>
    int main() {
        printf("Hello, World!\n");
        return 0;
    }
    
  3. Generate the corresponding assembly code using gcc. Open a terminal and navigate to the directory containing lab.c. Run the following command to compile the C code:

    gcc -O -S -fomit-frame-pointer lab.c
    

    This command tells gcc to optimize (-O), generate assembly code (-S), and omit frame pointers (-fomit-frame-pointer) to simplify the output slightly. It creates an assembly code file named lab.s.

    • Now, view the ASM code by either opening the file lab.s, or using the cat command in the terminal:
    cat lab.s 
    

    FYI: The cat command is a command-line utility in Unix and Unix-like operating systems. It is short for "concatenate" and is primarily used for displaying the contents of files.

  4. Now, you can review the assembly code.

B. Using Godbolt Simulator

The procedure above contains a lot of information, and some details might be confusing. To simplify the process, we recommend using Godbolt. This interactive compiler exploration website supports various programming languages and allows you to instantly visualise the compiled code in real-time.

Note: Throughout the rest of the lab you will refer to documentation of C and Python (on your own time) for the soruce code and the compiled code.

Task-2: Getting started with the Simulator (30 mins)

Go to https://godbolt.org/ to access a blank script in C.

Familiarise yourself with the interface, and nce you have loaded this you should see three windows like shown below.

  • The first window is where you select your language code type in your source code,
  • The second window shows you the complied source code, to the compiler of your choice,
  • And the third window shows output to console/terminal.

Compiler Explorer IDE

C is one the of foundational code languages upon which many other languages draw lineage from. C like C# uses the same keywords and you will see common syntax. For those who have not experienced C# do not worry, we are going to go step by step.

In order for any program to run C must have a main() function, for this is the entry point for script. All code gets executed in the main, you can create other functions outside of main() and call them inside.

Now you will need to write in the following code as seen below into the first window of Compiler Explorer that says C Source.

int main()
{
    return 0;
}

A quick break down of what we are seeing here:

  • Line 1 has the functions name and a returnable keyword, int. This means that when main() executes and reaches the end of the script it will return an integer to indicate it has finished to the hosts environment.
  • Line 3 introduces the keyword return which will return a value, in this ase 0, an int
  • Lines 2 an 4 have the identifiers { }, this encapsulates the code into a block. The compiler needs this syntax to understand where a block of code starts and finishes.

As you were entering the source code into the window 1, you will have seen the second window compiling the code in real-time. Your assembly code should look like below.

main:
  pushq %rbp
  movq %rsp, %rbp
  movl $0, %eax
  popq %rbp
  ret

The code looks very similar to that of the lecture, keywords/instructions; pushq, movq, movl, popq and ret. So nothing should be new here, even the commands to left of the keywords are very same. However, the follwing what each one does:

  • pushq %rbp: This instruction pushes the value of the base pointer (%rbp) onto the stack. This is a common practice at the beginning of a function to save the current stack frame. The base pointer is later used to access local variables and parameters within the function. In Simple language, imagine you have a stack of papers (the computer's memory). This instruction is like putting a sticky note (base pointer) on top of the current page to remember where you are.

  • movq %rsp, %rbp: This instruction moves the value of the stack pointer (%rsp) into the base pointer (%rbp). It establishes a new frame pointer for the current function. In simple language, Now, you're moving the sticky note to the current page. This helps you keep track of where your data is in the stack

  • movl $0, %eax: This instruction moves the immediate value 0 into the %eax register. In the context of a main function, setting %eax to 0 often represents a successful execution where the program is indicating it has finished without errors. %eax is a general-purpose register commonly used to store function return values in x86-64 assembly.

  • popq %rbp: This instruction pops the previously saved base pointer value from the stack, restoring the previous frame pointer. It is part of the function epilogue, responsible for cleaning up the stack frame before returning from the function. In simple language,you're removing the sticky note from the top of the page and putting it back where it was. This means you're done with the current task and going back to what you were doing before.

  • ret: This instruction performs the function return. It pops the return address from the stack (which was pushed during the call to the function) and transfers control back to that address, effectively returning from the main function. In simple language, this is like closing a book and going back to the person who asked you to do something. It tells the computer to finish what it was doing and go back to where it was called from, using the value you wrote on the paper (0 in this case) as the result.

  • And the ending letter l (longs) and q (quadword) denotes a 32-bit or 64-bit (64 bits (8 bytes) number respectively.

    rbp register is for frame pointer: It is used to access the area of the stack that belongs to the current function, including local variables and arguments passed into the current function.

    rsp register is for stack pointer. It is used to create a new stack frame for a function about to be called, including storing the return address and passing arguments beyond the first six.

Task-3: Now lets see when we add two integers together.(30 mins)

Edit the source code the look like the following:

int main()
{
    int a = 5;
    int b = 5;
    int sum = a + b;
    return 0;
}

Remember just like C#, we declare the data we want our variables to be in this instance int on lines 3 to 5. C is strong typed. We are provide the keyword with an identifier, a, b, and sum and those identifiers with the values 5 & 5 and the result of 5 + 5.

Now we can see that the compiled code has changed and should look like below:

main:
  pushq %rbp
  movq %rsp, %rbp
  movl $5, -4(%rbp)
  movl $5, -8(%rbp)
  movl -4(%rbp), %edx
  movl -8(%rbp), %eax
  addl %edx, %eax
  movl %eax, -12(%rbp)
  movl $0, %eax
  popq %rbp
  ret

So now we have an extra 6 lines of code to examine,YAYYYY specifically lines 4 to 10.

Firstly, looking at lines 4 and 5, we can see that the value 5 is stored in memory addresses that are 4 and 8 bytes away from the %rbp, recall from the lecture about memory spacing.

  movl $5, -4(%rbp)
  movl $5, -8(%rbp)

Next we can see that lines 6 and 7 reference the addresses of the two values by accessing the addresses located 4 and 8 bytes away from the %rdp and get ready to perform artimetic operations by storing using %edx and %eax.

  movl -4(%rbp), %edx
  movl -8(%rbp), %eax

Ok so finally we can see that lines 8, 9 and 10 deal with summation of the variables a and b.

  addl %edx, %eax
  movl %eax, -12(%rbp)
  movl $0, %eax

Notice we have a new keyword, addl, we are now adding the two values stored in the addresses pointed to by %edx and %eax.

The code then tempoarily stores the result in %eax and moves it memory address 12 bytes away from the %rdp. Finally, %eax is zeroed off so it does not point to anything.


Task-4: CALL TO ACTION (30 mins)

So you add different ways and this can change the output of the compiler. Modify the code so that it looks like the various examples shown below and note the differences in the compiled code:

int main()
{
    int a = 5;
    int sum = a + 5;
    return 0;
}
int main()
{
    int sum = 5 + 5;
    return 0;
}

Lets do something that includes window 3, the console/terminal. As long as you code has been compiling fine the output displayed should be...

Program returned: 0

So what about if we want to return the result of the summation we have been doing to the console/terminal? Make your source code look like the below.

#include <stdio.h>
int main()
{
    int sum = 5 + 5;
    printf("%d = 5 + 5", sum);
    return 0;
}

Ok so lets look at the first line, this is new, in order to output information generated by the programmer to the terminal, C needs to "include" a library of functions/code/operations. So we will include 'Standard Input Output' library.

#include <stdio.h>

The file extension for this library is .h, in C and C++ libraries and imported code are stored in header files, hence the .h.

If you don't include this library with the source code you will see the error, like below, in window three and if you read the message you are told to include the missing library.

Missing Library

Lets look at line 6, we can see we are using a new function that comes from the stdio.h library, printf().

printf("%d = 5 + 5",sum);

So breaking this down we can see that printf() function currently has two arguments or inputs.

Firstly, we have a string, this is a data type that lets you store a sequence of characters.

The second argument is the identifier sum, remember this is the result of the summation the script performs.

Finally, we have a format specifier in the string %d this means at this position in the string you are promising to provide an variable, in this case an integer(d), sum.

You can see more about format specifiers here, https://www.tutorialspoint.com/format-specifiers-in-c.

Now if we look in the compiler window we can see more code as expected, and certainly some new things.

.LC0:
  .string "%d = 5 + 5"
main:
  pushq %rbp
  movq %rsp, %rbp
  subq $16, %rsp
  movl $10, -4(%rbp)
  movl -4(%rbp), %eax
  movl %eax, %esi
  movl $.LC0, %edi
  movl $0, %eax
  call printf
  movl $0, %eax
  leave
  ret

Looking over this code, we can see 5 new lines. The first line .LC0: means local constant, e.g string literal, constants are values that cannot be changed. Line 2, shows string which means here is a string from printf()...

.LC0:
  .string "%d = 5 + 5"

The next new line is 6, reserves 16 bytes of memory for local variables, we have a string now.

  subq $16, %rsp

The fourth new line,9, points the address of the result of the sum to %esi. %esi links to the printf() function where it passes the value of this address into %d format specifier.

movl %eax, %esi

The fifth new line is line 10, where the local constant .LC0 address is pointed to bb %edi.

  movl $.LC0, %edi

Finally, line 12 use a keyword call to the function printf, that is linked to call printf

  call printf

Task-5: CALL TO ACTION (20 mins)

Modify the code to increment the sum by itself and print to console:

#include <stdio.h>
int main()
{
    int sum = 5 + 5;
    printf("%d = 5 + 5\n",sum);
    int sumsum = sum+sum;
    printf("%d = %d + %d\n",sumsum,sum,sum);
    return 0;
}

Did you know that if you hover over the lines in the source code related compiler code is highlighted.

What differences can you see, in the compiled code?

Task-6: Let's enhance our C program by adding a simple loop.(40 mins)

  • Modify (or create a new one) the code in the Godbolt Simulator to use a loop to calculate the sum of numbers from 1 to 5.

    Try to write the code yourself; it should be very straightforward.

    Otherwise, see the code below:

    Click for Expected Output

    #include <stdio.h>
    int main() {
        int sum = 0;
        for (int i = 1; i <= 5; ++i) {
            sum += i;
        }
        printf("Sum: %d\n", sum);
        return 0;
    }
    
  • After compiling it using Godbolt Compiler, you should be able to get the following:

    .LC0:
            .string "%d\n"
    main:
            push    rbp
            mov     rbp, rsp
            sub     rsp, 16
            mov     DWORD PTR [rbp-4], 0
            mov     DWORD PTR [rbp-8], 1
            jmp     .L2
    .L3:
            mov     eax, DWORD PTR [rbp-8]
            add     DWORD PTR [rbp-4], eax
            add     DWORD PTR [rbp-8], 1
    .L2:
            cmp     DWORD PTR [rbp-8], 5
            jle     .L3
            mov     eax, DWORD PTR [rbp-4]
            mov     esi, eax
            mov     edi, OFFSET FLAT:.LC0
            mov     eax, 0
            call    printf
            mov     eax, 0
            leave
            ret
    

Now let's break into section and try to understand the whole ASM code.

  • *String Constant Section (.LC0)

    .LC0:
    .string "%d\n"
    

    This section defines a string constant that will be used as the format specifier for printf. The %d serves as a placeholder for an integer value, and \n represents a newline character."

  • Main Function Section (main):

    main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
    
  • What do these parameters represent? You should know this by now.

    Check your understanding and compare it to this.

    Click for Expected Output

    save the base pointer (rbp), set the base pointer to the current stack pointer (rsp), and allocate space on the stack for local variables.
  • How about this:, any idea what is DWORD is ?

        mov     DWORD PTR [rbp-4], 0
        mov     DWORD PTR [rbp-8], 1
    
    • Initialize local variables: int sum = 0 (DWORD PTR [rbp-4]) and int i = 1 (DWORD PTR [rbp-8]).

    • DWORD stands for Double Word, and it is a data type used in computing. In the context of x86 and x86-64 architectures, a DWORD is a 32-bit unsigned integer, which means it can represent values from 0 to 4,294,967,295.

  • Now, this: Could you work it out?
    jmp     .L2

  • Now let discuss the loop's body:
.L3:
    mov     eax, DWORD PTR [rbp-8]
    add     DWORD PTR [rbp-4], eax
    add     DWORD PTR [rbp-8], 1
  • DWORD PTR [rbp-4], 0 ; Initialise a variable at [rbp-4] to 0 (accumulated sum
  • mov DWORD PTR [rbp-8], 1 ; Initialize another variable at [rbp-8] to 1 (loop counter)
  • Loop condition
    .L2:
        cmp     DWORD PTR [rbp-8], 5
        jle     .L3
    
    Loop condition: Compare i with 5 and jump to .L3 if less than or equal.
  • After existing the loop

        mov     eax, DWORD PTR [rbp-4]
        mov     esi, eax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
    

After the loop, load the final value of sum into the eax register, set esi to this value, load the address of the string constant into edi, and call printf to print the result.

  • How about this, can you work it out?
    mov     eax, 0
    leave
    ret

Check Answer:

Click for Expected Output

Set the return value to 0, clean up the stack frame, and return from the function.

[Optional] Compiler for Python Script [for those who'e intereted in Python]

Now you have a look two basic operations in C and analysed the compile code, lets look at Python which you may find easier to understand.

Unlike C and other languages, Python doesn't need you to define data types, ints, floats, chars, string etc. and therefore is consider loose typed.

Select Python from the drop down list in the window 1, top right. Or go here...https://godbolt.org/z/7dbb69Gxh

You will probably notice several major differences, and maybe the compiler looks easier to understand?

So we make an easier comparison between C and Python source code and compilers, lets use the same code as before, modify your to look like the below.

Line 1 the keyword,def, defines a function whose name is, main(). Line 2 pass is essientally a placeholder.

def main():
    pass

CALL TO ACTION

Python is particular about formatting, if code belong inside a function, or block then it must be indented by 4 spaces or tab.

If you look at the IDE there are four faint when you highlight the white space between pass and the left of the word.

So lets look at the compiled code for this function.

LOAD_CONST      1 (|'|main|')|
MAKE_FUNCTION   0
STORE_NAME      0 (main)
LOAD_CONST      2 (None)
RETURN_VALUE

So we can see that we have completely different assembly code..., well that isn't a bad thing. Each code language compiles differently. Sometimes these assembly languages are easier to understand as they are written using meaningful words.

Line 1, loads the script as a constant, remember constant is something that isn't allowed to change.

Line 2, takes the name of the first function and loads it in to memory too.

Then the function is made at line 3, MAKE_FUNCTION.

STORE_NAME on line 4, creates a name for the function main so that it the function can be reference later.

Remember pass, was a placeholder, well line 6 shows the nothing is to be loaded.

This is much easier to understand right?

Now lets modify the source code again.

def main():
    a = 5
    b = 5
    c = 5 + 5   

So simple enough, three variables that contain whole numbers. This is the same as our C example earlier, though we are not using the identifier sum as we did in C, because in Python sum is a keyword.

We will now look at the compiled code, but only from line 9, as the previous lines are the same as before.

LOAD_CONST      1 (5)
STORE_FAST      0 (a)

LOAD_CONST      1 (5)
STORE_FAST      1 (b)

LOAD_CONST      2 (10)
STORE_FAST      2 (c)
LOAD_CONST      0 (None)
RETURN_VALUE

Lines 9, 12 and 15 show that the values of the variables are loaded into memory.

Lines 10, 13 and 16 stores the variable name into the top of the stack (memory).

Much like C's movl $0, %eax python's compiler resets the memory pointer.

Okay, so we are almost there, but we haven't sent anything to the terminal yet, modify your code to look like this.

def main():
    a = 5
    b = 5
    c = 5 + 5
    print(c," = ", a, " + ",b)
main()

So only two new lines, 5 and 6.

On line 5 the printf() function takes each variable (a,b,c) separated by string to produce 10 = 5 + 5.

Line 7, is where we call the function main which will cause all of the code to be executed the output appear in the console/terminal (window 3).

Now lets look at the additional code that has appeared in the compile window.

Line 6 to 10 shows how the the function main is called in line 7 of the source code.

LOAD_NAME             0 (main)
CALL_FUNCTION         0
POP_TOP
LOAD_CONST            2 (None)
RETURN_VALUE

You can see that the function is loaded by its name, LOAD_NAME, which was originally stored in memory with STORE_NAME earlier.

The function is then called CALL_FUNCTION and subsequently removed STORE_NAME from the top of the stack (memory).

Godbolt Simulator resources:

On your own time, take a look at the Godbolt Compiler's Github Repository, where you can find documentation and tutorials, etc .