Week-5: Compilers: Lab
FYI: The Environmental Impact
Per Page ≈ 10.2 L water, 2 g CO2 and 2 g wood.
The environmental effects of paper production include deforestation, the use of enormous amounts of energy and water as well as air pollution and waste problems. Paper accounts for around 26%
of total waste at landfills.
Therefore, please print only if this is really necessary.
Introduction
The Compilers lab is designed for to experience different programming languages (C
and Python
), how they compile and subsequent inputs and outputs. You will develop basic understanding and appreciation of how high level code compiles down to machine code.
INFORMATION
Throughout this lab you will can refer to documentation for the source code and the compiled code:
- C -> https://devdocs.io/c/
- GCC -> https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Preface
- Python -> https://blog.finxter.com/python-cheat-sheet/
- Python 3.9 Bytebase -> https://docs.python.org/3/library/dis.html#python_bytecode_instructions
- GNAT -> https://gcc.gnu.org/onlinedocs/gcc-4.7.4/gnat_ugn_unw.pdf
RECAP
Registers
A CPU register is a small ( a word-sized storage unit), fast storage location within the central processing unit (CPU) of a computer. These registers are used to store temporary data, intermediate results, and memory addresses during program execution. CPU registers play a crucial role in the processing of instructions and data by providing quick and direct access to the CPU. For example, the Intel CPU
has a total of 16
registers for storing 64-bit
data: %rsp
, %rbp
, %rax
, %rbx
, %rcx
, %rdx
, %rdi
, %rsi
, %rsp
, %rbp
..........
stack
: stores local variablesheap
: dynamic memory for programmer to allocatedata
: stores global variables, separated into initialized and uninitialisedtext
: stores the code being executed
Task-1: Produce Assembely(ASM) code (30 mins)
There are two ways to produce ASM code as follows:
A. Using GCC Compiler:
-
Open this https://classroom.github.com/a/yb9UqZlV, accept the task, and then open the accepted repo using Codespaces. (More info on how to do that in Weeks 1,3, & 4 labs (revisit your previous labs))
-
Create a file named
lab.c
and open it in a text editor. Add the following content:#include <stdio.h> int main() { printf("Hello, World!\n"); return 0; }
-
Generate the corresponding assembly code using
gcc
. Open a terminal and navigate to the directory containinglab.c
. Run the following command to compile the C code:gcc -O -S -fomit-frame-pointer lab.c
This command tells
gcc
to optimize (-O
), generate assembly code (-S
), and omit frame pointers (-fomit-frame-pointer
) to simplify the output slightly. It creates an assembly code file namedlab.s
.- Now, view the ASM code by either opening the file
lab.s
, or using thecat
command in the terminal:
cat lab.s
FYI: The
cat
command is a command-line utility in Unix and Unix-like operating systems. It is short for "concatenate" and is primarily used for displaying the contents of files. - Now, view the ASM code by either opening the file
-
Now, you can review the assembly code.
B. Using Godbolt Simulator
The procedure above contains a lot of information, and some details might be confusing. To simplify the process, we recommend using Godbolt. This interactive compiler exploration website supports various programming languages and allows you to instantly visualise the compiled code in real-time.
Note: Throughout the rest of the lab you will refer to documentation of
C
andPython
(on your own time) for the soruce code and the compiled code.
Task-2: Getting started with the Simulator (30 mins)
Go to https://godbolt.org/ to access a blank script in C.
Familiarise yourself with the interface, and nce you have loaded this you should see three windows like shown below.
- The
first window
is where you select your language code type in your source code, - The
second window
shows you the complied source code, to the compiler of your choice, - And the
third window
shows output to console/terminal.
C
is one the of foundational code languages upon which many other languages draw lineage from. C
like C#
uses the same keywords and you will see common syntax. For those who have not experienced C# do not worry, we are going to go step by step.
In order for any program to run C
must have a main()
function, for this is the entry point for script. All code gets executed in the main, you can create other functions outside of main()
and call them inside.
Now you will need to write in the following code as seen below into the first window of Compiler Explorer that says C Source.
int main()
{
return 0;
}
A quick break down of what we are seeing here:
- Line 1 has the functions name and a returnable keyword,
int
. This means that whenmain()
executes and reaches the end of the script it will return an integer to indicate it has finished to the hosts environment. - Line 3 introduces the keyword
return
which will return a value, in this ase 0, anint
- Lines 2 an 4 have the identifiers
{ }
, this encapsulates the code into a block. The compiler needs this syntax to understand where a block of code starts and finishes.
As you were entering the source code into the window 1, you will have seen the second window compiling the code in real-time. Your assembly code should look like below.
main:
pushq %rbp
movq %rsp, %rbp
movl $0, %eax
popq %rbp
ret
The code looks very similar to that of the lecture, keywords/instructions; pushq
, movq
, movl
, popq
and ret
. So nothing should be new here, even the commands to left of the keywords are very same. However, the follwing what each one does:
-
pushq %rbp: This instruction pushes the value of the base pointer (
%rbp
) onto the stack. This is a common practice at the beginning of a function to save the current stack frame. The base pointer is later used to access local variables and parameters within the function. In Simple language,imagine you have a stack of papers (the computer's memory). This instruction is like putting a sticky note (base pointer) on top of the current page to remember where you are.
-
movq %rsp, %rbp: This instruction moves the value of the stack pointer
(%rsp)
into the base pointer(%rbp)
. It establishes a new frame pointer for the current function. In simple language,Now, you're moving the sticky note to the current page. This helps you keep track of where your data is in the stack
-
movl $0, %eax: This instruction moves the immediate value 0 into the
%eax
register. In the context of a main function, setting%eax
to 0 often represents a successful execution where the program is indicating it has finished without errors.%eax
is a general-purpose register commonly used to store function return values in x86-64 assembly. -
popq %rbp: This instruction pops the previously saved base pointer value from the stack, restoring the previous frame pointer. It is part of the function epilogue, responsible for cleaning up the stack frame before returning from the function. In simple language,
you're removing the sticky note from the top of the page and putting it back where it was. This means you're done with the current task and going back to what you were doing before.
-
ret:
This instruction performs the function return. It pops the return address from the stack (which was pushed during the call to the function) and transfers control back to that address, effectively returning from the main function. In simple language,this is like closing a book and going back to the person who asked you to do something. It tells the computer to finish what it was doing and go back to where it was called from, using the value you wrote on the paper (0 in this case) as the result.
-
And the ending letter
l
(longs) andq
(quadword) denotes a 32-bit or 64-bit (64 bits (8 bytes) number respectively.rbp
register is for frame pointer: It is used to access the area of the stack that belongs to the current function, including local variables and arguments passed into the current function.rsp
register is for stack pointer. It is used to create a new stack frame for a function about to be called, including storing the return address and passing arguments beyond the first six.
Task-3: Now lets see when we add two integers together.(30 mins)
Edit the source code the look like the following:
int main()
{
int a = 5;
int b = 5;
int sum = a + b;
return 0;
}
Remember just like C#
, we declare the data we want our variables to be in this instance int
on lines 3 to 5. C
is strong typed. We are provide the keyword with an identifier, a
, b
, and sum
and those identifiers with the values 5
& 5
and the result of 5 + 5
.
Now we can see that the compiled code has changed and should look like below:
main:
pushq %rbp
movq %rsp, %rbp
movl $5, -4(%rbp)
movl $5, -8(%rbp)
movl -4(%rbp), %edx
movl -8(%rbp), %eax
addl %edx, %eax
movl %eax, -12(%rbp)
movl $0, %eax
popq %rbp
ret
So now we have an extra 6 lines of code to examine,YAYYYY
specifically lines 4 to 10.
Firstly, looking at lines 4 and 5, we can see that the value 5 is stored in memory addresses that are 4 and 8 bytes away from the %rbp
, recall from the lecture about memory spacing.
movl $5, -4(%rbp)
movl $5, -8(%rbp)
Next we can see that lines 6 and 7 reference the addresses of the two values by accessing the addresses located 4 and 8 bytes away from the %rdp
and get ready to perform artimetic operations by storing using %edx
and %eax
.
movl -4(%rbp), %edx
movl -8(%rbp), %eax
Ok so finally we can see that lines 8, 9 and 10 deal with summation of the variables a
and b
.
addl %edx, %eax
movl %eax, -12(%rbp)
movl $0, %eax
Notice we have a new keyword, addl
, we are now adding the two values stored in the addresses pointed to by %edx
and %eax
.
The code then tempoarily stores the result in %eax
and moves it memory address 12 bytes away from the %rdp
. Finally, %eax
is zeroed off so it does not point to anything.
Task-4: CALL TO ACTION (30 mins)
So you add different ways and this can change the output of the compiler. Modify the code so that it looks like the various examples shown below and note the differences in the compiled code:
int main()
{
int a = 5;
int sum = a + 5;
return 0;
}
int main()
{
int sum = 5 + 5;
return 0;
}
Lets do something that includes window 3, the console/terminal. As long as you code has been compiling fine the output displayed should be...
Program returned: 0
So what about if we want to return the result of the summation we have been doing to the console/terminal? Make your source code look like the below.
#include <stdio.h>
int main()
{
int sum = 5 + 5;
printf("%d = 5 + 5", sum);
return 0;
}
Ok so lets look at the first line, this is new, in order to output information generated by the programmer to the terminal, C needs to "include"
a library of functions/code/operations. So we will include 'Standard Input Output' library.
#include <stdio.h>
The file extension for this library is .h
, in C and C++ libraries and imported code are stored in header files, hence the .h
.
If you don't include this library with the source code you will see the error, like below, in window three and if you read the message you are told to include the missing library.
Lets look at line 6, we can see we are using a new function that comes from the stdio.h
library, printf()
.
printf("%d = 5 + 5",sum);
So breaking this down we can see that printf()
function currently has two arguments or inputs.
Firstly, we have a string
, this is a data type that lets you store a sequence of characters.
The second argument is the identifier sum
, remember this is the result of the summation the script performs.
Finally, we have a format specifier in the string %d
this means at this position in the string you are promising to provide an variable, in this case an integer(d), sum
.
You can see more about format specifiers here, https://www.tutorialspoint.com/format-specifiers-in-c.
Now if we look in the compiler window we can see more code as expected, and certainly some new things.
.LC0:
.string "%d = 5 + 5"
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $10, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
Looking over this code, we can see 5 new lines. The first line .LC0:
means local constant, e.g string literal, constants are values that cannot be changed. Line 2, shows string
which means here is a string from printf()
...
.LC0:
.string "%d = 5 + 5"
The next new line is 6, reserves 16 bytes of memory for local variables, we have a string now.
subq $16, %rsp
The fourth new line,9, points the address of the result of the sum to %esi
. %esi
links to the printf()
function where it passes the value of this address into %d
format specifier.
movl %eax, %esi
The fifth new line is line 10, where the local constant .LC0
address is pointed to bb %edi
.
movl $.LC0, %edi
Finally, line 12 use a keyword call to the function printf, that is linked to call printf
call printf
Task-5: CALL TO ACTION (20 mins)
Modify the code to increment the sum by itself and print to console:
#include <stdio.h>
int main()
{
int sum = 5 + 5;
printf("%d = 5 + 5\n",sum);
int sumsum = sum+sum;
printf("%d = %d + %d\n",sumsum,sum,sum);
return 0;
}
Did you know that if you hover over the lines in the source code related compiler code is highlighted.
What differences can you see, in the compiled code?
Task-6: Let's enhance our C program by adding a simple loop.(40 mins)
-
Modify (or create a new one) the code in the Godbolt Simulator to use a loop to calculate the sum of numbers from 1 to 5.
Try to write the code yourself; it should be very straightforward.
Otherwise, see the code below:
Click for Expected Output
#include <stdio.h> int main() { int sum = 0; for (int i = 1; i <= 5; ++i) { sum += i; } printf("Sum: %d\n", sum); return 0; }
-
After compiling it using
Godbolt
Compiler, you should be able to get the following:.LC0: .string "%d\n" main: push rbp mov rbp, rsp sub rsp, 16 mov DWORD PTR [rbp-4], 0 mov DWORD PTR [rbp-8], 1 jmp .L2 .L3: mov eax, DWORD PTR [rbp-8] add DWORD PTR [rbp-4], eax add DWORD PTR [rbp-8], 1 .L2: cmp DWORD PTR [rbp-8], 5 jle .L3 mov eax, DWORD PTR [rbp-4] mov esi, eax mov edi, OFFSET FLAT:.LC0 mov eax, 0 call printf mov eax, 0 leave ret
Now let's break into section and try to understand the whole ASM code.
-
*String Constant Section (.LC0)
.LC0: .string "%d\n"
This section defines a string constant that will be used as the format specifier for printf. The
%d
serves as a placeholder for an integer value, and \n represents a newline character."
-
Main Function Section (main):
main: push rbp mov rbp, rsp sub rsp, 16
-
What do these parameters represent? You should know this by now.
Check your understanding and compare it to this.
Click for Expected Output
save the base pointer (rbp), set the base pointer to the current stack pointer (rsp), and allocate space on the stack for local variables.
-
How about this:, any idea what is
DWORD
is ?mov DWORD PTR [rbp-4], 0 mov DWORD PTR [rbp-8], 1
-
Initialize local variables: int sum = 0 (DWORD PTR [rbp-4]) and int i = 1 (DWORD PTR [rbp-8]).
-
DWORD stands for Double Word, and it is a data type used in computing. In the context of x86 and x86-64 architectures, a DWORD is a 32-bit unsigned integer, which means it can represent values from 0 to 4,294,967,295.
-
- Now, this: Could you work it out?
jmp .L2
- Now let discuss the loop's body:
.L3:
mov eax, DWORD PTR [rbp-8]
add DWORD PTR [rbp-4], eax
add DWORD PTR [rbp-8], 1
DWORD PTR [rbp-4], 0
; Initialise a variable at[rbp-4]
to0
(accumulated summov DWORD PTR [rbp-8]
,1
; Initialize another variable at[rbp-8]
to1
(loop counter)
- Loop condition
Loop.L2: cmp DWORD PTR [rbp-8], 5 jle .L3
condition
: Compare i with 5 and jump to .L3 if less than or equal.
-
After existing the loop
mov eax, DWORD PTR [rbp-4] mov esi, eax mov edi, OFFSET FLAT:.LC0 mov eax, 0 call printf
After the loop, load the final value of sum into the eax register, set esi
to this value, load the address of the string constant into edi, and call printf to print the result.
- How about this, can you work it out?
mov eax, 0
leave
ret
Check Answer:
[Optional] Compiler for Python Script [for those who'e intereted in Python
]
Now you have a look two basic operations in C and analysed the compile code, lets look at Python which you may find easier to understand.
Unlike C and other languages, Python doesn't need you to define data types, ints, floats, chars, string etc. and therefore is consider loose typed.
Select Python from the drop down list in the window 1, top right. Or go here...https://godbolt.org/z/7dbb69Gxh
You will probably notice several major differences, and maybe the compiler looks easier to understand?
So we make an easier comparison between C and Python source code and compilers, lets use the same code as before, modify your to look like the below.
Line 1 the keyword,def
, defines a function whose name is, main()
. Line 2 pass
is essientally a placeholder.
def main():
pass
CALL TO ACTION
Python is particular about formatting, if code belong inside a function, or block then it must be indented by 4 spaces or tab.
If you look at the IDE there are four faint when you highlight the white space between pass and the left of the word.
So lets look at the compiled code for this function.
LOAD_CONST 1 (|'|main|')|
MAKE_FUNCTION 0
STORE_NAME 0 (main)
LOAD_CONST 2 (None)
RETURN_VALUE
So we can see that we have completely different assembly code..., well that isn't a bad thing. Each code language compiles differently. Sometimes these assembly languages are easier to understand as they are written using meaningful words.
Line 1, loads the script as a constant, remember constant is something that isn't allowed to change.
Line 2, takes the name of the first function and loads it in to memory too.
Then the function is made at line 3, MAKE_FUNCTION
.
STORE_NAME
on line 4, creates a name for the function main so that it the function can be reference later.
Remember pass
, was a placeholder, well line 6 shows the nothing is to be loaded.
This is much easier to understand right?
Now lets modify the source code again.
def main():
a = 5
b = 5
c = 5 + 5
So simple enough, three variables that contain whole numbers. This is the same as our C example earlier, though we are not using the identifier sum as we did in C, because in Python sum
is a keyword.
We will now look at the compiled code, but only from line 9, as the previous lines are the same as before.
LOAD_CONST 1 (5)
STORE_FAST 0 (a)
LOAD_CONST 1 (5)
STORE_FAST 1 (b)
LOAD_CONST 2 (10)
STORE_FAST 2 (c)
LOAD_CONST 0 (None)
RETURN_VALUE
Lines 9, 12 and 15 show that the values of the variables are loaded into memory.
Lines 10, 13 and 16 stores the variable name into the top of the stack (memory).
Much like C's movl $0, %eax
python's compiler resets the memory pointer.
Okay, so we are almost there, but we haven't sent anything to the terminal yet, modify your code to look like this.
def main():
a = 5
b = 5
c = 5 + 5
print(c," = ", a, " + ",b)
main()
So only two new lines, 5 and 6.
On line 5 the printf()
function takes each variable (a,b,c) separated by string to produce 10 = 5 + 5.
Line 7, is where we call the function main
which will cause all of the code to be executed the output appear in the console/terminal (window 3).
Now lets look at the additional code that has appeared in the compile window.
Line 6 to 10 shows how the the function main
is called in line 7 of the source code.
LOAD_NAME 0 (main)
CALL_FUNCTION 0
POP_TOP
LOAD_CONST 2 (None)
RETURN_VALUE
You can see that the function is loaded by its name, LOAD_NAME
, which was originally stored in memory with STORE_NAME
earlier.
The function is then called CALL_FUNCTION
and subsequently removed STORE_NAME
from the top of the stack (memory).
Godbolt Simulator resources:
On your own time, take a look at the Godbolt Compiler's Github Repository, where you can find documentation and tutorials, etc .