Compiler Theory

What is a compiler?

A compiler is a specialized program that translates source code written in a high-level programming language into machine code, bytecode, or another programming language. This process is essential for executing programs written in languages like Java, C++, or Swift, as computers can only understand low-level machine code.

Functions of a Compiler

The primary function of a compiler is to convert source code into a format that can be executed by a computer's processor. The process involves several key stages:

Lexical Analysis: The compiler breaks down the source code into tokens, which are the smallest units of meaning, such as keywords and identifiers.

Syntax Analysis: The compiler checks the syntax of the code to ensure it follows the rules of the programming language. This stage often involves constructing a parse tree that represents the code's structure.

Semantic Analysis: Here, the compiler verifies the logical correctness of the code, ensuring that operations are valid and that variables are properly declared and used.

Intermediate Representation (IR) Generation: The compiler creates an intermediate representation of the code, which simplifies the subsequent translation into machine code.

Optimization: The compiler optimizes the IR to improve performance and efficiency, applying various techniques to enhance the generated code.

Code Generation: Finally, the compiler produces the machine code or bytecode that can be executed by the target system

Types of Compilers

Compilers can be categorized based on their output and functionality:

Native Compilers: Generate machine code for the same architecture on which they run.

Cross Compilers: Produce machine code for a different architecture than the one on which they are executed.

Just-In-Time (JIT) Compilers: Compile code at runtime, often used in environments like the Java Virtual Machine (JVM) to enhance performance by compiling bytecode into machine code just before execution

What is an interpreter?

An interpreter is a type of computer program that directly executes instructions written in a high-level programming language without requiring them to be compiled into machine code first. Unlike compilers, which translate the entire source code into machine code before execution, interpreters process the code line-by-line or statement-by-statement.

Key Functions of an Interpreter

Execution of Code: An interpreter reads the source code and executes it immediately, translating each instruction into machine code on the fly.

Error Detection: Because interpreters execute code line by line, they can provide immediate feedback on errors, allowing developers to debug their programs more interactively.

Flexibility: Interpreters allow for dynamic execution of code, which can be useful in environments where code needs to be modified or executed in real-time, such as in scripting languages like Python or JavaScript.

Types of Interpreters

Interpreted Languages: Languages like Python, Ruby, and JavaScript are typically interpreted, allowing for rapid development and testing.

Hybrid Approaches: Some languages, like Java, use a combination of both compilation and interpretation. Java code is first compiled into bytecode, which is then interpreted by the Java Virtual Machine (JVM).

Processor architectures can be classified into various types based on their design and operational principles. Here are the primary categories:

Main Types of CPU Architectures

RISC (Reduced Instruction Set Computer):

- Focuses on a small set of highly optimized instructions.
- Each instruction is designed to execute in a single clock cycle, promoting efficiency and speed.
- Commonly used in mobile devices and embedded systems, with ARM architecture being a notable example.

CISC (Complex Instruction Set Computer):

- Contains a larger set of instructions that can execute complex tasks in fewer lines of assembly code.
- Instructions may take multiple cycles to execute, which can enhance performance for specific tasks.
- The x86 architecture, used in most personal computers, is a prominent example of CISC.

Compilation Phases

The compilation process converts high-level source code to a low-level machine code that can be executed by the target machine. Moreover, an essential role of compilers is to inform the developer about errors committed, especially syntax-related ones.

The compilation process consists of several phases:

Lexical analysis
Syntax analysis
Semantic analysis
Intermediate code generation
Optimization
Code generation

1) Lexical Analysis

The first stage of the compilation process is lexical analysis. During this phase, the compiler splits source code into fragments called lexemes. A lexeme is an abstract unit of a specific language’s lexical system. Let’s analyze a simple example:

String greeting = "hello";

Copy

In the above statement, we have five lexemes:

String
greeting
=
“hello”
;

After splitting code into lexemes, a sequence of tokens is created. The sequence of tokens is a final product of lexical analysis. Thus, lexical analysis is also often called tokenization. A token is an object that describes a lexeme. It gives information about the lexeme’s purpose, such as whether it’s a keyword, variable name, or string literal. Moreover, it stores the lexeme’s source code location data.

2) Syntax Analysis

During syntax analysis, the compiler uses a sequence of tokens generated in the first stage. Tokens are used to create a structure called an abstract syntax tree (AST), which is a tree that represents the logical structure of a program.

In this phase, the compiler checks a grammatic structure of the source code and its syntax correctness. Meanwhile, any syntax errors result in a compilation error, and the compiler informs the developer about them.

In brief, syntax analysis is responsible for two tasks:

It checks source code for any syntax error.
It generates an abstract syntax tree that the next stage uses.

3) Semantic Analysis

In this stage, the compiler uses an abstract syntax tree to detect any semantic errors, for example:

assigning the wrong type to a variable
declaring variables with the same name in the same scope
using an undeclared variable
using language’s keyword as a variable name

Semantic analysis can be divided into three steps:

Type checking – inspects type match in assignment statements, arithmetic operations, functions, and method calls.
Flow control checking – investigates if flow control structures are used correctly and if classes and objects are correctly accessed.
Label checking – validates the use of labels and identifiers.

To achieve all the above goals, during semantic analysis, the compiler makes a complete traversal of the abstract syntax tree. Semantic analysis finally produces an annotated AST that describes the values of its attributes.

4) Intermediate Code Generation

During the compilation process, a compiler can generate one or more intermediate code forms.

Intermediate code is machine-independent.

Thus, there is no need for unique compilers for every different machine. Besides, optimization techniques are easier to apply to intermediate code than machine code.

Intermediate code has two forms of representation:

High-Level – similar to the source language. In this form, we can easily boost the performance of source code. However, it’s less preferred for enhancing the performance of the target machine.
Low-Level – close to the machine’s machine code. It’s preferred for making machine-related optimizations.

5) Optimization

In the optimization phase, the compiler uses a variety of ways to enhance the efficiency of the code.

Certainly, the optimization process should follow three important rules:

The resulting code can’t change the original meaning of the program.
Optimization should focus on consuming fewer resources and speeding up the operation of the software.
The optimization process shouldn’t significantly impact the overall time of compilation.

Let’s see a few examples of optimization techniques:

Function inlining – replacing the function call with its body.
Dead code elimination – compiler gets rid of code that is never executed, or if executed, its returned result isn’t used.
Loop fusion – executing, in one loop, operations from the adjacent loops that have the same iteration conditions.
Instruction combining – instructions realizing similar operations are combined into one; for example, x = x + 10; x = x – 7; could be replaced with x = x + 3;

6) Code Generation

Finally, the compiler converts the optimized intermediate code to the machine code dedicated to the target machine. The final code should have the same meaning as source code and be efficient in terms of memory and CPU resource usage. Furthermore, the code generation process must also be efficient.

Example:

Compiler vs. Interpreter

Assembler

An assembler is a computer program that translates assembly language, a low-level programming language, into machine code, which is the binary code that a computer's processor can execute directly. This translation is essential because assembly language uses human-readable mnemonics and symbolic names for operations and memory addresses, making it easier for programmers to write code compared to using raw machine code.

Example of an Assembler in Action

To illustrate how an assembler works, consider a simple assembly language program:

START: MOV R1, #5 ; Move the value 5 into register R1

MOV R2, #10 ; Move the value 10 into register R2

ADD R3, R1, R2 ; Add the values in R1 and R2, store the result in R3

END ; End of the program

In this example:

START is a label that marks the beginning of the program.
MOV is an operation code (opcode) that moves data into a register.
R1, R2, and R3 are symbolic register names.
The numbers (5 and 10) are immediate values.

Assembler Process

Pass 1: The assembler scans the code to create a symbol table, which maps labels (like START) to memory addresses. It also assigns memory addresses to each instruction.
Pass 2: The assembler translates each mnemonic into its corresponding machine code. For example, the MOV instruction might translate to a specific binary pattern that the CPU understands.

The final output might look something like this in machine code (the actual binary representation will depend on the architecture):

0001 0101 0000 0000 0000 0000 0000 0101 ; MOV R1, #5

0001 0101 0000 0000 0000 0000 0000 1010 ; MOV R2, #10

0001 0011 0000 0001 0000 0010 0000 0011 ; ADD R3, R1, R2

Types of Assemblers

Assemblers can be classified into two main types:

Single-Pass Assembler: Translates the entire program in one pass through the source code.
Multi-Pass Assembler: Requires multiple passes to resolve all symbols and generate the machine code.

Assemblers play a crucial role in bridging the gap between high-level programming languages and machine code, enabling efficient programming and better control over hardware operations.

Static vs. Dynamic Linking

Before we start with linking methods, let’s first understand the process of generating the executable of any piece of code. Formally, a computer program is a sequence of statements in a programming language instructing the CPU what to do.

To execute any program, we convert the source code to machine code. We do this by first compiling the code to an intermediate level and then converting it to assembly-level code. After that, we link the assembly code with other libraries or modules it uses.

So, linking is the process of combining external programs with our program to execute them successfully. After linking, we finally get our executable:

Linking

The assembler translates our compiled code to machine code and stores the result in an object file. It’s a binary representation of our program. The assembler gives a memory location to each object and instruction. Some of these memory locations are virtual, i.e., they’re offsets relative to the base address of the first instruction.

That’s the case with references to external libraries. The linker resolves them when combining the object file with the libraries.

Overall, we have two mechanisms for linking:

Static
Dynamic

Static Linking

In static linking, the system linker copies the dependencies into the final executable. At the time of linking an external library, the linker finds all dependencies that are defined in that library. And it replaces them with the corresponding functions from the library to resolve dependencies in our code. Afterward, the linker generates the final executable file that we can execute on the underlying machine.

For example, let’s say our application calls the function print() from an external library named Library. The assembler generates the object file with all native symbols resolved to their memory addresses. The external reference print() cannot be resolved. The linker loads this library and finds the definition of print() in it. Then, it maps to print() to a memory location and thus resolves the dependency:

So, a statically linked file contains our program’s code as well as the code of all the libraries it invokes. Since we copy complete libraries, we need space on both the disk and in the main memory because the resulting file may be very large.

Advantages of static linking:

Faster program startup and execution, as all libraries are already linked
Simpler distribution to diverse environments, as the program has everything it needs to run
Less chance of compatibility issues

Disadvantages of static linking:

Larger executable file size, as all required libraries are included
Difficult to update libraries, as the entire executable needs to be recompiled if any changes are made to external libraries
Higher memory usage, as each statically linked program contains copies of the same system library functions

Dynamic Linking

Dynamic linking, on the other hand, loads external libraries at runtime when the program is executed. The executable contains only the names of the required libraries and their unresolved symbols.

When the program encounters an unresolved symbol, the operating system loads the corresponding library into memory. This allows multiple processes to share a single copy of a library, reducing memory usage.

Advantages of dynamic linking:

Smaller executable file size, as libraries are linked at runtime
Easier to update libraries, as changes propagate to all programs using the library
More memory efficient, as libraries are loaded only once and shared among multiple processes

Disadvantages of dynamic linking:

Slightly slower program startup due to the additional linking process at runtime
Potential for version conflicts (DLL hell problem), where a program expects a specific library version that is different from the one loaded by the OS

When to Use Static vs Dynamic Linking

Use static linking when distributing binaries to multiple environments or OSes, as the program has everything it needs to run
Use static linking if the program makes a high volume of library calls or many small calls to library routines, as it may perform better
Use dynamic linking when multiple programs use the same set of libraries, as it can reduce memory usage and improve performance
Use dynamic linking when frequent library updates are expected, as changes can propagate to all programs without recompilation

Loader

A loader is a crucial component of an operating system responsible for loading executable files and their associated libraries into memory for execution. It prepares the program for execution by allocating memory space, resolving addresses, and setting up the execution environment.

Functions of a Loader

Loading Programs: The loader reads the executable file and loads it into memory. This includes both the program code and any required libraries.
Memory Allocation: It allocates memory for the program and its data, ensuring that there is enough space for execution.
Address Resolution: The loader resolves symbolic addresses used in the program to actual physical addresses in memory.
Linking: If the program uses dynamic linking, the loader links the necessary libraries at runtime, allowing the program to use shared code.
Execution Setup: It prepares the execution environment by initializing registers and setting up the stack.

Example of Loader Operation

When a user runs a program, the operating system invokes the loader, which performs the following steps:

Read the Executable: The loader reads the program's executable file from disk.
Allocate Memory: It allocates memory for the program's code and data segments.
Load the Program: The loader copies the program's code and data into the allocated memory space.
Resolve Dependencies: If the program requires external libraries, the loader locates and loads these libraries into memory.
Transfer Control: Finally, the loader transfers control to the program, allowing it to begin execution.

In summary, the loader is essential for the execution of programs in a computer system, managing the loading process and ensuring that all necessary components are in place for the program to run successfully.

Executable Generation

A computer program is a sequence of statements in a programming language that instructs the CPU to achieve a particular result.

To execute our program, we convert the source code to machine code. So, first, we compile the code to an intermediate level and then convert it to assembly-level code. Then, we link this assembly code with other external libraries or components it uses. Finally, we load it in memory and execute the code:

Page updated

Google Sites

Report abuse