#### Code Generation for Data Processing Lecture 1: Introduction and Interpretation

#### Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

# Module "Code Generation for Data Processing"

#### Learning Goals

Getting from an intermediate code representation to machine code
 Designing and implementing IRs and machine code generators
 Apply for: JIT compilation, guery compilation, ISA emulation

# Module "Code Generation for Data Processing"

#### Learning Goals

Getting from an intermediate code representation to machine code
 Designing and implementing IRs and machine code generators
 Apply for: JIT compilation, guery compilation, ISA emulation

#### Prerequisites

- Computer Architecture, Assembly
- Databases, Relational Algebra

ERA, GRA/ASP GDB

Beneficial: Compiler Construction, Modern DBs

#### Introduction

- Introduction and Interpretation
- Compiler Front-end

#### Introduction

- Introduction and Interpretation
- Compiler Front-end

#### Intermediate Representations

- IR Concepts and Design
- LLVM-IR
- Analyses and Optimizations

#### Introduction

- Introduction and Interpretation
- Compiler Front-end

#### Intermediate Representations

- IR Concepts and Design
- LLVM-IR
- Analyses and Optimizations

#### Compiler Back-end

- Instruction Selection
- Register Allocation
- Linker, Loader, Debuginfo

#### Introduction

- Introduction and Interpretation
- Compiler Front-end

#### Intermediate Representations

- IR Concepts and Design
- LLVM-IR
- Analyses and Optimizations

#### Compiler Back-end

- Instruction Selection
- Register Allocation
- Linker, Loader, Debuginfo

#### Applications

- JIT-compilation + Sandboxing
- Query Compilation
- Binary Translation

### Lecture Organization

- Lecturer: Dr. Alexis Engelke engelke@in.tum.de
- ▶ Time slot: Thu 10-14, 02.11.018
- Material: https://db.in.tum.de/teaching/ws2324/codegen/

#### Exam

- Written exam, 90 minutes, no retake, date TBD
- (Might change to oral on very low registration count)

#### Exercises

- Regular homework, often with programming exercise
- Submission via e-mail: engelke+cghomework@in.tum.de
  - Grading with  $\{*, +, \sim, -\}$ , feedback on best effort
- Exercise session modes:
  - Present and discuss homework solutions
  - Hands-on programming or analysis of systems (needs laptop)

#### Grade Bonus

 Requirement: N - 2 "sufficiently working" homework submissions and at least 2 presentations of homework in class (depends on submission count)
 Bonus: grades in [1.3; 4.0] improved by 0.3/0.4

# Why study compilers?

- Critical component of every system, functionality and performance
   Compiler mostly *alone* responsible for using hardware well
- Brings together many aspects of CS:
  - Theory, algorithms, systems, architecture, software engineering, (ML)
- New developments/requirements pose new challenges
  - New architectures, environments, language concepts, ....
- High complexity!

### Compiler Lectures @ TUM

Compiler Construction IN2227, SS, THEO

Front-end, parsing, semantic analyses, types Program Optimization IN2053, WS, THEO

Analyses, transformations, abstract interpretation

Virtual Machines IN2040, SS, THEO

Mapping programming paradigms to IR/bytecode

Programming Languages CIT3230000, WS

Implementation of advanced language features Code Generation CIT3230001, WS

Back-end, machine code generation, JIT comp.

# Why study code generation?

### Why study code generation?

- Frameworks (LLVM, ...) exist and are comparably good, but often not good enough (performance, features)
  - Many systems with code gen. have their own back-end
  - E.g.: V8, WebKit FTL, .NET RyuJIT, GHC, Zig, QEMU, Umbra, ...

### Why study code generation?

 Frameworks (LLVM, ...) exist and are comparably good, but often not good enough (performance, features)

- Many systems with code gen. have their own back-end
- E.g.: V8, WebKit FTL, .NET RyuJIT, GHC, Zig, QEMU, Umbra, ...
- Machine code is not the only target: bytecode
  - Often used for code execution
  - E.g.: V8, Java, .NET MSIL, BEAM (Erlang), Python, MonetDB, eBPF, ...
  - Allows for flexible design
  - But: efficient execution needs machine code generation

# Proebsting's Law



"Compiler advances double computing power every 18 years."

- Todd Proebsting, 1998<sup>1</sup>



#### "Compiler advances double computing power every 18 years."

- Todd Proebsting, 1998<sup>1</sup>

Still optimistic; depends on number of abstractions

<sup>1</sup>http://proebsting.cs.arizona.edu/law.html

### Motivational Example: Brainfuck

Turing-complete esoteric programming language, 8 operations

- Input/output: . ,
- Moving pointer over infinite array: < >
- Increment/decrement: + -
- Jump to matching bracket if (not) zero: []

```
++++++[->+++++<]>.
```



$$\mathsf{Program} \longrightarrow \mathsf{???} \longrightarrow \mathsf{Result}$$

# $\mathsf{Program} \longrightarrow \mathsf{Hardware} \longrightarrow \mathsf{Result}$



#### Programs

- High flexibility (possibly)
- Many abstractions (typically)
- Several paradigms



#### Programs

- High flexibility (possibly)
- Many abstractions (typically)
- Several paradigms

#### Hardware/ISA

- Low-level interface
- Few operations, imperative
- "Not easy" to write

#### Motivational Example: Brainfuck – Interpretation

Write an interpreter!

```
unsigned char state[10000];
unsigned ptr = 0, pc = 0;
while (prog[pc])
  switch (prog[pc++]) {
  case '.': putchar(state[ptr]); break;
  case ',': state[ptr] = getchar(); break;
  case '>': ptr++; break;
  case '<': ptr--; break;</pre>
  case '+': state[ptr]++; break;
  case '-': state[ptr]--; break;
  case '[': state[ptr] || (pc = matchParen(pc, prog)); break;
  case ']': state[ptr] && (pc = matchParen(pc, prog)); break;
  }
```

#### Compiler

$$\mathsf{Program} \to \mathsf{Compiler} \quad \to \mathsf{Program}$$

Translate program to other lang.Might optimize/improve program

#### Compiler

$$\mathsf{Program} \to \mathsf{Compiler} \to \mathsf{Program}$$

- Translate program to other lang.Might optimize/improve program
- C, C++, Rust → machine code
   Python, Java → bytecode

13

#### Compiler

- Translate program to other lang.Might optimize/improve program
- C, C++, Rust → machine code
   Python, Java → bytecode

Interpreter

 $\mathsf{Program} \to \mathsf{Interpreter} \longrightarrow \mathsf{Result}$ 

- Directly execute program
- Computes program result

#### Compiler

$$\mathsf{Program} \to \mathsf{Compiler} \to \mathsf{Program}$$

- Translate program to other lang.Might optimize/improve program
- C, C++, Rust → machine code
   Python, Java → bytecode

#### Interpreter

 $\mathsf{Program} \to \mathsf{Interpreter} \longrightarrow \mathsf{Result}$ 

- Directly execute program
- Computes program result
- Shell scripts, Python bytecode, machine code (conceptually)

#### Compiler

- Translate program to other lang.Might optimize/improve program
- C, C++, Rust → machine code
   Python, Java → bytecode

#### Interpreter

 $\mathsf{Program} \to \mathsf{Interpreter} \longrightarrow \mathsf{Result}$ 

- Directly execute program
- Computes program result
- Shell scripts, Python bytecode, machine code (conceptually)

Multiple compilation steps can precede the "final interpretation"



► Targets: machine code, bytecode, or other source language

### Compilers

- Targets: machine code, bytecode, or other source language
- Typical goals: better language usability and performance
  - Make lang. usable at all, faster, use less resources, etc.

### Compilers

Targets: machine code, bytecode, or other source language
 Typical goals: better language usability and performance
 Make lang. usable at all, faster, use less resources, etc.

Constraints: specs, resources (comp.-time, etc.), requirements (perf., etc.)



### Compilers

► Targets: machine code, bytecode, or other source language

- Typical goals: better language usability and performance
  - Make lang. usable at all, faster, use less resources, etc.

Constraints: specs, resources (comp.-time, etc.), requirements (perf., etc.)

#### Examples:

- "Classic" compilers source  $\rightarrow$  machine code
- ▶ JIT compilation of JavaScript, WebAssembly, Java bytecode, ...
- Database query compilation
- ISA emulation/binary translation

#### Compiler Structure: Monolithic



### Compiler Structure: Monolithic



Inflexible architecture, hard to retarget
Compiler Structure: Two-phase architecture



### Compiler Structure: Two-phase architecture



#### Front-end

- Parses source code
- Detect syntax/semantical errors
- Emit intermediate representation encode semantics/knowledge
- Typically:  $\mathcal{O}(n)$  or  $\mathcal{O}(n \log n)$

### Compiler Structure: Two-phase architecture



#### Front-end

- Parses source code
- Detect syntax/semantical errors
- Emit intermediate representation encode semantics/knowledge
- Typically:  $\mathcal{O}(n)$  or  $\mathcal{O}(n \log n)$

### Back-end

- Translate IR to target architecture
- ► Can assume valid IR (~→ no errors)
- Possibly one back-end per arch.
- Contains  $\mathcal{NP}$ -complete problems

Compiler Structure: Three-phase architecture



Compiler Structure: Three-phase architecture



Optimizer: analyze/transform/rewrite program inside IR

Compiler Structure: Three-phase architecture



Optimizer: analyze/transform/rewrite program inside IR

Conceptual architecture: real compilers typically much more complex

- Several IRs in front-end and back-end, optimizations on different IRs
- Multiple front-ends for different languages
- Multiple back-ends for different architectures

1. Tokenizer: recognize words, numbers, operators, etc.

▶ Example:  $a+b*c \rightarrow ID(a)$  PLUS ID(b) TIMES ID(c)

 $\mathcal{R}e$ 

1. Tokenizer: recognize words, numbers, operators, etc.

• Example:  $a+b*c \rightarrow ID(a)$  PLUS ID(b) TIMES ID(c)

- 2. Parser: build (abstract) syntax tree, check for syntax errors CFG
  - Syntax Tree: describe grammatical structure of complete program Example: expr("a", op("+"), expr("b", op("\*"), expr("c"))
  - Abstract Syntax Tree: only relevant information, more concise Example: plus("a", times("b", "c"))

 $\mathcal{R}_{\mathcal{P}}$ 

- 1. Tokenizer: recognize words, numbers, operators, etc.
  - Example:  $a+b*c \rightarrow ID(a)$  PLUS ID(b) TIMES ID(c)
- 2. Parser: build (abstract) syntax tree, check for syntax errors CFG
  - Syntax Tree: describe grammatical structure of complete program Example: expr("a", op("+"), expr("b", op("\*"), expr("c"))
     Abstract Syntax Tree: only relevant information, more concise
    - Example: plus("a", times("b", "c"))
- 3. Semantic Analysis: check types, variable existence, etc.

 $\mathcal{R}_{\mathcal{P}}$ 

- 1. Tokenizer: recognize words, numbers, operators, etc.
  - Example:  $a+b*c \rightarrow ID(a)$  PLUS ID(b) TIMES ID(c)
- 2. Parser: build (abstract) syntax tree, check for syntax errors CFG
  - Syntax Tree: describe grammatical structure of complete program Example: expr("a", op("+"), expr("b", op("\*"), expr("c"))
     Abstract Syntax Tree: only relevant information, more concise Example: plus("a", times("b", "c"))
- 3. Semantic Analysis: check types, variable existence, etc.
- 4. IR Generator: produce IR for next stage
  - This might be the AST itself

 $\mathcal{R}_{\mathcal{P}}$ 

- 1. Instruction Selection: map IR operations to target instructions
  - ► Use target features: special insts., addressing modes, ...
  - Still using virtual/unlimited registers

- 1. Instruction Selection: map IR operations to target instructions
  - Use target features: special insts., addressing modes, ...
  - Still using virtual/unlimited registers
- 2. Instruction Scheduling: optimize order for target arch.
  - Start memory/high-latency earlier, etc.
  - Requires knowledge about micro-architecture

- 1. Instruction Selection: map IR operations to target instructions
  - Use target features: special insts., addressing modes, ...
  - Still using virtual/unlimited registers
- 2. Instruction Scheduling: optimize order for target arch.
  - Start memory/high-latency earlier, etc.
  - Requires knowledge about micro-architecture
- 3. Register Allocation: map values to fixed register set/stack
  - Use available registers effectively, minimize stack usage

### Motivational Example: Brainfuck - Front-end



### Motivational Example: Brainfuck – AST Interpretation

```
AST can be interpreted recursively
```

```
struct node { char kind; int cldCnt; struct node* cld; };
struct state { unsigned char* arr; size_t ptr; };
void donode(struct node* n, struct state* s) {
 switch (n->kind) {
 case '+': s->arr[s->ptr]++; break;
 11 ...
 case '[': while (s->arr[s->ptr]) children(n); break;
 case 0: children(n); break; // root
 }
}
void children(struct node* n. struct state* s) {
 for (int i = 0; i < n > cldCnt; i + +) donode(n - > cld + i, s);
}
```

- ▶ Inefficient sequences of +/-/</>> can be combined
  - Trivially done when generating IR

- Inefficient sequences of +/-/</> can be combined
  - Trivially done when generating IR
- Fold patterns into more high-level operations

### Fold offset into operation

- right(2) add(1) = addoff(2, 1) right(2)
- Also possible with loops

#### Fold offset into operation

- right(2) add(1) = addoff(2, 1) right(2)
- Also possible with loops
- Analysis: does loop move pointer?
  - Loops that keep position intact allow more optimizations
  - Maybe distinguish "regular loops" from arbitrary loops?
- ▶ Get rid of all "effect-less" pointer movements

### Fold offset into operation

- right(2) add(1) = addoff(2, 1) right(2)
- Also possible with loops
- Analysis: does loop move pointer?
  - Loops that keep position intact allow more optimizations
  - Maybe distinguish "regular loops" from arbitrary loops?
- Get rid of all "effect-less" pointer movements
- Combine arithmetic operations, disambiguate addresses, etc.

## Motivational Example: Brainfuck – Bytecode

### Motivational Example: Brainfuck – Bytecode

Tree is nice, but rather inefficient ~> flat and compact bytecode
 Avoid pointer dereferences/indirections; keep code size small

### Motivational Example: Brainfuck – Bytecode

- Tree is nice, but rather inefficient ~> flat and compact bytecode
   Avoid pointer dereferences/indirections; keep code size small
- Superinstructions: combine common sequences to one instruction
- Maybe dispatch two instructions at once?
  - switch (ops[pc] | ops[pc+1] « 8)

### Motivational Example: Brainfuck – Threaded Interpretation

### Motivational Example: Brainfuck – Threaded Interpretation

- Simple switch-case dispatch has lots of branch misses
- > Threaded interpretation: at end of a handler, jump to next op

```
struct op { char op; char data; };
struct state { unsigned char* arr; size_t ptr; };
void threadedInterp(struct op* ops, struct state* s) {
    static const void* table[] = { &&CASE_ADD, &&CASE_RIGHT, };
#define DISPATCH do { goto *table[(++pc)->op]; } while (0)
```

```
struct op* pc = ops;
DISPATCH;
```

```
CASE_ADD: s->arr[s->ptr] += pc->data; DISPATCH;
CASE_RIGHT: s->arr += pc->data; DISPATCH;
}
```

Key technique to "avoid" compilation to machine code

Key technique to "avoid" compilation to machine code

Preprocess program into efficiently executable bytecode

- Easily identifiable opcode, homogeneous structure
- Can be linear (fast to execute), but trees also work

Key technique to "avoid" compilation to machine code

- Preprocess program into efficiently executable bytecode
  - Easily identifiable opcode, homogeneous structure
  - Can be linear (fast to execute), but trees also work
- Perhaps optimize if it's worth the benefit
  - ► Fold constants, combine instructions, ...
  - Consider superinstructions for common sequences

Key technique to "avoid" compilation to machine code

- Preprocess program into efficiently executable bytecode
  - Easily identifiable opcode, homogeneous structure
  - Can be linear (fast to execute), but trees also work
- Perhaps optimize if it's worth the benefit
  - ► Fold constants, combine instructions, ...
  - Consider superinstructions for common sequences
- ► For very cold code: avoid transformations at all
- Use threaded-interpretation to avoid branch misses

### Compiler: Surrounding – Compile-time

► Typical environment for a C/C++ compiler:



Calling Convention: interface with other objects/libraries

- Build systems, dependencies, debuggers, etc.
- Compilation target machine (hardware, VM, etc.)

## Compiler: Surrounding – Run-time

- ► OS interface (I/O, ...)
- Memory management (allocation, GC, ...)
- Parallelization, threads, ...
- ▶ VM for execution of virtual assembly (JVM, ...)
- Run-time type checking
- Error handling: exception unwinding, assertions, ...
- Reflection, RTTI

## Motivational Example: Brainfuck - Runtime Environment

Motivational Example: Brainfuck - Runtime Environment

▶ Needs I/O for . and ,
Motivational Example: Brainfuck – Runtime Environment

 $\blacktriangleright$  Needs I/O for . and ,

Memory management: infinitely sized array

Allocate on demand (easy?)

### Motivational Example: Brainfuck – Runtime Environment

 $\blacktriangleright$  Needs I/O for . and ,

- Memory management: infinitely sized array
- Allocate on demand (easy?)
  - What if main memory or address space is insufficient?
- Deallocation of empty pages?

## Motivational Example: Brainfuck - Runtime Environment

 $\blacktriangleright$  Needs I/O for . and ,

- Memory management: infinitely sized array
- Allocate on demand (easy?)
  - What if main memory or address space is insufficient?
- Deallocation of empty pages?
- Error handling: unmatched brackets

## Compilation point: AoT vs. JIT

## Compilation point: AoT vs. JIT

Ahead-of-Time (AoT)

- All code has to be compiled
- No dynamic optimizations
- Compilation-time secondary concern

#### Just-in-Time (JIT)

- Compilation-time is critical
- Code can be compiled on-demand
  - Incremental optimization, too
- Handle cold code fast
- Dynamic specializations possible
- Allows for eval()

## Compilation point: AoT vs. JIT

Ahead-of-Time (AoT)

- All code has to be compiled
- No dynamic optimizations
- Compilation-time secondary concern

#### Just-in-Time (JIT)

- Compilation-time is critical
- Code can be compiled on-demand
  - Incremental optimization, too
- Handle cold code fast
- Dynamic specializations possible
- Allows for eval()

Various hybrid combinations possible

#### Introduction and Interpretation – Summary

- Compilation vs. interpretation and combinations
- Compilers are key to usable/performant languages
- Target language typically machine code or bytecode
- Three-phase architecture widely used
- Interpretation techniques: bytecode, threaded interpretation, ....
- ► JIT compilation imposes different constraints

#### Introduction and Interpretation – Questions

- What is typically compiled and what is interpreted? Why?
  - PostScript, C, JavaScript, HTML, SQL
- What are typical types of output languages of compilers?
- How does a compiler IR differ from the source input?
- What is the impact of the language paradigm on optimizations?
- What are important factors for an efficient interpreter?
- ▶ What are key differences between AoT and JIT compilation?

Code Generation for Data Processing Lecture 2: Compiler Front-end

Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

## Compiler Front-end



Typical architecture: separate lexer, parser, and context analysis

- Allows for more efficient lexical analysis
- Smaller components, easier to understand, etc.
- Some languages: preprocessor and macro expansion



- Convert stream of chars to stream of words (tokens)
- Detect/classify identifiers, numbers, operators, ....
- Strip whitespace, comments, etc.



Convert stream of chars to stream of words (tokens)

- Detect/classify identifiers, numbers, operators, ...
- Strip whitespace, comments, etc.

```
a+b*c \rightarrow ID(a) PLUS ID(b) TIMES ID(c)
```

Typically representable as regular expressions

# Typical Token Kinds

# Typical Token Kinds

| <ul> <li>Punctuators</li> </ul>                                   | ()[]{};=++=                                                         |
|-------------------------------------------------------------------|---------------------------------------------------------------------|
| <ul> <li>Identifiers</li> </ul>                                   | abc123 main                                                         |
| <ul> <li>Keywords</li> </ul>                                      | void intasm                                                         |
| Numeric constants                                                 | 123 Oxab1 5.7e3 Ox1.8p1                                             |
| <ul> <li>Char constants</li> </ul>                                | 'a' u'œ'                                                            |
| <ul> <li>String literals</li> </ul>                               | "abc\x12\n"                                                         |
| <ul> <li>Internal</li> <li>Comments might be useful fo</li> </ul> | EOF COMMENT UNKNOWN INDENT DEDENT rannotations, e.g. // fallthrough |

#### Lexer Implementation

```
def nextToken(inp: str) -> tuple[str, str, str]:
   # Get next token, return (kind, value, remainder)
   inp = inp.lstrip()
   if not inp:
       return "EOF", "", inp
   if inp[0].isdigit():
       m = re.match(r, [1-9][0-9]*|0([0-7]+|x[0-9a-fA-F]+|)), inp)
       return "NUM", m[0], inp[m.end():]
   if inp[0].isalpha():
       m = re.match(r'[a-zA-Z][a-zA-Z0-9]*', inp)
       if m[0] in KEYWORDS: return m[0], m[0], inp[m.end():]
       return "IDENT", m[0], inp[m.end():]
   if inp[:2] == "+=": return "PLUSEQ", inp[:2], inp[2:]
   if inp[:1] == "+": return "PLUS", inp[:1], inp[1:]
    . . .
   raise Exception()
```

## Lexing C??=

```
main() <%
    // yay, this is C99??/
    puts("hi_world!");
    puts("what's_up??!");
    %>
```

Output:

## Lexing C??=

```
main() <%
  // yay, this is C99??/
  puts("hi_world!");
  puts("what's_up??!");
%>
```

Output: what's up|

## Lexing C??=

```
main() <%
    // yay, this is C99??/
    puts("hi_world!");
    puts("what's_up??!");
%>
```

Output: what's up|

Trigraphs for systems with more limited encodings/char sets
Digraphs to provide a more readable alternative...

#### Lexer Implementation

Essentially a DFA (for most languages)

- Set of regexes  $\rightarrow$  NFA  $\rightarrow$  DFA
- Respect whitespace/separators for operators, e.g. + and +=
- Automatic tools (e.g., flex) exist; most compilers do their own
- Keywords typically parsed as identifiers first
  - Check identifier if it is a keyword; can use perfect hashing
- Other practical problems
  - UTF-8 homoglyphs; trigraphs; pre-processing directives

## Parsing

Convert stream of tokens into (abstract) syntax tree
 Most programming languages are context-sensitive
 Variable declarations, argument count, type match, etc.
 ~> separated into semantic analysis
 Svntactically valid: void foo = doesntExist / "abc";

Grammar usually specified as CFG

## Context-Free Grammar (CFG)

- Terminals: basic symbols/tokens
- Non-terminals: syntactic variables
- Start symbol: non-terminal defining language
- ▶ Productions: non-terminal  $\rightarrow$  series of (non-)terminals

$$stmt \rightarrow whileStmt \mid breakStmt \mid exprStmt$$

$$whileStmt \rightarrow while (expr) stmt$$

$$breakStmt \rightarrow break ;$$

$$exprStmt \rightarrow expr ;$$

$$expr \rightarrow expr + expr \mid expr * expr \mid expr = expr \mid (expr) \mid number$$

#### Hand-written Parsing – First Try

- One function per non-terminal
- Check expected structure
- Return AST node

def parseBreakStmt(...):
 matchToken("break")
 matchToken("SEMICOLON")
 return ("breakStmt",)

```
def parseWhileStmt(...):
    matchToken("while")
    matchToken("LPAREN")
    expr = parseExpr(...)
    matchToken("RPAREN")
    stmt = parseStmt(...)
    return ("whileStmt", expr, stmt)
```

```
def parseStmt(...):
    # whoops!
```

#### Hand-written Parsing – First Try

- One function per non-terminal
- Check expected structure
- Return AST node
- Need look-ahead!

```
def parseBreakStmt(...):
    matchToken("break")
    matchToken("SEMICOLON")
    return ("breakStmt",)
```

```
def parseWhileStmt(...):
    matchToken("while")
    matchToken("LPAREN")
    expr = parseExpr(...)
    matchToken("RPAREN")
    stmt = parseStmt(...)
    return ("whileStmt", expr, stmt)
```

```
def parseStmt(...):
    # whoops!
```

## Hand-written Parsing – Second Try

- Need look-ahead to distinguish production rules
- Consequences for grammar:
  - No left-recursion
  - First *n* terminals must allow distinguishing rules
  - LL(n) grammar; n typically 1
  - $\Rightarrow$  Not all CFGs (easily) parseable (but most programming langs. are)

```
def parseBreakStmt(...):
    ... # as before
def parseWhileStmt(...):
    ... # as before
```

```
def parseStmt(...):
   tok = peekToken()
   if tok == "break":
        return parseBreakStmt(...)
   if tok == "while":
        return parseWhileStmt(...)
   expr = parseExpr(...)
   matchToken("SEMICOLON")
   return ("exprStmt", expr)
```

## Hand-written Parsing – Second Try

- Need look-ahead to distinguish production rules
- Consequences for grammar:
  - No left-recursion
  - First *n* terminals must allow distinguishing rules
  - ► LL(n) grammar; n typically 1
     ⇒ Not all CFGs (easily) parseable (but most programming langs. are)

Now... expressions

```
def parseBreakStmt(...):
    ... # as before
def parseWhileStmt(...):
    ... # as before
```

```
def parseStmt(...):
   tok = peekToken()
   if tok == "break":
        return parseBreakStmt(...)
   if tok == "while":
        return parseWhileStmt(...)
   expr = parseExpr(...)
   matchToken("SEMICOLON")
   return ("exprStmt", expr)
```

Ambiguity

## $expr \rightarrow expr + expr | expr * expr | expr = expr | ( expr ) | number$ Input: 4 + 3 \* 2

Ambiguity

$$expr \rightarrow expr + expr | expr * expr | expr = expr | (expr) | number$$
  
Input: 4 + 3 \* 2



#### Ambiguity – Rewrite Grammar?

# $\begin{array}{rcl} \textit{primary} & \rightarrow & (\textit{expr}) \mid \textit{number} \\ & expr & \rightarrow & \textit{primary} + expr \mid \textit{primary} * expr \mid \textit{primary} = expr \mid \textit{primary} \end{array}$

## Ambiguity – Rewrite Grammar?

primary 
$$\rightarrow$$
 (expr) | number  
expr  $\rightarrow$  primary + expr | primary \* expr | primary = expr | primary

Input: 4 + 3 \* 2 Input: 4 \* 3 + 2





## Ambiguity – Precedence

Input: 4 ★ 3 () . . .





## Ambiguity – Precedence

Input: 4 ★ 3 ○ . . .



prec(○) > prec(★)
 Equal prec. and ★ is right-associative



- ▶  $prec(\bigcirc) < prec(\bigstar)$
- ► Equal prec. and ★ is left-associative

#### Hand-written Parsing – Expression Parsing

- Start with basic expr.:
- Number, variable, etc.
- ► Parenthesized expr.
  - Parse full expressionNext token must be )
- Unary expr: followed by expr. with higher prec.
  - < unary < []/->

```
def parsePrimaryExpr(...):
    # handle numbers, unary operators,
    # variables, parenthesized expr.
    ... # trivial ;)
def parseExpr(..., minPrec=0):
    lhs = parsePrimaryExpr(...)
    ... # (next slide)
```

## Hand-written Parsing - Expression Parsing

- Only allow ops. with higher prec. on the right child
- Operator precedence
  - ▶  $* \rightarrow (3, \text{left-assoc})$ ▶  $+ \rightarrow (2, \text{left-assoc})$ ▶  $= \rightarrow (1, \text{right-assoc})$
- ► Right-assoc.: allow same prec.

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

## Hand-written Parsing - Expression Parsing

- Only allow ops. with higher prec. on the right child
- Operator precedence
  - ▶  $* \rightarrow (3, \text{left-assoc})$ ▶  $+ \rightarrow (2, \text{left-assoc})$ ▶  $= \rightarrow (1, \text{right-assoc})$
- ▶ Right-assoc.: allow same prec.
  - Assignment, ternary

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

#### Hand-written Parsing – Expression Parsing

<u>4</u>+3\*2\*5+1

| tok | minPrec | lhs |
|-----|---------|-----|
|-----|---------|-----|

0 4

```
OPERATORS = {
    "*": (3, False),
    "+": (2, False),
    "=": (1, True),
}
```

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

#### Hand-written Parsing – Expression Parsing

4<u>+</u>3\*2\*5+1

| tok        | minPre  | c lhs   |
|------------|---------|---------|
| +          | 0       | 4       |
|            |         |         |
|            |         |         |
|            |         |         |
| OPERA      | TORS =  | {       |
| "*         | k": (3, | False), |
| Ч4         | ·": (2, | False), |
| 9 <u>-</u> | =": (1, | True),  |

ን

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```
4+<u>3</u>\*2\*5+1

| tok           | minPrec   | lhs    |  |
|---------------|-----------|--------|--|
| +             | 0         | 4      |  |
|               | 3         | 3      |  |
|               |           |        |  |
|               |           |        |  |
| OPERATORS = { |           |        |  |
|               | *": (3, F | alse), |  |
| 0.            | +": (2, F | alse), |  |
| 0,            | =": (1, T | rue),  |  |
| }             |           |        |  |

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

4+3<u>\*</u>2\*5+1

| tok              | minPre          | c lhs   |  |
|------------------|-----------------|---------|--|
| +                | 0               | 4       |  |
| *                | 3               | 3       |  |
|                  |                 |         |  |
|                  |                 |         |  |
| OPERATORS = $\{$ |                 |         |  |
| Π,               | <b>*</b> ": (3, | False), |  |
| 0.               | +": (2,         | False), |  |
| и.               | =": (1,         | True),  |  |
| }                |                 |         |  |

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

4+3\*<u>2</u>\*5+1

| tok            | minPred         | : lhs   |  |
|----------------|-----------------|---------|--|
| +              | 0               | 4       |  |
| *              | 3               | 3       |  |
|                | 4               | 2       |  |
|                |                 |         |  |
| OPERATORS = {  |                 |         |  |
| и,             | <b>*</b> ": (3, | False), |  |
| 0.             | +": (2,         | False), |  |
| и <sub>1</sub> | =": (1,         | True),  |  |
| }              |                 |         |  |

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

4+3\*2<u>\*</u>5+1

| tok              | minPrec   | lhs     |  |
|------------------|-----------|---------|--|
| +                | 0         | 4       |  |
| *                | 3         | 3*2     |  |
|                  |           |         |  |
|                  |           |         |  |
| OPERATORS = $\{$ |           |         |  |
| п,               | ∗": (3, F | False), |  |
| 0.               | +": (2, F | False), |  |
| и.<br>-          | =": (1, 7 | [rue),  |  |
| }                |           |         |  |

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

4+3\*2\*<u>5</u>+1

| tok           | minPre          | c <b>lhs</b> |  |
|---------------|-----------------|--------------|--|
| +             | 0               | 4            |  |
| *             | 3               | 3*2          |  |
|               | 4               | 5            |  |
|               |                 |              |  |
| OPERATORS = { |                 |              |  |
| и,            | <b>*</b> ": (3, | False),      |  |
| 0.            | +": (2,         | False),      |  |
| и,            | =": (1,         | True),       |  |
| }             |                 |              |  |

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

4+3\*2\*5<u>+</u>1

| tok | minPrec | lhs     |
|-----|---------|---------|
| +   | 0       | 4       |
| *   | 3       | (3*2)*5 |

```
OPERATORS = {
    "*": (3, False),
    "+": (2, False),
    "=": (1, True),
}
```

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

4+3\*2\*5<u>+</u>1

tok minPrec lhs + 0 4+((3\*2)\*5)

```
OPERATORS = {
    "*": (3, False),
    "+": (2, False),
    "=": (1, True),
}
```

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

4+3\*2\*5+<u>1</u>

| tok | minPrec | lhs         |
|-----|---------|-------------|
| +   | 0       | 4+((3*2)*5) |
|     | 3       | 1           |

```
OPERATORS = {
    "*": (3, False),
    "+": (2, False),
    "=": (1, True),
}
```

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

4+3\*2\*5+<u>1</u>

tok minPrec lhs + 0 (4+((3\*2)\*5))+1

```
OPERATORS = {
    "*": (3, False),
    "+": (2, False),
    "=": (1, True),
}
```

```
def parsePrimaryExpr(...):
 # handle numbers, unary operators,
 # variables, parenthesized expr.
  ... # trivial :)
def parseExpr(..., minPrec=0):
 lhs = parsePrimaryExpr(...)
 while True:
   tok = nextToken()
   prec, rassoc = OPERATORS[tok]
   if prec < minPrec:</pre>
     return lhs
   # XXX: handling for: (, [, ?:
   newPrec = prec if rassoc else prec+1
   rhs = parseExpr(..., newPrec)
   lhs = ("expr", tok, lhs, rhs)
```

# Top-down vs. Bottom-up Parsing

#### Top-down Parsing

- Start with top rule
- Every step: choose expansion
- LL(1) parser
  - Left-to-right, Leftmost Derivation
- "Easily" writable by hand
- Error handling rather simple
- Covers many prog. languages

#### Bottom-up Parsing

- Start with text
- Reduce to non-terminal
- LR(1) parser
  - Left-to-right, Rightmost Derivation
  - Strict super-set of LL(1)
- ► Often: uses parser generator
- Error handling more complex
- Covers nearly all prog. languages

#### Parser Generators

Writing parsers by hand can be large effort

Parser generators can simplify parser writing a lot
 Yacc/Bison, PLY, ANTLR, ...

Automatic generation of parser/parsing tables from CFG

But: lexer often written by hand either way

#### Parser Generators

Writing parsers by hand can be large effort

Parser generators can simplify parser writing a lot
 Yacc/Bison, PLY, ANTLR, ...

- Automatic generation of parser/parsing tables from CFG
  - But: lexer often written by hand either way

Used heavily in practice (unless error handling is important)

#### Bison Example – part 1

```
%define api.pure full
%define api.value.type {ASTNode*}
%param { Lexer* lexer }
%code{
static int yylex(ASTNode** lvalp, Lexer* lexer);
}
%token NUMBER
%token WHILE "while"
%token BREAK "break"
// precedence and associativity
%right '='
%left '+'
%left '*'
%%
```

#### Bison Example – part 2

```
%%
stmt : WHILE '(' expr ')' stmt { $$ = mkNode(WHILE, $1, $2); }
     | BREAK ':'
                                { $$ = mkNode(BREAK, NULL, NULL); }
                               \{ \$\$ = \$1; \}
     expr ';'
     ;
                          \{ \$\$ = mkNode('+', \$1, \$2); \}
expr : expr '+' expr
                            \{ \$\$ = mkNode('*', \$1, \$2); \}
     | expr '*' expr
     expr '=' expr
                        \{ \$\$ = mkNode('=', \$1, \$2); \}
     | '(' expr ')'
                              \{ \$\$ = \$1; \}
     I NUMBER
     :
%%
static int yvlex(ASTNode** lvalp, Lexer* lexer) {
    /* return next token, or YYEOF/... */ }
```

Some use parser generators, e.g. Python some use hand-written parsers, e.g. GCC, Clang

- Some use parser generators, e.g. Python some use hand-written parsers, e.g. GCC, Clang
- Optimization of grammar for performance
  - Rewrite rules to reduce states, etc.

- Some use parser generators, e.g. Python some use hand-written parsers, e.g. GCC, Clang
- Optimization of grammar for performance
  - Rewrite rules to reduce states, etc.
- Useful error-handling: complex!
  - Try skipping to next separator, e.g.; or ,

- Some use parser generators, e.g. Python some use hand-written parsers, e.g. GCC, Clang
- Optimization of grammar for performance
  - Rewrite rules to reduce states, etc.
- Useful error-handling: complex!
  - Try skipping to next separator, e.g.; or ,
- Programming languages are not always context-free
  - C: foo\* bar;
  - May need to break separation between lexer and parser





C++ is not context-free (inherited from C): T \* a;

# Parsing C++

C++ is not context-free (inherited from C): T \* a;

C++ is ambiguous: Type (a), b;

Can be a declaration or a comma expression

# Parsing C++

C++ is not context-free (inherited from C): T \* a;

- C++ is ambiguous: Type (a), b;
  - Can be a declaration or a comma expression
- C++ templates are Turing-complete<sup>2</sup>

# Parsing C++

C++ is not context-free (inherited from C): T \* a;

- C++ is ambiguous: Type (a), b;
  - Can be a declaration or a comma expression
- C++ templates are Turing-complete<sup>2</sup>
- C++ parsing is hence undecidable<sup>3</sup>
  - Template instantiation combined with C T \* a ambiguity

<sup>2</sup>TL Veldhuizen. C++ templates are Turing complete. 2003. .
 <sup>3</sup>J Haberman. Parsing C++ is literally undecidable. 2013. .

## Semantic Analysis

► Syntactical correctness ⇒ correct program void foo = doesntExist / ++"abc";

# Semantic Analysis

Syntactical correctness ⇒ correct program void foo = doesntExist / ++"abc";

Needs context-sensitive analysis:

- Variable existence, storage, accessibility, ....
- Function existence, arguments, ....
- Operator type compatibility
- Attribute allowance
- Additional type complexity: inference, polymorphism, ....

# Semantic Analysis: Scope Checking with AST Walking

- Idea: walk through AST (in DFS-order) and validate on the way
- Keep track of scope with declared variables
  - Scope = (Map[Name → Type] names, Scope parent)
  - Might need to keep track of defined types separately
- ► For identifiers: check existence and get type
- ► For expressions: check types and derive result type
- ▶ For assignment: check lvalue-ness of left side
- Might be possible during AST creation
- Needs care with built-ins and other special constructs

# Semantic Analysis and Post-Parsing Transformations

- Check for error-prone code patterns
  - Completeness of switch, out-of-range constants, unused variables, ...
- Check method calls, parameter types
- Duplicate code for templates
- Make implicit value conversions explicit
- ► Handle attributes: visibility, warnings, etc.
- ▶ Mangle names, split functions (OpenMP), ABI-specific setup, ...
- Last step: generate IR code

- ► Not necessarily: normal compilers
  - Some languages (e.g., Rust) need unbounded time for parsing

- Not necessarily: normal compilers
  - Some languages (e.g., Rust) need unbounded time for parsing
- ► Somewhat: JIT compilers
  - Start-up time is generally noticable

- Not necessarily: normal compilers
  - Some languages (e.g., Rust) need unbounded time for parsing
- Somewhat: JIT compilers
  - Start-up time is generally noticable
- Somewhat more: Developer tools
  - Imagine: waiting for seconds just for updated syntax highlighting
  - Often uses tricks like incremental updates to parse tree

# Data Types

- Important part of programming languages
- Might have large variety and compatibility
  - Numbers, Strings, Arrays, Compound Types (struct/union), Enum, Templates, Functions, Pointers, ...
  - Class hierarchy, Interfaces, Abstract Classes, ....
  - Integer/float compatibility, promotion, ...
- Might have implicit conversions

Simple class/struct: trivial, just bunch of fields

Methods take (pointer to) this as implicit parameter

▶ Single inheritance: also trivial – extend struct at end

Simple class/struct: trivial, just bunch of fields
 Methods take (pointer to) this as implicit parameter
 Single inheritance: also trivial – extend struct at end

Virtual methods: store vtable in object representation

- vtable = table of function pointers for virtual methods
- Each sub-class has their own vtable

Simple class/struct: trivial, just bunch of fields
 Methods take (pointer to) this as implicit parameter
 Single inheritance: also trivial – extend struct at end

Virtual methods: store vtable in object representation

- vtable = table of function pointers for virtual methods
- Each sub-class has their own vtable
- Multiple inheritance is much more involved

Simple class/struct: trivial, just bunch of fields
 Methods take (pointer to) this as implicit parameter

Single inheritance: also trivial – extend struct at end

Virtual methods: store vtable in object representation

- vtable = table of function pointers for virtual methods
- Each sub-class has their own vtable
- Multiple inheritance is much more involved
- Dynamic casts: needs run-time type information (RTTI)
#### Recommended Lectures

AD IN2227 "Compiler Constructions" covers parsing/analysis in depth

AD CIT3230000 "Programming Languages" covers dispatching/mixins/...

# Compiler Front-end – Summary

#### Lexer splits input into tokens

- Essentially Regex-Matching + Keywords; rather simple
- Parser constructs (abstract) syntax tree from tokens
  - Top-down vs. bottom-up parsing
  - Typical: top-down for control flow; bottom-up for expressions
  - Respect precedence and associativity for operators
- Semantic analysis ensures meaningful program
- Some data structures are complex to implement
- Some programming languages are more difficult to parse

## Compiler Front-end – Questions

- ▶ What are typical components of a compiler front-end?
- What output does the lexer produce?
- ► How does a parser disambiguate rules?
- What is the typical way to handle operator precedence?
- Why are not all programming languages describable using CFGs?
- ▶ How to implement classes with virtual functions?

#### Code Generation for Data Processing Lecture 3: Intermediate Representations

#### Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

- So far: program parsed into AST
- $+\,$  Great for language-related checks
- + Easy to correlate with original source code (e.g., errors)
- Hard for analyses/optimizations due to high complexity
  - variable names, control flow constructs, etc.
  - Data and control flow implicit
- Highly language-specific



Question: how to optimize? Is x+1 redundant?



Question: how to optimize? Is x+1 redundant?  $\rightarrow$  hard to tell  $\Rightarrow$ 

| $x_1$                 | $\leftarrow$ | 5                     | + | 3     |
|-----------------------|--------------|-----------------------|---|-------|
| $y_1$                 | $\leftarrow$ | $x_1$                 | + | 1     |
| <i>x</i> <sub>2</sub> | $\leftarrow$ | 12                    |   |       |
| $Z_1$                 | $\leftarrow$ | <i>x</i> <sub>2</sub> | + | 1     |
| $tmp_1$               | $\leftarrow$ | $z_1$                 | — | $y_1$ |
| return                |              | $tmp_1$               |   |       |

Question: how to optimize? Is x+1 redundant?

| $x_1$                 | $\leftarrow$ | 5       | + | 3     |
|-----------------------|--------------|---------|---|-------|
| $y_1$                 | $\leftarrow$ | $x_1$   | + | 1     |
| <i>x</i> <sub>2</sub> | $\leftarrow$ | 12      |   |       |
| $z_1$                 | $\leftarrow$ | $x_2$   | + | 1     |
| $tmp_1$               | $\leftarrow$ | $Z_1$   | — | $y_1$ |
| return                |              | $tmp_1$ |   |       |

Question: how to optimize? Is x+1 redundant?  $\rightsquigarrow$  No!  $\because$ 

#### Intermediate Representations

Definitive program representation inside compiler

- During compilation, only the (current) IR is considered
- Goal: simplify analyses/transformations
  - Technically, single-step compilation is possible for, e.g., C ... but optimizations are hard without proper IRs
- Compilers *design* IRs to support frequent operations
   IR design can vary strongly between compilers
   Typically based on graphs or linear instructions (or both)

# Compiler Design: Effect of Languages - Imperative

- Step-by-step execution of program modification of state
- Close to hardware execution model
- Direct influence of result
- Tracking of state is complex
- Dynamic typing: more complexity
- Limits optimization possibilities

```
void addvec(int* a, const int* b) {
  for (unsigned i = 0; i < 4; i++)
    a[i] += b[i]; // vectorizable?
}</pre>
```

```
func:
  mov [rdi], rsi
  mov [rdi+8], rdx
  mov [rdi], 0 // redundant?
  ret
```

# Compiler Design: Effect of Languages – Declarative

- Describes execution target
- Compiler has to derive good mapping to imperative hardware
- Allows for more optimizations
- Mapping to hardware non-trivial
  - Might need more stages
  - Preserve semantic info for opt!
- Programmer has less "control"

```
let rec fac = function
| 0 | 1 -> 1
| n -> n * fac (n - 1)
```

## Graph IRs: Abstract Syntax Tree (AST)

- Code representation close to the source
- Representation of types, constants, etc. might differ
- Storage might be problematic for large inputs



# Graph IRs: Control Flow Graph (CFG)

Motivation: model control flow between different code sections

- Graph nodes represent basic blocks
  - Basic block: sequence of branch-free code (modulo exceptions)
  - Typically represented using a linear IR



function ret. type name arguments В

Idea: Keep track of current insert block while walking through AST

fn. prologue

















stmt\_while condition В



































L.....











# Build CFG from AST: Switch

## Build CFG from AST: Switch

Linear search

 $t \leftarrow \exp$ if t == 3: goto B<sub>3</sub> if t == 4: goto B<sub>4</sub> if t == 7: goto B<sub>7</sub> if t == 9: goto B<sub>9</sub> goto B<sub>D</sub>

#### Binary search

goto  $B_D$ 

```
t \leftarrow \exp
if t == 7: goto B_7
elif t > 7:
if t == 9: goto B_9
else:
if t == 3: goto B_3
if t == 4: goto B_4
```

Jump table

 $extsf{t} \leftarrow extsf{exp} \ extsf{if} \ 0 \leq extsf{t} < 10: \ extsf{goto table[t]} \ extsf{goto B}_D$ 

```
table = {

B_D, B_D, B_D, B_3,

B_4, B_D, ... }
```
# Build CFG from AST: Switch

Linear search

 $t \leftarrow \exp$ if t == 3: goto  $B_3$ if t == 4: goto  $B_4$ if t == 7: goto  $B_7$ if t == 9: goto  $B_9$ goto  $B_D$ 

+ Trivial

- Slow, lot of code

#### Binary search

```
t \leftarrow \exp
if t == 7: goto B_7
elif t > 7:
if t == 9: goto B_9
else:
if t == 3: goto B_3
if t == 4: goto B_4
goto B_D
```

+ Good: sparse values

Even more code

Jump table

```
	extsf{t} \leftarrow \exp \\ 	extsf{if} \quad 0 \leq 	extsf{t} < 10: \\ 	extsf{goto table[t]} \\ 	extsf{goto } B_D \end{array}
```

```
table = {

B_D, B_D, B_D, B_3,

B_4, B_D, ... }
```

#### + Fastest

 Table can be large, needs ind. jump

## Build CFG from AST: Break, Continue, Goto

break/continue: trivial

Keep track of target block, insert branch

goto: also trivial

- Split block at target label, if needed
- But: may lead to irreducible control flow graph

#### CFG: Formal Definition

- Flow graph: G = (N, E, s) with a digraph (N, E) and entry  $s \in N$ 
  - Each node is a basic block, *s* is the entry block
  - $(n_1, n_2) \in E$  iff  $n_2$  might be executed immediately after  $n_1$
  - All  $n \in N$  shall be reachable from s (unreachable nodes can be discarded)
  - Nodes without successors are end points

## Graph IRs: Call Graph

 Graph showing (possible) call relations between functions



# Graph IRs: Call Graph

- Graph showing (possible) call relations between functions
- Useful for interprocedural optimizations
  - Function ordering

. . .

Stack depth estimation

 $\begin{array}{c} \underset{\downarrow}{\text{main}} & & \\ \underset{parseArgs}{\downarrow} & \xrightarrow{} printf & \text{fibonacci} \\ \\ \underset{\downarrow}{\downarrow} & \underset{\forall}{\downarrow} & \underset{\forall}{\forall} \\ \\ \text{strtol} & \text{write} \end{array}$ 

## Graph IRs: Relational Algebra

- Higher-level representation of query plans
  - Explicit data flow



## Graph IRs: Relational Algebra

- Higher-level representation of query plans
  - Explicit data flow
- Allow for optimization and selection actual implementations
  - Elimination of common sub-trees
  - Joins: ordering, implementation, etc.



## Graph IRs: Relational Algebra

- Higher-level representation of query plans
  - Explicit data flow
- Allow for optimization and selection actual implementations
  - Elimination of common sub-trees
  - Joins: ordering, implementation, etc.



## Linear IRs: Stack Machines

\_

| Operands stored on a stack                                                                                                     | push 5<br>push 3                                     |
|--------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
| <ul> <li>Operations pop arguments from top<br/>and push result</li> <li>Typically accompanied with variable storage</li> </ul> | add<br>pop x<br>push x<br>push 1                     |
| <ul> <li>Generating IR from AST: trivial</li> <li>Often used for bytecode, e.g. Java, Python</li> </ul>                        | add<br>pop y<br>push 1:<br>pop x<br>push x<br>push 1 |

h 3 х h x h 1 у h 12 х h x h 1 add pop z

## Linear IRs: Stack Machines

| Operands stored on a stack                                      | push 5<br>push 3           |
|-----------------------------------------------------------------|----------------------------|
| Operations pop arguments from top                               | add                        |
| and push result                                                 | pop x                      |
| <ul> <li>Typically accompanied with variable storage</li> </ul> | pusn x<br>push 1<br>add    |
| Generating IR from AST: trivial                                 | pop y                      |
| Often used for bytecode, e.g. Java, Python                      | push 12<br>pop x<br>push x |
| + Compact code, easy to generate and implement                  | push 1<br>add              |
| <ul> <li>Performance, hard to analyze</li> </ul>                | pop z                      |

## Linear IRs: Register Machines

- Operands stored in registers
- Operations read and write registers
- ► Typically: infinite number of registers
- ► Typically: three-address form
  - dst = src1 op src2
- Generating IR from AST: trivialE.g., GIMPLE, eBPF, Assembly



## Example: High GIMPLE

```
int fac (int n)
                            gimple_bind < // <-- still has lexical scopes</pre>
                             int D.1950:
                             int res;
int foo(int n) {
  int res = 1:
                             gimple_assign <integer_cst, res, 1, NULL, NULL>
                             gimple_goto <<D.1947>>
  while (n) {
                             gimple label <<D.1948>>
    res *= n * n:
                             gimple_assign <mult_expr, _1, n, n, NULL>
    n -= 1:
                             gimple_assign <mult_expr, res, res, _1, NULL>
  }
                             gimple_assign <plus_expr, n, n, -1, NULL>
                             gimple_label <<D.1947>>
  return res;
                             gimple_cond <ne_expr, n, 0, <D.1948>, <D.1946>>
}
                             gimple_label <<D.1946>>
                             gimple_assign <var_decl, D.1950, res, NULL, NULL>
                             gimple_return <D.1950>
                            >
```

\$ gcc -fdump-tree-gimple-raw -c foo.c

## Example: Low GIMPLE

}

}

```
int fac (int n)
                           Ł
                             int res:
                             int D.1950;
int foo(int n) {
                             gimple_assign <integer_cst, res, 1, NULL, NULL>
  int res = 1;
                             gimple_goto <<D.1947>>
                             gimple_label <<D.1948>>
  while (n) {
                             gimple_assign <mult_expr, _1, n, n, NULL>
    res *= n * n;
                             gimple_assign <mult_expr, res, res, _1, NULL>
    n -= 1:
                             gimple_assign <plus_expr, n, n, -1, NULL>
                             gimple_label <<D.1947>>
                             gimple_cond <ne_expr, n, 0, <D.1948>, <D.1946>>
  return res:
                             gimple_label <<D.1946>>
                             gimple_assign <var_decl, D.1950, res, NULL, NULL>
                             gimple_goto <<D.1951>>
                             gimple_label <<D.1951>>
                             gimple_return <D.1950>
                           ን
```

\$ gcc -fdump-tree-lower-raw -c foo.c

#### Example: Low GIMPLE with CFG

```
int fac (int n) {
                             int res:
                             int D.1950:
                             <hh 2> ·
                             gimple_assign <integer_cst, res, 1, NULL, NULL>
                             goto <bb 4>; [INV]
int foo(int n) {
                             <bb 3> :
  int res = 1:
                             gimple_assign <mult_expr, _1, n, n, NULL>
                             gimple_assign <mult_expr, res, res, _1, NULL>
  while (n) {
                             gimple_assign <plus_expr, n, n, -1, NULL>
    res *= n * n:
                             <bb 4> :
    n = 1;
                             gimple_cond <ne_expr, n, 0, NULL, NULL>
  }
                               goto <bb 3>: [INV]
                             else
  return res;
                               goto <bb 5>; [INV]
}
                             <bb 5> :
                             gimple_assign <var_decl, D.1950, res, NULL, NULL>
                             <bb 6> :
                           gimple_label <<L3>>
                             gimple_return <D.1950>
                           }
```

\$ gcc -fdump-tree-cfg-raw -c foo.c

#### Linear IRs: Register Machines

Problem: no clear def-use information
Is x + 1 the same?
Hard to track actual values!
How to optimize?

## Linear IRs: Register Machines

Problem: no clear def-use information
Is x + 1 the same?
Hard to track actual values!
How to optimize?

 $\Rightarrow$  Disallow mutations of variables

| X       | $\leftarrow$ | 5  | +     | 3 |
|---------|--------------|----|-------|---|
| у       | $\leftarrow$ | x  | +     | 1 |
| X       | $\leftarrow$ | 12 |       |   |
| Ζ       | $\leftarrow$ | x  | +     | 1 |
| $tmp_1$ | $\leftarrow$ | Ζ  | _     | у |
| return  |              | tm | $o_1$ |   |

#### Single Static Assignment: Introduction

- Idea: disallow mutations of variables, value set in declaration
- Instead: create new variable for updated value
- SSA form: every computed value has a unique definition
   Equivalent formulation: each name describes result of one operation

#### Single Static Assignment: Introduction

- Idea: disallow mutations of variables, value set in declaration
- Instead: create new variable for updated value
- SSA form: every computed value has a unique definition
   Equivalent formulation: each name describes result of one operation

#### Single Static Assignment: Control Flow

How to handle diverging values in control flow?

entry :  $x \leftarrow \dots$ if (x > 2) goto cont  $\rightarrow$ then :  $x \leftarrow x * 2$ cont : return x

#### Single Static Assignment: Control Flow

How to handle diverging values in control flow?

entry :  $x \leftarrow \dots$ if (x > 2) goto cont  $\rightarrow$ then :  $x \leftarrow x * 2$ cont : return x  $\begin{array}{rcl} \textit{entry}: & v_1 & \leftarrow & \dots \\ & & \text{if } (v_1 > 2) \text{ goto } \textit{cont} \\ \textit{then}: & v_2 & \leftarrow & v_1 & * & 2 \\ \textit{cont}: & \text{return} & \ref{local}? \ref{local} \end{array}$ 

#### Single Static Assignment: Control Flow

- How to handle diverging values in control flow?
- **>** Solution: Φ-nodes to merge values depending on predecessor
  - Value depends on edge used to enter the block
  - > All Φ-nodes of a block execute concurrently (ordering irrelevant)

antru .

 $: v_1, then : v_2$ )

#### Example: GIMPLE in SSA form

```
int fac (int n) { int res, D.1950, _1, _6;
                             <bb 2> :
                             gimple_assign <integer_cst, res_4, 1, NULL, NULL>
                             goto <bb 4>; [INV]
                             <bb 3> :
                             gimple_assign <mult_expr, _1, n_2, n_2, NULL>
int foo(int n) {
                             gimple_assign <mult_expr, res_8, res_3, _1, NULL>
  int res = 1:
                             gimple_assign <plus_expr, n_9, n_2, -1, NULL>
                             <bb 4> :
  while (n) {
                             # gimple_phi <n_2, n_5(D)(2), n_9(3)>
    res *= n * n:
                             # gimple_phi <res_3, res_4(2), res_8(3)>
   n -= 1:
                             gimple_cond <ne_expr, n_2, 0, NULL, NULL>
  }
                               goto <bb 3>: [INV]
                             else
  return res;
                               goto <bb 5>; [INV]
}
                             <bb 5> :
                             gimple_assign <ssa_name, _6, res_3, NULL, NULL>
                             <bb 6> :
                            gimple_label <<L3>>
                             gimple_return <_6>
                            }
```

\$ gcc -fdump-tree-ssa-raw -c foo.c





| Code          |                          |                                                           |                                             |                 |                  |  |                                  | IR                                  |                                           | ` | Variable Mapping |                                       |                                  |  |
|---------------|--------------------------|-----------------------------------------------------------|---------------------------------------------|-----------------|------------------|--|----------------------------------|-------------------------------------|-------------------------------------------|---|------------------|---------------------------------------|----------------------------------|--|
| $\rightarrow$ | x<br>y<br>x<br>z<br>tmp1 | $\uparrow \uparrow \uparrow \uparrow \uparrow \downarrow$ | 5<br><i>x</i><br>12<br><i>x</i><br><i>z</i> | +++++           | 3<br>1<br>1<br>v |  | v <sub>1</sub><br>v <sub>2</sub> | $\stackrel{\leftarrow}{\leftarrow}$ | add 5, 3<br>add <i>v</i> <sub>1</sub> , 1 |   | x<br>y           | $\stackrel{\rightarrow}{\rightarrow}$ | v <sub>1</sub><br>v <sub>2</sub> |  |
|               | return                   |                                                           | tmp                                         | $\mathcal{O}_1$ | 5                |  |                                  |                                     |                                           |   |                  |                                       |                                  |  |

| Code                                                                                                                                                     |                                                   | SSA IR                                                                                                                                   | Variable Mapping                                                     |
|----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|
| $egin{array}{cccc} x & \leftarrow & \ y & \leftarrow & \  ightarrow & x & \leftarrow & \ z & \leftarrow & \ tmp_1 & \leftarrow & \ return & \end{array}$ | 5 + 3<br>x + 1<br>12<br>x + 1<br>z - y<br>$tmp_1$ | $egin{array}{rcl} v_1 &\leftarrow & 	ext{add 5, 3} \ v_2 &\leftarrow & 	ext{add } v_1, 1 \ v_3 &\leftarrow & 	ext{const 12} \end{array}$ | $\begin{array}{cccc} x & \to & v_3 & ! \\ y & \to & v_2 \end{array}$ |

| Code                                                                                                                                                     |                                                   | SSA IR                                                                                                                                                                         | Variable Mapping                                                                                  |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|--|--|--|
| $egin{array}{cccc} x & \leftarrow & \ y & \leftarrow & \ x & \leftarrow & \  ightarrow & z & \leftarrow & \ tmp_1 & \leftarrow & \ return & \end{array}$ | 5 + 3<br>x + 1<br>12<br>x + 1<br>z - y<br>$tmp_1$ | $egin{array}{rcl} v_1 &\leftarrow & 	ext{add 5, 3} \ v_2 &\leftarrow & 	ext{add } v_1, 1 \ v_3 &\leftarrow & 	ext{const 12} \ v_4 &\leftarrow & 	ext{add } v_3, 1 \end{array}$ | $egin{array}{ccc} x &  ightarrow & v_3 \ y &  ightarrow & v_2 \ z &  ightarrow & v_4 \end{array}$ |  |  |  |

| Code          | Code    |              |    |       |   | S | SA                    | IR           |                            | Variable Mapping |               |                       |  |
|---------------|---------|--------------|----|-------|---|---|-----------------------|--------------|----------------------------|------------------|---------------|-----------------------|--|
|               | x       | $\leftarrow$ | 5  | +     | 3 |   | $v_1$                 | $\leftarrow$ | add 5, 3                   | x                | $\rightarrow$ | <i>V</i> 3            |  |
|               | У       | $\leftarrow$ | X  | +     | 1 |   | $v_2$                 | $\leftarrow$ | add $v_1$ , $1$            | У                | $\rightarrow$ | <i>V</i> <sub>2</sub> |  |
|               | x       | $\leftarrow$ | 12 |       |   |   | <i>V</i> 3            | $\leftarrow$ | const 12                   | Z                | $\rightarrow$ | <i>V</i> 4            |  |
|               | Ζ       | $\leftarrow$ | X  | +     | 1 |   | <i>v</i> <sub>4</sub> | $\leftarrow$ | add $v_3$ , 1              | $tmp_1$          | $\rightarrow$ | <i>V</i> 5            |  |
| $\rightarrow$ | $tmp_1$ | $\leftarrow$ | Ζ  | —     | у |   | $V_5$                 | $\leftarrow$ | sub <i>v</i> 4, <i>v</i> 2 |                  |               |                       |  |
|               | return  |              | tm | $p_1$ |   |   |                       |              |                            |                  |               |                       |  |

| Code          |                                                |                                                         |                               |                  |                         | SSA                                                                    | IR                                             |                                                                                                                                                                          | Variable Mapping                |                                                                                         |                      |  |
|---------------|------------------------------------------------|---------------------------------------------------------|-------------------------------|------------------|-------------------------|------------------------------------------------------------------------|------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|-----------------------------------------------------------------------------------------|----------------------|--|
| $\rightarrow$ | x<br>y<br>x<br>z<br>tmp <sub>1</sub><br>return | $\uparrow \uparrow \uparrow \uparrow \uparrow \uparrow$ | 5<br>x<br>12<br>x<br>z<br>tmp | +<br>+<br>+<br>_ | 3<br>1<br>1<br><i>y</i> | <ul> <li>V1</li> <li>V2</li> <li>V3</li> <li>V4</li> <li>V5</li> </ul> | $\uparrow \uparrow \uparrow \uparrow \uparrow$ | add 5, 3<br>add <i>v</i> <sub>1</sub> , 1<br>const 12<br>add <i>v</i> <sub>3</sub> , 1<br>sub <i>v</i> <sub>4</sub> , <i>v</i> <sub>2</sub><br>ret <i>v</i> <sub>5</sub> | x<br>y<br>z<br>tmp <sub>1</sub> | $\begin{array}{c} \rightarrow \\ \rightarrow \\ \rightarrow \\ \rightarrow \end{array}$ | V3<br>V2<br>V4<br>V5 |  |
|               |                                                |                                                         |                               |                  |                         |                                                                        |                                                |                                                                                                                                                                          |                                 |                                                                                         |                      |  |

| Code | e                                              |                                                         |                               |                  |                         |                            | IR                                                        |                                                                                                                                                                          | Variable Mapping                |                                                                                         |                      |  |
|------|------------------------------------------------|---------------------------------------------------------|-------------------------------|------------------|-------------------------|----------------------------|-----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|-----------------------------------------------------------------------------------------|----------------------|--|
|      | x<br>y<br>x<br>z<br>tmp <sub>1</sub><br>return | $\uparrow \uparrow \uparrow \uparrow \uparrow \uparrow$ | 5<br>x<br>12<br>x<br>z<br>tm; | +<br>+<br>+<br>- | 3<br>1<br>1<br><i>y</i> | V1<br>V2<br>V3<br>V4<br>V5 | $\uparrow \uparrow \uparrow \uparrow \uparrow \downarrow$ | add 5, 3<br>add <i>v</i> <sub>1</sub> , 1<br>const 12<br>add <i>v</i> <sub>3</sub> , 1<br>sub <i>v</i> <sub>4</sub> , <i>v</i> <sub>2</sub><br>ret <i>v</i> <sub>5</sub> | x<br>y<br>z<br>tmp <sub>1</sub> | $\begin{array}{c} \rightarrow \\ \rightarrow \\ \rightarrow \\ \rightarrow \end{array}$ | V3<br>V2<br>V4<br>V5 |  |
|      |                                                |                                                         |                               |                  |                         |                            |                                                           |                                                                                                                                                                          |                                 |                                                                                         |                      |  |

#### SSA Construction – Across Blocks

- SSA construction with control flow is non-trivial
- ▶ Key problem: find value for variable in predecessor
- Naive approach: Φ-nodes for all variables everywhere
  - Create empty Φ-nodes for variables, populate variable mapping
  - Fill blocks (as on last slide)
  - Fill Φ-nodes with last value of variable in predecessor

#### SSA Construction – Across Blocks

- SSA construction with control flow is non-trivial
- ▶ Key problem: find value for variable in predecessor

Naive approach: Φ-nodes for all variables everywhere

- Create empty Φ-nodes for variables, populate variable mapping
- Fill blocks (as on last slide)
- > Fill Φ-nodes with last value of variable in predecessor
- Why is this a bad idea?

 $\Rightarrow$  don't do this!

• Extremely inefficient, code size explosion, many dead  $\Phi$ 

## SSA Construction – Across Blocks ("simple"<sup>4</sup>)

► Key problem: find value in predecessor

- Idea: seal block once all direct predecessors are known
  - For acyclic constructs: trivial
  - ▶ For loops: seal header once loop block is generated
- Current block not sealed: add Φ-node, fill on sealing
- Single predecessor: recursively query that
- Multiple preds.: add Φ-node, fill now

SSA Construction – Example

func  $foo(v_1)$ 

```
int foo(int n) {
    int res = 1;
    while (n) {
        res *= n * n;
        n -= 1;
    }
    return res;
}
```

SSA Construction – Example

 $\begin{array}{ll} \mbox{func} & foo \; (v_1) \\ \mbox{entry:} & \mbox{sealed; varmap: } n \rightarrow v_1 \end{array}$ 

```
int foo(int n) {
    int res = 1;
    while (n) {
        res *= n * n;
        n -= 1;
    }
    return res;
}
```

SSA Construction – Example

```
\begin{array}{ll} & \text{func } foo\left(v_{1}\right) \\ \text{entry:} & \text{sealed; varmap: } n \rightarrow v_{1}, \, \texttt{res} \rightarrow v_{2} \\ & v_{2} \leftarrow 1 \end{array}
```

```
int foo(int n) {
    int res = 1;
    while (n) {
        res *= n * n;
        n -= 1;
    }
    return res;
}
```
}

}

```
func foo(v_1)
                                   entry:
                                              sealed; varmap: n \rightarrow v_1, res\rightarrow v_2
                                               v_2 \leftarrow 1
                                 header: NOT sealed; varmap: \emptyset
int foo(int n) {
  int res = 1;
  while (n) {
                                    body:
                                              NOT sealed; varmap: \emptyset
    res *= n * n;
    n -= 1:
  return res;
```

int foo(int n) {

int res = 1;
while (n) {

n -= 1:

return res;

}

}

res \*= n \* n;

func  $foo(v_1)$ entry: sealed; varmap:  $n \rightarrow v_1$ , res $\rightarrow v_2$  $v_2 \leftarrow 1$ header: NOT sealed: varmap:  $\emptyset$  $v_3 \leftarrow \text{equal } ???, 0$ body: NOT sealed; varmap:  $\emptyset$ 

int foo(int n) {
 int res = 1;
 while (n) {
 res \*= n \* n;
 n -= 1;

return res;

}

}

|         | func $foo(v_1)$                                               |
|---------|---------------------------------------------------------------|
| entry:  | sealed; varmap: n $ ightarrow v_1$ , res $ ightarrow v_2$     |
|         | $v_2 \leftarrow 1$                                            |
| header: | NOT sealed; varmap: $\mathtt{n}  ightarrow \phi_{\mathtt{l}}$ |
|         | $\phi_{1} \leftarrow \phi$ incomplete, for <code>n</code>     |
|         | $v_{3} \leftarrow 	ext{equal } \phi_{1}, 	ext{ 0}$            |
| body:   | NOT sealed; varmap: $\emptyset$                               |
|         |                                                               |
|         |                                                               |

int foo(int n) {
 int res = 1;
 while (n) {
 res \*= n \* n;
 n -= 1;

return res;

}

}

|         | func $foo(v_1)$                                                                                                                                          |
|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| entry:  | sealed; varmap: $n \rightarrow v_1$ , res $\rightarrow v_2$                                                                                              |
|         | $v_2 \leftarrow 1$                                                                                                                                       |
| header: | NOT sealed; varmap: $n \rightarrow \phi_1$<br>$\phi_1 \leftarrow \phi$ incomplete, for n<br>$v_3 \leftarrow$ equal $\phi_1$ , 0<br>br $v_3$ , cont, body |
| body:   | NOT sealed; varmap: $\emptyset$                                                                                                                          |
|         |                                                                                                                                                          |

int foo(int n) {
 int res = 1;
 while (n) {
 res \*= n \* n;
 n -= 1;

return res;

}

}

|         | func $foo(v_1)$                                                                 |
|---------|---------------------------------------------------------------------------------|
| entry:  | sealed; varmap: $\mathtt{n} { ightarrow} v_1$ , $\mathtt{res} { ightarrow} v_2$ |
|         | $v_2 \leftarrow 1$                                                              |
| header: | NOT sealed; varmap: $\mathtt{n}  ightarrow \phi_{\mathtt{l}}$                   |
|         | $\phi_{1} \leftarrow \phi$ incomplete, for <code>n</code>                       |
|         | $v_{3} \leftarrow 	ext{equal } \phi_{1}, 	ext{ 0}$                              |
|         | br $v_3$ , cont, body                                                           |
| body:   | sealed; varmap: $\emptyset$                                                     |
|         |                                                                                 |
|         |                                                                                 |
|         |                                                                                 |

|         | func $foo(v_1)$                                                                                       |
|---------|-------------------------------------------------------------------------------------------------------|
| entry:  | sealed; varmap: $\mathtt{n} { ightarrow} v_{\mathtt{l}}$ , $\mathtt{res} { ightarrow} v_{\mathtt{2}}$ |
|         | $v_2 \leftarrow 1$                                                                                    |
| header: | NOT sealed; varmap: $n \rightarrow \phi_1$                                                            |
|         | $\phi_1 \leftarrow \phi$ incomplete, for <code>n</code>                                               |
|         | $v_{3} \leftarrow 	ext{equal } \phi_{1}, 	ext{ 0}$                                                    |
|         | br $v_3$ , cont, body                                                                                 |
| body:   | sealed; varmap: Ø                                                                                     |
|         | $v_4 \leftarrow \texttt{mul} ???, ???$                                                                |
|         |                                                                                                       |
|         |                                                                                                       |

```
int foo(int n) {
    int res = 1;
    while (n) {
        res *= n * n;
        n -= 1;
    }
    return res;
}
```

cont: NOT sealed; varmap:  $\emptyset$ 

int foo(int n) {
 int res = 1;
 while (n) {
 res \*= n \* n;
 n -= 1;

return res;

}

}

|         | func $foo(v_1)$                                           |
|---------|-----------------------------------------------------------|
| entry:  | sealed; varmap: n $ ightarrow v_1$ , res $ ightarrow v_2$ |
|         | $v_2 \leftarrow 1$                                        |
| header: | NOT sealed; varmap: $n \rightarrow \phi_1$                |
|         | $\phi_{1} \leftarrow \phi$ incomplete, for <code>n</code> |
|         | $v_{3} \leftarrow 	ext{equal } \phi_{1}, 	ext{ 0}$        |
|         | br $v_3$ , cont, body                                     |
| body:   | sealed; varmap: $\mathtt{n}{	o}\phi_{\mathtt{1}}$         |
|         | $v_4 \gets \texttt{mul} \ \phi_1$ , $\phi_1$              |
|         |                                                           |
|         |                                                           |

int foo(int n) {
 int res = 1;
 while (n) {
 res \*= n \* n;
 n -= 1;

return res;

}

}

| entry:  | func foo( $v_1$ )<br>sealed; varmap: $n \rightarrow v_1$ , res $\rightarrow v_2$<br>$v_2 \leftarrow 1$                                                   |
|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| header: | NOT sealed; varmap: $n \rightarrow \phi_1$<br>$\phi_1 \leftarrow \phi$ incomplete, for n<br>$v_3 \leftarrow$ equal $\phi_1$ , 0<br>br $v_3$ , cont, body |
| body:   | sealed; varmap: $n \rightarrow \phi_1$<br>$v_4 \leftarrow mul \phi_1, \phi_1$<br>$v_5 \leftarrow mul ???, v_4$                                           |

int foo(int n) {

int res = 1:

res \*= n \* n; n -= 1:

while (n) {

return res:

}

}

func  $foo(v_1)$ entry: sealed; varmap:  $n \rightarrow v_1$ , res $\rightarrow v_2$  $v_2 \leftarrow 1$ header: NOT sealed: varmap:  $n \rightarrow \phi_1$ . res $\rightarrow \phi_2$  $\phi_1 \leftarrow \phi$  incomplete. for n  $\phi_2 \leftarrow \phi$  incomplete, for res  $v_3 \leftarrow \text{equal } \phi_1, 0$ br  $v_3$ , cont, body bodv: sealed: varmap:  $n \rightarrow \phi_1$ , res $\rightarrow v_5$  $v_4 \leftarrow \text{mul } \phi_1, \phi_1$  $v_5 \leftarrow \text{mul } \phi_2, v_4$ 

**cont:** NOT sealed; varmap:  $\emptyset$ 

int foo(int n) {

int res = 1:

res \*= n \* n; n -= 1:

while (n) {

return res:

}

}

func  $foo(v_1)$ entry: sealed; varmap:  $n \rightarrow v_1$ , res $\rightarrow v_2$  $v_2 \leftarrow 1$ header: NOT sealed: varmap:  $n \rightarrow \phi_1$ . res $\rightarrow \phi_2$  $\phi_1 \leftarrow \phi$  incomplete. for n  $\phi_2 \leftarrow \phi$  incomplete, for res  $v_3 \leftarrow \text{equal } \phi_1, 0$ br  $v_3$ , cont, body bodv: sealed: varmap:  $n \rightarrow v_6$ , res $\rightarrow v_5$  $v_4 \leftarrow \text{mul } \phi_1, \phi_1$  $v_5 \leftarrow \text{mul } \phi_2, v_4$  $v_6 \leftarrow \text{sub } \phi_1$ . 1

**cont:** NOT sealed; varmap:  $\emptyset$ 

int foo(int n) {

int res = 1:

res \*= n \* n; n -= 1:

while (n) {

return res:

}

}

func  $foo(v_1)$ entry: sealed; varmap:  $n \rightarrow v_1$ , res $\rightarrow v_2$  $v_2 \leftarrow 1$ header: NOT sealed: varmap:  $n \rightarrow \phi_1$ , res $\rightarrow \phi_2$  $\phi_1 \leftarrow \phi$  incomplete, for n  $\phi_2 \leftarrow \phi$  incomplete, for res  $v_3 \leftarrow \text{equal } \phi_1, 0$ br  $v_3$ , cont, body bodv: sealed: varmap:  $n \rightarrow v_6$ , res $\rightarrow v_5$  $v_4 \leftarrow \text{mul } \phi_1, \phi_1$  $v_5 \leftarrow \text{mul } \phi_2, v_4$  $v_6 \leftarrow \text{sub } \phi_1$ . 1 br header

int foo(int n) {

int res = 1:

res \*= n \* n; n -= 1:

while (n) {

return res:

}

}

func  $foo(v_1)$ entry: sealed; varmap:  $n \rightarrow v_1$ , res $\rightarrow v_2$  $v_2 \leftarrow 1$ header: sealed; varmap:  $n \rightarrow \phi_1$ , res $\rightarrow \phi_2$  $\phi_1 \leftarrow \phi$  incomplete. for n  $\phi_2 \leftarrow \phi$  incomplete, for res  $v_3 \leftarrow \text{equal } \phi_1, 0$ br  $v_3$ , cont, body bodv: sealed: varmap:  $n \rightarrow v_6$ , res $\rightarrow v_5$  $v_4 \leftarrow \text{mul } \phi_1, \phi_1$  $v_5 \leftarrow \text{mul } \phi_2, v_4$  $v_6 \leftarrow \text{sub } \phi_1$ . 1 br header

func  $foo(v_1)$ entry: sealed; varmap:  $n \rightarrow v_1$ , res $\rightarrow v_2$  $v_2 \leftarrow 1$ header: sealed: varmap:  $n \rightarrow \phi_1$ , res $\rightarrow \phi_2$  $\phi_1 \leftarrow \phi(\text{entry: } v_1, \text{ body: } v_6)$  $\phi_2 \leftarrow \phi(\text{entry: } v_2, \text{ body: } v_5)$  $v_3 \leftarrow \text{equal } \phi_1, 0$ br  $v_3$ , cont, body bodv: sealed: varmap:  $n \rightarrow v_6$ , res $\rightarrow v_5$  $v_4 \leftarrow \text{mul } \phi_1, \phi_1$  $v_5 \leftarrow \text{mul } \phi_2, v_4$  $v_6 \leftarrow \text{sub } \phi_1$ . 1 br header

**cont:** NOT sealed; varmap:  $\emptyset$ 

int res = 1; while (n) { res \*= n \* n; n -= 1; } return res;

}

int foo(int n) {

func  $foo(v_1)$ entry: sealed; varmap:  $n \rightarrow v_1$ , res $\rightarrow v_2$  $v_2 \leftarrow 1$ header: sealed: varmap:  $n \rightarrow \phi_1$ , res $\rightarrow \phi_2$  $\phi_1 \leftarrow \phi(\text{entry: } v_1, \text{ body: } v_6)$  $\phi_2 \leftarrow \phi(\text{entry: } v_2, \text{ body: } v_5)$  $v_3 \leftarrow \text{equal } \phi_1, 0$ br  $v_3$ , cont, body bodv: sealed: varmap:  $n \rightarrow v_6$ , res $\rightarrow v_5$  $v_4 \leftarrow \text{mul } \phi_1, \phi_1$  $v_5 \leftarrow \text{mul } \phi_2, v_4$  $v_6 \leftarrow \text{sub } \phi_1$ . 1 br header

**cont:** sealed; varmap:  $\emptyset$ 

int res = 1; while (n) { res \*= n \* n; n -= 1; } return res;

}

int foo(int n) {

func  $foo(v_1)$ entry: sealed; varmap:  $n \rightarrow v_1$ , res $\rightarrow v_2$  $v_2 \leftarrow 1$ header: sealed: varmap:  $n \rightarrow \phi_1$ , res $\rightarrow \phi_2$  $\phi_1 \leftarrow \phi(\text{entry: } v_1, \text{ body: } v_6)$  $\phi_2 \leftarrow \phi(\text{entry: } v_2, \text{ body: } v_5)$  $v_3 \leftarrow \text{equal } \phi_1, 0$ br  $v_3$ , cont, body bodv: sealed: varmap:  $n \rightarrow v_6$ , res $\rightarrow v_5$  $v_4 \leftarrow \text{mul } \phi_1, \phi_1$  $v_5 \leftarrow \text{mul } \phi_2, v_4$  $v_6 \leftarrow \text{sub } \phi_1$ . 1 br header cont: sealed; varmap: res $\rightarrow \phi_2$ ret  $\phi_2$ 

int foo(int n) {
 int res = 1;
 while (n) {
 res \*= n \* n;
 n -= 1;
 }
 return res;
}

### SSA Construction – Pruned/Minimal Form

- Resulting SSA is *pruned* all  $\phi$  are used
- **>** But not *minimal*  $\phi$  nodes might have single, unique value

<sup>5</sup>M Braun et al. "Simple and efficient construction of static single assignment form". In: CC. 2013, pp. 102–122.

<sup>6</sup>R Cytron et al. "Efficiently computing static single assignment form and the control dependence graph". In: TOPLAS 13.4 (1991), pp. 451–490.

### SSA Construction – Pruned/Minimal Form

- Resulting SSA is *pruned* all  $\phi$  are used
- But not minimal  $\phi$  nodes might have single, unique value
- When filling  $\phi$ , check that multiple real values exist
  - Otherwise: replace  $\phi$  with the single value
  - $\blacktriangleright$  On replacement, update all  $\phi$  using this value, they might be trivial now, too
- Sufficient?

<sup>5</sup>M Braun et al. "Simple and efficient construction of static single assignment form". In: CC. 2013, pp. 102–122.

<sup>6</sup>R Cytron et al. "Efficiently computing static single assignment form and the control dependence graph". In: TOPLAS 13.4 (1991), pp. 451–490.

### SSA Construction – Pruned/Minimal Form

- Resulting SSA is *pruned* all  $\phi$  are used
- **>** But not *minimal*  $\phi$  nodes might have single, unique value
- When filling  $\phi$ , check that multiple real values exist
  - Otherwise: replace  $\phi$  with the single value
  - > On replacement, update all  $\phi$  using this value, they might be trivial now, too
- Sufficient? Not for irreducible CFG
  - Needs more complex algorithms<sup>5</sup> or different construction method<sup>6</sup>

#### AD IN2053 "Program Optimization" covers this more formally

<sup>5</sup>M Braun et al. "Simple and efficient construction of static single assignment form". In: CC. 2013, pp. 102–122. @.

<sup>6</sup>R Cytron et al. "Efficiently computing static single assignment form and the control dependence graph". In: TOPLAS 13.4 (1991), pp. 451–490.

### SSA: Implementation

Value is often just a pointer to instruction

 $\blacktriangleright \phi$  nodes placed at beginning of block

▶ They execute "concurrently" and on the edges, after all

### SSA: Implementation

- Value is often just a pointer to instruction
- $\blacktriangleright \phi$  nodes placed at beginning of block
  - ▶ They execute "concurrently" and on the edges, after all
- > Variable number of operands required for  $\phi$  nodes
- Storage format for instructions and basic blocks
  - Consecutive in memory: hard to modify/traverse
  - Array of pointers:  $\mathcal{O}(n)$  for a single insertion...
  - Linked List: easy to insert, but pointer overhead

# Is SSA a graph IR?

## Is SSA a graph IR?

Only if instructions have no side effects, consider load, store, call, ...

These can be solved using explicit dependencies as SSA values, e.g. for memory

### Intermediate Representations – Summary

- An IR is an internal representation of a program
- Main goal: simplify analyses and transformations
- ► IRs typically based on graphs or linear instructions
- Graph IRs: AST, Control Flow Graph, Relational Algebra
- Linear IRs: stack machines, register machines, SSA
- Single Static Assignment makes data flow explicit
- SSA is extremely popular, although non-trivial to construct

### Intermediate Representations – Questions

- Who designs an IR? What are design criteria?
- Why is an AST not suited for program optimization?
- ▶ How to convert an AST to another IR?
- What are the benefits/drawbacks of stack/register machines?
- > What benefits does SSA offer over a normal register machine?
- How do  $\phi$ -instructions differ from normal instructions?

Code Generation for Data Processing Lecture 4: LLVM and IR Design

Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

## LLVM<sup>7</sup>

#### LLVM "Core" Library

- Optimizer and compiler back-end
- "Set of compiler components"
  - ► IRs: LLVM-IR, SelDag, MIR
  - Analyses and Optimizations
  - Code generation back-ends
- Started from Chris Lattner's master's thesis
- Used for C, C++, Swift, D, Julia, Rust, Haskell, ...

<sup>&</sup>lt;sup>7</sup>C Lattner and V Adve. "LLVM: A compilation framework for lifelong program analysis & transformation". In: CGO. 2004, pp. 75–86. 🚱.

## LLVM<sup>7</sup>

### LLVM "Core" Library

- Optimizer and compiler back-end
- "Set of compiler components"
  - ► IRs: LLVM-IR, SelDag, MIR
  - Analyses and Optimizations
  - Code generation back-ends
- Started from Chris Lattner's master's thesis
- Used for C, C++, Swift, D, Julia, Rust, Haskell, ...

#### LLVM Project

- Umbrella for several projects related to compilers/toolchain
  - LLVM Core
  - Clang: C/C++ front-end for LLVM
  - libc++, compiler-rt: runtime support
  - LLDB: debugger
  - LLD: linker
  - MLIR: experimental IR framework

<sup>7</sup>C Lattner and V Adve. "LLVM: A compilation framework for lifelong program analysis & transformation". In: CGO. 2004, pp. 75–86. @.

### LLVM: Overview



Independent front-end derives LLVM-IR, LLVM does opt. and code gen.
 LTO: durant LLVM ID into abject file, antimize at link time.

LTO: dump LLVM-IR into object file, optimize at link-time

## LLVM-IR: Overview

- SSA-based IR, representations textual, bitcode, in-memory
- Hierarchical structure
  - Module
  - Functions, global variables
  - Basic blocks
  - Instructions

Strongly/strictly typed

```
define dso_local i32 @foo(i32 %0) {
  %2 = icmp eq i32 %0, 0
  br i1 %2, label %10, label %3
```

```
3: ; preds = %1, %3
    %4 = phi i32 [ %7, %3 ], [ 1, %1 ]
    %5 = phi i32 [ %8, %3 ], [ %0, %1 ]
    %6 = mul nsw i32 %5, %5
    %7 = mul nsw i32 %6, %4
    %8 = add nsw i32 %5, -1
    %9 = icmp eq i32 %8, 0
    br i1 %9, label %10, label %3
10: ; preds = %3, %1
```

}

```
%11 = phi i32 [ 1, %1 ], [ %7, %3 ]
ret i32 %11
```

## LLVM-IR: Data types

- First class types:
  - ▶ i<N> arbitrary bit width integer, e.g. i1, i25, i1942652
  - ptr/ptr addrspace(1) pointer with optional address space
  - float/double/half/bfloat/fp128/...
  - <N x ty> vector type, e.g. <4 x i32>
- Aggregate types:
  - [N x ty] constant-size array type, e.g. [32 x float]
  - { ty, ... } struct (can be packed/opaque), e.g. {i32, float}

#### Other types:

- ▶ ty (ty, ...) function type, e.g. {i32, i32} (ptr, ...)
- void
- label/token/metadata

### LLVM-IR: Modules

- ▶ Top-level entity, one compilation unit akin to C/C++
- Contains global values, specified with linkage type

```
    Global variable declarations/definitions
        @externInt = external global i32, align 4
        @globVar = global i32 4, align 4
        @staticPtr = internal global ptr null, align 8

    Function declarations/definitions
        declare i32 @readPtr(ptr)
        define i32 @return1() {
            ret i32 1
```

}

Global named metadata (discarded during compilation)

Functions definitions contain all code, not nestable

- Single return type (or void), multiple parameters, list of basic blocks
   No basic blocks ⇒ function declaration
- Specifiers for callconv, section name, other attributes
  - E.g.: noinline/alwaysinline, noreturn, readonly
- Parameter and return can also have attributes
  - E.g.: noalias, nonnull, sret(<ty>)

### LLVM-IR: Basic Block

#### Sequence of instructions

- $\blacktriangleright \phi$  nodes come first
- Regular instructions come next
- Must end with a terminator
- First block in function is entry block
   Entry block cannot be branch target

### LLVM-IR: Instructions – Control Flow and Terminators

Terminators end a block/modify control flow

- ret <ty> <val>/ret void
- br label <dest>/br i1 <cond>, label <then>, label <else>
- switch/indirectbr
- unreachable
- Few others for exception handling
- Not a terminator: call

### LLVM-IR: Instructions – Arithmetic-Logical

- add/sub/mul/udiv/sdiv/urem/srem
  - Arithmetic uses two's complement
  - Division corner cases are undefined behavior
- fneg/fadd/fsub/fmul/fdiv/frem
- shl/lshr/ashr/and/or/xor
  - Out-of-range shifts have an undefined result
- icmp <pred>/fcmp <pred>/select <cond>, <then>, <else>
- trunc/zext/sext/fptrunc/fpext/fptoui/fptosi/uitofp/sitofp
- bitcast
  - Cast between equi-sized datatypes by reinterpreting bits

### LLVM-IR: Instructions – Memory and Pointer

- alloca <ty> allocate addressable stack slot
- load <ty>, ptr <ptr>/store <ty> <val>, ptr <ptr>
  - May be volatile (e.g., MMIO) and/or atomic
- cmpxchg/atomicrmw similar to hardware operations
- ptrtoint/inttoptr
- getelementptr address computation on ptr/structs/arrays




Equivalent in C: &((int\*) p)[3]



Equivalent in C: &((int\*) p)[3]



Also works with nested structs and arrays

undef – unspecified value, compiler may choose any value

▶ %b = add i32 %a, i32 undef  $\rightarrow$  i32 undef

 $\blacktriangleright$  %c = and i32 %a, i32 undef  $\rightarrow$ 

undef – unspecified value, compiler may choose any value

- ▶ %b = add i32 %a, i32 undef  $\rightarrow$  i32 undef
- $\blacktriangleright$  %c = and i32 %a, i32 undef  $\rightarrow$  i32 %a
- $\blacktriangleright$  %d = xor i32 %b, i32 %b  $\rightarrow$

undef – unspecified value, compiler may choose any value

- ▶ %b = add i32 %a, i32 undef  $\rightarrow$  i32 undef
- $\blacktriangleright$  %c = and i32 %a, i32 undef  $\rightarrow$  i32 %a
- ▶ %d = xor i32 %b, i32 %b  $\rightarrow$  i32 undef
- $\blacktriangleright$  br i1 undef, label %p, label %q  $\rightarrow$

undef – unspecified value, compiler may choose any value

- ▶ %b = add i32 %a, i32 undef  $\rightarrow$  i32 undef
- $\blacktriangleright$  %c = and i32 %a, i32 undef  $\rightarrow$  i32 %a
- ▶ %d = xor i32 %b, i32 %b  $\rightarrow$  i32 undef
- ▶ br i1 undef, label %p, label %q  $\rightarrow$  undefined behavior

undef – unspecified value, compiler may choose any value

- ▶ %b = add i32 %a, i32 undef  $\rightarrow$  i32 undef
- $\blacktriangleright$  %c = and i32 %a, i32 undef  $\rightarrow$  i32 %a
- ▶ %d = xor i32 %b, i32 %b  $\rightarrow$  i32 undef
- ▶ br i1 undef, label %p, label %q  $\rightarrow$  undefined behavior

poison – result of erroneous operations

- Delay undefined behavior on illegal operation until actually relevant
- Allows to speculatively "execute" instructions in IR
- ▶ %d = shl i32 %b, i32 34  $\rightarrow$  i32 poison

## LLVM-IR: Intrinsics

Not all operations provided as instructions

- Intrinsic functions: special functions with defined semantics
  - Replaced during compilation, e.g., with instruction or lib call
- Benefit: no changes needed for parser/bitcode/... on addition

#### Examples:

- declare iN @llvm.ctpop.iN(iN <src>)
- declare {iN, i1} @llvm.sadd.with.overflow.iN(iN %a, iN %b)
- memcpy, memset, sqrt, returnaddress, ...

#### clang can emit LLVM-IR bitcode clang -0 -emit-llvm -c test.c -o test.bc

clang can emit LLVM-IR bitcode
 clang -0 -emit-llvm -c test.c -o test.bc

Ilvm-dis disassembles bitcode to textual LLVM-IR clang -0 -emit-llvm -c test.c -o - | llvm-dis

clang can emit LLVM-IR bitcode clang -0 -emit-llvm -c test.c -o test.bc

Ilvm-dis disassembles bitcode to textual LLVM-IR clang -0 -emit-llvm -c test.c -o - | llvm-dis

Ilc compiles LLVM-IR (textual or bitcode) to assembly clang -0 -emit-llvm -c test.c -o - | llc clang -0 -emit-llvm -c test.c -o - | llvm-dis | llc

Example Listings omitted - they would span several slides

```
define dso_local <4 x float> @foo2(<4 x float> %0, <4 x float> %1) {
    %3 = alloca <4 x float>, align 16
    %4 = alloca <4 x float>, align 16
    store <4 x float> %0, ptr %3, align 16
    store <4 x float> %1, ptr %4, align 16
    %5 = load <4 x float>, ptr %3, align 16
    %6 = load <4 x float>, ptr %4, align 16
    %7 = fadd <4 x float> %5, %6
    ret <4 x float> %7
}
```

```
define dso_local i32 @foo3(i32 %0, i32 %1) {
  %3 = tail call { i32, i1 } @llvm.smul.with.overflow.i32(i32 %0, i32 %1)
  %4 = extractvalue { i32, i1 } %3, 1
  %5 = extractvalue { i32, i1 } %3, 0
  %6 = select i1 %4, i32 -2147483648, i32 %5
  ret i32 %6
}
```

## LLVM-IR: Example

```
define dso_local i32 @sw(i32 %0) {
  switch i32 %0, label %4 [
   i32 4, label %5
   i32 5, label %2
   i32 8, label %3
   i32 100, label %5
  ]
2: ; preds = \%1
 br label %5
3: ; preds = \%1
 br label %5
4: ; preds = \%1
 br label %5
5: ; preds = \%1, \%1, \%4, \%3, \%2
 %6 = phi i32 [ %0, %4 ], [ 9, %3 ], [ 32, %2 ], [ 12, %1 ], [ 12, %1 ]
 ret i32 %6
}
```

#### LLVM-IR: Example

```
@switch.table.sw = private unnamed_addr constant [7 x i32] [i32 12, i32 32, i32 12,
                                        i32 12, i32 9, i32 12, i32 12], align 4
define dso_local i32 @sw(i32 %0) {
 \%2 = add i 32 \%0, -4
 %3 = icmp ult i32 \%2, 7
 br i1 %3, label %4, label %13
4: : preds = %1
 \%5 = trunc \ i32 \ \%2 \ to \ i8
 \%6 = 1shr i8 83, \%5
 \%7 = \text{and } i8 \%6. 1
 %8 = icmp eq i8 \%7, 0
 br i1 %8, label %13, label %9
9: ; preds = \%4
 %10 = sext i 32 \%2 to i 64
 %11 = getelementptr inbounds [7 x i32], ptr @switch.table.sw, i64 0, i64 %10
 %12 = load i32, ptr %11, align 4
 br label %13
13: ; preds = \%1, \%4, \%9
 %14 = phi i 32 [\%12, \%9], [\%0, \%4], [\%0, \%1]
 ret i32 %14
}
```

#### LLVM-IR API

- LLVM offers two APIs: C++ and C
  - C++ is the full API, exposing nearly all internals
  - C API is more limited, but more stable
- Nearly all major versions have breaking changes
- Some support for multi-threading:
  - All modules/types/... associated with an LLVMContext
  - Different contexts may be used in different threads

#### LLVM-IR C++ API: Basic Example

}

```
#include <llvm/IR/IRBuilder.h>
int main(void) {
    llvm::LLVMContext ctx;
    auto modUP = std::make_unique<llvm::Module>("mod", ctx);
```

```
llvm::IRBuilder<> irb(entryBB);
llvm::Value* add = irb.CreateAdd(fn->getArg(0), irb.getInt64(1));
irb.CreateRet(add);
modUP->print(llvm::outs(), nullptr);
return 0;
```

```
LLVM-IR API: Almost Everything is a Value... (excerpt)
                                                             \begin{array}{c} \mathsf{Value} & \overbrace{\mathsf{PoisonValue}}^{\mathsf{Argument}} \\ \mathsf{BasicBlock} \\ \mathsf{User} & \overbrace{\mathsf{Constant}}^{\mathsf{ConstantData}} \\ & \overbrace{\mathsf{ConstantPointerNull}}^{\mathsf{ConstantPointerNull}} \\ & \overbrace{\mathsf{UndefValue}}^{\mathsf{Value}} \\ & \overbrace{\mathsf{PoisonValue}}^{\mathsf{Value}} \\ & \overbrace{\mathsf{ConstantPointerNull}}^{\mathsf{Value}} \\ & \overbrace{\mathsf{Con
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        \stackrel{\downarrow}{\rightarrow} ConstantExpr \longrightarrow \cdots \\ GlobalValue &  GlobalAlias \\ GlobalObject &  GlobalVariable \\ GlobalVariable \\
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               \overrightarrow{\mathsf{PHINode}} \\ \overrightarrow{\mathsf{BranchInst}} \\ \overrightarrow{\mathsf{BinaryOperator}} \\ \overrightarrow{\mathsf{CallBase}} \rightarrow \overrightarrow{\mathsf{CallInst}} \rightarrow \overrightarrow{\mathsf{IntrinsicInst}} \rightarrow \cdots \\ \overrightarrow{\mathsf{StoreInst}} \\ \overrightarrow{\mathsf{UnaryInstruction}} \rightarrow \overrightarrow{\mathsf{AllocaInst}} \\ \overrightarrow{\mathsf{CastInst}} \\ \overrightarrow{\mathsf{LoadInst}} \\ \cdots \\ \overrightarrow{\mathsf{CastInst}} \\ \overrightarrow{\mathsf{LoadInst}} \\ \cdots \\ \overrightarrow{\mathsf{CastInst}} \\ \overrightarrow{\mathsf{LoadInst}} \\ \cdots \\ \overrightarrow{\mathsf{CastInst}} \\
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ↓ Instruction 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             ↓...
```

## LLVM-IR API: Programming Environment

LLVM implements custom RTTI

isa<>, cast<>, dyn\_cast<>

LLVM implements a multitude of specialized data structures

- E.g.: SmallVector<T, N> to keep N elements stack-allocated
- Custom vectors, sets, maps; see manual<sup>8</sup>

Preferably uses ArrayRef, StringRef, Twine for references

- LLVM implements custom streams instead of std streams
  - outs(), errs(), dbgs()

## LLVM-IR API: Use Tracking

#### Values track their users

```
llvm::Value* v = /* ... */;
for (llvm::User* u : v->users())
    if (auto i = llvm::dyn_cast<llvm::Instruction>(u))
        // ...
```

- Simplifies implementation of analyses
- ► Allows for easy replacement:
  - inst->replaceAllUsesWith(replVal);

# LLVM IR Implementation: Value/User

| llvm::Value | Туре  | UseList | subclassID | flags    |
|-------------|-------|---------|------------|----------|
|             | Type* | Use*    | unsigned   | unsigned |

PHINode additionally stores *n* BasicBlock\* after the operands, but aren't users of blocks.

# LLVM IR Implementation: Value/User



PHINode additionally stores n BasicBlock\* after the operands, but aren't users of blocks.

# LLVM IR Implementation: Value/User



PHINode additionally stores n BasicBlock\* after the operands, but aren't users of blocks.

## LLVM IR Implementation: Use



Instruction and BasicBlock have pointers to parent and next/prev

- Linked list updated on changes and used for iteration
- Instructions have cached order (integer) for fast "comes before"

Instruction and BasicBlock have pointers to parent and next/prev

- Linked list updated on changes and used for iteration
- Instructions have cached order (integer) for fast "comes before"

BasicBlock successors: blocks used by terminator

Instruction and BasicBlock have pointers to parent and next/prev

- Linked list updated on changes and used for iteration
- Instructions have cached order (integer) for fast "comes before"
- BasicBlock successors: blocks used by terminator
- BasicBlock predecessors:
  - Iterate over users of block these are terminators (and blockaddress)
  - Ignore non-terminators, parent of using terminator is predecessor
  - ► Same predecessor might be duplicated (~> getUniquePredecessor())

Instruction and BasicBlock have pointers to parent and next/prev

- Linked list updated on changes and used for iteration
- Instructions have cached order (integer) for fast "comes before"
- BasicBlock successors: blocks used by terminator
- BasicBlock predecessors:
  - Iterate over users of block these are terminators (and blockaddress)
  - Ignore non-terminators, parent of using terminator is predecessor
  - ► Same predecessor might be duplicated (~> getUniquePredecessor())



# LLVM and IR Design

# LLVM and IR Design

- LLVM provides a decent general-purpose IR for compilers
- But: not ideal for all purposes
  - ▶ High-level optimizations difficult, e.g. due to lost semantics
  - Several low-level operations only exposed as intrinsics
  - ► IR rather complex, high code complexity
  - High compilation times

# LLVM and IR Design

- LLVM provides a decent general-purpose IR for compilers
- But: not ideal for all purposes
  - ▶ High-level optimizations difficult, e.g. due to lost semantics
  - Several low-level operations only exposed as intrinsics
  - ▶ IR rather complex, high code complexity
  - High compilation times
- Thus: heavy trend towards custom IRs


#### Define purpose!

Structure: SSA vs. something else; control flow

- Control flow: basic blocks/CFG vs. structured control flow
- Remember: SSA can be considered as a DAG, too
- SSA is easy to analyse, but non-trivial to construct/leave

#### Define purpose!

Structure: SSA vs. something else; control flow

- Control flow: basic blocks/CFG vs. structured control flow
- Remember: SSA can be considered as a DAG, too
- SSA is easy to analyse, but non-trivial to construct/leave

Broader integration: keep multiple stages in single IR?

- Example: create IR with high-level operations, then incrementally lower
- Model machine instructions in same IR?
- Can avoid costly transformations, but adds complexity

### Data types

- Simple type structure vs. complex/aggregate types?
- Keep relation to high-level types vs. low-level only?
- Virtual data types, e.g. for flags/memory?

### Data types

- Simple type structure vs. complex/aggregate types?
- Keep relation to high-level types vs. low-level only?
- Virtual data types, e.g. for flags/memory?

### Instruction format

- Single vs. multiple results?
- Strongly typed vs. more generic result/operand types?
- Operand number fixed vs. dynamic?

- Allow instruction side effects?
  - E.g.: memory, floating-point arithmetic, implicit control flow

- Allow instruction side effects?
  - E.g.: memory, floating-point arithmetic, implicit control flow
- Operation complexity and abstraction
  - E.g.: CheckBounds, GetStackPtr, HashInt128
  - E.g.: load vs. MOVQconstidx4

Extensibility for new operations (e.g., new targets, high-level ops)

### IR Design: Implementation

### Maintain user lists?

- Simplifies optimizations, but adds considerable overhead
- Replacement can use copy and lazy canonicalization
- User count might be sufficient alternative
- Storage layout: operation size and locations
  - ▶ For performance: reduce heap allocations, small data structures
- Special handling for arguments vs. all-instructions?
- Metadata for source location, register allocation, etc.
- SSA:  $\phi$  nodes vs. block arguments?

# IR Example: Go SSA

- Strongly typed
  - Structured types decomposed
- Explicit memory side-effects
- Also High-level operations
  - IsInBounds, VarDef
- Only one type of value/instruction
  - Const64, Arg, Phi
- ▶ No user list, but user count
- ► Also used for arch-specific repr.

```
env GOSSAFUNC=fac go build test.go
```

```
b1:
```

```
v1 (?) = InitMem <mem>
    v2 (?) = SP <uintptr>
    v5 (?) = LocalAddr <*int> {~r1} v2 v1
    v6 (7) = Arg < int > \{n\} (n[int])
    v8 (?) = Const64 \langle int \rangle [1] (res[int])
    v9 (?) = Const64 <int> [2] (i[int])
Plain -> b2 (+9)
b2: <- b1 b4
    v10 (9) = Phi <int> v9 v17 (i[int])
    v23 (12) = Phi <int> v8 v15 (res[int])
    v12 (+9) = Less64 < bool> v10 v6
If v12 -> b4 b5 (likelv) (9)
b4 \cdot < - b2
    v15 (+10) = Mul64 <int> v23 v10 (res[int])
    v17 (+9) = Add64 < int > v10 v8 (i[int])
Plain \rightarrow b2 (9)
b5: <- b2
    v20 (12) = VarDef <mem> {~r1} v1
    v21 (+12) = Store <mem> {int} v5 v23 v20
Ret v21 (+12)
```

## LLVM and IR Design – Summary

- LLVM is a modular compiler framework
- Extremely popular and high-quality compiler back-end
- Primarily provides optimizations and a code generator
- Main interface is the SSA-based LLVM-IR
  - Easy to generate, friendly for writing front-ends/optimizations
- IR design depends on purpose and integration constraints
- Structurally similar IRs can strongly differ in capabilities

### LLVM and IR Design – Questions

- What is the structure of an LLVM-IR module/function?
- Which LLVM-IR data types exist? How do they relate to the target architecture?
- How do semantically invalid operations in LLVM-IR behave?
- What is special about intrinsic functions?
- ▶ How to derive LLVM-IR from C code using Clang?
- How does LLVM's replaceAllUsesWith work? How could this work without building/maintaining user lists?
- How can an SSA-based IR make side effects explicit?
- ▶ How would you design an IR for optimizing Brainfuck?

### Code Generation for Data Processing Lecture 5: Analyses and Transformations

#### Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

"User code" is often not very efficient

- "User code" is often not very efficient
- ▶ Also: no need to, compiler can (often?) optimize better
  - More knowledge: e.g., data layout, constants after inlining, etc.
- Allows for more pragmatic/simple code

- "User code" is often not very efficient
- ▶ Also: no need to, compiler can (often?) optimize better
  - More knowledge: e.g., data layout, constants after inlining, etc.
- Allows for more pragmatic/simple code
- Generating "better" IR code on first attempt is expensive
   What parts are actually used? How to find out?
- Transformation to "better" code must be done somewhere

- "User code" is often not very efficient
- ▶ Also: no need to, compiler can (often?) optimize better
  - More knowledge: e.g., data layout, constants after inlining, etc.
- Allows for more pragmatic/simple code
- Generating "better" IR code on first attempt is expensive
   What parts are actually used? How to find out?
- > Transformation to "better" code must be done *somewhere*
- Optimization is a misnomer: we don't know whether it improves code!
   Many transformations are driven by heuristics

- "User code" is often not very efficient
- ▶ Also: no need to, compiler can (often?) optimize better
  - More knowledge: e.g., data layout, constants after inlining, etc.
- Allows for more pragmatic/simple code
- Generating "better" IR code on first attempt is expensive
   What parts are actually used? How to find out?
- ► Transformation to "better" code must be done *somewhere*
- Optimization is a misnomer: we don't know whether it improves code!
   Many transformations are driven by heuristics
- Many types of optimizations are well-known<sup>9</sup>

### Dead Block Elimination

- CFG not necessarily connected
- E.g., consequence of optimization
  - $\blacktriangleright \ \ Conditional \ \ branch \rightarrow unconditional \ \ branch$

### Dead Block Elimination

- CFG not necessarily connected
- E.g., consequence of optimization
  - $\blacktriangleright \ \ Conditional \ \ branch \rightarrow unconditional \ \ branch$
- Removing dead blocks is trivial
  - 1. DFS traversal of CFG from entry, mark visited blocks
  - 2. Remove unmarked blocks

### **Optimization Example 1**

```
define i32 @fac(i32 %0) {
 br label %for.header
for.header: ; preds = %for.body, %1
 %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
 %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
 %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
 %cond = icmp sle i32 %i, %0
 br i1 %cond, label %for.body, label %exit
for.body: : preds = %for.header
 %a.new = mul i32 %a, %i
 %b.new = add i32 %b, %i
 \%i.new = add i32 %i. 1
 br label %for.header
exit: ; preds = %for.header
 absum = add i32 \%a, %b
 ret i32 %a
}
```

# Simple Dead Code Elimination (DCE)

Look for trivially dead instructions

- No users or side-effects
- Calls *might* be removed
- 1. Add all instructions to work queue
- 2. While work queue not empty:
  - 2.1 Check for deadness (zero users, no side-effects)
  - 2.2 If dead, remove and add all operands to work queue

Warning: Don't implement it this naively, this is inefficient

# Applying Simple DCE

```
define i32 @fac(i32 %0) {
eff cf br label %for.header
        for.header: ; preds = %for.body, %1
users: 2 %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
users: 2 %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
users: 4 %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
users: 1 %cond = icmp sle i32 %i, %0
eff.: cf br i1 % cond, label % for.body, label % exit
        for.body: ; preds = %for.header
users: 1 \%a.new = mul i32 %a. %i
users: 1 %b.new = add i32 %b. %i
users: 1 %i.new = add i32 %i, 1
eff.: cf br label %for.header
        exit: ; preds = %for.header
users: 0 \%absum = add i32 \%a, \%b
eff.: cf ret i32 %a
```

# Applying Simple DCE

```
define i32 @fac(i32 %0) {
eff cf br label %for.header
        for.header: ; preds = %for.body, %1
users: 1 %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
users: 1 %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
users: 4 %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
users: 1 %cond = icmp sle i32 %i, %0
eff.: cf br i1 % cond, label % for.body, label % exit
        for.body: ; preds = %for.header
users: 1 \%a.new = mul i32 %a. %i
users: 1 %b.new = add i32 %b. %i
users: 1 %i.new = add i32 %i, 1
eff cf br label %for.header
        exit: ; preds = %for.header
eff.: cf ret i32 %a
```

### Dead Code Elimination

Problem: unused value cycles

### Dead Code Elimination

- Problem: unused value cycles
- Idea: find "value sinks" and mark all needed values as live unmarked values can be removed
  - Sink: instruction with side effects (e.g., store, control flow)
- 1. Only mark instrs. with side effects as live
- 2. Populate work list with newly added live instrs.
- 3. While work list not empty:
  - 3.1 Mark dead operand instructions as live and add to work list
- 4. Remove instructions not marked as live

```
define i32 @fac(i32 %0) {
 br<sub>1</sub> label %for.header
for.header: ; preds = %for.body, %1
 %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
 %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
 %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
 %cond = icmp sle i32 %i, %0
 br<sub>2</sub> i1 %cond, label %for.body, label %exit
for.body: ; preds = %for.header
 %a.new = mul i32 %a. %i
 %b.new = add i32 %b, %i
 %i.new = add i32 %i, 1
 br<sub>3</sub> label %for.header
exit: ; preds = %for.header
 absum = add i32 \%a, %b
 ret i32 %a
}
```

Work list (stack)

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
      %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
      %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
      %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
      %a.new = mul i32 %a. %i
      %b.new = add i32 %b, %i
      %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

Work list (stack) br<sub>1</sub> br<sub>2</sub> br<sub>3</sub> ret

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
live %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
      %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
      %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
      %a.new = mul i32 %a. %i
      %b.new = add i32 %b, %i
      %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

Work list (stack) br<sub>1</sub> br<sub>2</sub> br<sub>3</sub> %a

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
live %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
      %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
      %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live \%a.new = mul i32 \%a.\%i
      %b.new = add i32 %b, %i
      %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

```
Work list (stack)
br<sub>1</sub>
br<sub>2</sub>
br<sub>3</sub>
%a.new
```

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
     %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
live
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
     %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
live
      %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live \%a.new = mul i32 \%a.\%i
      %b.new = add i32 %b, %i
      %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

Work list (stack) br<sub>1</sub> br<sub>2</sub> br<sub>3</sub> %i

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
     %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
live
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
     %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
live
      %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live
     %a.new = mul i32 %a, %i
     %b.new = add i32 %b, %i
live %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

```
Work list (stack)
br<sub>1</sub>
br<sub>2</sub>
br<sub>3</sub>
%i.new
```

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
     %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
live
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
     %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
live
      %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live
     %a.new = mul i32 %a, %i
     %b.new = add i32 %b, %i
live %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

Work list (stack) br<sub>1</sub> br<sub>2</sub> br<sub>3</sub>

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
     %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
live
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
     %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
live
      %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live \%a.new = mul i32 \%a.\%i
     %b.new = add i32 %b, %i
live %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

Work list (stack) br<sub>1</sub> br<sub>2</sub>
```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
     %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
live
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
     %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
live
live %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live
     %a.new = mul i32 %a, %i
      %b.new = add i32 %b, %i
live %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

Work list (stack) br<sub>1</sub> %cond

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
     %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
live
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
     %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
live
live %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live
     %a.new = mul i32 %a, %i
      %b.new = add i32 %b, %i
live %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

Work list (stack) br<sub>1</sub>

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
     %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
live
      %b = phi i32 [ 0, %1 ], [ %b.new, %for.body ]
     %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
live
live %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live
     %a.new = mul i32 %a, %i
      %b.new = add i32 %b, %i
live %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
    exit: ; preds = %for.header
      absum = add i32 \%a, %b
live ret i32 %a
     }
```

#### Work list (stack)

```
define i32 @fac(i32 %0) {
live br<sub>1</sub> label %for.header
    for.header: ; preds = %for.body, %1
     %a = phi i32 [ 1, %1 ], [ %a.new, %for.body ]
live
live
     %i = phi i32 [ 0, %1 ], [ %i.new, %for.body ]
live
    %cond = icmp sle i32 %i, %0
live br<sub>2</sub> i1 %cond, label %for.body, label %exit
    for.body: ; preds = %for.header
live
     %a.new = mul i32 %a. %i
```

```
live %i.new = add i32 %i, 1
live br<sub>3</sub> label %for.header
exit: ; preds = %for.header
```

```
live ret i32 %a
```

#### Work list (stack)

## **Optimization Example 2**

```
define i32 @foo(i32 %0, ptr %1, ptr %2) {
  %4 = zext i32 %0 to i64
  %5 = getelementptr inbounds i32, ptr %1, i64 %4
  %6 = load i32, ptr %5, align 4
  %7 = zext i32 %0 to i64
  %8 = getelementptr inbounds i32, ptr %2, i64 %7
  %9 = load i32, ptr %8, align 4
  %10 = add nsw i32 %6, %9
  ret i32 %10
}
```

# Common Subexpression Elimination (CSE) – Attempt 1

Idea: find/eliminate redundant computation of same value

## Common Subexpression Elimination (CSE) – Attempt 1

Idea: find/eliminate redundant computation of same value

Keep track of previously seen values in hash map

- Iterate over all instructions
  - If found in map, remove and replace references
  - Otherwise add to map

#### Easy, right?

define i32 @foo(i32 %0, ptr %1, ptr %2) {
 %4 = zext i32 %0 to i64
 %5 = getelementptr inbounds i32, ptr %1, i64 %4
 %6 = load i32, ptr %5, align 4
 %7 = zext i32 %0 to i64
 %8 = getelementptr inbounds i32, ptr %2, i64 %7
 %9 = load i32, ptr %8, align 4
 %10 = add nsw i32 %6, %9
 ret i32 %10
}

define i32 @foo(i32 %0, ptr %1, ptr %2) { %4 = zext i32 %0 to i64  $\rightarrow$  ht %5 = getelementptr inbounds i32, ptr %1, i64 %4 %6 = load i32, ptr %5, align 4  $\%7 = \text{zext} \ \text{i32} \ \%0 \ \text{to} \ \text{i64}$ %8 = getelementptr inbounds i32, ptr %2, i64 %7 %9 = load i32, ptr %8, align 4 %10 = add nsw i32 %6, %9ret i32 %10 }

 $\rightarrow$  ht

 $\rightarrow$  ht

define i32 @foo(i32 %0, ptr %1, ptr %2) { %4 = zext i 32 %0 to i 64 $\rightarrow$  ht %5 = getelementptr inbounds i32, ptr %1, i64 %4  $\rightarrow$  ht %6 = load i32, ptr %5, align 4  $\%7 = \text{zext} \ \text{i32} \ \%0 \ \text{to} \ \text{i64}$ %8 = getelementptr inbounds i32, ptr %2, i64 %7 %9 = load i32, ptr %8, align 4 %10 = add nsw i32 %6, %9ret i32 %10 }

Obsolete instr. can be killed immediately, or in a later DCE

```
define i32 @square(i32 %a, i32 %b) {
entry:
%cmp = icmp slt i32 %a, %b
 br i1 %cmp, label %if.then, label %if.end
if.then: ; preds = %entry
 %add1 = add i32 %a. %b
 br label %if.end
if.end: ; preds = %if.then, %entry
 %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
 %add2 = add i32 %a, %b
 %res = add i32 %condvar, %add2
ret i32 %res
}
```

 $\rightarrow$  ht

```
define i32 @square(i32 %a, i32 %b) {
entry:
% cmp = icmp slt i32 % a, % b
br i1 %cmp, label %if.then, label %if.end
if.then: ; preds = %entry
 %add1 = add i32 %a. %b
 br label %if.end
if.end: ; preds = %if.then, %entry
 %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
 %add2 = add i32 %a, %b
 %res = add i32 %condvar, %add2
ret i32 %res
}
```

 $\rightarrow$  ht  $\rightarrow$  ht

```
define i32 @square(i32 %a, i32 %b) {
entry:
% cmp = icmp slt i32 % a, % b
br i1 %cmp, label %if.then, label %if.end
if.then: ; preds = %entry
 %add1 = add i32 %a. %b
 br label %if.end
if.end: ; preds = %if.then, %entry
 %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
 %add2 = add i32 %a, %b
 %res = add i32 %condvar, %add2
ret i32 %res
}
```

 $\rightarrow$  ht  $\rightarrow$  ht

 $\rightarrow$  ht

```
define i32 @square(i32 %a, i32 %b) {
entry:
% cmp = icmp slt i32 % a, % b
br i1 %cmp, label %if.then, label %if.end
if.then: ; preds = %entry
 %add1 = add i32 %a. %b
br label %if.end
if.end: ; preds = %if.then, %entry
 %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
 %add2 = add i32 %a, %b
 %res = add i32 %condvar, %add2
ret i32 %res
}
```

 $\rightarrow$  ht  $\rightarrow$  ht

 $\rightarrow$  ht  $\rightarrow$  ht

```
define i32 @square(i32 %a, i32 %b) {
entry:
%cmp = icmp slt i32 %a, %b
br i1 %cmp, label %if.then, label %if.end
if.then: ; preds = %entry
%add1 = add i32 %a. %b
br label %if.end
if.end: ; preds = %if.then, %entry
 %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
 %add2 = add i32 %a, %b
 %res = add i32 %condvar, %add2
ret i32 %res
}
```

```
define i32 @square(i32 %a, i32 %b) {
            entry:
             %cmp = icmp slt i32 %a, %b
\rightarrow ht
\rightarrow ht
             br i1 %cmp, label %if.then, label %if.end
            if.then: ; preds = %entry
             %add1 = add i32 %a. %b
\rightarrow ht
\rightarrow ht
             br label %if.end
            if.end: ; preds = %if.then, %entry
             %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
\rightarrow ht
             %add2 = add i32 %a, %b
             %res = add i32 %condvar, %add2
             ret i32 %res
            }
```

```
define i32 @square(i32 %a, i32 %b) {
            entry:
             %cmp = icmp slt i32 %a, %b
\rightarrow ht
\rightarrow ht
             br i1 %cmp, label %if.then, label %if.end
            if.then: ; preds = %entry
             %add1 = add i32 %a. %b
\rightarrow ht
\rightarrow ht
             br label %if.end
            if.end: ; preds = %if.then, %entry
             %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
\rightarrow ht
            \%add2 = add i32 %a, %b
dup %add1
             %res = add i32 %condvar, %add2
             ret i32 %res
            }
```

```
define i32 @square(i32 %a, i32 %b) {
            entry:
             %cmp = icmp slt i32 %a, %b
\rightarrow ht
\rightarrow ht
             br i1 %cmp, label %if.then, label %if.end
            if.then: ; preds = %entry
             %add1 = add i32 %a. %b
\rightarrow ht
\rightarrow ht
             br label %if.end
            if.end: ; preds = %if.then, %entry
             %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
\rightarrow ht
            \%add2 = add i32 %a, %b
dup %add1
             %res = add i32 %condvar, %add1
             ret i32 %res
            }
```

```
define i32 @square(i32 %a, i32 %b) {
            entry:
             %cmp = icmp slt i32 %a, %b
\rightarrow ht
\rightarrow ht
              br i1 %cmp, label %if.then, label %if.end
            if.then: ; preds = %entry
\rightarrow ht
             \%add1 = add i32 %a, %b
\rightarrow ht
             br label %if.end
             if.end: ; preds = %if.then, %entry
             %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
\rightarrow ht
dup %add1
            \%add2 = add i32 %a, %b
             %res = add i32 %condvar, %add1
\rightarrow ht
             ret i32 %res
             }
```

```
define i32 @square(i32 %a, i32 %b) {
             entry:
              %cmp = icmp slt i32 %a, %b
\rightarrow ht
\rightarrow ht
              br i1 %cmp, label %if.then, label %if.end
             if.then: ; preds = %entry
\rightarrow ht
              \%add1 = add i32 %a, %b
\rightarrow ht
              br label %if.end
             if.end: ; preds = %if.then, %entry
              %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
\rightarrow ht
dup %add1
             \%add2 = add i32 %a, %b
              %res = add i32 %condvar, %add1
\rightarrow ht
\rightarrow ht
              ret i32 %res
             }
```

```
define i32 @square(i32 %a, i32 %b) {
            entry:
             %cmp = icmp slt i32 %a, %b
\rightarrow ht
\rightarrow ht
             br i1 %cmp, label %if.then, label %if.end
            if.then: ; preds = %entry
             %add1 = add i32 %a, %b
\rightarrow ht
\rightarrow ht
             br label %if.end
            if.end: ; preds = %if.then, %entry
             %condvar = phi i32 [ %add1, %if.then ], [ %a, %entry ]
\rightarrow ht
dup %add1
            %add2 = add i32 %a. %b
            %res = add i32 %condvar, %add1
\rightarrow ht
\rightarrow ht
             ret i32 %res
             ን
```

Instruction does not dominate all uses!
error: input module is broken!

## World Domination

- ▶ Remember: CFG G = (N, E, s) with digraph (N, E) and entry  $s \in N$
- Dominate: d dom n iff every path from s to n contains d
  - Dominators of *n*:  $DOM(n) = \{d | d \text{ dom } n\}$
- ▶ Strictly dominate: d sdom  $n \Leftrightarrow d$  dom  $n \land d \neq n$
- Immediate dominator:

idom (n) = d: d sdom  $n \land \not \exists d'.d$  sdom  $d' \land d'$  sdom n

- ▶ Remember: CFG G = (N, E, s) with digraph (N, E) and entry  $s \in N$
- Dominate: *d* dom *n* iff every path from *s* to *n* contains *d* 
  - Dominators of *n*:  $DOM(n) = \{d | d \text{ dom } n\}$
- ▶ Strictly dominate: d sdom  $n \Leftrightarrow d$  dom  $n \land d \neq n$

#### Immediate dominator:

idom (n) = d: d sdom  $n \land \not \exists d'.d$  sdom  $d' \land d'$  sdom n

 $\Rightarrow\,$  All strict dominators are always executed before the block

- ▶ Remember: CFG G = (N, E, s) with digraph (N, E) and entry  $s \in N$
- Dominate: *d* dom *n* iff every path from *s* to *n* contains *d* 
  - Dominators of *n*:  $DOM(n) = \{d | d \text{ dom } n\}$
- ▶ Strictly dominate: d sdom  $n \Leftrightarrow d$  dom  $n \land d \neq n$

#### Immediate dominator:

idom (n) = d: d sdom  $n \land \not \exists d'.d$  sdom  $d' \land d'$  sdom n

- $\Rightarrow\,$  All strict dominators are always executed before the block
- $\Rightarrow$  All values from dominators available/usable

- ▶ Remember: CFG G = (N, E, s) with digraph (N, E) and entry  $s \in N$
- Dominate: *d* dom *n* iff every path from *s* to *n* contains *d* 
  - Dominators of *n*:  $DOM(n) = \{d | d \text{ dom } n\}$
- ▶ Strictly dominate: d sdom  $n \Leftrightarrow d$  dom  $n \land d \neq n$

#### Immediate dominator:

idom (n) = d: d sdom  $n \land \not \exists d'.d$  sdom  $d' \land d'$  sdom n

- $\Rightarrow\,$  All strict dominators are always executed before the block
- $\Rightarrow$  All values from dominators available/usable
- $\Rightarrow\,$  All values not from dominators not usable

### Dominator Tree

- Tree of immediate dominators
- Allows to iterate over blocks in pre-order/post-order
- Answer a sdom b quickly



#### Dominator Tree

- Tree of immediate dominators
- Allows to iterate over blocks in pre-order/post-order
- Answer a sdom b quickly


### Dominator Tree: Construction

Naive: inefficient (but reasonably simple)<sup>10</sup>

- ► For each block: find a path from the root superset of dominators
- Remove last block on path and check for alternative path
- If no alternative path exists, last block is idom

<sup>10</sup>ES Lowry and CW Medlock. "Object code optimization". In: CACM 12.1 (1969), pp. 13–22. .

### Dominator Tree: Construction

Naive: inefficient (but reasonably simple)<sup>10</sup>

- ► For each block: find a path from the root superset of dominators
- Remove last block on path and check for alternative path
- If no alternative path exists, last block is idom
- Lengauer–Tarjan: more efficient methods<sup>11</sup>
  - Simple method in  $\mathcal{O}(m \log n)$ ; sophisticated method in  $\mathcal{O}(m \cdot \alpha(m, n))$

 $(\alpha(m,n)$  is the inverse Ackermann function, grows extremely slowly)

Used frequently in compilers<sup>12</sup>

<sup>10</sup>ES Lowry and CW Medlock. "Object code optimization". In: CACM 12.1 (1969), pp. 13–22. .

11T Lengauer and RE Tarjan. "A fast algorithm for finding dominators in a flowgraph". In: TOPLAS 1.1 (1979), pp. 121–141. 🔕

<sup>12</sup>Example: https://github.com/WebKit/WebKit/blob/aabfacb/Source/WTF/wtf/Dominators.h

> Per node store: *idom*, idom-children, DFS pre-order/post-order number

> Per node store: *idom*, idom-children, DFS pre-order/post-order number

► Get immediate dominator:

Per node store: *idom*, idom-children, DFS pre-order/post-order number

► Get immediate dominator: ...lookup *idom* 

Iterate over all dominators/dominated by:

Per node store: *idom*, idom-children, DFS pre-order/post-order number

► Get immediate dominator: ...lookup *idom* 

Iterate over all dominators/dominated by: ...trivial

Check whether a sdom b

Per node store: *idom*, idom-children, DFS pre-order/post-order number

- ► Get immediate dominator: ...lookup *idom*
- Iterate over all dominators/dominated by: ...trivial
- Check whether a sdom  $b^{13}$ 
  - ▶ a.preNum < b.preNum ∧ a.postNum > b.postNum
  - After updates, numbers might be invalid: recompute or walk tree

Per node store: *idom*, idom-children, DFS pre-order/post-order number

- ► Get immediate dominator: ...lookup *idom*
- Iterate over all dominators/dominated by: ...trivial
- Check whether a sdom  $b^{13}$ 
  - ▶ a.preNum < b.preNum ∧ a.postNum > b.postNum
  - After updates, numbers might be invalid: recompute or walk tree

#### ▶ Problem: dominance of unreachable blocks ill-defined ~→ special handling

### CSE Attempt 2

- ► Option 1:
  - ► For identical instructions, store all
  - Add dominance check before replacing
  - Visit nodes in reverse post-order (i.e., topological order)
- ▶ Option 2:<sup>14</sup>
  - Do a DFS over dominator tree
  - Use scoped hashmap to track available values

#### Does this work?

### CSE Attempt 2

- ► Option 1:
  - ► For identical instructions, store all
  - Add dominance check before replacing
  - Visit nodes in reverse post-order (i.e., topological order)
- ▶ Option 2:<sup>14</sup>
  - Do a DFS over dominator tree
  - Use scoped hashmap to track available values

#### Does this work? Yes.

# CSE: Hashing an Instruction (and Beyond)

- Needs hash function and "relaxed" equality
- Idea: combine opcode and operands/constants into hash value
  - Use pointer or index for instruction result operands

# CSE: Hashing an Instruction (and Beyond)

Needs hash function and "relaxed" equality

Idea: combine opcode and operands/constants into hash value
 Use pointer or index for instruction result operands

- Canonicalize commutative operations
  - Order operands deterministically, e.g., by address
- Identities: a+(b+c) vs. (a+b)+c

## Global Value Numbering - or: advanced CSE

Hash-based approach only catches trivially removable duplicates

# Global Value Numbering - or: advanced CSE

Hash-based approach only catches trivially removable duplicates

- Alternative: partition values into *congruence classes* Congruent values are guaranteed to always have the same value
  Optimistic approach: values are congruent unless proven otherwise
  Pessimistic approach: values are not congruent unless proven
- Combinable with: reassociation, DCE, constant folding
- Rather complex, but can be highly beneficial<sup>15</sup>

- Estimate whether inlining is beneficial
  - Savings of avoided call/computations/branches; cost of increased size

Estimate whether inlining is beneficial

- Savings of avoided call/computations/branches; cost of increased size
- Copy original function in place of the call
  - Split basic block containing function call

Estimate whether inlining is beneficial

- Savings of avoided call/computations/branches; cost of increased size
- Copy original function in place of the call
  - Split basic block containing function call
- $\blacktriangleright$  Replace returns with branches and  $\phi$ -node to/at continuation point

Estimate whether inlining is beneficial

- Savings of avoided call/computations/branches; cost of increased size
- Copy original function in place of the call
  - Split basic block containing function call
- $\blacktriangleright$  Replace returns with branches and  $\phi$ -node to/at continuation point
- Move alloca to beginning or save stack pointer
  - Prevent unbounded stack growth in loops
  - LLVM provides stacksave/stackrestore intrinsics
- Exceptions may need special treatment

# Simple Transformations: Mem2Reg and SROA

- Mem2reg: promote alloca to SSA values/phis
  - Condition: only load/store, no address taken
  - Essentially just SSA construction
- ► SROA: scalar replacement of aggregate
  - Separate structure fields into separate variables
  - Also promote them to SSA

- Entry: block with an edge from outside of L
- Header h: first entry found (might be ambiguous)
- Loop nested in L: loop in subgraph  $L \setminus \{h\}$

- Entry: block with an edge from outside of L
- Header h: first entry found (might be ambiguous)
- ▶ Loop nested in *L*: loop in subgraph  $L \setminus \{h\}$



- Entry: block with an edge from outside of L
- Header h: first entry found (might be ambiguous)
- ▶ Loop nested in *L*: loop in subgraph  $L \setminus \{h\}$



- Entry: block with an edge from outside of L
- Header h: first entry found (might be ambiguous)
- ▶ Loop nested in *L*: loop in subgraph  $L \setminus \{h\}$



- Entry: block with an edge from outside of L
- Header h: first entry found (might be ambiguous)
- ▶ Loop nested in *L*: loop in subgraph  $L \setminus \{h\}$



- Entry: block with an edge from outside of L
- Header h: first entry found (might be ambiguous)
- ▶ Loop nested in *L*: loop in subgraph  $L \setminus \{h\}$





- Entry: block with an edge from outside of L
- Header h: first entry found (might be ambiguous)
- ▶ Loop nested in *L*: loop in subgraph  $L \setminus \{h\}$



# Natural Loops

- Natural Loop: loop with single entry
  - $\Rightarrow$  Header is unique
  - $\Rightarrow$  Header dominates all block
  - $\Rightarrow$  Loop is reducible
- Backedge: edge from block to header
- Predecessor: block with edge into loop
- Preheader: unique predecessor

#### **Formal Definition**

Loop *L* is reducible iff  $\exists h \in L : \forall n \in L : h \text{ dom } n$ CFG is reducible iff all loops are reducible



# Finding Natural Loops

Modified version<sup>17</sup> of Tarjan's algorithm<sup>18</sup>

- Iterate over dominator tree in post order
- Each block: find predecessors dominated by the block
  - ▶ None ~→ no loop header, continue
  - ► Any ~→ loop header, these edges *must* be backedges
- Walk through predecessors until reaching header again
  - All blocks on the way must be part of the loop body
  - Might encounter nested loops, update loop parent

<sup>17</sup>G Ramalingam. "Identifying loops in almost linear time". In: TOPLAS 21.2 (1999), pp. 175–188. .
 <sup>18</sup>R Tarjan. "Testing flow graph reducibility". In: STOC. 1973, pp. 96–107. .

Control Flow Graph

Dominator Tree





Control Flow Graph

Dominator Tree





Control Flow Graph

Dominator Tree





Control Flow Graph

Dominator Tree





Control Flow Graph

Dominator Tree





Control Flow Graph

Dominator Tree

Loop Info





Loop A: {c} header: c; parent: NULL

Control Flow Graph

Dominator Tree

Loop Info





Loop A: {c} header: c; parent: NULL

Control Flow Graph

Dominator Tree

Loop Info





Loop A: {c} header: c; parent: NULL
Control Flow Graph

Dominator Tree

Loop Info





Loop A: {c} header: c; parent: NULL

Control Flow Graph

Dominator Tree

Loop Info





Loop A: {c} header: c; parent: NULL

Loop **B**: {f,g} header: f; parent: NULL

Control Flow Graph

Dominator Tree

Loop Info





Loop A: {c} header: c; parent: NULL

Loop **B**: {f,g} header: f; parent: NULL

Control Flow Graph

Dominator Tree

Loop Info





Loop A: {c} header: c; parent: NULL

Loop **B**: {f,g} header: f; parent: NULL

Control Flow Graph

а b g d е

а h f g d e

Dominator Tree

Loop Info

Loop A: {c} header: c; parent: NULL

Loop B: {f,g} header: f; parent: C

Loop C: {b,f,g} header: b; parent: NULL

Control Flow Graph



Dominator Tree



Loop Info

Loop A: {c} header: c; parent: NULL

Loop B: {f,g} header: f; parent: C

Loop C: {b,f,g} header: b; parent: NULL

Control Flow Graph



Dominator Tree



Loop Info

Loop A: {c} header: c; parent: NULL

Loop B: {f,g} header: f; parent: C

Loop C: {b,f,g} header: b; parent: NULL

Control Flow Graph



Dominator Tree



Loop Info

Loop A: {c} header: c; parent: D

Loop **B**: {f,g} header: f; parent: C

Loop C: {b,f,g} header: b; parent: D

Loop D: {a,b,c,d,e,f,g} header: a; parent: NULL

# Loop Invariant Code Motion (LICM)

Analyze loops, iterate over loop tree in post-order

► I.e., visit inner loops first

# Loop Invariant Code Motion (LICM)

Analyze loops, iterate over loop tree in post-order

I.e., visit inner loops first

<sup>↑</sup> Hoist:<sup>19</sup> iterate over blocks of loop in reverse post-order

- For each movable inst., check for loop-defined operands
- If not, move to preheader (create one, if not existent)
- Otherwise, add inst. to set of values defined inside loop

# Loop Invariant Code Motion (LICM)

Analyze loops, iterate over loop tree in post-order

I.e., visit inner loops first

<sup>↑</sup> Hoist:<sup>19</sup> iterate over blocks of loop in reverse post-order
 For each movable inst., check for loop-defined operands
 If not, move to preheader (create one, if not existent)
 Otherwise, add inst. to set of values defined inside loop

- $\downarrow\,$  Sink: Iterate over blocks of loop in post-order
  - ► For each movable inst., check for users inside loop
  - If none, move to unique exit (if existent)

## Transformations and Analyses in LLVM: Passes

- Transformations and analyses organized in passes
- Pass can operate on Module/(CGSCC)/Function/Loop

## Transformations and Analyses in LLVM: Passes

- Transformations and analyses organized in passes
- Pass can operate on Module/(CGSCC)/Function/Loop
- Analysis pass: takes input IR and returns analysis result
  - May also use results of other analyses; results are cached
- ▶ Transformation pass: takes input IR and returns preserved analyses
  - Can use analyses, which are re-run when outdated

## Transformations and Analyses in LLVM: Passes

- Transformations and analyses organized in passes
- Pass can operate on Module/(CGSCC)/Function/Loop
- ► Analysis pass: takes input IR and returns analysis result
  - May also use results of other analyses; results are cached
- ▶ Transformation pass: takes input IR and returns preserved analyses
  - Can use analyses, which are re-run when outdated
- Pass manager executes passes on same granularity
  - Otherwise, use adaptor: createFunctionToLoopPassAdaptor (and preferably combine multiple smaller passes into a separate pass manager)

# Using LLVM (New) Pass Manager

}

```
void optimize(llvm::Function* fn) {
 llvm::PassBuilder pb;
 llvm::LoopAnalysisManager lam{};
 llvm::FunctionAnalysisManager fam{};
 llvm::CGSCCAnalysisManager cgam{};
 llvm::ModuleAnalysisManager mam{};
 pb.registerModuleAnalyses(mam);
 pb.registerCGSCCAnalyses(cgam);
 pb.registerFunctionAnalyses(fam);
 pb.registerLoopAnalyses(lam);
 pb.crossRegisterProxies(lam, fam, cgam, mam);
```

```
llvm::FunctionPassManager fpm{};
fpm.addPass(llvm::DCEPass());
fpm.addPass(llvm::createFunctionToLoopPassAdaptor(llvm::LoopRotatePass()));
fpm.run(*fn, fam);
```

#### Writing a Pass for LLVM's New PM – Part 1

```
#include "llvm/IR/PassManager.h"
#include "llvm/Passes/PassBuilder.h"
#include "llvm/Passes/PassPlugin.h"
```

11 ...

```
class TestPass : public llvm::PassInfoMixin<TestPass> {
public:
  llvm::PreservedAnalyses run(llvm::Function &F.
                            llvm::FunctionAnalvsisManager &AM) {
   // Do some magic
   llvm::DominatorTree *DT = &AM.getResult<llvm::DominatorTreeAnalvsis>(F);
   11 ...
   llvm::errs() << F.getName() << "\n";</pre>
   return llvm::PreservedAnalyses::all();
 }
};
```

# Writing a Pass for LLVM's New PM – Part 2

```
extern "C" ::llvm::PassPluginLibraryInfo LLVM_ATTRIBUTE_WEAK
llvmGetPassPluginInfo() {
 return { LLVM_PLUGIN_API_VERSION, "TestPass", "v1",
   [] (llvm::PassBuilder &PB) {
     PB.registerPipelineParsingCallback(
       [] (llvm::StringRef Name, llvm::FunctionPassManager &FPM,
          llvm::ArravRef<llvm::PassBuilder::PipelineElement>) {
         if (Name == "testpass") {
          FPM.addPass(TestPass());
          return true:
         }
         return false:
       }):
   } }:
}
c++ -shared -o testpass.so testpass.cc -lLLVM -fPIC
opt -load-pass-plugin=$PWD/testpass.so -passes=testpass input.ll | llvm-dis
```

164

# Analyses and Transformations – Summary

Program Transformation critical for performance improvement

- Code not necessarily better
- Analyses are important to drive transformations
  - Dominator tree, loop detection, value liveness
- Important optimizations
  - Dead code elimination, common sub-expression elimination, loop-invariant code motion
- Compilers often implement transformations as passes
- Analyses may be invalidated by transformations, needs tracking

## Analyses and Transformations – Questions

- ▶ Why is "optimization" a misleading name for a transformation?
- How to find unused code sections in a function's CFG?
- ▶ Why is a liveness-based DCE better than a simple, user-based DCE?
- What is a dominator tree useful for?
- ▶ What is the difference between an irreducible and a natural loop?
- How to find natural loops in a CFG?
- How does the algorithm handle irreducible loops?
- Why is sinking a loop-invariant inst. harder than hoisting?

#### Code Generation for Data Processing Lecture 6: Vectorization

#### Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

## Parallel Data Processing

Sequential execution has inherently limited performance

- Clock rate, data path lengths, speed of light, ...
- > Parallelism is the key to substantial and scalable perf. improvements
- Modern systems have many levels of parallelism:

## Parallel Data Processing

Sequential execution has inherently limited performance

- Clock rate, data path lengths, speed of light, ...
- ▶ Parallelism is the key to substantial and scalable perf. improvements
- Modern systems have many levels of parallelism:
  - Multiple nodes/systems, connected via network
  - ▶ Different compute units (CPU, GPU, etc.), connected via PCIe
  - Multiple CPU sockets, connected via QPI (Intel) or HyperTransport (AMD)
  - Multiple CPU cores
  - Multiple threads per core
  - Instruction-level parallelism (superscalar out-of-order execution)
  - Data parallelism (SIMD)

# Single Instruction, Multiple Data (SIMD)

Idea: perform same operations on multiple data in parallel

► First computer with SIMD operations:

# Single Instruction, Multiple Data (SIMD)

Idea: perform same operations on multiple data in parallel

- ▶ First computer with SIMD operations: MIT Lincoln Labs TX-2, 1957<sup>20</sup>
- ▶ Wider use in HPC in 1970s with vector processors (Cray et al.)
  - Ultimately replaced by much more scalable distributed machines

# Single Instruction, Multiple Data (SIMD)

Idea: perform same operations on multiple data in parallel

- ► First computer with SIMD operations: MIT Lincoln Labs TX-2, 1957<sup>20</sup>
- Wider use in HPC in 1970s with vector processors (Cray et al.)
  - Ultimately replaced by much more scalable distributed machines
- SIMD-extensions for multimedia processing from 1990s onwards
   Often include comparial instructions for increase (video (and in processing))
  - Often include very special instructions for image/video/audio processing
- Shift towards HPC and data processing around 2010
- Extensions for machine learning/AI in late 2010s

## SIMD: Idea

Multiple data elements are stored in vectors

- Size of data may differ, vector size is typically constant
- Single elements in vector referred to as *lane*
- (Vertical) Operations apply the same operation to all lanes



Horizontal operations work on neighbored elements

# SIMD ISAs: Design

- Vectors are often implemented as fixed-size wide registers
  - ► Examples: ARM NEON 32×128-bit, Power QPX 32×256-bit
  - Data types and element count is defined by instruction
- Some ISAs have dynamic vector sizes: ARM VFP, ARM SVE, RISC-V V
  - Problematic for compilers: variable spill size, less constant folding

# SIMD ISAs: Design

- Vectors are often implemented as fixed-size wide registers
  - ► Examples: ARM NEON 32×128-bit, Power QPX 32×256-bit
  - Data types and element count is defined by instruction
- Some ISAs have dynamic vector sizes: ARM VFP, ARM SVE, RISC-V V
  - ▶ Problematic for compilers: variable spill size, less constant folding
- Data types vary, e.g. i8/i16/i32/i64/f16/bf16/f32/f64/f128
  - Sometimes only conversion, sometime with saturating arithmetic

# SIMD ISAs: Design

- Vectors are often implemented as fixed-size wide registers
  - ► Examples: ARM NEON 32×128-bit, Power QPX 32×256-bit
  - Data types and element count is defined by instruction
- Some ISAs have dynamic vector sizes: ARM VFP, ARM SVE, RISC-V V
  - Problematic for compilers: variable spill size, less constant folding
- Data types vary, e.g. i8/i16/i32/i64/f16/bf16/f32/f64/f128
  - Sometimes only conversion, sometime with saturating arithmetic
- Masking allows to suppress operations for certain lanes
  - Dedicated mask registers (AVX-512, SVE, RVV) allow for hardware masking
  - Can also apply for memory operations, optionally suppressing faults
  - Otherwise: software masking with another vector register



Dense linear algebra: vector/matrix operations

Implementations: Intel MKL, OpenBLAS, ATLAS, ....

Dense linear algebra: vector/matrix operations

Implementations: Intel MKL, OpenBLAS, ATLAS, ...

Sparse linear algebra

Needs gather/scatter instructions

Dense linear algebra: vector/matrix operations

- Implementations: Intel MKL, OpenBLAS, ATLAS, ....
- Sparse linear algebra
  - Needs gather/scatter instructions
- Image and video processing, manipulation, encoding

Dense linear algebra: vector/matrix operations

- Implementations: Intel MKL, OpenBLAS, ATLAS, ...
- Sparse linear algebra
  - Needs gather/scatter instructions
- Image and video processing, manipulation, encoding
- String operations
  - Implemented, e.g., in glibc, simdjson

Dense linear algebra: vector/matrix operations

- Implementations: Intel MKL, OpenBLAS, ATLAS, ...
- Sparse linear algebra
  - Needs gather/scatter instructions
- Image and video processing, manipulation, encoding
- String operations
  - Implemented, e.g., in glibc, simdjson



# SIMD ISAs: Usage Considerations
## SIMD ISAs: Usage Considerations

#### Very easy to implement in hardware

- Simple replication of functional units and larger vector registers
- ► Too large vectors, however, also cause problems (AVX-512)

## SIMD ISAs: Usage Considerations

#### Very easy to implement in hardware

- Simple replication of functional units and larger vector registers
- ► Too large vectors, however, also cause problems (AVX-512)
- Offer significant speedups for certain applications
  - ▶ With 4x parallelism, speed-ups of ~3x are achievable



## SIMD ISAs: Usage Considerations

#### Very easy to implement in hardware

- Simple replication of functional units and larger vector registers
- ▶ Too large vectors, however, also cause problems (AVX-512)
- Offer significant speedups for certain applications
   With 4x parallelism, speed-ups of ~3x are achievable
- Caveat: non-trivial to program
  - Optimized routines provided by libraries
  - Compilers try to auto-vectorize, but often need guidance

# SIMD Programming: (Inline) Assembly

► Idea: SIMD is too complicated, let programmer handle this

- Programmer specifies exact code (instrs, control flow, and registers)
- Inline assembly allows for integration into existing code
  - Specification of register constraints and clobbers needed
- "Popular" for optimized libraries

# SIMD Programming: (Inline) Assembly

► Idea: SIMD is too complicated, let programmer handle this

- Programmer specifies exact code (instrs, control flow, and registers)
- Inline assembly allows for integration into existing code
  - Specification of register constraints and clobbers needed
- "Popular" for optimized libraries
- + Allows for best performance
- Very tedious to write, manual register allocation, non-portable
- No optimization across boundaries

## SIMD Programming: Intrinsics

Idea: deriving a SIMD schema is complicated, delegate to programmer

Intrinsic functions correspond to hardware instructions

\_\_m128i \_mm\_add\_epi32 (\_\_m128i a, \_\_m128i b)

 Programmer explicitly specifies vector data processing instructions compiler supplements registers, control flow, and scalar processing

## SIMD Programming: Intrinsics

Idea: deriving a SIMD schema is complicated, delegate to programmer

- Intrinsic functions correspond to hardware instructions
  - \_\_m128i \_mm\_add\_epi32 (\_\_m128i a, \_\_m128i b)
- Programmer explicitly specifies vector data processing instructions compiler supplements registers, control flow, and scalar processing
- $+\,$  Allows for very good performance, still exposes all operations
- + Compiler can to some degree optimize intrinsics
  - GCC does not; Clang/LLVM does intrinsics often lowered to LLVM-IR vectors
- Tedious to write, non-portable

## SIMG Programming: Intrinsics – Example

```
float sdot(size_t n, const float x[n], const float y[n]) {
   size_t i = 0;
   __m128 sum = _mm_set_ps1(0);
   for (i = 0; i < (n & ~3ul); i += 4) {
     __m128 xl = _mm_loadu_ps(&x[i]);
     __m128 yl = _mm_loadu_ps(&y[i]);
     sum = _mm_add_ps(sum, _mm_mul_ps(xl, yl));
   }
   // ... take care of tail (i..<n) ...</pre>
```

### Intrinsics for Unknown Vector Size

Size not known at compile-time, but can be queried at runtime

- SVE: instruction incd adds number of vector lanes to register
- ▶ In C: behave like an incomplete type, except for parameters/returns
- ► Flexible code often slower than with assumed constant vector size

Consequences:

### Intrinsics for Unknown Vector Size

Size not known at compile-time, but can be queried at runtime

- SVE: instruction incd adds number of vector lanes to register
- ▶ In C: behave like an incomplete type, except for parameters/returns
- Flexible code often slower than with assumed constant vector size
- Consequences:
  - Cannot put such types in structures, arrays, sizeof
  - Stack spilling implies variably-sized stack
- Instructions to set mask depending on bounds: whilelt, ...
  - No loop peeling for tail required

## SIMD Programming: Target-independent Vector Extensions

Idea: vectorization still complicated, but compiler can choose instrs.

- Programmer still specifies exact operations, but in target-independent way
- Often mixable with target-specific intrinsics
- Compiler maps operations to actual target instructions
- If no matching target instruction exists, use replacement code
  - Inherent danger: might be less efficient than scalar code
- Often relies on explicit vector size

#### GCC Vector Extensions

```
#include <stdint.h>
```

```
typedef uint32_t uint32x4_t
__attribute__((vector_size(16)));
```

```
uint32x4_t
addvec(uint32x4_t a, uint32x4_t b) {
    return a + b;
}
```

```
uint32x4_t
modvec(uint32x4_t a, uint32x4_t b) {
    return a % b;
}
```

#### GCC Vector Extensions

addvec: paddd xmm0, xmm1 ret

#include <stdint.h>

```
typedef uint32_t uint32x4_t
__attribute__((vector_size(16)));
```

```
uint32x4_t
addvec(uint32x4_t a, uint32x4_t b) {
    return a + b;
}
```

```
uint32x4_t
modvec(uint32x4_t a, uint32x4_t b) {
    return a % b;
}
```

#### GCC Vector Extensions

```
#include <stdint.h>
```

```
typedef uint32_t uint32x4_t
__attribute__((vector_size(16)));
```

```
uint32x4_t
addvec(uint32x4_t a, uint32x4_t b) {
    return a + b;
}
```

```
uint32x4_t
modvec(uint32x4_t a, uint32x4_t b) {
    return a % b;
}
```

```
addvec:
paddd xmm0, xmm1
ret
<sup>modvec:</sup>
movd ecx, xmm1
movd eax, xmm0
```

```
xor edx, edx
pextrd edi, xmm1, 1
div ecx
pextrd eax, xmm0, 1
pextrd ecx, xmm1, 2
mov esi, edx
xor edx. edx
div edi
pextrd eax, xmm0, 2
mov r8d. edx
xor edx. edx
div ecx
pextrd ecx, xmm1, 3
pextrd eax, xmm0, 3
movd xmm0. esi
pinsrd xmm0, r8d, 1
mov edi. edx
xor edx. edx
div ecx
movd xmm1, edi
pinsrd xmm1, edx, 1
```

## LLVM-IR: Vectors

▶ <N x ty> – fixed-size vector type, e.g. <4 x i32>

- Valid element type: integer, floating-point, pointers
- Memory layout: densely packed (i.e., <8 x i2>  $\approx$  i16)

### LLVM-IR: Vectors

- N x ty> fixed-size vector type, e.g. <4 x i32>
  - Valid element type: integer, floating-point, pointers
  - Memory layout: densely packed (i.e., <8 x i2>  $\approx$  i16)
- <vscale x N x ty> scalable vector, e.g. <vscale x 4 x i32>
  - Vector with a multiple of N elements
  - Intrinsic @llvm.vscale.i32() get runtime value of vscale

## LLVM-IR: Vectors

- N x ty> fixed-size vector type, e.g. <4 x i32>
  - Valid element type: integer, floating-point, pointers
  - Memory layout: densely packed (i.e., <8 x i2>  $\approx$  i16)
- <vscale x N x ty> scalable vector, e.g. <vscale x 4 x i32>
  - Vector with a multiple of N elements
  - Intrinsic @llvm.vscale.i32() get runtime value of vscale
- Most arithmetic operations can also operate on vectors
- insertelement/extractelement: modify single element
  - Example: %4 = insertelement <4 x float> %0, float %1, i32 %2
  - Index can be non-constant value

#### LLVM-IR: shufflevector

- Instruction to reorder values and resize vectors
- shufflevector <n x ty> %x, <n x ty> %y, <m x i32> %mask
  - %x, %y values to shuffle, must have same size
  - %mask element indices for result (0..<n refer to %x, n..<2n to %y)</p>
  - Result is of type <m x ty>

#### LLVM-IR: shufflevector

- Instruction to reorder values and resize vectors
- shufflevector <n x ty> %x, <n x ty> %y, <m x i32> %mask
  - %x, %y values to shuffle, must have same size
  - %mask element indices for result (0..<n refer to %x, n..<2n to %y)</p>
  - Result is of type <m x ty>

shufflevector <4 x i32> %x, <4 x i32> %y, <3 x i32> <i32 1, i32 7, i32 7>



#### LLVM-IR: Lowering Intrinsics

Intrinsics translated to native LLVM-IR if possible

```
#include <immintrin.h>
m128 func( m128 a, m128 b) {
 _m128 \text{ rev} = _mm_shuffle_epi32(a + b, 0x1b);
 return _mm_round_ps(rev, _MM_FROUND_TO_NEG_INF);
}
define <4 x float> @func(<4 x float> %0, <4 x float> %1) {
 %3 = fadd < 4 x float > %0, %1
 %4 = shufflevector <4 x float> %3, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
 %5 = tail call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %4, i32 1)
 ret <4 x float> %5
}
declare <4 x float> @llvm.x86.sse41.round.ps(<4 x float>, i32 immarg)
```

## LLVM-IR: Lowering Intrinsics

- Intrinsics translated to native LLVM-IR if possible
- + Allows optimizations
- Intent of programmer might get lost

```
#include <immintrin.h>
__m128 func(__m128 a, __m128 b) {
 _m128 \text{ rev} = _mm_shuffle_epi32(a + b, 0x1b);
 return _mm_round_ps(rev, _MM_FROUND_TO_NEG_INF);
}
define <4 x float> @func(<4 x float> %0, <4 x float> %1) {
 %3 = fadd < 4 x float > %0, %1
 %4 = shufflevector <4 x float> %3, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
 %5 = tail call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %4, i32 1)
 ret <4 x float> %5
}
declare <4 x float> @llvm.x86.sse41.round.ps(<4 x float>, i32 immarg)
```

# SIMD Programming: Single Program, Multiple Data (SPMD)

- So far: manual vectorization
- Observation: same code is executed on multiple elements
- Idea: tell compiler to vectorize handling of single element
  - Splice code for element into separate function
  - Tell compiler to generate vectorized version of this function
  - Function called in vector-parallel loop
- Needs annotation of variables
  - Varying: variables that differ between lanes
  - Uniform: variables that are guaranteed to be the same (basically: scalar values that are broadcasted if necessary)

```
#pragma omp declare simd
int foo(int x, int y) {
  return x + y;
}
```

foo: add edi, esi mov eax, edi ret

\_ZGVxN4vv\_foo: paddd xmm0, xmm1 ret

```
#pragma omp declare simd
int foo(int x, int y) {
  return x + y;
}
```

 Compiler generates version that operates on vector foo: add edi, esi mov eax, edi ret

\_ZGVxN4vv\_foo: paddd xmm0, xmm1 ret

```
#pragma omp declare simd uniform(y)
int foo(int x, int y) {
  return x + y;
}
```

```
foo:
   add edi, esi
   mov eax, edi
   ret
```

\_ZGVxN4vu\_foo: movd xmm1, eax pshufd xmm2, xmm1, 0 paddd xmm0, xmm2 ret

```
#pragma omp declare simd uniform(y)
int foo(int x, int y) {
  return x + y;
}
```

Uniform: always same value

foo:
 add edi, esi
 mov eax, edi
 ret

\_ZGVxN4vu\_foo: movd xmm1, eax pshufd xmm2, xmm1, 0 paddd xmm0, xmm2 ret

## SPMD: Example (OpenMP) - if/else

```
#pragma omp declare simd
int foo(int x, int y) {
    int res;
    if (x > y) res = x;
    else res = y - x;
    return res;
}
```

foo:

mov eax, esi
sub eax, edi
cmp edi, esi
cmovg eax, edi
ret

# SPMD: Example (OpenMP) - if/else

```
#pragma omp declare simd
int foo(int x, int y) {
    int res;
    if (x > y) res = x;
    else res = y - x;
    return res;
}
```

foo:

mov eax, esi sub eax, edi cmp edi, esi cmovg eax, edi ret

\_ZGVxN4vv\_foo: movdqa xmm2, xmm0 pcmpgtd xmm0, xmm1 psubd xmm1, xmm2 pblendvb xmm1, xmm2, xmm0 movdqa xmm0, xmm1 ret

# SPMD: Example (OpenMP) - if/else

```
#pragma omp declare simd
int foo(int x, int y) {
    int res;
    if (x > y) res = x;
    else res = y - x;
    return res;
}
```

 Diverging control flow: all paths are executed foo:

mov eax, esi sub eax, edi cmp edi, esi cmovg eax, edi ret

\_ZGVxN4vv\_foo: movdqa xmm2, xmm0 pcmpgtd xmm0, xmm1 psubd xmm1, xmm2 pblendvb xmm1, xmm2, xmm0 movdqa xmm0, xmm1 ret

## SPMD to SIMD: Handling if/else

Control flow solely depending on uniforms:

# SPMD to SIMD: Handling if/else

Control flow solely depending on uniforms: nothing different

- Otherwise: control flow may diverge
  - Different lanes may choose different execution paths
  - ▶ But: CPU has only one control flow, so all paths must execute
- Condition becomes mask, mask determines result
- ► After insertion of masks, linearize control flow
  - Relevant control flow now encoded in data through masks



## SPMD to SIMD: Handling if/else

Control flow solely depending on uniforms: nothing different

- Otherwise: control flow may diverge
  - Different lanes may choose different execution paths
  - ▶ But: CPU has only one control flow, so all paths must execute
- Condition becomes mask, mask determines result
- ► After insertion of masks, linearize control flow
  - Relevant control flow now encoded in data through masks
- Problem: side-effects prevent vectorization

## SPMD to SIMD: Handling Loops

- Uniform loops: nothing different
- Otherwise: need to retain loop structure
  - "active" mask added to all loop iterations
  - Loop only terminates once all lanes terminate (active is zero)
  - Lanes that terminated early need their values retained
- Approach also works for nested loops/conditions
- Irreducible loops need special handling<sup>21</sup>

## SPMD Implementations on CPUs

#### OpenMP SIMD functions

Need to be combined with #pragma omp simd loops

- Intel ispc<sup>22</sup> (Implicit SPMD Program Compiler)
  - Extension of C with keywords uniform, varying
  - Still active and interesting history<sup>23</sup>

#### OpenCL on CPU

- Very similar programming model
- But: higher complexity for communicating with rest of application

<sup>22</sup>M Pharr and WR Mark. "ispc: A SPMD compiler for high-performance CPU programming". In: InPar. 2012, pp. 1–13.
<sup>23</sup>https://pharr.org/matt/blog/2018/04/30/ispc-all

## SIMD Programming: SPMD on CPUs

Semi-explicit vectorization

Programmer chooses level of vectorization

E.g., inner vs. outer loop

Compiler does actual work

# SIMD Programming: SPMD on CPUs

Semi-explicit vectorization

Programmer chooses level of vectorization

E.g., inner vs. outer loop

- Compiler does actual work
- $+\,$  Allows simple formulation of complex control flow
- Compilers often fail at handling complex control flow well
  - Loops are particularly problematic
# SIMD Programming: Auto-vectorization

Idea: programmer is too incompetent/busy, let compiler do vectorization

► Inherently difficult and problematic, after decades of research

# SIMD Programming: Auto-vectorization

Idea: programmer is too incompetent/busy, let compiler do vectorization

- ► Inherently difficult and problematic, after decades of research
  - Recognizing and matching lots of patterns
  - Instruction selection becomes more difficult
  - Compiler lacks domain knowledge about permissible transformations
- Executive summary of the state of the art:
  - Auto-vectorization works well for very simple cases
  - For "medium complexity", code is often suboptimal
  - In many cases, auto-vectorization fails on unmodified code

## Auto-vectorization Strategies

# Auto-vectorization Strategies

#### Loop Vectorization

- Try to transform loop body into vectors with n lanes
- Often needs tail loop for remainder that doesn't fill a vector
- Extremely common

# Auto-vectorization Strategies

#### Loop Vectorization

- Try to transform loop body into vectors with n lanes
- Often needs tail loop for remainder that doesn't fill a vector
- Extremely common
- Superword-level Parallelism (SLP)
  - Vectorize constructs outside of loops
  - Detect neighbored stores, try to fold operations into vectors

Only consider innermost loop (at first)

Only consider innermost loop (at first)

1. Check legality: is vectorization possible at all?

- Only vectorizable data types and operations used
- No loop-carried dependencies, overlapping memory regions, etc.

Only consider innermost loop (at first)

- 1. Check legality: is vectorization possible at all?
  - Only vectorizable data types and operations used
  - ▶ No loop-carried dependencies, overlapping memory regions, etc.
- 2. Check profitability: is vectorization benefitial?
  - Consider: runtime checks, gather/scatter, masked operations, etc.
  - Needs information about target architecture

Only consider innermost loop (at first)

- 1. Check legality: is vectorization possible at all?
  - Only vectorizable data types and operations used
  - ▶ No loop-carried dependencies, overlapping memory regions, etc.
- 2. Check profitability: is vectorization benefitial?
  - Consider: runtime checks, gather/scatter, masked operations, etc.
  - Needs information about target architecture
- 3. Perform transformation

# Outer Loop Vectorization

Vectorizing the innermost loop not always beneficial

- Example 1: inner loop has only few iterations
- Example 2: inner loop has loop-carried dependencies
- ► Thus: need to consider outer loops as well
  - Also: vectorization on multiple levels might be beneficial

# Outer Loop Vectorization

Vectorizing the innermost loop not always beneficial

- Example 1: inner loop has only few iterations
- Example 2: inner loop has loop-carried dependencies
- ► Thus: need to consider outer loops as well
  - Also: vectorization on multiple levels might be beneficial

Very limited support in compilers, if any

#### Auto-vectorization is Hard

- Biggest problem: data dependencies
  - Resolving loop-carried dependencies is difficult
- Memory aliasing
  - Overlapping arrays, or worse loop counter
- Which loop level to vectorize? Multiple?
- Loop body *might* impact loop count
- Function calls, e.g. for math functions
- Strided memory access (e.g., only every n-th element)
- Choosing vectorization level (outer loop *might* be better)

#### Auto-vectorization is Hard

- Biggest problem: data dependencies
  - Resolving loop-carried dependencies is difficult
- Memory aliasing
  - Overlapping arrays, or worse loop counter
- Which loop level to vectorize? Multiple?
- Loop body *might* impact loop count
- Function calls, e.g. for math functions
- Strided memory access (e.g., only every n-th element)
- Choosing vectorization level (outer loop *might* be better)
- Is vectorization profitable at all?
- Often black box to programmer, preventing fine-grained tuning

# Vectorization – Summary

- SIMD is an easy way to improve performance numbers of CPUs
- Most general-purpose ISAs have one or more SIMD extensions
- Recent trend: variably-length vectors
- ▶ Inline Assembly: easiest for compiler, but extremely tedious
- Intrinsics: best trade-off towards performance and usability
- Target-independent operations: slightly increase portability
- SPMD: strategy dominant for GPU programming
- > Auto-vectorization: very hard, unsuited for complex code

#### Vectorization – Questions

- Why do modern CPUs provide SIMD extensions?
- Why come variable-length SIMD extensions with higher runtime costs?
- ► How are SIMD intrinsics lowered to LLVM-IR?
- What is the downside of target-independent vector operations?
- How can if/else/for constructs be vectorized?
- > What is the difference between a uniform and a varying variable?
- Why is auto-vectorization often sub-par to manual optimization?

Code Generation for Data Processing Lecture 7: Instruction Selection

Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

Instruction Selection

- ► Map IR to assembly
- Keep code shape and storage; change operations

#### Instruction Selection

- Map IR to assembly
- Keep code shape and storage; change operations
- Instruction Scheduling
  - Optimize order to hide latencies
  - Keep operations, may increases demand for registers

#### Instruction Selection

- Map IR to assembly
- Keep code shape and storage; change operations
- Instruction Scheduling
  - Optimize order to hide latencies
  - Keep operations, may increases demand for registers

#### Register Allocation

- Map virtual to architectural registers and stack
- Adds operations (spilling), changes storage

#### Instruction Selection (ISel) – Overview

Find machine instructions to implement abstract IR

- Typically separated from scheduling and register allocation
- ▶ Input: IR code with abstract instructions
- Output: lower-level IR code with target machine instructions

```
i64 %10 = add %8, %9
i8 %11 = trunc %10
i64 %12 = const 24
i64 %13 = add %7, %12
store %11, %13
```

i64 %10 = ADD %8, %9 STRB %10, [%7+24]

Target offers multiple ways to implement operations
 imul x, 2, add x, x, shl x, 1, lea x, [x+x]

Target offers multiple ways to implement operations

imul x, 2, add x, x, shl x, 1, lea x, [x+x]

Target operations have more complex semantics

- E.g., combine truncation and offset computation into store
- Can have multiple outputs, e.g. value+flags, quotient+remainder

Target offers multiple ways to implement operations

imul x, 2, add x, x, shl x, 1, lea x, [x+x]

Target operations have more complex semantics

- E.g., combine truncation and offset computation into store
- Can have multiple outputs, e.g. value+flags, quotient+remainder
- ► Target has multiple register sets, e.g. GP and FP/SIMD
  - Important to consider even before register allocation

Target offers multiple ways to implement operations

imul x, 2, add x, x, shl x, 1, lea x, [x+x]

Target operations have more complex semantics

E.g., combine truncation and offset computation into store

- Can have multiple outputs, e.g. value+flags, quotient+remainder
- ► Target has multiple register sets, e.g. GP and FP/SIMD
  - Important to consider even before register allocation
- Target requires specific instruction sequences
  - E.g., for macro fusion
  - Often represented as pseudo-instructions until assembly writing

#### **Optimal ISel**

▶ Find most performant instruction sequence with same semantics (?)

- ▶ I.e., there no program with better "performance" exists
- Performance = instructions associated with specific costs

#### **Optimal ISel**

▶ Find most performant instruction sequence with same semantics (?)

- I.e., there no program with better "performance" exists
- Performance = instructions associated with specific costs

Problem: optimal code generation is undecidable

#### **Optimal ISel**

▶ Find most performant instruction sequence with same semantics (?)

- I.e., there no program with better "performance" exists
- Performance = instructions associated with specific costs
- Problem: optimal code generation is undecidable
- Alternative: optimal *tiling* of IR with machine code instrs
  - IR as dataflow graph, instr. tiles to optimally cover graph
     NP-complete<sup>24</sup>

# Avoiding ISel Altogether

# Avoiding ISel Altogether

Use an interpreter

- + Fast "compilation time", easy to implement
- Slow execution time
- Best if code is executed once

Expand each IR operation with corresponding machine instrs

Oldest approach, historically also does register allocation

Also possible by walking AST

- Oldest approach, historically also does register allocation
   Also possible by walking AST
- $+\,$  Very fast, linear time, simple to implement, easy to port
- Inefficient and large output code

- Oldest approach, historically also does register allocation
   Also possible by walking AST
- $+\,$  Very fast, linear time, simple to implement, easy to port
- Inefficient and large output code
- Used by, e.g., LLVM FastISel, Go, GCC

#### Peephole Optimization

Plain macro expansion leads to suboptimal results

Idea: replace inefficient instruction sequences<sup>25</sup>

Originally: physical window over assembly code

- Replace with more efficient instructions having same effects
- Possibly with allocated registers
- Extension: do expansion before register allocation<sup>26</sup>
  - Expand IR into Register Transfer Lists (RTL) with temporary registers
  - ▶ While *combining*, ensure that each RTL can be implemented as single instr.

<sup>25</sup>WM McKeeman. "Peephole optimization". In: CACM 8.7 (1965), pp. 443–444. 🝥.
#### Peephole Optimization

- Originally covered only adjacent instructions
- Can also use logical window of data dependencies
  - Problem: instructions with multiple uses
  - Needs more sophisticated matching schemes for data deps.
    - $\Rightarrow$  Tree-pattern matching

### Peephole Optimization

- Originally covered only adjacent instructions
- Can also use logical window of data dependencies
  - Problem: instructions with multiple uses
  - $\blacktriangleright$  Needs more sophisticated matching schemes for data deps.  $\Rightarrow$  Tree-pattern matching
- $+\,$  Fast, also allows for target-specific sequences
- Pattern set grows large, limited potential

### Peephole Optimization

- Originally covered only adjacent instructions
- Can also use logical window of data dependencies
  - Problem: instructions with multiple uses
  - $\blacktriangleright$  Needs more sophisticated matching schemes for data deps.  $\Rightarrow$  Tree-pattern matching
- $+\,$  Fast, also allows for target-specific sequences
- Pattern set grows large, limited potential
- Widely used today at different points during compilation

ISel as Graph Covering – High-level Intuition

Idea: represent program as data flow graph

#### ISel as Graph Covering – High-level Intuition

Idea: represent program as data flow graph

Tree: expression, comb. of single-use SSA instructions (local ISel)
 DAG: data flow in basic block, e.g. SSA block (local ISel)
 Graph: data flow of entire function, e.g. SSA function (global ISel)

#### ISel as Graph Covering – High-level Intuition

Idea: represent program as data flow graph

- Tree: expression, comb. of single-use SSA instructions (local ISel)
- DAG: data flow in basic block, e.g. SSA block
- Graph: data flow of entire function, e.g. SSA function
- ▶ ISA "defines" *pattern set* of trees/DAGs/graphs for instrs.
- Cover data flow tree/DAG/graph with least-cost combination of patterns
  - Patterns in data flow graph may overlap

(local ISel)

(global ISel)

#### SSA form: %4 = shl %1, 4 %5 = add %2, %4

- %6 = add %3, %4
- %7 = load %5

live-out: %6, %7

SSA form: %4 = shl %1, 4 %5 = add %2, %4 %6 = add %3, %4 %7 = load %5

live-out: %6, %7

Data flow graph:



SSA form: %4 = shl %1, 4 %5 = add %2, %4 %6 = add %3, %4 %7 = load %5 live-out: %6, %7

► Data flow graph:



 Method 1: Edge Splitting







SSA form: %4 = shl %1, 4 %5 = add %2, %4 %6 = add %3, %4 %7 = load %5 live-out: %6, %7

► Data flow graph:



 Method 1: Edge Splitting







 Method 2: Node Duplication



## Tree Covering: Patterns

|          | Pattern                                                                                                   | Cost | Instruction                                 |
|----------|-----------------------------------------------------------------------------------------------------------|------|---------------------------------------------|
| $P_0$    | ${\it GP_{R1}}  ightarrow {\it \ll}({\it GP_{R2}}, {\it K_1})$                                            | 1    | lsl $R_1$ , $R_2$ , # $K_1$                 |
| $P_1$    | $GP_{R1} \rightarrow + (GP_{R2}, GP_{R3})$                                                                | 1    | add $R_1$ , $R_2$ , $R_3$                   |
| $P_2$    | ${\it GP}_{R1}  ightarrow$ +( ${\it GP}_{R2}$ , «( ${\it GP}_{R3}$ , ${\it K}_1$ )                        | 2    | add $R_1$ , $R_2$ , $R_3$ , lsl # $K_1$     |
| $P_3$    | ${\it GP}_{R1}  ightarrow$ +(«( ${\it GP}_{R2}$ , ${\it K}_1$ ), ${\it GP}_{R2}$ )                        | 2    | add $R_1$ , $R_3$ , $R_2$ , 1sl # $K_1$     |
| $P_4$    | ${\it GP_{R1}}  ightarrow { m ld}({\it GP_{R2}})$                                                         | 2    | ldr $R_1$ , $[R_2]$                         |
| $P_5$    | $GP_{R1}  ightarrow 	extsf{ld}(	extsf{+}(GP_{R2},\ GP_{R3}))$                                             | 2    | ldr $R_1$ , [ $R_2$ , $R_3$ ]               |
| $P_6$    | ${\it GP}_{R1}  ightarrow { m ld}({ m +}({\it GP}_{R2},{ m 	imes}({\it GP}_{R3},{\it K}_1))$              | 3    | ldr $R_1$ , [ $R_2$ , $R_3$ , lsl # $K_1$ ] |
| $P_7$    | $GP_{R1}  ightarrow 	extsf{ld}(	extsf{+}(	extsf{	extsf{	iny R2}},	extsf{	iny K_1}),	extsf{	iny GP_{R3}})$ | 3    | ldr $R_1$ , [ $R_3$ , $R_2$ , lsl # $K_1$ ] |
| $P_8$    | ${\it GP}_{R1}  ightarrow * ({\it GP}_{R2}, {\it GP}_{R3})$                                               | 3    | madd $R_1$ , $R_2$ , $R_3$ , xzr            |
| $P_9$    | $GP_{R1}  ightarrow$ +(*( $GP_{R2}$ , $GP_{R3}$ ), $GP_{R4}$ )                                            | 3    | madd $R_1$ , $R_2$ , $R_3$ , $R_4$          |
| $P_{10}$ | ${\it GP_{R1}}  ightarrow {\it K_1}$                                                                      | 1    | mov $R_1$ , $K_1$                           |
| ÷        | :                                                                                                         | :    | :<br>:                                      |

## Tree Covering: Greedy/Maximal Munch

- Top-down always take largest pattern
- Repeat for sub-trees, until everything is covered

+ Easy to implement, fast

## Tree Covering: Greedy/Maximal Munch

- Top-down always take largest pattern
- Repeat for sub-trees, until everything is covered
- + Easy to implement, fast
- Result might be non-optimum





Matching Patterns:

 $\blacktriangleright$  +:  $P_1$  - cost 1 - covered nodes: 1

▶ +:  $P_2$  - cost 2 - covered nodes: 3





- ▶ +:  $P_1$  cost 1 covered nodes: 1
- ▶ +:  $P_2$  cost 2 covered nodes: 3
- ▶ +:  $P_9$  cost 3 covered nodes: 2



- ▶ +:  $P_1$  cost 1 covered nodes: 1
- ▶ +:  $P_2$  cost 2 covered nodes: 3 best



- ▶ +:  $P_1$  cost 1 covered nodes: 1
- ▶ +:  $P_2$  cost 2 covered nodes: 3 best
- ▶ +:  $P_9$  cost 3 covered nodes: 2



- ▶ +:  $P_1$  cost 1 covered nodes: 1
- ▶ +:  $P_2$  cost 2 covered nodes: 3 best
- ▶ +:  $P_9$  cost 3 covered nodes: 2
- ▶ \*:  $P_8$  cost 3 covered nodes: 1 best



Matching Patterns:

- ▶ +:  $P_1$  cost 1 covered nodes: 1
- ▶ +:  $P_2$  cost 2 covered nodes: 3 best
- ▶ +:  $P_9$  cost 3 covered nodes: 2
- ▶ \*:  $P_8$  cost 3 covered nodes: 1 best

Total cost: 5



Matching Patterns:

- ▶ +:  $P_1$  cost 1 covered nodes: 1
- ▶ +:  $P_2$  cost 2 covered nodes: 3 best

▶ \*:  $P_8$  - cost 3 - covered nodes: 1 - best

Total cost: 5

madd %1, %a, %b, xzr
add %2, %1, %c, lsl #2

### Tree Covering: with LR-Parsing?

Can we use (LR-)parsing for instruction selection?

### Tree Covering: with LR-Parsing

- Can we use (LR-)parsing for instruction selection? Yes!<sup>27</sup>
  - Pattern set = grammar; IR (in prefix notation) = input



## Tree Covering: with LR-Parsing

- Can we use (LR-)parsing for instruction selection? Yes!<sup>27</sup>
  - Pattern set = grammar; IR (in prefix notation) = input

#### Advantages

Disadvantages

- Possible in linear time
- Can be formally verified
- Implementation can be generated automatically

## Tree Covering: with LR-Parsing

- Can we use (LR-)parsing for instruction selection? Yes!<sup>27</sup>
  - Pattern set = grammar; IR (in prefix notation) = input

#### Advantages

- Possible in linear time
- Can be formally verified
- Implementation can be generated automatically

#### Disadvantages

- Constraints must map to non-terminals
  - Constant ranges, reg types, . . .
- CISC: handle all operand combinations
  - Large grammar (impractical)
  - Refactoring into non-terminals
- Ambiguity hard to handle optimally

### Tree Covering: Dynamic Programming<sup>28</sup>

Step 1: compute cost matrix, bottom-up for all nodes

- Matrix: tree node × non-terminal (different patterns might yield different non-terminals)
- Cost is sum of pattern and sum of children costs
- Always store cheapest rule and cost
- Step 2: walk tree top-down using rules in matrix
  - Start with goal non-terminal, follow rules in matrix
- ► Time linear w.r.t. tree size



Node: Pattern: Pat. Cost: Cost Sum:

2

|    | Node            | +        | *        | «        | 2        |
|----|-----------------|----------|----------|----------|----------|
| GΡ | Cost<br>Pattern | $\infty$ | $\infty$ | $\infty$ | $\infty$ |



Node:2Pattern: $P_{10}$ :  $GP \rightarrow K_1$ Pat. Cost:1Cost Sum:1

|    | Node    | +        | *        | «        | 2        |
|----|---------|----------|----------|----------|----------|
| GΡ | Cost    | $\infty$ | $\infty$ | $\infty$ | 1        |
|    | Pattern |          |          |          | $P_{10}$ |



Node: « Pattern: Pat. Cost: Cost Sum:

|    | Node            | +        | *        | «        | 2          |
|----|-----------------|----------|----------|----------|------------|
| GΡ | Cost<br>Pattern | $\infty$ | $\infty$ | $\infty$ | $1 P_{10}$ |



Node: « Pattern:  $P_?: GP \rightarrow (GP, GP)$ Pat. Cost: 1 Cost Sum: 2

|    | Node    | +        | *        | «       | 2        |
|----|---------|----------|----------|---------|----------|
| GΡ | Cost    | $\infty$ | $\infty$ | 2       | 1        |
|    | Pattern |          |          | $P_{?}$ | $P_{10}$ |



Node: « Pattern:  $P_1$ :  $GP \rightarrow \ll (GP, K_1)$ Pat. Cost: 1 Cost Sum: 1

|    | Node    | +        | *        | «     | 2        |
|----|---------|----------|----------|-------|----------|
| GΡ | Cost    | $\infty$ | $\infty$ | 1     | 1        |
|    | Pattern |          |          | $P_1$ | $P_{10}$ |

\_



Node: Pattern: Pat. Cost: Cost Sum:

\*

|    | Node            | +        | *        | «                                       | 2                    |
|----|-----------------|----------|----------|-----------------------------------------|----------------------|
| GΡ | Cost<br>Pattern | $\infty$ | $\infty$ | $\begin{array}{c} 1 \\ P_1 \end{array}$ | 1<br>P <sub>10</sub> |



Node: \* Pattern:  $P_8: GP \rightarrow *(GP, GP)$ Pat. Cost: 3 Cost Sum: 3

|    | Node    | +        | *     | «     | 2        |
|----|---------|----------|-------|-------|----------|
| GΡ | Cost    | $\infty$ | 3     | 1     | 1        |
|    | Pattern |          | $P_8$ | $P_1$ | $P_{10}$ |

\_



Node: + Pattern: Pat. Cost: Cost Sum:

|    | Node    | +        | *     | «     | 2        |
|----|---------|----------|-------|-------|----------|
| GΡ | Cost    | $\infty$ | 3     | 1     | 1        |
|    | Pattern |          | $P_8$ | $P_1$ | $P_{10}$ |



Node: + Pattern:  $P_1: GP \rightarrow +(GP, GP)$ Pat. Cost: 1 Cost Sum: 5

|    | Node            | +                   | *                   | «                                       | 2                    |
|----|-----------------|---------------------|---------------------|-----------------------------------------|----------------------|
| GP | Cost<br>Pattern | 5<br>P <sub>1</sub> | 3<br>P <sub>8</sub> | $\begin{array}{c} 1 \\ P_1 \end{array}$ | 1<br>P <sub>10</sub> |
#### Tree Covering: Dynamic Programming – Example



Node: + Pattern:  $P_2$ :  $GP \rightarrow +(GP, \ll(GP, K_1))$ Pat. Cost: 2 Cost Sum: 5

|    | Node    | +     | *     | «     | 2        |
|----|---------|-------|-------|-------|----------|
| GΡ | Cost    | 5     | 3     | 1     | 1        |
|    | Pattern | $P_1$ | $P_8$ | $P_1$ | $P_{10}$ |

#### Tree Covering: Dynamic Programming – Example



Node: + Pattern:  $P_9: GP \rightarrow +(*(GP, GP), GP)$ Pat. Cost: 3 Cost Sum: 4

|    | Node    | +     | *     | *     | 2        |
|----|---------|-------|-------|-------|----------|
| GP | Cost    | 4     | 3     | 1     | 1        |
|    | Pattern | $P_9$ | $P_8$ | $P_1$ | $P_{10}$ |

#### Tree Covering: Dynamic Programming – Off-line Analysis

- Cost analysis can actually be precomputed<sup>29</sup>
- Idea: annotate each node with a state based on child states
- Lookup node label from precomputed table (one per non-terminal)
- Significantly improves compilation time
- But: Tables can be large, need to cover all possible (sub-)trees
- ▶ Variation: dynamically compute and cache state tables<sup>30</sup>

<sup>29</sup>A Balachandran, DM Dhamdhere, and S Biswas. "Efficient retargetable code generation using bottom-up tree pattern matching".
 In: Computer Languages 15.3 (1990), pp. 127–140.

<sup>30</sup>MA Ertl, K Casey, and D Gregg. "Fast and flexible instruction selection with on-demand tree-parsing automata". In: *PLDI* 41.6 (2006), pp. 52–60.

# Tree Covering

#### Tree Covering

- $+\,$  Efficient: linear time to find local optimum
- + Better code than pure macro expansion
- $+\,$  Applicable to many ISAs

# Tree Covering

- $+\,$  Efficient: linear time to find local optimum
- + Better code than pure macro expansion
- + Applicable to many ISAs
- $-\,$  Common sub-expressions cannot be represented
  - ► Need either edge split (prevents using complex instructions) or node duplication (redundant computation ⇒ inefficient code)
- Cannot make use of multi-output instructions (e.g., divmod)

#### DAG Covering

Idea: lift restriction of trees, operate on data flow DAG
 Reminder: an SSA basic block already forms a DAG

► Trivial approach: split into trees 🔆

#### DAG Covering

Idea: lift restriction of trees, operate on data flow DAG
 Reminder: an SSA basic block already forms a DAG

► Trivial approach: split into trees 🔆

▶ Least-cost covering is *NP*-complete<sup>31</sup>

# DAG Covering: Adapting Dynamic Programming I<sup>32</sup>

- Step 1: compute cost matrix, bottom-up for all nodes
  - As before; make sure to visit each node once
- Step 2: iterate over DAG top-down
  - Respect that multiple roots exist: start from all roots
  - Mark visited node/non-terminal combinations: avoid redundant emit

# DAG Covering: Adapting Dynamic Programming I<sup>32</sup>

Step 1: compute cost matrix, bottom-up for all nodes

- As before; make sure to visit each node once
- Step 2: iterate over DAG top-down
  - Respect that multiple roots exist: start from all roots
  - Mark visited node/non-terminal combinations: avoid redundant emit

+ Linear time

- Generally not optimal, only for specific grammars



Node: Pattern: Pat. Cost: Cost Sum:

|    | Node            | +2       | +1       | *        |
|----|-----------------|----------|----------|----------|
| GP | Cost<br>Pattern | $\infty$ | $\infty$ | $\infty$ |

\*



| Node:      | *                                   |
|------------|-------------------------------------|
| Pattern:   | $P_8: \ GP \rightarrow *(GP, \ GP)$ |
| Pat. Cost: | 3                                   |
| Cost Sum:  | 3                                   |
|            |                                     |
|            |                                     |

|    | Node            | +2       | +1       | *       |
|----|-----------------|----------|----------|---------|
| GP | Cost<br>Pattern | $\infty$ | $\infty$ | 3<br>Pa |
|    | i attern        |          |          | 18      |



Node: +1 Pattern: Pat. Cost: Cost Sum:

|    | Node            | +2       | +1       | *                   |
|----|-----------------|----------|----------|---------------------|
| GΡ | Cost<br>Pattern | $\infty$ | $\infty$ | 3<br>P <sub>8</sub> |



|    | Node    | +2       | +1    | *     |
|----|---------|----------|-------|-------|
| GP | Cost    | $\infty$ | 4     | 3     |
|    | Pattern |          | $P_1$ | $P_8$ |



Node:  $+_1$ Pattern:  $P_9$ :  $GP \rightarrow +(*(GP, GP), GP)$ Pat. Cost: 3 Cost Sum: 3

|    | Node    | +2       | +1    | *     |
|----|---------|----------|-------|-------|
| GP | Cost    | $\infty$ | 3     | 3     |
|    | Pattern |          | $P_9$ | $P_8$ |



Node: +<sub>2</sub> Pattern: Pat. Cost: Cost Sum:

|    | Node    | +2       | +1    | *     |
|----|---------|----------|-------|-------|
| GP | Cost    | $\infty$ | 3     | 3     |
|    | Pattern |          | $P_9$ | $P_8$ |



|    | Node    | +2    | +1    | *     |
|----|---------|-------|-------|-------|
| GP | Cost    | 4     | 3     | 3     |
|    | Pattern | $P_1$ | $P_9$ | $P_8$ |



Node:  $+_2$ Pattern:  $P_9$ :  $GP \rightarrow +(*(GP, GP), GP)$ Pat. Cost: 3 Cost Sum: 3

|    | Node    | +2    | +1    | *     |
|----|---------|-------|-------|-------|
| GP | Cost    | 3     | 3     | 3     |
|    | Pattern | $P_9$ | $P_9$ | $P_8$ |

Total cost: 6



| madd | %1, | %b, | %с, | %a |
|------|-----|-----|-----|----|
| madd | %2, | %Ъ, | %с, | %d |

|    | Node    | +2    | +1    | *     |
|----|---------|-------|-------|-------|
| GΡ | Cost    | 3     | 3     | 3     |
|    | Pattern | $P_9$ | $P_9$ | $P_8$ |

Total cost: 6



madd %1, %b, %c, %a madd %2, %b, %c, %d

Optimal cost: 5  $\rightarrow$  non-optimal result

|    | Node    | +2    | +1    | *     |
|----|---------|-------|-------|-------|
| GP | Cost    | 3     | 3     | 3     |
|    | Pattern | $P_9$ | $P_9$ | $P_8$ |

# DAG Covering: Adapting Dynamic Programming II<sup>33</sup>

- Step 1: compute cost matrix, bottom-up (as before)
- Step 2: iterate over DAG top-down (as before)
- Step 3: identify overlaps and check whether split is beneficial
  - Mark nodes which should not be duplicated as *fixed*
- Step 4: as step 1, but skip patterns that *include* fixed nodes
- Step 5: as step 2

# DAG Covering: Adapting Dynamic Programming II<sup>33</sup>

- Step 1: compute cost matrix, bottom-up (as before)
- Step 2: iterate over DAG top-down (as before)
- > Step 3: identify overlaps and check whether split is beneficial
  - Mark nodes which should not be duplicated as *fixed*
- Step 4: as step 1, but skip patterns that *include* fixed nodes
- Step 5: as step 2
- + Probably fast? "Near-optimal"?
- Generally not optimal, superlinear time

#### DAG Covering: ILP<sup>34</sup>

Idea: model ISel as integer linear programming (ILP) problem
 P is set of patterns with cost and edges, V are DAG nodes
 Variables: M<sub>p,v</sub> is 1 iff a pattern p is rooted at v

minimize 
$$\sum_{p,v} p.cost \cdot M_{p,v}$$
  
subject to  $\forall r \in roots. \sum_{p} M_{p,r} \geq 1$   
 $\forall p, v, e \in p.edges(v). M_{p,v} - \sum_{p'} M_{p',e} \leq 0$   
 $M_{p,v} \in \{0,1\}$ 

Minimize cost for all matched patterns s.t. every root has a match and every input of a match has a match.

# DAG Covering: ILP<sup>34</sup>

Idea: model ISel as integer linear programming (ILP) problem
 P is set of patterns with cost and edges, V are DAG nodes
 Variables: M<sub>p,v</sub> is 1 iff a pattern p is rooted at v

minimize 
$$\sum_{p,v} p.cost \cdot M_{p,v}$$
  
subject to  $\forall r \in roots. \sum_{p} M_{p,r} \geq 1$   
 $\forall p, v, e \in p.edges(v). M_{p,v} - \sum_{p'} M_{p',e} \leq 0$   
 $M_{p,v} \in \{0,1\}$ 

Minimize cost for all matched patterns s.t. every root has a match and every input of a match has a match.

+ Optimal result

- Practicability beyond small programs questionable (at best)

34 DR Koes and SC Goldstein. "Near-optimal instruction selection on DAGs". In: CGO. 2008, pp. 45-54.

# DAG Covering: Greedy/Maximal Munch

Top-down, start at roots, always take largest pattern
Repeat for remaining roots until whole graph is covered

# DAG Covering: Greedy/Maximal Munch

Top-down, start at roots, always take largest pattern
 Repeat for remaining roots until whole graph is covered

- + Easy to implement, reasonably fast
- Result often non-optimal

# DAG Covering: Greedy/Maximal Munch

- Top-down, start at roots, always take largest pattern
   Repeat for remaining roots until whole graph is covered
- + Easy to implement, reasonably fast
- Result often non-optimal
- Used by: LLVM SelectionDAG

# Graph Covering

Idea: lift limitation of DAGs, cover entire function graphs

- Better handling of predication and VLIW bundling
  - E.g., hoisting instructions from a conditional block
- Allows to handle instructions that expand to multiple blocks
  - switch, select, etc.

# Graph Covering

Idea: lift limitation of DAGs, cover entire function graphs

- Better handling of predication and VLIW bundling
  - E.g., hoisting instructions from a conditional block
- Allows to handle instructions that expand to multiple blocks
  - switch, select, etc.

May need new IR to model control flow in addition to data flow

# Graph Covering

Idea: lift limitation of DAGs, cover entire function graphs

- Better handling of predication and VLIW bundling
  - E.g., hoisting instructions from a conditional block
- Allows to handle instructions that expand to multiple blocks

switch, select, etc.

- May need new IR to model control flow in addition to data flow
- In practice: only used by adapting methods showed for DAGs
- Used by: Java HotSpot Server, LLVM GlobalISel (all tree-covering)

# Flawed Assumptions

#### Flawed Assumptions

- Cost model is fundamentally flawed
- $\Rightarrow$  "Optimal" ISel doesn't really mean anything

#### Flawed Assumptions

- Cost model is fundamentally flawed
- $\Rightarrow$  "Optimal" ISel doesn't really mean anything
- Out-of-order execution: costs are not linear
  - Instructions executed in parallel, might execute for free
  - Possible contention of functional units
- Register allocator will modify instructions
- "Bad" instructions boundaries increase register requirements
  - ▶ More stack spilling ~→ much slower code!

#### LLVM Back-end: Overview

#### LLVM Back-end: Overview

▶ LLVM-IR  $\rightarrow$  Machine IR: instruction selection + scheduling

- MIR is SSA-representation of target instructions
- Selectors: SelectionDAG, FastISel, GlobalISel
- ▶ Also selects register bank (GP/FP/...) required for instruction
- Annotates registers: calling convention, encoding restrictions, etc.

#### LLVM Back-end: Overview

▶ LLVM-IR  $\rightarrow$  Machine IR: instruction selection + scheduling

- MIR is SSA-representation of target instructions
- Selectors: SelectionDAG, FastISel, GlobalISel
- ▶ Also selects register bank (GP/FP/...) required for instruction
- Annotates registers: calling convention, encoding restrictions, etc.
- ▶ MIR: minor (peephole) optimizations
- ► MIR: register allocation
- MIR: prolog/epilog insertion (stack frame, callee-saved regs, etc.)
### LLVM Back-end: Overview

▶ LLVM-IR  $\rightarrow$  Machine IR: instruction selection + scheduling

- MIR is SSA-representation of target instructions
- Selectors: SelectionDAG, FastISel, GlobalISel
- ▶ Also selects register bank (GP/FP/...) required for instruction
- Annotates registers: calling convention, encoding restrictions, etc.
- ▶ MIR: minor (peephole) optimizations
- MIR: register allocation
- MIR: prolog/epilog insertion (stack frame, callee-saved regs, etc.)
- $\blacktriangleright$  MIR  $\rightarrow$  MC: translation to machine code

# LLVM MIR Example

```
define i64 @fn(i64 %a,i64 %b,i64 %c) {
  %shl = shl i64 %c, 2
  %mul = mul i64 %a, %b
  %add = add i64 %mul, %shl
  ret i64 %add
}
```

```
# YAML with name, registers, frame info
body: |
 bb.0 (%ir-block.0):
   liveins: $x0, $x1, $x2
   %2:gpr64 = COPY $x2
   %1:gpr64 = COPY $x1
   \%0:gpr64 = COPY $x0
   %3:gpr64 = MADDXrrr %0, %1, $xzr
   %4:gpr64 = ADDXrs killed %3, %2, 2
   x0 = COPY \%4
   RET_ReallvLR implicit $x0
```

llc -march=aarch64 -stop-after=finalize-isel

#### FastISel

- Uses macro expansion
- Low compile-time
- Code quality poor
- Only common cases
- Otherwise: fallback to SelectionDAG
- Default for -00

#### FastISel

- Uses macro expansion
- Low compile-time
- Code quality poor
- Only common cases
- Otherwise: fallback to SelectionDAG
- Default for -00

#### SelectionDAG

- Converts each block into separate DAGs
- Greedy tree matching
- Slow, but good code
- Handles all cases
- No cross-block opt. (done in DAG building)
- Default

#### FastISel

- Uses macro expansion
- Low compile-time
- Code quality poor
- Only common cases
- Otherwise: fallback to SelectionDAG
- Default for -00

#### SelectionDAG

- Converts each block into separate DAGs
- Greedy tree matching
- Slow, but good code
- Handles all cases
- No cross-block opt. (done in DAG building)

#### Default

#### GloballSel

- Conv. to generic-MIR then legalize to MIR
- Reuses SD patterns
- Faster than SelDAG
- Few architectures
- Handles many cases, SelDAG-fallback

#### LLVM SelectionDAG: IR to ISelDAG

Construct DAG for basic block

EntryToken as ordering chain



# LLVM SelectionDAG: IR to ISelDAG

- Construct DAG for basic block
  - EntryToken as ordering chain
- Legalize data types
  - Integers: promote or expand into multiple
  - Vectors: widen or split (or scalarize)



# LLVM SelectionDAG: IR to ISelDAG

- Construct DAG for basic block
  - EntryToken as ordering chain
- Legalize data types
  - Integers: promote or expand into multiple
  - Vectors: widen or split (or scalarize)
- Legalize operations
  - E.g., conditional move, etc.



# I IVM Selection DAG. IR to ISelDAG

- Construct DAG for basic block
  - EntryToken as ordering chain
- Legalize data types
  - Integers: promote or expand into multiple
  - Vectors: widen or split (or scalarize)
- Legalize operations
  - E.g., conditional move, etc.
- Optimize DAG, e.g. some pattern matching. removing unneeded sign/zero extensions

llc -march=aarch64 -view-isel-dags Note: needs LLVM debug build



#### LLVM SelectionDAG: ISelDAG to DAG

- Mainly pattern matching
- Simple patterns specified in TableGen
  - Matching/selection compiled into bytecode
  - SelectionDAGISel::SelectCodeCommon()
- ► Complex selections done in C++
- Scheduling: linearization of graph

llc -march=aarch64 -view-sched-dags
Note: needs LLVM debug build



# Instruction Selection – Summary

- Instruction Selection: transform generic into arch-specific instructions
- Often focus on optimizing tiling costs
- ► Target instructions often more complex, e.g., multi-result
- Macro Expansion: simple, fast, but inefficient code
- Peephole optimization on sequences/trees to optimize
- Tree Covering: allows for better tiling of instructions
- $\blacktriangleright$  DAG Covering: support for multi-res instrs., but  $\mathcal{NP}\text{-complete}$
- Graph Covering: mightiest, but also most complex, rarely used

#### Instruction Selection – Questions

- ▶ What is the (nowadays typical) input and output IR for ISel?
- Why is good instruction selection important for performance?
- ▶ Why is peephole optimization beneficial for nearly all ISel approaches?
- How can peephole opt. be done more effectively than on neighboring instrs.?
- ▶ What are options to transform an SSA-IR into data flow trees?
- Why is a greedy strategy not optimal for tree pattern matching?
- ▶ When is DAG covering beneficial over tree covering?
- Which ISel strategies does LLVM implement? Why?

Code Generation for Data Processing Lecture 8: Register Allocation

Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

- Assign a register to every value
  - Outputs get a (new) register, input operands often require registers

- Assign a register to every value
  - Outputs get a (new) register, input operands often require registers
- When running out of registers, move values to stack
  - Stack spilling save value register from to stack memory

- Assign a register to every value
  - Outputs get a (new) register, input operands often require registers
- When running out of registers, move values to stack
  - Stack spilling save value register from to stack memory
- $\blacktriangleright$   $\phi$ -nodes: ensure all inputs are assigned to same location

- Assign a register to every value
  - Outputs get a (new) register, input operands often require registers
- When running out of registers, move values to stack
  - Stack spilling save value register from to stack memory
- $\blacktriangleright$   $\phi$ -nodes: ensure all inputs are assigned to same location
- ► Goal: produce correct code, minimize extra load/stores
  - Regalloc affects performance in orders of magnitude

# Register Allocation: Overview Example

```
gauss(%0) {
  %2 = SUBXri %0, 1
  %3 = MADDXrrr %0, %2, 0
  %4 = MOVXconst 2
  %5 = SDIVrr %3, %4
  ret %5
}
```

```
gauss(%0 : X0) {
  %2 = SUBXri %0, 1 : X1
  %3 = MADDXrrr %0, %2, 0 : X0
  %4 = MOVXconst 2 : X1
  %5 = SDIVrr %3, %4 : X0
  ret %5
}
```

May also insert copy and stack spilling instructions

Idea: allocate a one stack slot for every SSA variable/argument

Idea: allocate a one stack slot for every SSA variable/argument

- Load all instruction operands into registers right before
- Perform instruction
- Write result back to stack slot for that SSA variable

Idea: allocate a one stack slot for every SSA variable/argument

- Load all instruction operands into registers right before
- Perform instruction
- Write result back to stack slot for that SSA variable
- $+\,$  Simple, always works, debugging easy
- Extremely inefficient in time and space

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

#### gauss(%0)

%3 = MADDXrrr %0, %2, 0

%4 = MOVXconst 2

```
gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5
```

```
gauss(%0 : X0)
```

```
%2 = SUBXri %0, 1
```

3 = MADDXrrr %0, %2, 0

%4 = MOVXconst 2

```
gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5
```

```
gauss(%0 : X0)
%spills = alloca 8
STRXi %0, %spills, 0
```

%2 = SUBXri %0, 1

%3 = MADDXrrr %0, %2, 0

%4 = MOVXconst 2

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)
%spills = alloca 8
STRXi %0, %spills, 0
%10 = LDRXi %spills, 0 : X0
%2 = SUBXri %10, 1

%3 = MADDXrrr %0, %2, 0

%4 = MOVXconst 2

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)
%spills = alloca 8
STRXi %0, %spills, 0
%10 = LDRXi %spills, 0 : X0
%2 = SUBXri %10, 1 : X0

%3 = MADDXrrr %0, %2, 0

%4 = MOVXconst 2

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

- gauss(%0 : X0)
   %spills = alloca 16
   STRXi %0, %spills, 0
   %l0 = LDRXi %spills, 0 : X0
   %2 = SUBXri %l0, 1 : X0
   STRXi %2, %spills, 8
  - %3 = MADDXrrr %0, %2, 0
  - %4 = MOVXconst 2

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)
 %spills = alloca 16
 STRXi %0, %spills, 0
 %l0 = LDRXi %spills, 0 : X0
 %2 = SUBXri %10, 1 : X0
 STRXi %2, %spills, 8
 %l1 = LDRXi %spills, 0 : X0
 %l2 = LDRXi %spills, 8 : X1
 %3 = MADDXrrr %0, %2, 0

%4 = MOVXconst 2

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)
 %spills = alloca 16
 STRXi %0, %spills, 0
 %l0 = LDRXi %spills, 0 : X0
 %2 = SUBXri %l0, 1 : X0
 STRXi %2, %spills, 8
 %l1 = LDRXi %spills, 0 : X0
 %l2 = LDRXi %spills, 8 : X1
 %3 = MADDXrrr %l1, %l2, 0 : X0

%4 = MOVXconst 2

%5 = SDIVrr %3, %4

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)
 %spills = alloca 24
 STRXi %0, %spills, 0
 %10 = LDRXi %spills, 0 : X0
 %2 = SUBXri %10, 1 : X0
 STRXi %2, %spills, 8
 %11 = LDRXi %spills, 0 : X0
 %12 = LDRXi %spills, 8 : X1
 %3 = MADDXrrr %11, %12, 0 : X0
 STRXi %3, %spills, 16
 %4 = MOVXconst 2

%5 = SDIVrr %3, %4

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)
 %spills = alloca 24
 STRXi %0, %spills, 0
 %10 = LDRXi %spills, 0 : X0
 %2 = SUBXri %10, 1 : X0
 STRXi %2, %spills, 8
 %11 = LDRXi %spills, 0 : X0
 %12 = LDRXi %spills, 8 : X1
 %3 = MADDXrrr %11, %12, 0 : X0
 STRXi %3, %spills, 16
 %4 = MOVXconst 2 : X0

%5 = SDIVrr %3, %4

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)
 %spills = alloca 32
 STRXi %0, %spills, 0
 %10 = LDRXi %spills, 0 : X0
 %2 = SUBXri %10, 1 : X0
 STRXi %2, %spills, 8
 %11 = LDRXi %spills, 0 : X0
 %12 = LDRXi %spills, 8 : X1
 %3 = MADDXrrr %11, %12, 0 : X0
 STRXi %3, %spills, 16
 %4 = MOVXconst 2 : X0
 STRXi %4,i %spills, 24

%5 = SDIVrr %3, %4

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)%spills = alloca 32 STRXi %0, %spills, 0 %10 = LDRXi %spills, 0 : X0 %2 = SUBXri %10, 1 : X0 STRXi %2, %spills, 8 %11 = LDRXi %spills, 0 : X0 %12 = LDRXi %spills, 8 : X1 %3 = MADDXrrr %11, %12, 0 : X0 STRXi %3, %spills, 16 %4 = MOVX const 2 : XOSTRXi %4,i %spills, 24 %13 = LDRXi %spills, 16 : X0 %14 = LDRXi %spills, 24 : X1 %5 = SDIVrr %13, %14
gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)%spills = alloca 32 STRXi %0, %spills, 0 %10 = LDRXi %spills, 0 : X0 %2 = SUBXri %10, 1 : X0 STRXi %2, %spills, 8 %11 = LDRXi %spills, 0 : X0 %12 = LDRXi %spills, 8 : X1 %3 = MADDXrrr %11, %12, 0 : X0 STRXi %3, %spills, 16 %4 = MOVX const 2 : XOSTRXi %4,i %spills, 24 %13 = LDRXi %spills, 16 : X0 %14 = LDRXi %spills, 24 : X1 %5 = SDIVrr %13, %14 : X0

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)%spills = alloca 40 STRXi %0, %spills, 0 %10 = LDRXi %spills, 0 : X0 %2 = SUBXri %10, 1 : X0 STRXi %2, %spills, 8 %11 = LDRXi %spills, 0 : X0 %12 = LDRXi %spills, 8 : X1 %3 = MADDXrrr %11, %12, 0 : X0 STRXi %3, %spills, 16 %4 = MOVX const 2 : XOSTRXi %4.i %spills, 24 %13 = LDRXi %spills, 16 : X0 %14 = LDRXi %spills, 24 : X1 %5 = SDIVrr %13, %14 : X0STRXi %5, %spills, 32

gauss(%0)
%2 = SUBXri %0, 1
%3 = MADDXrrr %0, %2, 0
%4 = MOVXconst 2
%5 = SDIVrr %3, %4
ret %5

gauss(%0 : X0)%spills = alloca 40 STRXi %0, %spills, 0 %10 = LDRXi %spills, 0 : X0 %2 = SUBXri %10, 1 : X0 STRXi %2, %spills, 8 %11 = LDRXi %spills, 0 : X0 %12 = LDRXi %spills, 8 : X1 %3 = MADDXrrr %11, %12, 0 : X0 STRXi %3, %spills, 16 %4 = MOVX const 2 : XOSTRXi %4.i %spills, 24 %13 = LDRXi %spills, 16 : X0 %14 = LDRXi %spills, 24 : X1 %5 = SDIVrr %13, %14 : X0STRXi %5, %spills, 32 %15 = LDRXi %spills, 32 : X0 ret %15

- $\blacktriangleright$   $\phi$ -node needs to become register or stack slot
  - Simplest thing that could possibly work: PHI becomes stack slot
- ▶ Remember:  $\phi$ -nodes are executed on the edge

# Handling PHI Nodes

- $\blacktriangleright$   $\phi$ -node needs to become register or stack slot
  - Simplest thing that could possibly work: PHI becomes stack slot
- ▶ Remember:  $\phi$ -nodes are executed on the edge
- Idea: predecessors write their value to that location at the end
  - First pass: define/allocate storage for  $\phi$ -node, but ignore inputs
  - Second pass: insert move operations at end of predecessors

```
identity(%0)
  br %2
2:
 %3 = phi [ 0, %1 ], [ %4, %2 ]
 %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %2, %6
6:
 ret %3
```

identity(%0)

%5 = CMPXrr\_BLS %4, %0 br %5, %2, %6 6: ret %3

identity(%0) br %2 2: %3 = phi [ 0, %1 ], [ %4, %2 ] %4 = ADDXri %3, 1  $\%5 = CMPXrr_BLS \%4, \%0$ br %5, %2, %6 6: ret %3

identity(%0 : X0)
 %spills = alloca 8
 STRXi %0, %spills, 0
 br %2
2:%3 = phi [ 0, %1 ], [ %4, %2 ]
 %4 = ADDXri %3, 1

%5 = CMPXrr\_BLS %4, %0 br %5, %2, %6 6: ret %3

identity(%0) br %2 2: %3 = phi [ 0, %1 ], [ %4, %2 ] %4 = ADDXri %3, 1  $\%5 = CMPXrr_BLS \%4, \%0$ br %5, %2, %6 6: ret %3

identity(%0 : X0)
 %spills = alloca 16
 STRXi %0, %spills, 0
 br %2
2:%3 = phi [ 0, %1 ], [ %4, %2 ]
 %4 = ADDXri %3, 1

%5 = CMPXrr\_BLS %4, %0 br %5, %2, %6 6: ret %3

identity(%0) br %2 2: %3 = phi [ 0, %1 ], [ %4, %2 ] %4 = ADDXri %3, 1  $\%5 = CMPXrr_BLS \%4, \%0$ br %5, %2, %6 6: ret %3

# identity(%0 : X0) %spills = alloca 16 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %2 ] %10 = LDRXi %spills, 8 : X0 %4 = ADDXri %10, 1 : X0

%5 = CMPXrr\_BLS %4, %0 br %5, %2, %6 6: ret %3

identity(%0) br %2 2: %3 = phi [ 0, %1 ], [ %4, %2 ] %4 = ADDXri %3, 1  $\%5 = CMPXrr_BLS \%4, \%0$ br %5, %2, %6 6: ret %3

# identity(%0 : X0) %spills = alloca 24 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %2 ] %10 = LDRXi %spills, 8 : X0 %4 = ADDXri %10, 1 : X0 STRXi %4, %spills, 16

%5 = CMPXrr\_BLS %4, %0 br %5, %2, %6 6: ret %3

identity(%0) br %2 2: %3 = phi [ 0, %1 ], [ %4, %2 ] %4 = ADDXri %3, 1  $\%5 = CMPXrr_BLS \%4, \%0$ br %5, %2, %6 6: ret %3

identity(%0 : X0) %spills = alloca 24 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %2 ] %10 = LDRXi %spills, 8 : X0 %4 = ADDXri %10, 1 : X0STRXi %4, %spills, 16 %11 = LDRXi %spills, 16 : X0 %12 = LDRXi %spills, 0 : X1 %5 = CMPXrr\_BLS %11, %12 br %5, %2, %6 6: ret %3

identity(%0) br %2 2: %3 = phi [ 0, %1 ], [ %4, %2 ] %4 = ADDXri %3, 1  $\%5 = CMPXrr_BLS \%4, \%0$ br %5, %2, %6 6: ret %3

#### identity(%0 : X0) %spills = alloca 24 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %2 ] %10 = LDRXi %spills, 8 : X0 %4 = ADDXri %10, 1 : X0STRXi %4, %spills, 16 %11 = LDRXi %spills, 16 : X0 %12 = LDRXi %spills, 0 : X1 %5 = CMPXrr\_BLS %11, %12 br %5, %2, %6 6:%13 = LDRXi %spills, 8 : X0 ret %13

identity(%0) br %2 2: %3 = phi [ 0, %1 ], [ %4, %2 ] %4 = ADDXri %3, 1  $\%5 = CMPXrr_BLS \%4, \%0$ br %5, %2, %6 6: ret %3

#### identity(%0 : X0) %spills = alloca 24 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %2 ] %10 = LDRXi %spills, 8 : X0 %4 = ADDXri %10, 1 : X0STRXi %4, %spills, 16 %11 = LDRXi %spills, 16 : X0 %12 = LDRXi %spills, 0 : X1 %5 = CMPXrr\_BLS %11, %12 br %5, %2, %6 6:%13 = LDRXi %spills, 8 : X0 ret %13

```
identity(%0)
  br %2
2:
  %3 = phi [ 0, %1 ], [ %4, %2 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %2, %6
6:
  ret %3
```

```
identity(%0 : X0)
 %spills = alloca 24
  STRXi %0, %spills, 0
 %c0 = MOVXconst 0 : X0
 STRXi %c, %spills, 8
 br %2
2:%3 = phi [ 0, %1 ], [ %4, %2 ]
  %10 = LDRXi %spills, 8 : X0
 %4 = ADDXri %10, 1 : X0
  STRXi %4, %spills, 16
  %11 = LDRXi %spills, 16 : X0
 %12 = LDRXi %spills, 0 : X1
  %5 = CMPXrr_BLS %11, %12
  br %5, %2, %6
6:%13 = LDRXi %spills, 8 : X0
 ret %13
```

```
identity(%0)
    br %2
2:
    %3 = phi [ 0, %1 ], [ %4, %2 ]
    %4 = ADDXri %3, 1
    %5 = CMPXrr_BLS %4, %0
    br %5, %2, %6
6:
    ret %3
```

```
identity(%0 : X0)
 %spills = alloca 24
  STRXi %0, %spills, 0
 %c0 = MOVXconst 0 : X0
 STRXi %c, %spills, 8
 br %2
2:%3 = phi [ 0, %1 ], [ %4, %2 ]
  %10 = LDRXi %spills, 8 : X0
 %4 = ADDXri %10, 1 : X0
  STRXi %4, %spills, 16
  %14 = LDRXi %spills, 16 : X0
  STRXi %14, %spills, 8
  %11 = LDRXi %spills, 16 : X0
  %12 = LDRXi %spills, 0 : X1
 %5 = CMPXrr_BLS %11, %12
  br %5, %2, %6
6:%13 = LDRXi %spills, 8 : X0
 ret %13
```

```
identity(%0)
br %2
2:
    %3 = phi [ 0, %1 ], [ %4, %2 ]
    %4 = ADDXri %3, 1
    %5 = CMPXrr_BLS %4, %0
    br %5, %2, %6
6:
    ret %3
```

```
identity(%0 : X0)
  %spills = alloca 24
  STRXi %0, %spills, 0
  %c0 = MOVXconst 0 : X0
  STRXi %c, %spills, 8
  br %2
2:%3 = phi [ 0, %1 ], [ %4, %2 ]
  %10 = LDRXi %spills, 8 : X0
  %4 = ADDXri %10, 1 : X0
  STRXi %4, %spills, 16
  %14 = LDRXi %spills, 16 : X0
  STRXi %14. %spills. 8
  %11 = LDRXi %spills, 16 : X0
  %12 = LDRXi %spills, 0 : X1
  \%5 = CMPXrr_BLS \%11, \%12
  br %5, %2, %6
6:%13 = LDRXi %spills, 8 : X0
  ret %13
```

```
identity(%0)
br %2
2:
    %3 = phi [ 0, %1 ], [ %4, %2 ]
    %4 = ADDXri %3, 1
    %5 = CMPXrr_BLS %4, %0
    br %5, %2, %6
6:
    ret %3
```

Original value lost in %6!

```
identity(%0 : X0)
  %spills = alloca 24
  STRXi %0, %spills, 0
 %c0 = MOVXconst 0 : X0
 STRXi %c, %spills, 8
 br %2
2:%3 = phi [ 0, %1 ], [ %4, %2 ]
  %10 = LDRXi %spills, 8 : X0
  %4 = ADDXri %10, 1 : X0
  STRXi %4, %spills, 16
  %14 = LDRXi %spills, 16 : X0
  STRXi %14. %spills. 8
  %11 = LDRXi %spills, 16 : X0
  %12 = LDRXi %spills, 0 : X1
 %5 = CMPXrr_BLS %11, %12
  br %5, %2, %6
6:%13 = LDRXi %spills, 8 : X0
  ret %13
```

# Critical Edges

Critical edge: edge from block with mult. succs. to block with mult. preds.

- Problem: cannot place move on such edges
  - When placing in predecessor, they would also execute for other successor ⇒ unnecessary and – worse – incorrect



# Critical Edges

Critical edge: edge from block with mult. succs. to block with mult. preds.

- Problem: cannot place move on such edges
  - When placing in predecessor, they would also execute for other successor ⇒ unnecessary and – worse – incorrect



Break critical edges: insert an empty block

# Critical Edges

Critical edge: edge from block with mult. succs. to block with mult. preds.

- Problem: cannot place move on such edges
  - ▶ When placing in predecessor, they would also execute for other successor ⇒ unnecessary and – worse – incorrect



Break critical edges: insert an empty block

```
identity(%0)
  br %2
2:
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
  ret %3
                                            7:
```

identity(%0)

```
br %2
2:%3 = phi [ 0, %1 ], [ %4, %6 ]
 %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:br %2
```

ret %3

```
identity(%0)
  br %2
2:
                                               br %2
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
                                             6:br %2
  ret %3
                                             7:
                                               ret %3
```

# identity(%0 : X0) %spills = alloca 8 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %6 ] %4 = ADDXri %3, 1

```
%5 = CMPXrr_BLS %4, %0
br %5, %6, %7
5:br %2
7:
ret %3
```

```
identity(%0)
  br %2
2:
                                               br %2
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
                                             6:br %2
  ret %3
                                             7:
                                               ret %3
```

# identity(%0 : X0) %spills = alloca 16 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %6 ] %4 = ADDXri %3, 1

```
%5 = CMPXrr_BLS %4, %0
br %5, %6, %7
5:br %2
7:
ret %3
```

```
identity(%0)
  br %2
2:
                                               br %2
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
                                             6:br %2
  ret %3
                                             7:
```

```
identity(%0 : X0)
  %spills = alloca 16
  STRXi %0, %spills, 0
  br %2
2:%3 = phi [ 0, %1 ], [ %4, %6 ]
  %10 = LDRXi %spills, 8 : X0
  %4 = ADDXri %10, 1 : X0
```

```
%5 = CMPXrr_BLS %4, %0
br %5, %6, %7
5:br %2
7:
ret %3
```

```
identity(%0)
  br %2
2:
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
  ret %3
                                             7:
```

identity(%0 : X0)
 %spills = alloca 24
 STRXi %0, %spills, 0
 br %2
2:%3 = phi [ 0, %1 ], [ %4, %6 ]
 %10 = LDRXi %spills, 8 : X0
 %4 = ADDXri %10, 1 : X0
 STRXi %4, %spills, 16

```
%5 = CMPXrr_BLS %4, %0
br %5, %6, %7
6:br %2
7:
ret %3
```

```
identity(%0)
  br %2
2:
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
  ret %3
```

identity(%0 : X0) %spills = alloca 24 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %6 ] %10 = LDRXi %spills, 8 : X0 %4 = ADDXri %10, 1 : X0STRXi %4, %spills, 16 %11 = LDRXi %spills, 16 : X0 %12 = LDRXi %spills, 0 : X1  $\%5 = CMPXrr_BLS \%11, \%12$ br %5, %6, %7 6:br %2 7: ret %3

```
identity(%0)
  br %2
2:
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
  ret %3
```

identity(%0 : X0) %spills = alloca 24 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %6 ] %10 = LDRXi %spills, 8 : X0 %4 = ADDXri %10, 1 : X0STRXi %4, %spills, 16 %11 = LDRXi %spills, 16 : X0 %12 = LDRXi %spills, 0 : X1  $\%5 = CMPXrr_BLS \%11, \%12$ br %5, %6, %7 6:br %2 7:%13 = LDRXi %spills, 8 : X0 ret %13

```
identity(%0)
  br %2
2:
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
  ret %3
```

identity(%0 : X0) %spills = alloca 24 STRXi %0, %spills, 0 br %2 2:%3 = phi [ 0, %1 ], [ %4, %6 ] %10 = LDRXi %spills, 8 : X0 %4 = ADDXri %10, 1 : X0STRXi %4, %spills, 16 %11 = LDRXi %spills, 16 : X0 %12 = LDRXi %spills, 0 : X1  $\%5 = CMPXrr_BLS \%11, \%12$ br %5, %6, %7 6:br %2 7:%13 = LDRXi %spills, 8 : X0 ret %13

```
identity(%0)
  br %2
2:
  %3 = phi [ 0, %1 ], [ %4, %6 ]
  %4 = ADDXri %3, 1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
  ret %3
```

```
identity(%0 : X0)
  %spills = alloca 24
  STRXi %0, %spills, 0
  %c0 = MOVXconst 0 : X0
  STRXi %c, %spills, 8
  br %2
2:%3 = phi [ 0, %1 ], [ %4, %6 ]
  %10 = LDRXi %spills, 8 : X0
  %4 = ADDXri %10, 1 : X0
  STRXi %4, %spills, 16
  %11 = LDRXi %spills, 16 : X0
  %12 = LDRXi %spills, 0 : X1
  \%5 = CMPXrr_BLS \%11, \%12
  br %5, %6, %7
6:br %2
7:%13 = LDRXi %spills, 8 : X0
  ret %13
```

```
identity(%0)
  br %2
2:
  %3 = phi [0, %1], [%4, %6]
  %4 = ADDXri %3.1
  \%5 = CMPXrr_BLS \%4, \%0
  br %5, %6, %7
6:
  br %2
7:
  ret %3
Pass 2
```

```
identity(%0 : X0)
  %spills = alloca 24
  STRXi %0, %spills, 0
 %c0 = MOVXconst 0 : X0
 STRXi %c, %spills, 8
 br %2
2:\%3 = phi [0, \%1], [\%4, \%6]
  %10 = LDRXi %spills, 8 : X0
 %4 = ADDXri %10, 1 : X0
  STRXi %4, %spills, 16
 %11 = LDRXi %spills, 16 : X0
 %12 = LDRXi %spills, 0 : X1
 %5 = CMPXrr_BLS %11, %12
  br %5, %6, %7
6:%14 = LDRXi %spills, 16 : X0
  STRXi %14. %spills. 8
 br %2
7:%13 = LDRXi %spills, 8 : X0
 ret %13
```

### Breaking Edges

Insert new block for moves



### Breaking Edges

- Insert new block for moves
- $+ \,$  Simple, no analyses needed
- Bad performance in loops



### Breaking Edges

- Insert new block for moves
- + Simple, no analyses needed
- Bad performance in loops

### Copy Used Values

Move values still used to new reg.



$$\begin{array}{c}
\mathbf{r1} \leftarrow \mathbf{0} \\
\mathbf{r2} \leftarrow \mathbf{r1} + \mathbf{1} \\
\mathbf{rT} \leftarrow \mathbf{r1} \\
\mathbf{r1} \leftarrow \mathbf{r2} \\
\mathbf{r3} \leftarrow \mathbf{rT} + \mathbf{x}
\end{array}$$

### Breaking Edges

- Insert new block for moves
- + Simple, no analyses needed
- Bad performance in loops



### Copy Used Values

- Move values still used to new reg.
- + Performance might be better
- Needs more registers

$$r1 \leftarrow 0$$

$$r2 \leftarrow r1 + 1$$

$$rT \leftarrow r1$$

$$r1 \leftarrow r2$$

$$r3 \leftarrow rT + x$$

odd(%0)
 br %2
2:
 %3 = phi [ %0, %1 ], [ %8, %7 ]
 %4 = phi [ 1, %1 ], [ %5, %7 ]
 %5 = phi [ 0, %1 ], [ %4, %7 ]
 %6 = CBNZX(%3)
 br %6, %7, %9
7:
 %8 = SUBXri %3, 1
 br %2
9:
 ret %4

```
odd(%0 : X0)
%spills = alloca 40
STRXi %0, %spills, 0
```

br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
%4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
%5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
%10 = LDRXi %spills, 8 : X0
%6 = CBNZX(%10)
br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
%8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

br %2
9:%12 = LDRXi %spills, 24 : X0
ret %12

odd(%0)
 br %2
2:
 %3 = phi [ %0, %1 ], [ %8, %7 ]
 %4 = phi [ 1, %1 ], [ %5, %7 ]
 %5 = phi [ 0, %1 ], [ %4, %7 ]
 %6 = CBNZX(%3)
 br %6, %7, %9
7:
 %8 = SUBXri %3, 1
 br %2
9:
 ret %4

```
odd(%0 : X0)
%spills = alloca 40
STRXi %0, %spills, 0
```

br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
%4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
%5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
%10 = LDRXi %spills, 8 : X0
%6 = CBNZX(%10)
br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
%8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

br %2
9:%12 = LDRXi %spills, 24 : X0
ret %12
```
odd(%0)
    br %2
2:
    %3 = phi [ %0, %1 ], [ %8, %7 ]
    %4 = phi [ 1, %1 ], [ %5, %7 ]
    %5 = phi [ 0, %1 ], [ %4, %7 ]
    %6 = CBNZX(%3)
    br %6, %7, %9
7:
    %8 = SUBXri %3, 1
    br %2
9:
    ret %4
```

```
odd(%0 : X0)

%spills = alloca 40

STRXi %0, %spills, 0

%13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
```

br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
%4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
%5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
%10 = LDRXi %spills, 8 : X0
%6 = CBNZX(%10)
br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
%8 = SUBKri %12, 1 : X0; STRXi %8, %spills, 32
%14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

```
br %2
9:%12 = LDRXi %spills, 24 : X0
ret %12
```

```
odd(%0)
    br %2
2:
    %3 = phi [ %0, %1 ], [ %8, %7 ]
    %4 = phi [ 1, %1 ], [ %5, %7 ]
    %5 = phi [ 0, %1 ], [ %4, %7 ]
    %6 = CBNZX(%3)
    br %6, %7, %9
7:
    %8 = SUBXri %3, 1
    br %2
9:
    ret %4
```

```
odd(%0 : X0)
%spills = alloca 40
STRXi %0, %spills, 0
%13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
```

2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10) br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

```
odd(%0)
    br %2
2:
    %3 = phi [ %0, %1 ], [ %8, %7 ]
    %4 = phi [ 1, %1 ], [ %5, %7 ]
    %5 = phi [ 0, %1 ], [ %4, %7 ]
    %6 = CBNZX(%3)
    br %6, %7, %9
7:
    %8 = SUBXri %3, 1
    br %2
9:
    ret %4
```

```
odd(\%0 : X0)
 %spills = alloca 40
 STRXi %0, %spills, 0
 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16
 br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
 %10 = LDRXi %spills, 8 : X0
 \%6 = CBNZX(\%10)
 br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32
 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8
 %15 = LDRXi %spills, 24 : X0; STRXi %15, %spills, 16
 br %2
9:%12 = LDRXi %spills, 24 : X0
```

```
ret %12
```

odd(%0)
 br %2
2:
 %3 = phi [ %0, %1 ], [ %8, %7 ]
 %4 = phi [ 1, %1 ], [ %5, %7 ]
 %5 = phi [ 0, %1 ], [ %4, %7 ]
 %6 = CBNZX(%3)
 br %6, %7, %9
7:
 %8 = SUBXri %3, 1
 br %2
9:
 ret %4

odd(%0 : X0)%spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8 %15 = LDRXi %spills, 24 : X0; STRXi %15, %spills, 16 br %2 9:%12 = LDRXi %spills, 24 : X0

ret %12

odd(%0)
 br %2
2:
 %3 = phi [ %0, %1 ], [ %8, %7 ]
 %4 = phi [ 1, %1 ], [ %5, %7 ]
 %5 = phi [ 0, %1 ], [ %4, %7 ]
 %6 = CBNZX(%3)
 br %6, %7, %9
7:
 %8 = SUBXri %3, 1
 br %2
9:
 ret %4

odd(%0 : X0)%spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8 %15 = LDRXi %spills, 24 : X0; STRXi %15, %spills, 16 %16 = LDRXi %spills, 16 : X0; STRXi %16, %spills, 24 br %2 9:%12 = LDRXi %spills, 24 : X0 ret %12

odd(%0)
 br %2
2:
 %3 = phi [ %0, %1 ], [ %8, %7 ]
 %4 = phi [ 1, %1 ], [ %5, %7 ]
 %5 = phi [ 0, %1 ], [ %4, %7 ]
 %6 = CBNZX(%3)
 br %6, %7, %9
7:
 %8 = SUBXri %3, 1
 br %2
9:
 ret %4

 $\blacktriangleright$  Value of  $\phi$  node lost!

```
odd(\%0 : X0)
  %spills = alloca 40
  STRXi %0, %spills, 0
  %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16
 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16
  br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
  %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
  %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
  %10 = LDRXi %spills, 8 : X0
 \%6 = CBNZX(\%10)
  br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
  %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32
  %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8
  %15 = LDRXi %spills, 24 : X0; STRXi %15, %spills, 16
  %16 = LDRXi %spills, 16 : X0; STRXi %16, %spills, 24
  br %2
9:%12 = LDRXi %spills, 24 : X0
  ret %12
```

# PHI Cycles

Problem: \u03c6-nodes can depend on each other

- Can be chains (ordering matters) or cycles (need to be broken)
- ▶ Note: only  $\phi$ -nodes defined in same block are relevant/problematic



1. Compute number of other  $\phi$  nodes reading other  $\phi$  on same edge

- 1. Compute number of other  $\phi$  nodes reading other  $\phi$  on same edge
- 2. For each  $\phi$  with 0 readers: handle node/chain
  - ▶ No readers ~→ start of chain
  - Handling node may unblock next element in chain

- 1. Compute number of other  $\phi$  nodes reading other  $\phi$  on same edge
- 2. For each  $\phi$  with 0 readers: handle node/chain
  - No readers ~> start of chain
  - Handling node may unblock next element in chain
- 3. For all remaining  $\phi$ -nodes: must be cycles, reader count always 1
  - Choose arbitrary node, load to temporary register, unblock value
  - Handle just-created chain
  - Write temporary register to target

- 1. Compute number of other  $\phi$  nodes reading other  $\phi$  on same edge
- 2. For each  $\phi$  with 0 readers: handle node/chain
  - No readers ~> start of chain
  - Handling node may unblock next element in chain
- 3. For all remaining  $\phi$ -nodes: must be cycles, reader count always 1
  - Choose arbitrary node, load to temporary register, unblock value
  - Handle just-created chain
  - Write temporary register to target

Resolving  $\phi$  cycles requires an extra register (or stack slot)

odd(%0 : X0) %spills = alloca 40 STRXi %0, %spills, 0

Edge %1  $\rightarrow$  %2

Critical  $\phi$ :

#### br %2

2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
%4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
%5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
%10 = LDRXi %spills, 8 : X0
%6 = CBNZX(%10)
br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
%8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

odd(%0 : X0) %spills = alloca 40 STRXi %0, %spills, 0

Edge %1  $\rightarrow$  %2

Critical  $\phi$ :

br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
%4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
%5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
%10 = LDRXi %spills, 8 : X0
%6 = CBNZX(%10)
br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
%8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

Edge %1  $\rightarrow$  %2

Critical  $\phi$ :

```
odd(%0 : X0)

%spills = alloca 40

STRXi %0, %spills, 0

%13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
```

```
br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
%4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
%5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
%10 = LDRXi %spills, 8 : X0
%6 = CBNZX(%10)
br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
%8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32
```

Edge %1  $\rightarrow$  %2

Critical  $\phi$ :

```
odd(%0 : X0)

%spills = alloca 40

STRXi %0, %spills, 0

%13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
```

br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
%4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
%5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
%10 = LDRXi %spills, 8 : X0
%6 = CBNZX(%10)
br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
%8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

Edge %1  $\rightarrow$  %2

Critical  $\phi$ :

```
odd(%0 : X0)
%spills = alloca 40
STRXi %0, %spills, 0
%13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
%c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16
```

#### br %2

2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
 %10 = LDRXi %spills, 8 : X0
 %6 = CBNZX(%10)
 br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

Edge %1  $\rightarrow$  %2

Critical  $\phi$ :

```
odd(%0 : X0)
%spills = alloca 40
STRXi %0, %spills, 0
%13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
%c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16
```

#### br %2

2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
%4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
%5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
%10 = LDRXi %spills, 8 : X0
%6 = CBNZX(%10)
br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
%8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

Edge %1  $\rightarrow$  %2

Critical  $\phi$ :

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

▶ %4

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

▶ %4

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :



odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :



odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

- ▶ %4 #readers: 1
- ▶ %5 #readers: 1

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

%4 #readers: 1 - broken
 %5 #readers: 1

Action: break %4

```
odd(\%) : X0)
 %spills = alloca 40
 STRXi %0, %spills, 0
 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8
 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16
 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16
 br %2
2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8
 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16
 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24
 %10 = LDRXi %spills, 8 : X0
 \%6 = CBNZX(\%10)
 br %6, %7, %9
7:%11 = LDRXi %spills, 8 : X0
 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32
 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8
```

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

- ▶ %4 #readers: 1 broken
- ▶ %5 #readers: 0

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8 %15 = LDRXi %spills, 24 : X1

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

- ▶ %4 #readers: 1 broken
- ▶ %5 #readers: 0

odd(%) : X0) %spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8 %15 = LDRXi %spills, 24 : X1

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

- %4 #readers: 0 broken
- ▶ %5 #readers: 0

odd(%0 : X0)%spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8 %15 = LDRXi %spills, 24 : X1 %16 = LDRXi %spills, 16 : X0; STRXi %16, %spills, 24

Edge %7  $\rightarrow$  %2

Critical  $\phi$ :

- ▶ %4 #readers: 0 broken
- ▶ %5 #readers: 0

odd(%0 : X0)%spills = alloca 40 STRXi %0, %spills, 0 %13 = LDRXi %spills, 0 : X0; STRXi %13, %spills, 8 %c0 = MOVXconst 1 : X0; STRXi %c0, %spills, 16 %c1 = MOVXconst 0 : X0; STRXi %c1, %spills, 16 br %2 2:%3 = phi [ %0, %1 ], [ %8, %7 ] // spills+8 %4 = phi [ 1, %1 ], [ %5, %7 ] // spills+16 %5 = phi [ 0, %1 ], [ %4, %7 ] // spills+24 %10 = LDRXi %spills, 8 : X0 %6 = CBNZX(%10)br %6, %7, %9 7:%11 = LDRXi %spills, 8 : X0 %8 = SUBXri %12, 1 : X0; STRXi %8, %spills, 32 %14 = LDRXi %spills, 40 : X0; STRXi %14, %spills, 8 %15 = LDRXi %spills, 24 : X1 %16 = LDRXi %spills, 16 : X0; STRXi %16, %spills, 24 STRXi %15, %spills, 16 br %2 9:%12 = LDRXi %spills, 24 : X0 ret %12



► Goal: keep as many values in registers as possible

• Less stack spilling  $\Rightarrow$  better performance

Goal: keep as many values in registers as possible

- Less stack spilling  $\Rightarrow$  better performance
- Problem: register count (severely) limited
- ~ Are there enough registers? (otherwise: spilling)
- $\rightsquigarrow$  Which register to choose?
- $\rightsquigarrow$  Which register to kill and put on the stack?

Goal: keep as many values in registers as possible

- Less stack spilling  $\Rightarrow$  better performance
- Problem: register count (severely) limited
- ~ Are there enough registers? (otherwise: spilling)
- $\rightsquigarrow$  Which register to choose?
- $\rightsquigarrow$  Which register to kill and put on the stack?
- Needs information when value is actually needed
- ► *Tons* of papers exist
- ► Papers are academic

- ► *Tons* of papers exist
- Papers often skip over important details
  - E.g., when spilling using the value needs another register
  - E.g., temporary register for shuffling values
- Additional (ISA) constraints in practice: (incomplete list)
  - 2-address instructions with destructive source
  - Fixed registers for specific instructions
  - Computing the stack address may need yet another register
  - Different register classes, often just handled independently

- ► *Tons* of papers exist
- Papers often skip over important details
  - E.g., when spilling using the value needs another register
  - E.g., temporary register for shuffling values
- Additional (ISA) constraints in practice: (incomplete list)
  - 2-address instructions with destructive source
  - Fixed registers for specific instructions
  - Computing the stack address may need yet another register
  - Different register classes, often just handled independently

Implementations even of simple algorithms tend to be large and complex

## Liveness Analysis – Definitions

- Live: value still used afterwards
  - ▶ After last (possible) use in program flow, the value becomes dead
- Live ranges: set of ranges in program where value is live
  Not necessarily contiguous, e.g. in case of branches
- Live interval: over-approximation of live ranges without holes
  - Depends on block order, reverse post-order often a good choice
- Live-in/Live-out: values live at begin/end of basic block
  - For φ nodes: φ is live-in, operands are live-out in predecessors (Note: different literature uses different definitions)





















### Liveness Analysis – Example – Live Ranges vs. Live Intervals



Live intervals are substantially worse, but easier to compute

## Liveness Analysis – Algorithm<sup>36</sup>

Iterate over blocks in post-order

- ▶ *live*  $\leftarrow \cup s$ *.liveIn*  $\setminus s$ *.phis*,  $s \in b$ *.successors*
- live  $\leftarrow$  live  $\cup \{\phi.input(b) | \phi \in b.successors.phis\}$
- ▶ b.liveOut  $\leftarrow$  live
- $\blacktriangleright \forall v \in live : ranges[v].add(b.start, b.end)$
- For each non- $\phi$  instruction *inst* in reverse order
  - $\blacktriangleright \textit{ live} \leftarrow (\textit{live} \cup \textit{inst.ops}) \setminus \{\textit{inst}\}$
  - ranges[inst].setStart(inst)
  - $\blacktriangleright \quad \forall op \in inst.ops : ranges[op].add(b.start, inst)$
- $\blacktriangleright$  b.liveln  $\leftarrow$  live  $\cup$  b.phis

Repeat until convergence<sup>35</sup>

<sup>&</sup>lt;sup>35</sup>Reducible graphs: expanding *liveln* of loop headers to the entire loop suffices

<sup>36</sup> Adapted from C Wimmer and M Franz. "Linear scan register allocation on SSA form". In: CGO. 2010, pp. 170-179.

## Register Allocation Decisions (Outline)

Question: are there enough registers for all values?

- Register pressure = number of values live at some point
- Register pressure > # registers  $\Rightarrow$  move some values to stack (spilling)

## Register Allocation Decisions (Outline)

Question: are there enough registers for all values?

- Register pressure = number of values live at some point
- Register pressure > # registers  $\Rightarrow$  move some values to stack (spilling)

Question: when spilling, which values and where to store/reload?

Spilling is expensive, so avoid spilling frequently used values

## Register Allocation Decisions (Outline)

Question: are there enough registers for all values?

- Register pressure = number of values live at some point
- Register pressure > #registers  $\Rightarrow$  move some values to stack (spilling)

Question: when spilling, which values and where to store/reload?
 Spilling is expensive, so avoid spilling frequently used values

Question: for unspilled values, which register to assign?

Also: respect register constraints, etc.

## Register Allocation Strategies

#### Scan-based

- Iterate over the program
- Decide locally what to do
- Greedily assign registers

## Register Allocation Strategies

### Scan-based

- Iterate over the program
- Decide locally what to do
- Greedily assign registers

#### **Graph-based**

- Compute interference graph
  - Nodes are values
  - Edge  $\Rightarrow$  live ranges overlap
- Holistic approach

# Register Allocation Strategies

### Scan-based

- Iterate over the program
- Decide locally what to do
- Greedily assign registers

- $+\,$  Fast, good for straight code
- Code quality often bad
- ▶ Used for -00 and JIT comp.

### **Graph-based**

- Compute interference graph
  - Nodes are values
  - Edge  $\Rightarrow$  live ranges overlap
- Holistic approach
- + Often generate good code
- Expensive, superlinear runtime
- Used for optimized code

### Linear Scan Register Allocation<sup>37</sup>

Idea: treat whole function as single block

- Block order affects quality (but not correctness)
- Only consider live intervals without holes
- Iterate over instructions from top to bottom
- > For operands of instruction in their last use: mark register as free
- Assign instruction result to new free register
  - If no free register available: move some value to the stack
  - Heuristic: value whose liveness ends furthest in future

- + low compile-time, simple
- very suboptimal code, live intervals grossly over-approximated
- What's missing?

- + low compile-time, simple
- very suboptimal code, live intervals grossly over-approximated
- What's missing?
  - Registers to load spilled values
  - Shuffling of values between blocks
  - Register constraints (e.g., for instructions or function calls)

+ low compile-time, simple

- very suboptimal code, live intervals grossly over-approximated
- What's missing?
  - Registers to load spilled values
  - Shuffling of values between blocks
  - Register constraints (e.g., for instructions or function calls)
- > Other disadvantage: once a value is spilled, it is spilled everywhere
  - Some other approaches based on lifetime splitting<sup>38</sup>
- Function calls: clobber lots of registers

Scan-based Register Allocation<sup>41</sup>

Iterate over basic blocks<sup>39</sup>

- Start with register assignment from predecessor
  - Multiple predecessors: choose assignment from any one
  - $\phi$ -nodes can either reside in registers or on the stack
- Iterate over instructions top-down
  - Ensure all instruction operands are in registers
    - When out of registers: move any value to stack
  - ► For operands in their last use: mark register as free
  - Assign instruction result to new free register
- Shuffle values back into registers where successor expects them<sup>40</sup>

<sup>&</sup>lt;sup>39</sup>Typically: reverse post-order, so most predecessors are seen before successors, except for loops.

<sup>&</sup>lt;sup>40</sup>Without critical edges, only relevant for blocks with one successor — others are visited afterwards by RPO definition.

<sup>&</sup>lt;sup>41</sup>Mostly following Go: https://github.com/golang/go/blob/5f7abe/src/cmd/compile/internal/ssa/regalloc.go

What to spill?

What to spill?

- Spill value with furthest use in future<sup>42</sup>
  - ► Frees register for longest time
  - Requires information on next use to be stored during analysis
  - But: avoid spilling values computed inside loops (esp. loop-carried dependencies), reloads are fine<sup>43</sup>
  - Downside: super-linear runtime

Where to store?

<sup>42</sup>C Wimmer and H Mössenböck. "Optimized interval splitting in a linear scan register allocator". In: VEE. 2005, pp. 132–141.
 <sup>43</sup>Intel Optimization Reference Manual (Aug. 2023), Assembly/Compiler Coding Rules 38 and 45

What to spill?

- Spill value with furthest use in future<sup>42</sup>
  - ► Frees register for longest time
  - Requires information on next use to be stored during analysis
  - But: avoid spilling values computed inside loops (esp. loop-carried dependencies), reloads are fine<sup>43</sup>
  - Downside: super-linear runtime

Where to store?

- Stack, period.
- Spilling to FP/vector registers...occasionally proposed, not used in practice

<sup>42</sup>C Wimmer and H Mössenböck. "Optimized interval splitting in a linear scan register allocator". In: VEE. 2005, pp. 132–141.

<sup>43</sup>Intel Optimization Reference Manual (Aug. 2023), Assembly/Compiler Coding Rules 38 and 45

Where to insert store?

Where to insert store?

- Option 1: spill exactly where required
  - Downside: multiple spills of same value, many reloads
- ▶ Option 2: spill once, immediately after computation
  - Later "spills" to the stack are less costly
  - May lead to spills on code paths that don't need it
- Option 3: compute best place using dominator tree
  - Spill store must dominate all subsequent loads

Scan-based Register Allocation – Register Assignment


Scan-based Register Allocation – Register Assignment

Merge blocks: choose predecessor with most values in registers

- High likelihood of reducing the number of stores
- Re-loads are pushed into predecessors

#### Scan-based Register Allocation – Register Assignment

Merge blocks: choose predecessor with most values in registers

- High likelihood of reducing the number of stores
- Re-loads are pushed into predecessors
- Propagate register constraints bottom-up as hints first
  - E.g.: call parameters, instruction constraints, assignment for merge block
  - Reduces number of moves

# Graph Coloring Approaches

# Graph Coloring Approaches

 $\ + \$  Considerably better results than greedy algorithms

# Graph Coloring Approaches

 $\ + \$  Considerably better results than greedy algorithms

- High run-time, even with heuristics
- ► Graph coloring in general is *NP*-complete
- Often used in compilers (e.g., GCC, WebKit)

AD IN2053 "Program Optimization" covers this more formally

- Optionally setup frame pointer
  - Required for variably-sized stack frame
     Otherwise: cannot access spilled variables or stack parameters
- Optionally re-align stack pointer

- Optionally setup frame pointer
  - Required for variably-sized stack frame
     Otherwise: cannot access spilled variables or stack parameters
- Optionally re-align stack pointer
- Save callee-saved registers, maybe also link register
- Optionally add code for stack canary

- Optionally setup frame pointer
  - Required for variably-sized stack frame
     Otherwise: cannot access spilled variables or stack parameters
- Optionally re-align stack pointer
- Save callee-saved registers, maybe also link register
- Optionally add code for stack canary
- Compute stack frame size and adjust stack pointer
  - Mainly size of allocas, but needs to respect alignment
  - Ensure sufficient space for parameters passed on the stack
  - Ensure stack pointer is sufficiently aligned
- Stack pointer adjustment may be omitted for leaf functions
  - Some ABIs guarantee a *red zone*

# Block Ordering

- Order blocks to make use of fall-through in machine code
- Avoid sequences of b.cond; b
  - Sometimes cannot be avoided: conditional branches often have shorter range
- Block ordering has implications for branch prediction
  - Forward branches default to not-taken, backward taken
  - Unlikely blocks placed "out of the way" of the main execution path
  - Indirect branches are predicted as fall-through

# Register Allocation – Summary

Map unlimited virtual registers to restricted register set

#### Responsible for:

- Assigning registers to values
- Deciding which registers to spill to stack
- Deciding when to spill/unspill values
- $\blacktriangleright$   $\phi$ -nodes require extra care, esp. for chains and cycles
- Liveness information is key information for register allocation
- Scan-based approaches are fast, but lead to suboptimal code
- Graph coloring yields better results, but is much slower
- Register allocation/spilling heavily relies on heuristics in practice

# Register Allocation – Questions

- Why is register allocation a difficult problem?
- ▶ How are *φ*-nodes handled during register allocation?
- ▶ What are the two main problems when destructing *φ*-nodes?
- Why are critical edges problematic and how to deal with them?
- ▶ What are practical constraints for register allocation?
- How to detect whether a value is still needed at some point?
- ▶ How to compute the live ranges of values in an SSA-based IR?
- What is the idea of linear scan and what are its practical problems?

#### Code Generation for Data Processing Lecture 9: Object Files, Linker, and Loader

#### Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

Compiler emits object file

Somehow? Some format?

- Compiler emits object file
  - Somehow? Some format?
- Linker merges object files and determines required shared libraries
  - Somehow resolves missing symbols?

- Compiler emits object file
  - Somehow? Some format?
- Linker merges object files and determines required shared libraries
  - Somehow resolves missing symbols?
- Linker creates executable file
  - Somehow? Some format the OS understands?

- Compiler emits object file
  - Somehow? Some format?
- Linker merges object files and determines required shared libraries
  - Somehow resolves missing symbols?
- Linker creates executable file
  - Somehow? Some format the OS understands?
- Kernel loads executable file into memory
- Someone loads shared libraries

- Code Model = address constraints
- Allows for better code
  - Long addrs/offsets = more instrs.
- Exact constraints arch/ABI-specific

- Code Model = address constraints
- Allows for better code
  - Long addrs/offsets = more instrs.
- Exact constraints arch/ABI-specific
- ► x86-64 SysV ABI:
  - Small: code and data max. 2 GiB
  - Medium: code max. 2 GiB
  - Large: no restrictions

- Code Model = address constraints
- Allows for better code
  - Long addrs/offsets = more instrs.
- Exact constraints arch/ABI-specific
- ► x86-64 SysV ABI:
  - Small: code and data max. 2 GiB
  - Medium: code max. 2 GiB
  - Large: no restrictions

- non-PIC: absolute addresses fixed at link-time
  - Addrs can be encoded directly
  - Sometime slightly faster
  - Not possible for shared libs

- Code Model = address constraints
- Allows for better code
  - Long addrs/offsets = more instrs.
- Exact constraints arch/ABI-specific
- ► x86-64 SysV ABI:
  - Small: code and data max. 2 GiB
  - Medium: code max. 2 GiB
  - Large: no restrictions

- non-PIC: absolute addresses fixed at link-time
  - Addrs can be encoded directly
  - Sometime slightly faster
  - Not possible for shared libs
- PIC: address random at load time
  - Offsets need be PC-relative
  - Addresses need fixup at load time (e.g., in jump tables)

- Code Model = address constraints
- Allows for better code
  - Long addrs/offsets = more instrs.
- Exact constraints arch/ABI-specific
- ► x86-64 SysV ABI:
  - Small: code and data max. 2 GiB
  - Medium: code max. 2 GiB
  - Large: no restrictions

- non-PIC: absolute addresses fixed at link-time
  - Addrs can be encoded directly
  - Sometime slightly faster
  - Not possible for shared libs
- ▶ PIC: address random at load time
  - Offsets need be PC-relative
  - Addresses need fixup at load time (e.g., in jump tables)

Compiler needs to know code model

Section 18

**Object Files** 

# Executable and Linkable Format (ELF)

- Widely used format for code
  - REL: relocatable/object file
  - EXEC: executable (non-PIE)
  - DYN: shared library/PIE
  - CORE: coredump
- ELF header: general information
- Program headers: used for execution
- Section headers: used for linking

| ELF Header          |
|---------------------|
| Program Headers     |
| (not for REL)       |
| .text               |
| .rodata             |
| .data               |
|                     |
| e.g., symtab, debug |
| Section Headers     |
| (primarily for REL) |

#### ELF Header

```
// from glibc's elf.h
typedef struct {
 unsigned char e_ident[EI_NIDENT]; /* Magic number and other info */
  Elf64_Half e_type: /* Object file type */
  Elf64_Half e_machine; /* Architecture */
  Elf64_Word e_version; /* Object file version */
  Elf64_Addr e_entry: /* Entry point virtual address */
  Elf64_Off e_phoff: /* Program header table file offset */
  Elf64_Off e_shoff: /* Section header table file offset */
  Elf64_Word e_flags; /* Processor-specific flags */
  Elf64_Half e_ehsize; /* ELF header size in bytes */
  Elf64_Half e_phentsize; /* Program header table entry size */
  Elf64_Half e_phnum; /* Program header table entry count */
  Elf64_Half e_shentsize; /* Section header table entry size */
  Elf64_Half e_shnum: /* Section header table entry count */
  Elf64_Half e_shstrndx; /* Section header string table index */
} Elf64 Ehdr:
```

- Structures content of object files for linker
  - Linker later merges content sections of same "type"
- Some sections have "meta" information (e.g., symbols)



- Structures content of object files for linker
  - Linker later merges content sections of same "type"
- Some sections have "meta" information (e.g., symbols)
- .text program text/code, executable
- 🕨 .rodata -

- Structures content of object files for linker
  - Linker later merges content sections of same "type"
- Some sections have "meta" information (e.g., symbols)
- .text program text/code, executable
- .rodata read-only data
- 🕨 .data -

- Structures content of object files for linker
  - Linker later merges content sections of same "type"
- Some sections have "meta" information (e.g., symbols)
- .text program text/code, executable
- .rodata read-only data
- .data initialized data, writable
- 🕨 .bss -

- Structures content of object files for linker
  - Linker later merges content sections of same "type"
- Some sections have "meta" information (e.g., symbols)
- .text program text/code, executable
- .rodata read-only data
- .data initialized data, writable
- .bss zero-initialized data, no storage, writable
  - Name history: block started by symbol
- .strtab -

- Structures content of object files for linker
  - Linker later merges content sections of same "type"
- Some sections have "meta" information (e.g., symbols)
- .text program text/code, executable
- .rodata read-only data
- .data initialized data, writable
- .bss zero-initialized data, no storage, writable
  - Name history: block started by symbol
- .strtab string table for symbol names
- .symtab -

- Structures content of object files for linker
  - Linker later merges content sections of same "type"
- Some sections have "meta" information (e.g., symbols)
- .text program text/code, executable
- .rodata read-only data
- .data initialized data, writable
- .bss zero-initialized data, no storage, writable
  - Name history: block started by symbol
- .strtab string table for symbol names
- symtab symbol table, references string table for names
- .shstrtab -

- Structures content of object files for linker
  - Linker later merges content sections of same "type"
- Some sections have "meta" information (e.g., symbols)
- .text program text/code, executable
- .rodata read-only data
- .data initialized data, writable
- .bss zero-initialized data, no storage, writable
  - Name history: block started by symbol
- .strtab string table for symbol names
- symtab symbol table, references string table for names
- shstrtab string table for section header names

# ELF String Table

- Sequence of NUL-terminated character sequences
- String identified by byte offset
- Must start with a NUL byte:

# ELF String Table

Sequence of NUL-terminated character sequences

String identified by byte offset

Must start with a NUL byte: string 0 always empty string

Must end with a NUL byte: all strings are terminated

Example .strtab:

\0 v a r n a m e \0 f o o \0String 0String 1String 4String 9

# ELF String Table

Sequence of NUL-terminated character sequences

String identified by byte offset

Must start with a NUL byte: string 0 always empty string

Must end with a NUL byte: all strings are terminated

Example .strtab: \0 e ∖0 f \0 v а r n а m 0 0 String 0 String 4 String 1 String 9 .....
#### ELF String Table

Sequence of NUL-terminated character sequences

String identified by byte offset

Must start with a NUL byte: string 0 always empty string

Must end with a NUL byte: all strings are terminated

Example .strtab: \0 e ∖0 f \0 v а r n а m 0 0 String 4 String 0 String 1 String 9 ..... "varname"

#### ELF String Table

Sequence of NUL-terminated character sequences

String identified by byte offset

Must start with a NUL byte: string 0 always empty string

Must end with a NUL byte: all strings are terminated

Example .strtab: \0 \0 f \0 v а r n а m е 0 0 String 0 String 1 String 4 String 9 ..... "varname" "name"

#### ELF String Table

Sequence of NUL-terminated character sequences

String identified by byte offset

Must start with a NUL byte: string 0 always empty string

Must end with a NUL byte: all strings are terminated

Example .strtab: \0 \0 f \0 v а r n а m е 0 0 String 0 String 1 String 4 String 9 ..... "varname" "name" "foo"

#### ELF Section Header

```
typedef struct {
  Elf64_Word sh_name; /* Section name (string tbl index) */
  Elf64_Word sh_type; /* Section type */
  // SHT_{NULL, PROGBITS, SYMTAB, STRTAB, RELA, HASH, NOBITS, ...}
  Elf64_Xword sh_flags; /* Section flags */
  // SHF_{WRITE, ALLOC, EXECINSTR, MERGE, STRINGS, ... }
  Elf64_Addr sh_addr; /* Section virtual addr at execution */
  Elf64 Off sh offset: /* Section file offset */
  Elf64_Xword sh_size; /* Section size in bytes */
  Elf64 Word sh link: /* Link to another section */
  Elf64_Word sh_info; /* Additional section information */
  Elf64_Xword sh_addralign; /* Section alignment */
  Elf64_Xword sh_entsize; /* Entry size if section holds table */
} Elf64 Shdr:
```

// first section is always undefined/SHT\_NULL

#### Example: Section Headers

```
void external(void);
static void bar(void) {}
void foo(void) { bar(); }
void func(void) {
foo(); external(); }
```

#### Section Headers:

| [Nr] | Name               | Туре     | ES | Flg | Lk | Inf | Al |
|------|--------------------|----------|----|-----|----|-----|----|
| 0]   |                    | NULL     | 00 |     | 0  | 0   | 0  |
| [ 1] | .text              | PROGBITS | 00 | AX  | 0  | 0   | 1  |
| 2]   | .rela.text         | RELA     | 18 | I   | 10 | 1   | 8  |
| 3]   | .data              | PROGBITS | 00 | WA  | 0  | 0   | 1  |
| [ 4] | .bss               | NOBITS   | 00 | WA  | 0  | 0   | 1  |
| 5]   | .comment           | PROGBITS | 01 | MS  | 0  | 0   | 1  |
| 6]   | .note.GNU-stack    | PROGBITS | 00 |     | 0  | 0   | 1  |
| 7]   | .note.gnu.property | NOTE     | 00 | Α   | 0  | 0   | 8  |
| 8]   | .eh_frame          | PROGBITS | 00 | Α   | 0  | 0   | 8  |
| 9]   | .rela.eh_frame     | RELA     | 18 | I   | 10 | 8   | 8  |
| 10]  | .symtab            | SYMTAB   | 18 |     | 11 | 4   | 8  |
| [11] | .strtab            | STRTAB   | 00 |     | 0  | 0   | 1  |
| 12]  | .shstrtab          | STRTAB   | 00 |     | 0  | 0   | 1  |

#### Symbol Table

Describes symbolic reference to object/function

Names in associated string table, referenced by byte offset

```
typedef struct {
  Elf64_Word st_name; /* Symbol name (string tbl index) */
  unsigned char st_info; /* Symbol type and binding */
  unsigned char st_other; /* Symbol visibility */
  Elf64_Section st_shndx; /* Section index */
  Elf64_Addr st_value; /* Symbol value */
  Elf64_Xword st_size; /* Symbol size */
} Elf64_Sym;
```

#### Example: Symbol Table

```
void external(void);
static void bar(void) {}
void foo(void) { bar(); }
void func(void) {
 foo(); external(); }
```

- ► Nax=UND: undefined
  - value is zero
- Ndx=ABS: no section base
  - value is absolute
- Ndx=num: section idx.
  - value is offset into sec.
  - later refers to address

| Section | n Headers: |            |          |    |     |    |             |    |
|---------|------------|------------|----------|----|-----|----|-------------|----|
| [Nr]    | Name       | Туре       | Size     | ES | Flg | Lk | ${\tt Inf}$ | Al |
| [ 0]    |            | NULL       | 000000   | 00 |     | 0  | 0           | 0  |
| [ 1]    | .text      | PROGBITS   | 00001a   | 00 | AX  | 0  | 0           | 1  |
| 11 .    |            |            |          |    |     |    |             |    |
| [10]    | .symtab    | SYMTAB     | 0000a8   | 18 |     | 11 | 4           | 8  |
|         | siz        | zeof(Elf64 | 4_Sym) - | /  |     |    | - I         |    |
|         |            | link to a  | strtab - |    |     | /  | - I         |    |
|         | first      | t non-loca | al sym - |    |     |    | /           |    |
| [11]    | .strtab    | STRTAB     | 00001f   | 00 |     | 0  | 0           | 1  |
| [12]    | .shstrtab  | STRTAB     | 00006c   | 00 |     | 0  | 0           | 1  |

| Symbol | l tabl | le '.s | symtab' | contains | s 7 entri | es: |                 |
|--------|--------|--------|---------|----------|-----------|-----|-----------------|
| Num:   | Val    | Size   | Туре    | Bind     | Vis       | Ndx | Name            |
| 0:     | 000    | 0      | NOTYPE  | LOCAL    | DEFAULT   | UND |                 |
| 1:     | 000    | 0      | FILE    | LOCAL    | DEFAULT   | ABS | <stdin></stdin> |
| 2:     | 000    | 0      | SECTION | LOCAL    | DEFAULT   | 1   | .text           |
| 3:     | 000    | 1      | FUNC    | LOCAL    | DEFAULT   | 1   | bar             |
| 4:     | 001    | 6      | FUNC    | GLOBAL   | DEFAULT   | 1   | foo             |
| 5:     | 007    | 19     | FUNC    | GLOBAL   | DEFAULT   | 1   | func            |
| 6:     | 000    | 0      | NOTYPE  | GLOBAL   | DEFAULT   | UND | externa         |

```
void external(void);
static void bar(void) {}
void foo(void) { bar(); }
void func(void) {
 foo(); external(); }
```

000000000000000 <bar>: 0: c3 ret 000000000000001 <foo>: e8 ?? ?? ?? ?? call ??? 1: 6: c3 ret 000000000000007 <func>: 48 83 ec 08 7: sub rsp,0x8 b: e8 ?? ?? ?? ?? call ??? 10: e8 ?? ?? ?? ?? call ??? 15: 48 83 c4 08 add rsp,0x8 19: cЗ ret

```
void external(void);
static void bar(void) {}
void foo(void) { bar(); }
void func(void) {
 foo(); external(); }
```

000000000000000 <bar>: 0: c3 ret 000000000000001 <foo>: e8 fa ff ff ff call 1: 0 < bar >6: c3 ret 000000000000007 <func>: 48 83 ec 08 7: sub rsp,0x8 b: e8 ?? ?? ?? ?? call ??? 10: e8 ?? ?? ?? ?? call ??? 15: 48 83 c4 08 add rsp,0x8 19: cЗ ret

```
void external(void);
static void bar(void) {}
void foo(void) { bar(); }
void func(void) {
foo(); external(); }
```

000000000000000 <bar>: 0: c3 ret 000000000000001 <foo>: 1: e8 fa ff ff ff call 0 < bar>c3 6: ret 000000000000007 <func>: 7: 48 83 ec 08 sub rsp,0x8 10 < func+0x9>h: e8 00 00 00 00 call c: R\_X86\_64\_PC32<sup>a</sup>  $f_{00} = 0x4$ 10: e8 ?? ?? ?? ?? call ??? 15: 48 83 c4 08 add rsp,0x8 19: c3 ret

<sup>a</sup>Recent GAS emits R\_X86\_64\_PLT32, which is equivalent for local symbols.

```
void external(void);
static void bar(void) {}
void foo(void) { bar(); }
void func(void) {
foo(); external(); }
```

000000000000000 <bar>: 0: c3 ret 000000000000001 < foo>: 1. e8 fa ff ff ff call 0 < bar >6: c3 ret 000000000000007 < func>:7: 48 83 ec 08 rsp,0x8 sub h٠ e8 00 00 00 00 call 10 < func+0x9>c: R X86 64 PC32<sup>a</sup>  $f_{00} = 0x4$ 10. e8 00 00 00 00 call 15 < func+0xe>11: R X86 64 PLT32 external-0x4 15: 48 83 c4 08 add rsp,0x8 19: c3 ret

<sup>a</sup>Recent GAS emits R\_X86\_64\_PLT32, which is equivalent for local symbols.

```
void external(void);
static void bar(void) {}
void foo(void) { bar(); }
void func(void) {
foo(); external(); }
```

- Symbol may be unknown
- Linker needs to resolve offset later
- $\rightsquigarrow$  Relocations

000000000000000 <bar>: 0: c3 ret 000000000000001 < foo>: 1. e8 fa ff ff ff call 0 < bar >6: c3 ret 00000000000007 <func>: 7: 48 83 ec 08 rsp,0x8 sub h٠ e8 00 00 00 00 call 10 < func+0x9>c: R X86 64 PC32<sup>a</sup>  $f_{00} = 0x4$ 10. e8 00 00 00 00 call 15 < func+0xe>11: R X86 64 PLT32 external-0x4 15: 48 83 c4 08 add rsp,0x8 19. c3 ret

<sup>a</sup>Recent GAS emits R\_X86\_64\_PLT32, which is equivalent for local symbols.

#### Relocations

- Problem: symbol values unknown before linking
  - External symbols: unavailable; other section: distance unknown
- Idea: store *relocations*  $\Rightarrow$  linker patches code/data

#### Relocations

- Problem: symbol values unknown before linking
  - External symbols: unavailable; other section: distance unknown
- Idea: store *relocations*  $\Rightarrow$  linker patches code/data
- Relocation: quadruple of (offset in sec., type, symbol idx, addend)
   Contained in REL/RELA/RELR sections

#### Relocations

- Problem: symbol values unknown before linking
  - External symbols: unavailable; other section: distance unknown
- Idea: store *relocations*  $\Rightarrow$  linker patches code/data
- Relocation: quadruple of (offset in sec., type, symbol idx, addend)
   Contained in REL/RELA/RELR sections

# Static RelocationET\_RELDynamic Relocation ET\_EXEC/ET\_DYNFor static linker (1d)For dynamic linker/loaderEither: resolve or emit dyn. relocShall be fast, outside code

#### Types and meaning defined by psABI<sup>44</sup>

P: address of place being relocated; S: symbol address; L: PLT addr. for symbol; Z: sym. size; A: addend; B: dynamic base address of shared obj.; G: GOT offset; GOT: GOT address

| Name Field Calculation | Name | Field Calculation |
|------------------------|------|-------------------|
|------------------------|------|-------------------|

#### Types and meaning defined by psABI<sup>44</sup>

P: address of place being relocated; S: symbol address; L: PLT addr. for symbol; Z: sym. size; A: addend; B: dynamic base address of shared obj.; G: GOT offset; GOT: GOT address

| Name          | Field | Calculation | Name | Field | Calculation |
|---------------|-------|-------------|------|-------|-------------|
| R_X86_64_PC32 | 32    | S + A - P   |      |       |             |

#### Types and meaning defined by psABI<sup>44</sup>

**P**: address of place being relocated; **S**: symbol address; **L**: PLT addr. for symbol; **Z**: sym. size; **A**: addend; **B**: dynamic base address of shared obj.; **G**: GOT offset; **GOT**: GOT address

| Name           | Field | Calculation | Name | Field | Calculation |
|----------------|-------|-------------|------|-------|-------------|
| R_X86_64_PC32  | 32    | S + A - P   |      |       |             |
| R_X86_64_PLT32 | 32    | L + A - P   |      |       |             |

#### Types and meaning defined by psABI<sup>44</sup>

P: address of place being relocated; S: symbol address; L: PLT addr. for symbol; Z: sym. size;
A: addend; B: dynamic base address of shared obj.; G: GOT offset; GOT: GOT address

| Name                 | Field | Calculation     | Name                | Field | Calculation     |
|----------------------|-------|-----------------|---------------------|-------|-----------------|
| R_X86_64_64          | 64    | S + A           | R_X86_64_32         | 32    | S + A (zext)    |
| R_X86_64_PC32        | 32    | S + A - P       | R_X86_64_32S        | 32    | S + A (sext)    |
| R_X86_64_GOT32       | 32    | G + A           | R_X86_64_GOTOFF64   | 64    | S + A - GOT     |
| R_X86_64_PLT32       | 32    | L + A - P       | R_X86_64_GOTPC32    | 32    | GOT + A - P     |
| R_X86_64_GLOB_DAT    | addr  | S               | R_X86_64_G0T64      | 64    | G + A           |
| R_X86_64_JUMP_SLOT   | addr  | 5               | R_X86_64_GOTPCREL64 | 64    | G + GOT + A - P |
| R_X86_64_RELATIVE    | addr  | B + A           | R_X86_64_GOTPC64    | 64    | GOT + A - P     |
| R_X86_64_GOTPCREL    | 32    | G + GOT + A - P | R_X86_64_PLT0FF64   | 64    | L - GOT + A     |
| R_X86_64_GOTPCRELX   |       |                 | R_X86_64_SIZE32     | 32    | Z + A           |
| R_X86_64_REX_GOTPCRE | LX    |                 | R_X86_64_SIZE64     | 64    | Z + A           |

#### **Relocation Section**

Section Headers:

| [Nr] Name                               | Туре       | Size     | ES | Flg | Lk | Inf | Al |
|-----------------------------------------|------------|----------|----|-----|----|-----|----|
| [ 1] .text                              | PROGBITS   | 00001a   | 00 | AX  | 0  | 0   | 1  |
| [ 2] .rela.text                         | RELA       | 000030   | 18 | I   | 10 | 1   | 8  |
| siz                                     | eof(Elf64  | _Rela) - | /  | - I |    | - I |    |
| I: info                                 | is section | n link – |    | /   | 1  | 1   |    |
|                                         | link to a  | symtab - |    |     | /  | 1   |    |
| <pre>target sec. for relocations/</pre> |            |          |    |     |    |     |    |
| [10] .symtab                            | SYMTAB     | 0000a8   | 18 |     | 11 | 4   | 8  |

 Info
 Type
 Symbol's Name + Addend

 00000000000000
 0000000000002
 R\_X86\_64\_PC32
 foo - 4

 000000000000011
 000000000004
 R\_X86\_64\_PLT32
 external - 4

#### Relocations on RISC Architectures

- RISC architectures typically have more relocation types
  - ► Example: AArch64<sup>45</sup> has >50 relocations
- Building a 64-bit address requires several instructions (AArch64: one for bits 0–15, 16–31, ...)
  - Each instruction needs a different relocation to patch in the bits! movz x0, #:abs\_g0\_nc:globalVariable movk x0, #:abs\_g1\_nc:globalVariable movk x0, #:abs\_g2\_nc:globalVariable movk x0, #:abs\_g3:globalVariable
- Often: page-granular address with added offset for low bits
  - ▶ adrp for  $\pm$ 4 GiB range, add or load offset for low bits
  - Scaled load offsets require different relocations for each scale

- Branches (often) have limited range; compiler must assume max. distance
- ▶ x86-64:  $\pm 2$  GiB range, if larger use mov and indirect jump
- ► AArch64: ±128 MiB range ~→ executable sections must be <127 MiB linker will insert veneer between different .text sections</p>
  - Veneer allowed to clobber inter-procedural scratch registers x16/x17

- Branches (often) have limited range; compiler must assume max. distance
- ▶ x86-64:  $\pm 2$  GiB range, if larger use mov and indirect jump
- ► AArch64: ±128 MiB range ~→ executable sections must be <127 MiB linker will insert veneer between different .text sections</p>
  - Veneer allowed to clobber inter-procedural scratch registers x16/x17
- ▶ badly designed ISA:  $\pm 1$  MiB range  $\rightsquigarrow$  needs ind. jump often

Branches (often) have limited range; compiler must assume max. distance

▶ x86-64:  $\pm 2$  GiB range, if larger use mov and indirect jump

- ► AArch64: ±128 MiB range ~→ executable sections must be <127 MiB linker will insert veneer between different .text sections</p>
  - Veneer allowed to clobber inter-procedural scratch registers x16/x17
- badly designed ISA: ±1 MiB range → needs ind. jump often
   Construct 20 bit first with auipc, insert low 12 bit in jalr

Branches (often) have limited range; compiler must assume max. distance

▶ x86-64:  $\pm 2$  GiB range, if larger use mov and indirect jump

► AArch64: ±128 MiB range ~→ executable sections must be <127 MiB linker will insert veneer between different .text sections</p>

Veneer allowed to clobber inter-procedural scratch registers x16/x17

▶ badly designed ISA:  $\pm 1$  MiB range  $\rightsquigarrow$  needs ind. jump often

- Construct 20 bit first with auipc, insert low 12 bit in jalr
- Add new relax reloc: linker optimizes/relaxes code if possible

Branches (often) have limited range; compiler must assume max. distance

▶ x86-64:  $\pm 2$  GiB range, if larger use mov and indirect jump

► AArch64: ±128 MiB range ~→ executable sections must be <127 MiB linker will insert veneer between different .text sections</p>

Veneer allowed to clobber inter-procedural scratch registers x16/x17

▶ badly designed ISA:  $\pm 1$  MiB range  $\rightsquigarrow$  needs ind. jump often

- Construct 20 bit first with auipc, insert low 12 bit in jalr
- Add new relax reloc: linker optimizes/relaxes code if possible
- Changes code size, all relative offsets now need relocations, too

Branches (often) have limited range; compiler must assume max. distance

▶ x86-64:  $\pm 2$  GiB range, if larger use mov and indirect jump

► AArch64: ±128 MiB range ~→ executable sections must be <127 MiB linker will insert veneer between different .text sections</p>

Veneer allowed to clobber inter-procedural scratch registers x16/x17

▶ badly designed ISA:  $\pm 1$  MiB range  $\rightsquigarrow$  needs ind. jump often

- Construct 20 bit first with auipc, insert low 12 bit in jalr
- Add new relax reloc: linker optimizes/relaxes code if possible
- Changes code size, all relative offsets now need relocations, too
- Alignment guarantees need new, special align relocations

#### Section 19

#### Executable Files



► Goal: combine multiple input files (.o/.so/.a) into executable or shared lib.

1. Find and load all input files

- 1. Find and load all input files
- 2. Scan input, store symbols, resolve symbols on-the-fly

- 1. Find and load all input files
- 2. Scan input, store symbols, resolve symbols on-the-fly
- 3. Create synthetic section (GOT, PLT, relocations for output file)

- 1. Find and load all input files
- 2. Scan input, store symbols, resolve symbols on-the-fly
- 3. Create synthetic section (GOT, PLT, relocations for output file)
- 4. Process relocations: create PLT/GOT entry and dynamic reloc.

- 1. Find and load all input files
- 2. Scan input, store symbols, resolve symbols on-the-fly
- 3. Create synthetic section (GOT, PLT, relocations for output file)
- 4. Process relocations: create PLT/GOT entry and dynamic reloc.
- 5. Optimize and deduplicate sections

- 1. Find and load all input files
- 2. Scan input, store symbols, resolve symbols on-the-fly
- 3. Create synthetic section (GOT, PLT, relocations for output file)
- 4. Process relocations: create PLT/GOT entry and dynamic reloc.
- 5. Optimize and deduplicate sections
- 6. Write section to output file
  - Apply relocations which are now known; compress sections; etc.

► Goal: combine multiple input files (.o/.so/.a) into executable or shared lib.

- 1. Find and load all input files
- 2. Scan input, store symbols, resolve symbols on-the-fly
- 3. Create synthetic section (GOT, PLT, relocations for output file)
- 4. Process relocations: create PLT/GOT entry and dynamic reloc.
- 5. Optimize and deduplicate sections
- 6. Write section to output file
  - Apply relocations which are now known; compress sections; etc.

7. Profit!
Entry in ELF header: entry address of the program

- Typically provided by libc to call \_\_libc\_start\_main
- Program headers: instructions for loading the program

Entry in ELF header: entry address of the program

- Typically provided by libc to call \_\_libc\_start\_main
- Program headers: instructions for loading the program

▶ PT\_PHDR: described program headers

Entry in ELF header: entry address of the program

- Typically provided by libc to call \_\_libc\_start\_main
- Program headers: instructions for loading the program
- ▶ PT\_PHDR: described program headers
- PT\_LOAD: loadable segment
  - Specifies virtual address, file offset, file size/memory size, permission
  - vaddr&(pgsize-1)==offset&(pgsize-1) kernel will just mmap the file
  - memory size > file size  $\Rightarrow$  filled up with zeros (for .bss)

Entry in ELF header: entry address of the program

- Typically provided by libc to call \_\_libc\_start\_main
- Program headers: instructions for loading the program
- ▶ PT\_PHDR: described program headers
- PT\_LOAD: loadable segment
  - Specifies virtual address, file offset, file size/memory size, permission
  - vaddr&(pgsize-1)==offset&(pgsize-1) kernel will just mmap the file
  - memory size > file size  $\Rightarrow$  filled up with zeros (for .bss)
- ▶ PT\_INTERP/PT\_DYNAMIC: when PIE or with shared libraries
- PT\_GNU\_STACK: permissions indicate whether stack is non-executable

#### Example: Program Headers

Program Headers:

```
Offset VirtAddr FileSiz MemSiz
                                             Flg Align
 Type
 I.OAD
             0x000000 0x00400000 0x0a0d5e 0x0a0d5e B E 0x1000
 LOAD
              0x0a17d8 0x004a27d8 0x005ab8 0x00b2e8 RW
                                                0 \times 1000
      offset in file -/
     virtual address -----/
bytes provided in file -----/
  segment size in mem -----/
 (memsz > filesz = zero-filled)
     mmap protection -----
 11 ...
 GNU_STACK
             0x000000 0x0000000 0x000000 0x000000 RW
                                                0x10
```

Note: the kernel always maps full pages from the file cache
 Note: first segment includes ELF header and program headers

Load ELF header and program header

- Load ELF header and program header
- ▶ If ET\_DYN ( $\rightsquigarrow$  PIE), set random base added to all addresses

- Load ELF header and program header
- ▶ If ET\_DYN (~→ PIE), set random base added to all addresses
- Look if PT\_INTERP is present
  - If present, load interpreter using same algorithm (but no nested interpreters)

- Load ELF header and program header
- ▶ If ET\_DYN (~→ PIE), set random base added to all addresses
- Look if PT\_INTERP is present

If present, load interpreter using same algorithm (but no nested interpreters)

- Iterate over PT\_LOAD and mmap segments
  - May needs zeroing of last page and mapping extra zero pages

- Load ELF header and program header
- ▶ If ET\_DYN ( $\rightsquigarrow$  PIE), set random base added to all addresses
- Look if PT\_INTERP is present
  - If present, load interpreter using same algorithm (but no nested interpreters)
- Iterate over PT\_LOAD and mmap segments
  - May needs zeroing of last page and mapping extra zero pages
- Setup initial stack frame and auxiliary vector (e.g., with phdr address)

- Load ELF header and program header
- ▶ If ET\_DYN (~→ PIE), set random base added to all addresses
- Look if PT\_INTERP is present
  - If present, load interpreter using same algorithm (but no nested interpreters)
- Iterate over PT\_LOAD and mmap segments
  - May needs zeroing of last page and mapping extra zero pages
- Setup initial stack frame and auxiliary vector (e.g., with phdr address)
- Start execution at (the interpreter's) entry

This is the kernel's job

# Section 20

# Linker Optimizations

# Eliminating Duplicate Strings/Constants

- Sections in different object may contain same data, e.g. strings
  - Critical for debug info (file names, function names, etc.)
- ► Idea: linker finds and deduplicates strings and other constant data
- Precondition: relative order of entries irrelevant

# Eliminating Duplicate Strings/Constants

- Sections in different object may contain same data, e.g. strings
  - Critical for debug info (file names, function names, etc.)
- ▶ Idea: linker finds and deduplicates strings and other constant data
- Precondition: relative order of entries irrelevant
- SHF\_MERGE fixed-size entries, size stored in header
  - Collect all entries in hash map; afterwards emit all keys

# Eliminating Duplicate Strings/Constants

- Sections in different object may contain same data, e.g. strings
  - Critical for debug info (file names, function names, etc.)
- Idea: linker finds and deduplicates strings and other constant data
- Precondition: relative order of entries irrelevant
- SHF\_MERGE fixed-size entries, size stored in header
  - Collect all entries in hash map; afterwards emit all keys
- SHF\_MERGE|SHF\_STRINGS NUL-terminated strings, entsize is char width
  - Precondition: strings must not contain NUL-byte
  - ▶ Tail merging:  $foobar \setminus 0 + bar \setminus 0 \rightsquigarrow foobar \setminus 0$
  - Sort strings from tail (e.g., radix sort), deduplicate neighbors

Problem: objects may contain unused functions

Compiler can't know whether function is used

Problem: objects may contain unused functions

- Compiler can't know whether function is used
- ► Idea: put all function into separate sections, drop unused sections
- Sections are considered as inseparable units
- ► GC roots:

Problem: objects may contain unused functions

- Compiler can't know whether function is used
- ► Idea: put all function into separate sections, drop unused sections
- Sections are considered as inseparable units
- ▶ GC roots: exported symbols, init functions, ...
- Iteratively mark all referenced sections, drop unmarked sections

#### Downside:

Problem: objects may contain unused functions

- Compiler can't know whether function is used
- Idea: put all function into separate sections, drop unused sections
- Sections are considered as inseparable units
- ▶ GC roots: exported symbols, init functions, ....
- Iteratively mark all referenced sections, drop unmarked sections
- ▶ Downside: may need longer relocations ~→ possibly less efficient code
- GCC/Clang -ffunction-sections, ld --gc-sections

# Identical Code Folding

Problem: objects may contain duplicate code

Same function compiled in many objs, e.g. template instantiation

# Identical Code Folding

Problem: objects may contain duplicate code

- Same function compiled in many objs, e.g. template instantiation
- Idea: deduplicate read-only sections (same flags, contents, relocations(!))
- Hash all sections and their relocations, remove duplicates
- Repeat until convergence
  - Only after folding foo1 and foo2, these become equivalent:

int funcA(void) { foo1(); } int funcB(void) { foo2(); }

- Caution: function pointers may be guaranteed to be different
- LLD has more aggressive deduplication

Problem: Compilers still suck

Problem: no optimizations across object files

- Inlining, constant propagation+cloning, specialized call conv., ...
- Optimization across language boundaries

Problem: no optimizations across object files

- Inlining, constant propagation+cloning, specialized call conv., ...
- Optimization across language boundaries
- Idea 1: glue all source code together, compile with -fwhole-program
   Downside: single core, problematic with same-name static functions

Problem: no optimizations across object files

- Inlining, constant propagation+cloning, specialized call conv., ...
- Optimization across language boundaries
- Idea 1: glue all source code together, compile with -fwhole-program
   Downside: single core, problematic with same-name static functions
   Idea 2: Use static binary optimization during linking (severely limited)

Problem: no optimizations across object files

- Inlining, constant propagation+cloning, specialized call conv., ...
- Optimization across language boundaries
- Idea 1: glue all source code together, compile with -fwhole-program
   Downside: single core, problematic with same-name static functions
   Idea 2: Use static binary optimization during linking (severely limited)
   Idea 3: dump IR into object, glue IR together (-flto)
   Done as very first step at link-time

LTO is widely used and highly effective

# Section 21

Static Libraries

#### Static Libraries

- Archive of relocatable object files
- Header often contains index mapping symbol to object file
- Linker takes only object files that are needed
- Code/data copied into final executable

#### Static Libraries

- Archive of relocatable object files
- Header often contains index mapping symbol to object file
- Linker takes only object files that are needed
- Code/data copied into final executable

 $+\,$  Simple and fast, no ABI problems, no extra library needed at run-time

#### Static Libraries

- Archive of relocatable object files
- Header often contains index mapping symbol to object file
- Linker takes only object files that are needed
- Code/data copied into final executable
- $+\,$  Simple and fast, no ABI problems, no extra library needed at run-time
- Larger executable files, library changes need relinking

### Section 22

# Shared Libraries

#### Shared Libraries

Problem: code duplication, large executables, recompile needed for changes
 Idea: *share* code between different executables

#### Shared Libraries

- Problem: code duplication, large executables, recompile needed for changes
- Idea: share code between different executables
- Executable references functions/objects in shared library
  - Shared libraries can refer to other shared libraries, too
  - Linker needs to retain dynamic relocations and symbols (dynamic symbol = externally visible symbol)
- Run-time loader links executable and libraries program start
  - Find and load libraries from different paths, resolve all relocations

# Shared Libraries: Changes in Compiler

# Shared Libraries: Changes in Compiler

# None :: (almost)
## Shared Libraries: Changes in Compiler

# None :: (almost)

▶ When building a shared library, code must be position-independent

Relocations to symbols in shared libraries must be retained

Store dynamic relocations and symbols in separate sections (.dynsym, .rela.dyn)

- Relocations to symbols in shared libraries must be retained
  - Store dynamic relocations and symbols in separate sections (.dynsym, .rela.dyn)
- Create table (GOT) for pointers to external function/objects
  - Allocate space where loader puts addresses, add relocations

- Relocations to symbols in shared libraries must be retained
  - Store dynamic relocations and symbols in separate sections (.dynsym, .rela.dyn)
- Create table (GOT) for pointers to external function/objects
  Allocate space where loader puts addresses, add relocations
- Create stub functions for external functions (PLT)
  - Compiler still creates near call, which gets redirected to stub
  - Stub jumps to address stored in table

- Relocations to symbols in shared libraries must be retained
  - Store dynamic relocations and symbols in separate sections (.dynsym, .rela.dyn)
- Create table (GOT) for pointers to external function/objects
  Allocate space where loader puts addresses, add relocations
- Create stub functions for external functions (PLT)
  - Compiler still creates near call, which gets redirected to stub
  - Stub jumps to address stored in table
- Emit PT\_DYNAMIC segment with info for loader
  - Point loader to needed libs, relocations, symtab, strtab, ...

# Global Offset Table (GOT) and Procedure Linkage Table (PLT)

Global Offset Table: pointer table filled by loader

- Linker emits dynamic relocations for GOT; loader fills addresses
- Often subject to RELRO: after relocations are applied, GOT becomes read-only

# Global Offset Table (GOT) and Procedure Linkage Table (PLT)

Global Offset Table: pointer table filled by loader

- Linker emits dynamic relocations for GOT; loader fills addresses
- Often subject to RELRO: after relocations are applied, GOT becomes read-only

Procedure Linkage Table: stubs that perform jump using GOT 00401030 <func@plt>: 401030: ff 25 8a 2f 00 00 jmp QWORD PTR [rip+0x2f8a] # GOT slot

# Global Offset Table (GOT) and Procedure Linkage Table (PLT)

Global Offset Table: pointer table filled by loader

- Linker emits dynamic relocations for GOT; loader fills addresses
- Often subject to RELRO: after relocations are applied, GOT becomes read-only
- Procedure Linkage Table: stubs that perform jump using GOT 00401030 <func@plt>: 401030: ff 25 8a 2f 00 00 jmp QWORD PTR [rip+0x2f8a] # GOT slot
- PLT can be disabled (-fno-plt): indirect jump is duplicated
  - Compiler emits indirect calls/jumps instead of near calls to PLT
  - Linker cannot convert into near jump if target is in same DSO

#### PT\_DYNAMIC segment

► Loader needs to know needed libraries, flags, locations of relocations, etc.

- Sections headers might be unavailable and more info is needed
- ▶ Info for loader stored in dynamic section

| Туре       | Name/Value                  |
|------------|-----------------------------|
| (NEEDED)   | Shared library: [libm.so.6] |
| (NEEDED)   | Shared library: [libc.so.6] |
| (GNU_HASH) | 0x4003c0                    |
| (STRTAB)   | 0x4004b8                    |
| (SYMTAB)   | 0x4003e0                    |
| (STRSZ)    | 259 (bytes)                 |
| (SYMENT)   | 24 (bytes)                  |
| //         |                             |
| (NULL)     | 0x0                         |

Symbol lookup using linear search + strcmp is slow

Symbol lookup using linear search + strcmp is slow

- Idea: linker creates hash table
  - Hash symbol names and store them in hash table
  - Dynamic symbols grouped by hash bucket
  - > Additional bloom filter to avoid useless walks for absent symbols

Symbol lookup using linear search + strcmp is slow

- Idea: linker creates hash table
  - Hash symbol names and store them in hash table
  - Dynamic symbols grouped by hash bucket
  - Additional bloom filter to avoid useless walks for absent symbols
- Lookup:
  - Compute hash of target symbol string
  - Check bloom filter, if absent: abort
  - Iterate through symbols in bucket, compare names (and version)

Symbol lookup using linear search + strcmp is slow

- Idea: linker creates hash table
  - Hash symbol names and store them in hash table
  - Dynamic symbols grouped by hash bucket
  - Additional bloom filter to avoid useless walks for absent symbols
- Lookup:
  - Compute hash of target symbol string
  - Check bloom filter, if absent: abort
  - Iterate through symbols in bucket, compare names (and version)

#### Documentation unfortunately sparse<sup>47</sup>

- Purpose of all these dynamic entries
- Symbols: versioning and visibility

- Purpose of all these dynamic entries
- Symbols: versioning and visibility
- Constructors/destructors: called at load/unload of DSO
- Indirect functions (ifunc)
  - Function to dynamically determine actual address of symbol
  - Used e.g. for determining memcpy variant based on CPU features

- Purpose of all these dynamic entries
- Symbols: versioning and visibility
- Constructors/destructors: called at load/unload of DSO
- Indirect functions (ifunc)
  - Function to dynamically determine actual address of symbol
  - Used e.g. for determining memcpy variant based on CPU features
- Dynamic loading of DSOs (dlopen)

#### Object Files, Linker, and Loader - Summary

- Compiler needs to know code model to emit proper asm code/relocations
- ELF format used for relocatable files, executables and shared libraries
- ELF relocatables structured in sections and have static relocations
- ELF dynamic executables grouped in segments and have dynamic relocations

Need dynamic loader to resolve dynamic relocations and shared libraries
 Linker combines relocatable files into executables or shared libraries
 Linker can perform further optimizations

#### Object Files, Linker, and Loader - Questions

- Which ELF file types exist? What is different?
- What are typical sections found in an ELF relocatable file?
- What information is contained in a symbol table?
- What information is required for a relocation?
- What are typical differences between static and dynamic relocations?
- Which steps and possible optimization does a linker perform?
- How does the OS load a binary into memory?
- What is the difference between static and shared libraries?
- ▶ How are symbols from other shared libraries resolved?

Code Generation for Data Processing Lecture 10: Unwinding and Debuginfo

Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

Machine code suffices for execution

Machine code suffices for execution

 $\rightarrow$  not true

Needs program headers and entry point

Linking with shared libraries needs dynamic symbols and interpreter

#### Machine code suffices for execution

 $\rightarrow$  not true

- Needs program headers and entry point
- Linking with shared libraries needs dynamic symbols and interpreter
- Stack unwinding needs information about the stack
  - Size of each stack frame, destructors to be called, etc.
  - ▶ Vital for C++ exceptions, even for non-C++ code
- Stack traces require stack information to find return addresses
  - ► Use cases: coredumps, debuggers, profilers
- Debugging experience enhanced by variables, files, lines, statements, etc.

# -g

# -g -fexceptions

#### -g -fexceptions -fasynchronous-unwind-tables

-g supports different formats and levels (and GNU extensions)

- Exceptions must work without debuginfo
- Unwinding through code without exception-support must work

## Stack Unwinding

Needed for exceptions (\_Unwind\_RaiseException) or forced unwinding

## Stack Unwinding

Needed for exceptions (\_Unwind\_RaiseException) or forced unwinding

- Search phase: walk through the stack, check whether to stop at each frame
  - May depend on exception type, ask personality function
  - Personality function needs extra language-specific data
  - Stop once an exception handler is found

# Stack Unwinding

Needed for exceptions (\_Unwind\_RaiseException) or forced unwinding

- Search phase: walk through the stack, check whether to stop at each frame
  - May depend on exception type, ask personality function
  - Personality function needs extra language-specific data
  - Stop once an exception handler is found
- Cleanup phase: walk again, do cleanup and stop at handler
  - > Personality function indicates whether handler needs to be called
  - Can be for exception handler or for calling destructors
  - If yes: personality function sets up registers/sp/pc for landing pad
  - Non-matching handler or destructor-only: landing pad calls \_Unwind\_Resume

# Stack Unwinding: Requirements

Given: current register values in unwind function

- Need: iterate through stack frames
  - Get address of function of the stack frame
  - Get pc and sp for this function
  - Find personality function and language-specific data
  - Maybe get some registers from the stack frame
  - Update some registers with exception data

# Stack Unwinding: Requirements

Given: current register values in unwind function

- Need: iterate through stack frames
  - Get address of function of the stack frame
  - Get pc and sp for this function
  - Find personality function and language-specific data
  - Maybe get some registers from the stack frame
  - Update some registers with exception data

Increased difficulty: stepping through signal handler

# Stack Unwinding: setjmp/longjmp

Simple idea – all functions that run code during unwinding do:

- Register their handler at function entry
- Deregister their handler at function exit
- Personality function sets jmpbuf to landing pad
- Unwinder does longjmp
## Stack Unwinding: setjmp/longjmp

Simple idea – all functions that run code during unwinding do:

- Register their handler at function entry
- Deregister their handler at function exit
- Personality function sets jmpbuf to landing pad
- Unwinder does longjmp
- + Needs no extra information
- High overhead in non-exceptional case

#### Stack Unwinding: Frame Pointer

Frame pointers allow for fast unwinding

fp points to stored caller's fp

Return address stored adjacent to frame pointer

+ Fast and simple, also without exception

```
x86_64:
 push rbp
 mov rbp, rsp
 // ...
 mov rsp, rbp
 pop rbp
 ret
aarch64:
 stp x29, x30, [sp, -32]!
 mov x29, sp
 // ...
 ldp x29, x30, [sp], 32
 ret
```

#### Stack Unwinding: Frame Pointer

Frame pointers allow for fast unwinding

- fp points to stored caller's fp
- Return address stored adjacent to frame pointer
- $+\,$  Fast and simple, also without exception
- Not all programs have frame pointers
  - Overhead of creating full stack frame
  - Causes loss of one register (esp. x86)

```
x86_64:
 push rbp
 mov rbp, rsp
 // ...
 mov rsp, rbp
 pop rbp
 ret
aarch64:
  stp x29, x30, [sp, -32]!
 mov x29, sp
 // ...
 ldp x29, x30, [sp], 32
 ret
```

#### Stack Unwinding: Frame Pointer

Frame pointers allow for fast unwinding

- fp points to stored caller's fp
- Return address stored adjacent to frame pointer
- $+\,$  Fast and simple, also without exception
- Not all programs have frame pointers
  - Overhead of creating full stack frame
  - Causes loss of one register (esp. x86)
- Still needs to find meta-information
- Need to distinguish prologue with wrong info

```
x86_64:
 push rbp
 mov rbp, rsp
 // ...
 mov rsp, rbp
 pop rbp
 ret
aarch64:
  stp x29, x30, [sp, -32]!
 mov x29, sp
 // ...
 ldp x29, x30, [sp]. 32
 ret
```

#### Stack Unwinding: Without Frame Pointer

- Given: pc and sp (bottom of stack frame/call frame)
  - ▶ In parent frames:  $retaddr 1 \sim pc$  and  $CFA \sim sp$
- Need to map pc to stack frame size
  - sp+framesize = CFA (canonical frame address sp at call)
  - Stack frame size varies throughout function, e.g. prologue

#### Stack Unwinding: Without Frame Pointer

- Given: pc and sp (bottom of stack frame/call frame)
  - ▶ In parent frames:  $retaddr 1 \sim pc$  and  $CFA \sim sp$
- Need to map pc to stack frame size
  - sp+framesize = CFA (canonical frame address sp at call)
  - Stack frame size varies throughout function, e.g. prologue
- Case 1: some register used as frame pointer CFA constant offset to fp
  - E.g., for variable stack frame size
- Case 2: no frame pointer: CFA is constant offset to sp

#### Stack Unwinding: Without Frame Pointer

- Given: pc and sp (bottom of stack frame/call frame)
  - ▶ In parent frames:  $retaddr 1 \sim pc$  and  $CFA \sim sp$
- Need to map pc to stack frame size
  - sp+framesize = CFA (canonical frame address sp at call)
  - Stack frame size varies throughout function, e.g. prologue
- Case 1: some register used as frame pointer CFA constant offset to fp
  - E.g., for variable stack frame size
- Case 2: no frame pointer: CFA is constant offset to sp
- → Unwinding *must* restore register values
  - Other reg. can act as frame pointer, register saved in other register, ....
  - Need to know where return address is stored

#### Call Frame Information

> Table mapping each instr. to info about registers and CFA

- CFA: register with signed offset (or arbitrary expression)
- Register:
  - Undefined unrecoverable (default for caller-saved reg)
  - Same unmodified (default for callee-saved reg)
  - Offset(N) stored at address CFA+N
  - Register(reg) stored in other register
  - or arbitrary expressions

|      |          |     | CFA | rip | rbx | rbp |  |
|------|----------|-----|-----|-----|-----|-----|--|
|      | foo:     |     |     |     |     |     |  |
| 0x0: | push rbx |     |     |     |     |     |  |
| 0x1: | mov ebx, | edi |     |     |     |     |  |
| 0x3: | call bar |     |     |     |     |     |  |
| 0x8: | mov eax, | ebx |     |     |     |     |  |
| Oxa: | pop rbx  |     |     |     |     |     |  |
| Oxb: | ret      |     |     |     |     |     |  |

|      |          |     | CFA      | rip | rbx | rbp |  |
|------|----------|-----|----------|-----|-----|-----|--|
|      | foo:     |     |          |     |     |     |  |
| 0x0: | push rbx |     | rsp+0x08 |     |     |     |  |
| 0x1: | mov ebx, | edi |          |     |     |     |  |
| 0x3: | call bar |     |          |     |     |     |  |
| 0x8: | mov eax, | ebx |          |     |     |     |  |
| Oxa: | pop rbx  |     |          |     |     |     |  |
| Oxb: | ret      |     |          |     |     |     |  |

|      |              | CFA      | rip        | rbx | rbp |  |
|------|--------------|----------|------------|-----|-----|--|
|      | foo:         |          |            |     |     |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] |     |     |  |
| 0x1: | mov ebx, edi |          |            |     |     |  |
| 0x3: | call bar     |          |            |     |     |  |
| 0x8: | mov eax, ebx |          |            |     |     |  |
| Oxa: | pop rbx      |          |            |     |     |  |
| Oxb: | ret          |          |            |     |     |  |

|      |              | CFA      | rip        | rbx  | rbp |  |
|------|--------------|----------|------------|------|-----|--|
|      | foo:         |          |            |      |     |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same |     |  |
| 0x1: | mov ebx, edi |          |            |      |     |  |
| 0x3: | call bar     |          |            |      |     |  |
| 0x8: | mov eax, ebx |          |            |      |     |  |
| Oxa: | pop rbx      |          |            |      |     |  |
| Oxb: | ret          |          |            |      |     |  |

|      |              | CFA      | rip        | rbx  | rbp  |
|------|--------------|----------|------------|------|------|
|      | foo:         |          |            |      |      |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same | same |
| 0x1: | mov ebx, edi |          |            |      |      |
| 0x3: | call bar     |          |            |      |      |
| 0x8: | mov eax, ebx |          |            |      |      |
| Oxa: | pop rbx      |          |            |      |      |
| 0xb: | ret          |          |            |      |      |

|      |              | CFA      | rip        | rbx  | rbp  |
|------|--------------|----------|------------|------|------|
|      | foo:         |          |            |      |      |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same | same |
| 0x1: | mov ebx, edi | rsp+0x10 |            |      |      |
| 0x3: | call bar     |          |            |      |      |
| 0x8: | mov eax, ebx |          |            |      |      |
| Oxa: | pop rbx      |          |            |      |      |
| Oxb: | ret          |          |            |      |      |

|      |              | CFA      | rip        | rbx  | rbp  |
|------|--------------|----------|------------|------|------|
|      | foo:         |          |            |      |      |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same | same |
| 0x1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] |      |      |
| 0x3: | call bar     |          |            |      |      |
| 0x8: | mov eax, ebx |          |            |      |      |
| Oxa: | pop rbx      |          |            |      |      |
| Oxb: | ret          |          |            |      |      |

|      |              | CFA      | rip        | rbx        | rbp  |  |
|------|--------------|----------|------------|------------|------|--|
|      | foo:         |          |            |            |      |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same       | same |  |
| 0x1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] | [CFA-0x10] |      |  |
| 0x3: | call bar     |          |            |            |      |  |
| 0x8: | mov eax, ebx |          |            |            |      |  |
| Oxa: | pop rbx      |          |            |            |      |  |
| Oxb: | ret          |          |            |            |      |  |

|      |              | CFA      | rip        | rbx        | rbp  |  |
|------|--------------|----------|------------|------------|------|--|
|      | foo:         |          |            |            |      |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same       | same |  |
| 0x1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x3: | call bar     |          |            |            |      |  |
| 0x8: | mov eax, ebx |          |            |            |      |  |
| 0xa: | pop rbx      |          |            |            |      |  |
| Oxb: | ret          |          |            |            |      |  |

|      |              | CFA      | rip        | rbx        | rbp  |  |
|------|--------------|----------|------------|------------|------|--|
|      | foo:         |          |            |            |      |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same       | same |  |
| 0x1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x3: | call bar     | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x8: | mov eax, ebx |          |            |            |      |  |
| Oxa: | pop rbx      |          |            |            |      |  |
| Oxb: | ret          |          |            |            |      |  |

|      |              | CFA      | rip        | rbx        | rbp  |  |
|------|--------------|----------|------------|------------|------|--|
|      | foo:         |          |            |            |      |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same       | same |  |
| 0x1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x3: | call bar     | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x8: | mov eax, ebx | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| Oxa: | pop rbx      |          |            |            |      |  |
| Oxb: | ret          |          |            |            |      |  |

|      |              | CFA      | rip        | rbx        | rbp  |  |
|------|--------------|----------|------------|------------|------|--|
|      | foo:         |          |            |            |      |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same       | same |  |
| 0x1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x3: | call bar     | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x8: | mov eax, ebx | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| Oxa: | pop rbx      | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| Oxb: | ret          |          |            |            |      |  |

|      |              | CFA      | rip        | rbx        | rbp  |  |
|------|--------------|----------|------------|------------|------|--|
|      | foo:         |          |            |            |      |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same       | same |  |
| 0x1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x3: | call bar     | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x8: | mov eax, ebx | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| Oxa: | pop rbx      | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| Oxb: | ret          | rsp+0x08 |            |            |      |  |

|      |              | CFA      | rip        | rbx        | rbp  |  |
|------|--------------|----------|------------|------------|------|--|
|      | foo:         |          |            |            |      |  |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same       | same |  |
| Ox1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x3: | call bar     | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| 0x8: | mov eax, ebx | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| Oxa: | pop rbx      | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |  |
| Oxb: | ret          | rsp+0x08 | [CFA-0x08] |            |      |  |

|      |              | CFA      | rip        | rbx        | rbp  |
|------|--------------|----------|------------|------------|------|
|      | foo:         |          |            |            |      |
| 0x0: | push rbx     | rsp+0x08 | [CFA-0x08] | same       | same |
| 0x1: | mov ebx, edi | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |
| 0x3: | call bar     | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |
| 0x8: | mov eax, ebx | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |
| Oxa: | pop rbx      | rsp+0x10 | [CFA-0x08] | [CFA-0x10] | same |
| Oxb: | ret          | rsp+0x08 | [CFA-0x08] | same       | same |

|       |            |     | CFA | rip | rbx | rbp |  |
|-------|------------|-----|-----|-----|-----|-----|--|
|       | foo:       |     |     |     |     |     |  |
| 0x0:  | push rbp   |     |     |     |     |     |  |
| 0x1:  | mov rbp, 1 | rsp |     |     |     |     |  |
| 0x4:  | shl rdi, 4 | 4   |     |     |     |     |  |
| 0x8:  | sub rsp, 1 | rdi |     |     |     |     |  |
| Oxb:  | mov rdi, 1 | rsp |     |     |     |     |  |
| Oxe:  | call bar   |     |     |     |     |     |  |
| 0x13: | leave      |     |     |     |     |     |  |
| 0x14: | ret        |     |     |     |     |     |  |

|       |              | CFA      | rip        | rbx  | rbp  |  |
|-------|--------------|----------|------------|------|------|--|
|       | foo:         |          |            |      |      |  |
| 0x0:  | push rbp     | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x1:  | mov rbp, rsp |          |            |      |      |  |
| 0x4:  | shl rdi, 4   |          |            |      |      |  |
| 0x8:  | sub rsp, rdi |          |            |      |      |  |
| Oxb:  | mov rdi, rsp |          |            |      |      |  |
| Oxe:  | call bar     |          |            |      |      |  |
| 0x13: | leave        |          |            |      |      |  |
| 0x14: | ret          |          |            |      |      |  |

|       |              | CFA      | rip        | rbx  | rbp        |  |
|-------|--------------|----------|------------|------|------------|--|
|       | foo:         |          |            |      |            |  |
| 0x0:  | push rbp     | rsp+0x08 | [CFA-0x08] | same | same       |  |
| 0x1:  | mov rbp, rsp | rsp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x4:  | shl rdi, 4   |          |            |      |            |  |
| 0x8:  | sub rsp, rdi |          |            |      |            |  |
| Oxb:  | mov rdi, rsp |          |            |      |            |  |
| Oxe:  | call bar     |          |            |      |            |  |
| 0x13: | leave        |          |            |      |            |  |
| 0x14: | ret          |          |            |      |            |  |

|       |              | CFA      | rip        | rbx  | rbp        |  |
|-------|--------------|----------|------------|------|------------|--|
|       | foo:         |          |            |      |            |  |
| 0x0:  | push rbp     | rsp+0x08 | [CFA-0x08] | same | same       |  |
| 0x1:  | mov rbp, rsp | rsp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x4:  | shl rdi, 4   | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x8:  | sub rsp, rdi | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| Oxb:  | mov rdi, rsp | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| Oxe:  | call bar     | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x13: | leave        |          |            |      |            |  |
| 0x14: | ret          |          |            |      |            |  |

|       |              | CFA      | rip        | rbx  | rbp        |  |
|-------|--------------|----------|------------|------|------------|--|
|       | foo:         |          |            |      |            |  |
| 0x0:  | push rbp     | rsp+0x08 | [CFA-0x08] | same | same       |  |
| 0x1:  | mov rbp, rsp | rsp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x4:  | shl rdi, 4   | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x8:  | sub rsp, rdi | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| Oxb:  | mov rdi, rsp | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| Oxe:  | call bar     | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x13: | leave        | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x14: | ret          |          |            |      |            |  |

|       |              | CFA      | rip        | rbx  | rbp        |  |
|-------|--------------|----------|------------|------|------------|--|
|       | foo:         |          |            |      |            |  |
| 0x0:  | push rbp     | rsp+0x08 | [CFA-0x08] | same | same       |  |
| 0x1:  | mov rbp, rsp | rsp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x4:  | shl rdi, 4   | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x8:  | sub rsp, rdi | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| Oxb:  | mov rdi, rsp | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| Oxe:  | call bar     | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x13: | leave        | rbp+0x10 | [CFA-0x08] | same | [CFA-0x10] |  |
| 0x14: | ret          | rsp+0x08 | [CFA-0x08] | same | same       |  |

|       |               | CFA | rip | rbx | rbp |  |
|-------|---------------|-----|-----|-----|-----|--|
|       | foo:          |     |     |     |     |  |
| 0x0:  | sub rsp, 8    |     |     |     |     |  |
| 0x4:  | test edi, edi |     |     |     |     |  |
| 0x6:  | js 0x12       |     |     |     |     |  |
| 0x8:  | call positive |     |     |     |     |  |
| Oxd:  | add rsp, 8    |     |     |     |     |  |
| Ox11: | ret           |     |     |     |     |  |
| 0x12: | call negative |     |     |     |     |  |
| 0x17: | add rsp, 8    |     |     |     |     |  |
| 0x1a: | ret           |     |     |     |     |  |

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi |          |            |      |      |  |
| 0x6:  | js 0x12       |          |            |      |      |  |
| 0x8:  | call positive |          |            |      |      |  |
| Oxd:  | add rsp, 8    |          |            |      |      |  |
| 0x11: | ret           |          |            |      |      |  |
| 0x12: | call negative |          |            |      |      |  |
| 0x17: | add rsp, 8    |          |            |      |      |  |
| Ox1a: | ret           |          |            |      |      |  |

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x6:  | js Ox12       |          |            |      |      |  |
| 0x8:  | call positive |          |            |      |      |  |
| Oxd:  | add rsp, 8    |          |            |      |      |  |
| Ox11: | ret           |          |            |      |      |  |
| 0x12: | call negative |          |            |      |      |  |
| 0x17: | add rsp, 8    |          |            |      |      |  |
| 0x1a: | ret           |          |            |      |      |  |

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x6:  | js Ox12       | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x8:  | call positive |          |            |      |      |  |
| Oxd:  | add rsp, 8    |          |            |      |      |  |
| Ox11: | ret           |          |            |      |      |  |
| 0x12: | call negative |          |            |      |      |  |
| 0x17: | add rsp, 8    |          |            |      |      |  |
| 0x1a: | ret           |          |            |      |      |  |

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x6:  | js Ox12       | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x8:  | call positive | rsp+0x10 | [CFA-0x08] | same | same |  |
| Oxd:  | add rsp, 8    |          |            |      |      |  |
| 0x11: | ret           |          |            |      |      |  |
| 0x12: | call negative |          |            |      |      |  |
| 0x17: | add rsp, 8    |          |            |      |      |  |
| Ox1a: | ret           |          |            |      |      |  |

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x6:  | js Ox12       | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x8:  | call positive | rsp+0x10 | [CFA-0x08] | same | same |  |
| Oxd:  | add rsp, 8    | rsp+0x10 | [CFA-0x08] | same | same |  |
| Ox11: | ret           |          |            |      |      |  |
| 0x12: | call negative |          |            |      |      |  |
| 0x17: | add rsp, 8    |          |            |      |      |  |
| Ox1a: | ret           |          |            |      |      |  |

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x6:  | js Ox12       | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x8:  | call positive | rsp+0x10 | [CFA-0x08] | same | same |  |
| Oxd:  | add rsp, 8    | rsp+0x10 | [CFA-0x08] | same | same |  |
| Ox11: | ret           | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x12: | call negative |          |            |      |      |  |
| 0x17: | add rsp, 8    |          |            |      |      |  |
| 0x1a: | ret           |          |            |      |      |  |
# Call Frame Information – Example 3

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x6:  | js Ox12       | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x8:  | call positive | rsp+0x10 | [CFA-0x08] | same | same |  |
| Oxd:  | add rsp, 8    | rsp+0x10 | [CFA-0x08] | same | same |  |
| Ox11: | ret           | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x12: | call negative | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x17: | add rsp, 8    |          |            |      |      |  |
| 0x1a: | ret           |          |            |      |      |  |

# Call Frame Information – Example 3

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x6:  | js Ox12       | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x8:  | call positive | rsp+0x10 | [CFA-0x08] | same | same |  |
| Oxd:  | add rsp, 8    | rsp+0x10 | [CFA-0x08] | same | same |  |
| Ox11: | ret           | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x12: | call negative | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x17: | add rsp, 8    | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x1a: | ret           |          |            |      |      |  |

# Call Frame Information – Example 3

|       |               | CFA      | rip        | rbx  | rbp  |  |
|-------|---------------|----------|------------|------|------|--|
|       | foo:          |          |            |      |      |  |
| 0x0:  | sub rsp, 8    | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x4:  | test edi, edi | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x6:  | js Ox12       | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x8:  | call positive | rsp+0x10 | [CFA-0x08] | same | same |  |
| Oxd:  | add rsp, 8    | rsp+0x10 | [CFA-0x08] | same | same |  |
| Ox11: | ret           | rsp+0x08 | [CFA-0x08] | same | same |  |
| 0x12: | call negative | rsp+0x10 | [CFA-0x08] | same | same |  |
| 0x17: | add rsp, 8    | rsp+0x10 | [CFA-0x08] | same | same |  |
| Ox1a: | ret           | rsp+0x08 | [CFA-0x08] | same | same |  |

### Call Frame Information: Encoding

- Expanded table can be huge
- Contents change rather seldomly
  - Mainly in prologue/epilogue, but mostly constant in-between

### Call Frame Information: Encoding

- Expanded table can be huge
- Contents change rather seldomly
  - Mainly in prologue/epilogue, but mostly constant in-between
- Idea: encode table as bytecode
- Bytecode has instructions to create a now row
  - Advance machine code location
- Bytecode has instructions to define CFA value
- Bytecode has instructions to define register location
- Bytecode has instructions to remember and restore state

|    |              | CFA | rip | rbx |  |
|----|--------------|-----|-----|-----|--|
|    | foo:         |     |     |     |  |
| 0: | push rbx     |     |     |     |  |
| 1: | mov ebx, edi |     |     |     |  |
| 3: | call bar     |     |     |     |  |
| 8: | mov eax, ebx |     |     |     |  |
| a: | pop rbx      |     |     |     |  |
| b: | ret          |     |     |     |  |

DW\_CFA\_def\_cfa: RSP +8 DW\_CFA\_offset: RIP -8 DW\_CFA\_advance\_loc: 1 DW\_CFA\_def\_cfa\_offset: +16 DW\_CFA\_offset: RBX -16 DW\_CFA\_advance\_loc: 10 DW\_CFA\_def\_cfa\_offset: +8

|    |              | CFA   | rip | rbx | => DW_CFA_def_cfa: RSP +8  |
|----|--------------|-------|-----|-----|----------------------------|
|    | foo:         |       |     |     | DW_CFA_offset: RIP -8      |
| 0: | push rbx     | rsp+8 |     |     | DW_CFA_advance_loc: 1      |
| 1: | mov ebx, edi |       |     |     | DW_CFA_def_cfa_offset: +16 |
| 3: | call bar     |       |     |     | DW_CFA_offset: RBX -16     |
| 8: | mov eax, ebx |       |     |     | DW_CFA_advance_loc: 10     |
| a: | pop rbx      |       |     |     | DW_CFA_def_cfa_offset: +8  |
| b: | ret          |       |     |     |                            |

|    |              | CFA   | rip     | rbx | DW_CFA_def_cfa: RSP +8     |
|----|--------------|-------|---------|-----|----------------------------|
|    | foo:         |       |         |     | => DW_CFA_offset: RIP -8   |
| 0: | push rbx     | rsp+8 | [CFA-8] |     | DW_CFA_advance_loc: 1      |
| 1: | mov ebx, edi |       |         |     | DW_CFA_def_cfa_offset: +16 |
| 3: | call bar     |       |         |     | DW_CFA_offset: RBX -16     |
| 8: | mov eax, ebx |       |         |     | DW_CFA_advance_loc: 10     |
| a: | pop rbx      |       |         |     | DW_CFA_def_cfa_offset: +8  |
| b: | ret          |       |         |     |                            |

|    |              | CFA   | rip     | rbx |
|----|--------------|-------|---------|-----|
|    | foo:         |       |         |     |
| 0: | push rbx     | rsp+8 | [CFA-8] |     |
| 1: | mov ebx, edi | rsp+8 | [CFA-8] |     |
| 3: | call bar     |       |         |     |
| 8: | mov eax, ebx |       |         |     |
| a: | pop rbx      |       |         |     |
| b: | ret          |       |         |     |

|    | DW_CFA_def_cfa: RSP +8 |     |
|----|------------------------|-----|
|    | DW_CFA_offset: RIP -8  |     |
| => | DW_CFA_advance_loc: 1  |     |
|    | DW_CFA_def_cfa_offset: | +16 |
|    | DW_CFA_offset: RBX -16 |     |
|    | DW_CFA_advance_loc: 10 |     |
|    | DW_CFA_def_cfa_offset: | +8  |

|    |              | CFA    | rip     | rbx |   |
|----|--------------|--------|---------|-----|---|
|    | foo:         |        |         |     |   |
| 0: | push rbx     | rsp+8  | [CFA-8] |     |   |
| 1: | mov ebx, edi | rsp+16 | [CFA-8] |     | = |
| 3: | call bar     |        |         |     |   |
| 8: | mov eax, ebx |        |         |     |   |
| a: | pop rbx      |        |         |     |   |
| b: | ret          |        |         |     |   |

|    | DW_CFA_def_cfa: RSP +8 |     |
|----|------------------------|-----|
|    | DW_CFA_offset: RIP -8  |     |
|    | DW_CFA_advance_loc: 1  |     |
| => | DW_CFA_def_cfa_offset: | +16 |
|    | DW_CFA_offset: RBX -16 |     |
|    | DW_CFA_advance_loc: 10 |     |
|    | DW_CFA_def_cfa_offset: | +8  |

|    |              | CFA    | rip     | rbx      |
|----|--------------|--------|---------|----------|
|    | foo:         |        |         |          |
| 0: | push rbx     | rsp+8  | [CFA-8] |          |
| 1: | mov ebx, edi | rsp+16 | [CFA-8] | [CFA-16] |
| 3: | call bar     |        |         |          |
| 8: | mov eax, ebx |        |         |          |
| a: | pop rbx      |        |         |          |
| b: | ret          |        |         |          |

|    | DW_CFA_def_cfa: RSP +8 |     |
|----|------------------------|-----|
|    | DW_CFA_offset: RIP -8  |     |
|    | DW_CFA_advance_loc: 1  |     |
|    | DW_CFA_def_cfa_offset: | +16 |
| => | DW_CFA_offset: RBX -16 |     |
|    | DW_CFA_advance_loc: 10 |     |
|    | DW_CFA_def_cfa_offset: | +8  |

|    |              | CFA    | rip     | rbx      |
|----|--------------|--------|---------|----------|
|    | foo:         |        |         |          |
| 0: | push rbx     | rsp+8  | [CFA-8] |          |
| 1: | mov ebx, edi | rsp+16 | [CFA-8] | [CFA-16] |
| 3: | call bar     | rsp+16 | [CFA-8] | [CFA-16] |
| 8: | mov eax, ebx | rsp+16 | [CFA-8] | [CFA-16] |
| a: | pop rbx      | rsp+16 | [CFA-8] | [CFA-16] |
| b: | ret          | rsp+16 | [CFA-8] | [CFA-16] |

|    | DW_CFA_def_cfa: RSP +8 |     |
|----|------------------------|-----|
|    | DW_CFA_offset: RIP -8  |     |
|    | DW_CFA_advance_loc: 1  |     |
|    | DW_CFA_def_cfa_offset: | +16 |
|    | DW_CFA_offset: RBX -16 |     |
| => | DW_CFA_advance_loc: 10 |     |
|    | DW_CFA_def_cfa_offset: | +8  |

|    |              | CFA    | rip     | rbx      | DW_CFA_def_cfa: RSP +8       |
|----|--------------|--------|---------|----------|------------------------------|
|    | foo:         |        |         |          | DW_CFA_offset: RIP -8        |
| 0: | push rbx     | rsp+8  | [CFA-8] |          | DW_CFA_advance_loc: 1        |
| 1: | mov ebx, edi | rsp+16 | [CFA-8] | [CFA-16] | DW_CFA_def_cfa_offset: +16   |
| 3: | call bar     | rsp+16 | [CFA-8] | [CFA-16] | DW_CFA_offset: RBX -16       |
| 8: | mov eax, ebx | rsp+16 | [CFA-8] | [CFA-16] | DW_CFA_advance_loc: 10       |
| a: | pop rbx      | rsp+16 | [CFA-8] | [CFA-16] | => DW_CFA_def_cfa_offset: +8 |
| b: | ret          | rsp+8  | [CFA-8] | [CFA-16] |                              |

|     |              | CFA | rip | rbp | DW_CFA_def_cfa: RSP +8       |
|-----|--------------|-----|-----|-----|------------------------------|
|     | foo:         |     |     |     | DW_CFA_offset: RIP -8        |
| 0:  | push rbp     |     |     |     | DW_CFA_advance_loc: 1        |
| 1:  | mov rbp, rsp |     |     |     | DW_CFA_def_cfa_offset: +16   |
| 4:  | shl rdi, 4   |     |     |     | DW_CFA_offset: RBP -16       |
| 8:  | sub rsp, rdi |     |     |     | DW_CFA_advance_loc: 3        |
| b:  | mov rdi, rsp |     |     |     | DW_CFA_def_cfa_register: RBP |
| e:  | call bar     |     |     |     | DW_CFA_advance_loc: 16       |
| 13: | leave        |     |     |     | DW_CFA_def_cfa: RSP +8       |
| 14: | ret          |     |     |     |                              |

|     |              | CFA   | rip | rbp | => DW_CFA_def_cfa: RSP +8    |
|-----|--------------|-------|-----|-----|------------------------------|
|     | foo:         |       |     |     | DW_CFA_offset: RIP -8        |
| 0:  | push rbp     | rsp+8 |     |     | DW_CFA_advance_loc: 1        |
| 1:  | mov rbp, rsp | -     |     |     | DW_CFA_def_cfa_offset: +16   |
| 4:  | shl rdi, 4   |       |     |     | DW_CFA_offset: RBP -16       |
| 8:  | sub rsp, rdi |       |     |     | DW_CFA_advance_loc: 3        |
| b:  | mov rdi, rsp |       |     |     | DW_CFA_def_cfa_register: RBP |
| e:  | call bar     |       |     |     | DW_CFA_advance_loc: 16       |
| 13: | leave        |       |     |     | DW_CFA_def_cfa: RSP +8       |
| 14: | ret          |       |     |     |                              |

|     |              | CFA   | rip     | rbp | _  | DW_CFA_def_cfa: RSP +8       |
|-----|--------------|-------|---------|-----|----|------------------------------|
|     | foo:         |       |         |     | => | DW_CFA_offset: RIP -8        |
| 0:  | push rbp     | rsp+8 | [CFA-8] |     |    | DW_CFA_advance_loc: 1        |
| 1:  | mov rbp, rsp | _     |         |     |    | DW_CFA_def_cfa_offset: +16   |
| 4:  | shl rdi, 4   |       |         |     |    | DW_CFA_offset: RBP -16       |
| 8:  | sub rsp, rdi |       |         |     |    | DW_CFA_advance_loc: 3        |
| b:  | mov rdi, rsp |       |         |     |    | DW_CFA_def_cfa_register: RBP |
| e:  | call bar     |       |         |     |    | DW_CFA_advance_loc: 16       |
| 13: | leave        |       |         |     |    | DW_CFA_def_cfa: RSP +8       |
| 14: | ret          |       |         |     |    |                              |

|     |              | CFA   | rip     | rbp | _  | DW_CFA_def_cfa: RSP +8       |
|-----|--------------|-------|---------|-----|----|------------------------------|
|     | foo:         |       |         |     |    | DW_CFA_offset: RIP -8        |
| 0:  | push rbp     | rsp+8 | [CFA-8] |     | => | DW_CFA_advance_loc: 1        |
| 1:  | mov rbp, rsp | rsp+8 | [CFA-8] |     |    | DW_CFA_def_cfa_offset: +16   |
| 4:  | shl rdi, 4   | -     |         |     |    | DW_CFA_offset: RBP -16       |
| 8:  | sub rsp, rdi |       |         |     |    | DW_CFA_advance_loc: 3        |
| b:  | mov rdi, rsp |       |         |     |    | DW_CFA_def_cfa_register: RBP |
| e:  | call bar     |       |         |     |    | DW_CFA_advance_loc: 16       |
| 13: | leave        |       |         |     |    | DW_CFA_def_cfa: RSP +8       |
| 14: | ret          |       |         |     |    |                              |

|                            |                                                                                | CFA             | rip                | rbp | DW_CFA_def_cfa: RSP +8                                                                                                                                                                                           |
|----------------------------|--------------------------------------------------------------------------------|-----------------|--------------------|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0:<br>1:<br>4:<br>8:<br>b: | foo:<br>push rbp<br>mov rbp, rsp<br>shl rdi, 4<br>sub rsp, rdi<br>mov rdi, rsp | rsp+8<br>rsp+16 | [CFA-8]<br>[CFA-8] | 200 | <pre>DW_CFA_def_cfa: RSP +8<br/>DW_CFA_offset: RIP -8<br/>DW_CFA_advance_loc: 1<br/>=&gt; DW_CFA_def_cfa_offset: +16<br/>DW_CFA_offset: RBP -16<br/>DW_CFA_advance_loc: 3<br/>DW_CFA_def_cfa_register: RBP</pre> |
| e:                         | call bar                                                                       |                 |                    |     | DW_CFA_advance_loc: 16                                                                                                                                                                                           |
| 13:                        | leave                                                                          |                 |                    |     | DW_CFA_dei_cia: RSP +8                                                                                                                                                                                           |
| 14:                        | ret                                                                            |                 |                    |     |                                                                                                                                                                                                                  |

|                                  |                                                                                            | CFA             | rip                | rbp      | DW_CFA_def_cfa: RSP +8                                                                                                                                                                                                                        |
|----------------------------------|--------------------------------------------------------------------------------------------|-----------------|--------------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0:<br>1:<br>4:<br>8:<br>b:<br>e: | foo:<br>push rbp<br>mov rbp, rsp<br>shl rdi, 4<br>sub rsp, rdi<br>mov rdi, rsp<br>call bar | rsp+8<br>rsp+16 | [CFA-8]<br>[CFA-8] | [CFA-16] | <pre>_ DW_CFA_def_cfa: RSP +8<br/>DW_CFA_offset: RIP -8<br/>DW_CFA_advance_loc: 1<br/>DW_CFA_def_cfa_offset: +16<br/>=&gt; DW_CFA_offset: RBP -16<br/>DW_CFA_advance_loc: 3<br/>DW_CFA_def_cfa_register: RBP<br/>DW_CFA_advance_loc: 16</pre> |
| 13:<br>14:                       | leave<br>ret                                                                               |                 |                    |          | DW_CFA_def_cfa: RSP +8                                                                                                                                                                                                                        |

|                                         |                                                                                                     | CFA                       | rip                           | rbp                  | _ DW_CFA_def_cfa: RSP +8                                                                                                                                                                                                                                               |
|-----------------------------------------|-----------------------------------------------------------------------------------------------------|---------------------------|-------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0:<br>1:<br>4:<br>8:<br>b:<br>e:<br>13: | foo:<br>push rbp<br>mov rbp, rsp<br>shl rdi, 4<br>sub rsp, rdi<br>mov rdi, rsp<br>call bar<br>leave | rsp+8<br>rsp+16<br>rsp+16 | [CFA-8]<br>[CFA-8]<br>[CFA-8] | [CFA-16]<br>[CFA-16] | <pre>DW_CFA_def_cfa: RSP +8<br/>DW_CFA_offset: RIP -8<br/>DW_CFA_advance_loc: 1<br/>DW_CFA_def_cfa_offset: +16<br/>DW_CFA_offset: RBP -16<br/>=&gt; DW_CFA_advance_loc: 3<br/>DW_CFA_def_cfa_register: RBP<br/>DW_CFA_advance_loc: 16<br/>DW_CFA_def_cfa: RSP +8</pre> |
| 14:                                     | ret                                                                                                 |                           |                               |                      |                                                                                                                                                                                                                                                                        |

|                                         |                                                                                                     | CFA                       | rip                           | rbp                  | DW_CFA_def_cfa: RSP +8                                                                                                                                                                                                                                                 |
|-----------------------------------------|-----------------------------------------------------------------------------------------------------|---------------------------|-------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0:<br>1:<br>4:<br>8:<br>b:<br>e:<br>13: | foo:<br>push rbp<br>mov rbp, rsp<br>shl rdi, 4<br>sub rsp, rdi<br>mov rdi, rsp<br>call bar<br>leave | rsp+8<br>rsp+16<br>rbp+16 | [CFA-8]<br>[CFA-8]<br>[CFA-8] | [CFA-16]<br>[CFA-16] | <pre>DW_CFA_def_cfa: RSP +8<br/>DW_CFA_offset: RIP -8<br/>DW_CFA_advance_loc: 1<br/>DW_CFA_def_cfa_offset: +16<br/>DW_CFA_offset: RBP -16<br/>DW_CFA_advance_loc: 3<br/>=&gt; DW_CFA_def_cfa_register: RBP<br/>DW_CFA_advance_loc: 16<br/>DW_CFA_def_cfa: RSP +8</pre> |
| 10.                                     | ret                                                                                                 |                           |                               |                      |                                                                                                                                                                                                                                                                        |

|     |              | CFA    | rip     | rbp      | DW_CFA_def_cfa: RSP +8       |
|-----|--------------|--------|---------|----------|------------------------------|
|     | foo:         |        |         |          | DW_CFA_offset: RIP -8        |
| 0:  | push rbp     | rsp+8  | [CFA-8] |          | DW_CFA_advance_loc: 1        |
| 1:  | mov rbp, rsp | rsp+16 | [CFA-8] | [CFA-16] | DW_CFA_def_cfa_offset: +16   |
| 4:  | shl rdi, 4   | rbp+16 | [CFA-8] | [CFA-16] | DW_CFA_offset: RBP -16       |
| 8:  | sub rsp, rdi | rbp+16 | [CFA-8] | [CFA-16] | DW_CFA_advance_loc: 3        |
| b:  | mov rdi, rsp | rbp+16 | [CFA-8] | [CFA-16] | DW_CFA_def_cfa_register: RBP |
| e:  | call bar     | rbp+16 | [CFA-8] | [CFA-16] | => DW_CFA_advance_loc: 16    |
| 13: | leave        | rbp+16 | [CFA-8] | [CFA-16] | DW_CFA_def_cfa: RSP +8       |
| 14: | ret          | rbp+16 | [CFA-8] | [CFA-16] |                              |

|     |              | CFA    | rip     | rbp      | DW_CFA_def_cfa: RSP +8       |
|-----|--------------|--------|---------|----------|------------------------------|
|     | foo:         |        |         |          | DW_CFA_offset: RIP -8        |
| 0:  | push rbp     | rsp+8  | [CFA-8] |          | DW_CFA_advance_loc: 1        |
| 1:  | mov rbp, rsp | rsp+16 | [CFA-8] | [CFA-16] | DW_CFA_def_cfa_offset: +16   |
| 4:  | shl rdi, 4   | rbp+16 | [CFA-8] | [CFA-16] | DW_CFA_offset: RBP -16       |
| 8:  | sub rsp, rdi | rbp+16 | [CFA-8] | [CFA-16] | DW_CFA_advance_loc: 3        |
| b:  | mov rdi, rsp | rbp+16 | [CFA-8] | [CFA-16] | DW_CFA_def_cfa_register: RBP |
| e:  | call bar     | rbp+16 | [CFA-8] | [CFA-16] | DW_CFA_advance_loc: 16       |
| 13: | leave        | rbp+16 | [CFA-8] | [CFA-16] | => DW_CFA_def_cfa: RSP +8    |
| 14: | ret          | rsp+8  | [CFA-8] | [CFA-16] |                              |

|     |               | CFA | rip |
|-----|---------------|-----|-----|
|     | foo:          |     |     |
| 0:  | sub rsp, 8    |     |     |
| 4:  | test edi, edi |     |     |
| 6:  | js Ox12       |     |     |
| 8:  | call positive |     |     |
| d:  | add rsp, 8    |     |     |
| 11: | ret           |     |     |
| 12: | call negative |     |     |
| 17: | add rsp, 8    |     |     |
| 1a: | ret           |     |     |

DW\_CFA\_def\_cfa: RSP +8 DW CFA offset: RIP -8 DW CFA advance loc: 4 DW CFA def cfa offset: +16 DW CFA advance loc: 13 DW\_CFA\_remember\_state: DW CFA def cfa offset: +8 DW\_CFA\_advance\_loc: 1 DW\_CFA\_restore\_state: DW\_CFA\_advance\_loc: 9 DW\_CFA\_def\_cfa\_offset: +8

|     |               | CFA   | rip |
|-----|---------------|-------|-----|
|     | foo:          |       |     |
| 0:  | sub rsp, 8    | rsp+8 |     |
| 4:  | test edi, edi |       |     |
| 6:  | js Ox12       |       |     |
| 8:  | call positive |       |     |
| d:  | add rsp, 8    |       |     |
| 11: | ret           |       |     |
| 12: | call negative |       |     |
| 17: | add rsp, 8    |       |     |
| 1a: | ret           |       |     |

| => | DW_CFA_def_cfa: RSP +8 |     |
|----|------------------------|-----|
|    | DW_CFA_offset: RIP -8  |     |
|    | DW_CFA_advance_loc: 4  |     |
|    | DW_CFA_def_cfa_offset: | +16 |
|    | DW_CFA_advance_loc: 13 |     |
|    | DW_CFA_remember_state: |     |
|    | DW_CFA_def_cfa_offset: | +8  |
|    | DW_CFA_advance_loc: 1  |     |
|    | DW_CFA_restore_state:  |     |
|    | DW_CFA_advance_loc: 9  |     |
|    | DW_CFA_def_cfa_offset: | +8  |

|     |               | CFA   | rip     |
|-----|---------------|-------|---------|
|     | foo:          |       |         |
| 0:  | sub rsp, 8    | rsp+8 | [CFA-8] |
| 4:  | test edi, edi |       |         |
| 6:  | js Ox12       |       |         |
| 8:  | call positive |       |         |
| d:  | add rsp, 8    |       |         |
| 11: | ret           |       |         |
| 12: | call negative |       |         |
| 17: | add rsp, 8    |       |         |
| 1a: | ret           |       |         |

DW\_CFA\_def\_cfa: RSP +8 => DW CFA offset: RIP -8 DW CFA advance loc: 4 DW CFA def cfa offset: +16 DW CFA advance loc: 13 DW\_CFA\_remember\_state: DW CFA def cfa offset: +8 DW CFA advance loc: 1 DW\_CFA\_restore\_state: DW\_CFA\_advance\_loc: 9 DW\_CFA\_def\_cfa\_offset: +8

|     |               | CFA   | rip     |
|-----|---------------|-------|---------|
|     | foo:          |       |         |
| 0:  | sub rsp, 8    | rsp+8 | [CFA-8] |
| 4:  | test edi, edi | rsp+8 | [CFA-8] |
| 6:  | js 0x12       |       |         |
| 8:  | call positive |       |         |
| d:  | add rsp, 8    |       |         |
| 11: | ret           |       |         |
| 12: | call negative |       |         |
| 17: | add rsp, 8    |       |         |
| 1a: | ret           |       |         |

DW\_CFA\_def\_cfa: RSP +8 DW CFA offset: RIP -8  $\Rightarrow$  DW CFA advance loc: 4 DW CFA def cfa offset: +16 DW CFA advance loc: 13 DW\_CFA\_remember\_state: DW CFA def cfa offset: +8 DW\_CFA\_advance\_loc: 1 DW\_CFA\_restore\_state: DW\_CFA\_advance\_loc: 9 DW\_CFA\_def\_cfa\_offset: +8

|     |               | CFA    | rip     |
|-----|---------------|--------|---------|
|     | foo:          |        |         |
| 0:  | sub rsp, 8    | rsp+8  | [CFA-8] |
| 4:  | test edi, edi | rsp+16 | [CFA-8] |
| 6:  | js 0x12       |        |         |
| 8:  | call positive |        |         |
| d:  | add rsp, 8    |        |         |
| 11: | ret           |        |         |
| 12: | call negative |        |         |
| 17: | add rsp, 8    |        |         |
| 1a: | ret           |        |         |

|  | DW_ | _CFA_ | _def_ | _cfa: | RSP | +8 |
|--|-----|-------|-------|-------|-----|----|
|--|-----|-------|-------|-------|-----|----|

DW\_CFA\_offset: RIP -8

DW\_CFA\_advance\_loc: 4

=> DW\_CFA\_def\_cfa\_offset: +16 DW\_CFA\_advance\_loc: 13 DW\_CFA\_remember\_state: DW\_CFA\_def\_cfa\_offset: +8 DW\_CFA\_advance\_loc: 1 DW\_CFA\_restore\_state: DW\_CFA\_advance\_loc: 9 DW\_CFA\_def\_cfa\_offset: +8

|     |               | CFA    | rip     |
|-----|---------------|--------|---------|
|     | foo:          |        |         |
| 0:  | sub rsp, 8    | rsp+8  | [CFA-8] |
| 4:  | test edi, edi | rsp+16 | [CFA-8] |
| 6:  | js 0x12       | rsp+16 | [CFA-8] |
| 8:  | call positive | rsp+16 | [CFA-8] |
| d:  | add rsp, 8    | rsp+16 | [CFA-8] |
| 11: | ret           | rsp+16 | [CFA-8] |
| 12: | call negative |        |         |
| 17: | add rsp, 8    |        |         |
| 1a: | ret           |        |         |

DW\_CFA\_def\_cfa: RSP +8

DW\_CFA\_offset: RIP -8

DW\_CFA\_advance\_loc: 4

- DW\_CFA\_def\_cfa\_offset: +16
- => DW\_CFA\_advance\_loc: 13 DW\_CFA\_remember\_state:
  - DW\_CFA\_def\_cfa\_offset: +8
  - DW\_CFA\_advance\_loc: 1
  - DW\_CFA\_restore\_state:
  - DW\_CFA\_advance\_loc: 9
  - DW\_CFA\_def\_cfa\_offset: +8

|     |               | CFA    | rip     |
|-----|---------------|--------|---------|
|     | foo:          |        |         |
| 0:  | sub rsp, 8    | rsp+8  | [CFA-8] |
| 4:  | test edi, edi | rsp+16 | [CFA-8] |
| 6:  | js Ox12       | rsp+16 | [CFA-8] |
| 8:  | call positive | rsp+16 | [CFA-8] |
| d:  | add rsp, 8    | rsp+16 | [CFA-8] |
| 11: | ret           | rsp+16 | [CFA-8] |
| 12: | call negative |        |         |
| 17: | add rsp, 8    |        |         |
| 1a: | ret           |        |         |

DW\_CFA\_def\_cfa: RSP +8 DW\_CFA\_offset: RIP -8

DW\_CFA\_advance\_loc: 4

- DW\_CFA\_def\_cfa\_offset: +16 DW\_CFA\_advance\_loc: 13
- => DW\_CFA\_remember\_state: DW\_CFA\_def\_cfa\_offset: +8 DW\_CFA\_advance\_loc: 1 DW\_CFA\_restore\_state: DW\_CFA\_advance\_loc: 9 DW\_CFA\_def\_cfa\_offset: +8

Remember stack: {CFA=rsp+16; rip=[CFA-8]}

|     |               | CFA    | rip     |
|-----|---------------|--------|---------|
|     | foo:          |        |         |
| 0:  | sub rsp, 8    | rsp+8  | [CFA-8] |
| 4:  | test edi, edi | rsp+16 | [CFA-8] |
| 6:  | js Ox12       | rsp+16 | [CFA-8] |
| 8:  | call positive | rsp+16 | [CFA-8] |
| d:  | add rsp, 8    | rsp+16 | [CFA-8] |
| 11: | ret           | rsp+8  | [CFA-8] |
| 12: | call negative |        |         |
| 17: | add rsp, 8    |        |         |
| 1a: | ret           |        |         |

DW\_CFA\_def\_cfa: RSP +8 DW\_CFA\_offset: RIP -8 DW\_CFA\_advance\_loc: 4 DW\_CFA\_def\_cfa\_offset: +16 DW\_CFA\_advance\_loc: 13 DW\_CFA\_remember\_state: => DW\_CFA\_def\_cfa\_offset: +8 DW\_CFA\_advance\_loc: 1 DW\_CFA\_restore\_state:

- DW\_CFA advance loc: 9
- DW\_CFA\_def\_cfa\_offset: +8

Remember stack: {CFA=rsp+16; rip=[CFA-8]}

|     |               | CFA    | rip     |
|-----|---------------|--------|---------|
|     | foo:          |        |         |
| 0:  | sub rsp, 8    | rsp+8  | [CFA-8] |
| 4:  | test edi, edi | rsp+16 | [CFA-8] |
| 6:  | js 0x12       | rsp+16 | [CFA-8] |
| 8:  | call positive | rsp+16 | [CFA-8] |
| d:  | add rsp, 8    | rsp+16 | [CFA-8] |
| 11: | ret           | rsp+8  | [CFA-8] |
| 12: | call negative | rsp+8  | [CFA-8] |
| 17: | add rsp, 8    |        |         |
| 1a: | ret           |        |         |

DW\_CFA\_def\_cfa: RSP +8 DW\_CFA\_offset: RIP -8 DW\_CFA\_advance\_loc: 4 DW\_CFA\_def\_cfa\_offset: +16 DW\_CFA\_advance\_loc: 13 DW\_CFA\_def\_cfa\_offset: +8 => DW\_CFA\_def\_cfa\_offset: +8 => DW\_CFA\_advance\_loc: 1 DW\_CFA\_restore\_state: DW\_CFA\_advance\_loc: 9 DW\_CFA\_def\_cfa\_offset: +8

Remember stack: {CFA=rsp+16; rip=[CFA-8]}

|     |               | CFA    | rip     |
|-----|---------------|--------|---------|
|     | foo:          |        |         |
| 0:  | sub rsp, 8    | rsp+8  | [CFA-8] |
| 4:  | test edi, edi | rsp+16 | [CFA-8] |
| 6:  | js 0x12       | rsp+16 | [CFA-8] |
| 8:  | call positive | rsp+16 | [CFA-8] |
| d:  | add rsp, 8    | rsp+16 | [CFA-8] |
| 11: | ret           | rsp+8  | [CFA-8] |
| 12: | call negative | rsp+16 | [CFA-8] |
| 17: | add rsp, 8    |        |         |
| 1a: | ret           |        |         |

DW\_CFA\_def\_cfa: RSP +8 DW CFA offset: RIP -8 DW CFA advance loc: 4 DW CFA def cfa offset: +16 DW CFA advance loc: 13 DW CFA remember state: DW CFA def cfa offset: +8 DW\_CFA\_advance\_loc: 1 => DW\_CFA\_restore\_state: DW\_CFA\_advance\_loc: 9 DW\_CFA\_def\_cfa\_offset: +8

Remember stack: {}

|      |               | CFA    | rip     |
|------|---------------|--------|---------|
| foo: |               |        |         |
| 0:   | sub rsp, 8    | rsp+8  | [CFA-8] |
| 4:   | test edi, edi | rsp+16 | [CFA-8] |
| 6:   | js 0x12       | rsp+16 | [CFA-8] |
| 8:   | call positive | rsp+16 | [CFA-8] |
| d:   | add rsp, 8    | rsp+16 | [CFA-8] |
| 11:  | ret           | rsp+8  | [CFA-8] |
| 12:  | call negative | rsp+16 | [CFA-8] |
| 17:  | add rsp, 8    | rsp+16 | [CFA-8] |
| 1a:  | ret           | rsp+16 | [CFA-8] |

DW\_CFA\_def\_cfa: RSP +8 DW CFA offset: RIP -8 DW CFA advance loc: 4 DW CFA def cfa offset: +16 DW CFA advance loc: 13 DW CFA remember state: DW CFA def cfa offset: +8 DW CFA advance loc: 1 DW\_CFA\_restore\_state: => DW\_CFA\_advance\_loc: 9 DW\_CFA\_def\_cfa\_offset: +8

Remember stack: {}

|     |               | CFA    | rip     |
|-----|---------------|--------|---------|
|     | foo:          |        |         |
| 0:  | sub rsp, 8    | rsp+8  | [CFA-8] |
| 4:  | test edi, edi | rsp+16 | [CFA-8] |
| 6:  | js 0x12       | rsp+16 | [CFA-8] |
| 8:  | call positive | rsp+16 | [CFA-8] |
| d:  | add rsp, 8    | rsp+16 | [CFA-8] |
| 11: | ret           | rsp+8  | [CFA-8] |
| 12: | call negative | rsp+16 | [CFA-8] |
| 17: | add rsp, 8    | rsp+16 | [CFA-8] |
| 1a: | ret           | rsp+8  | [CFA-8] |

DW\_CFA\_def\_cfa: RSP +8 DW CFA offset: RIP -8 DW CFA advance loc: 4 DW CFA def cfa offset: +16 DW CFA advance loc: 13 DW CFA remember state: DW CFA def cfa offset: +8 DW\_CFA\_advance\_loc: 1 DW\_CFA\_restore\_state: DW\_CFA\_advance\_loc: 9 => DW\_CFA\_def\_cfa\_offset: +8

Remember stack: {}

#### Call Frame Information: Bytecode

- ► DWARF<sup>48</sup> specifies bytecode for call frame information
- Self-contained section .eh\_frame (or .debug\_frame)
- Series of entries; two possible types distinguished using header
- Frame Description Entry (FDE): description of a function
  - Code range, instructions, pointer to CIE, language-specific data
- Common Information Entry (CIE): shared information among multiple FDEs
  - Initial instrs. (prepended to all FDE instrs.), personality function, alignment factors (constants factored out of instrs.), ...
- readelf --debug-dump=frames <file> llvm-dwarfdump --debug-frame <file>


- Problem: linear search over possibly many FDEs is slow
- ► Idea:

- Problem: linear search over possibly many FDEs is slow
- ▶ Idea: create binary search table over FDEs at link-time
- Ordered list of all function addresses and their FDE
- Unwinder does binary search to find matching FDE

- Problem: linear search over possibly many FDEs is slow
- Idea: create binary search table over FDEs at link-time
- Ordered list of all function addresses and their FDE
- Unwinder does binary search to find matching FDE
- Separate program header entry: PT\_GNU\_EH\_FRAME
- Unwinder needs loader support to find these
  - \_dl\_find\_object or dl\_iterate\_phdr
- FDEs and indices are cached to avoid redundant lookups

#### Call Frame Information: Assembler Directives

#### Compilers produces textual CFI

- Assembler encodes CFI into binary format
  - Allows for integration of annotated inline assembly
  - Inline-asm also needs CFI directives
- Register numbers specified by psABI
- Wrap function with .cfi\_startproc/.cfi\_endproc
- Many directives map straight to DWARF instructions
  - .cfi\_def\_cfa\_offset 16; .cfi\_offset %rbp, -16; .cfi\_def\_cfa\_register %rbp

#### Call Frame Information: Assembler Directives - Example

foo:

```
int bar(int*);
int foo(unsigned long x) {
  int arr[x * 4];
  return bar(arr);
}
```

```
gcc -O -S foo.c
```

```
.globl foo
.type foo, @function
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
shl rdi, 4
sub rsp, rdi
mov rdi, rsp
call bar
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.size foo, .-foo
```

Unwinding depends strongly on OS and architecture

Unwinding depends strongly on OS and architecture

Linux uses DWARF

- Apple has modified version
- Windows has SEH with kernel-support for unwinding
- IBM AIX has their own format
- AArch32 has another custom format

Unwinding depends strongly on OS and architecture

Linux uses DWARF

- Apple has modified version
- Windows has SEH with kernel-support for unwinding
- IBM AIX has their own format
- AArch32 has another custom format

Additionally: minor differences for return address, stack handling, ...

Unwinding depends strongly on OS and architecture

Linux uses DWARF

- Apple has modified version
- Windows has SEH with kernel-support for unwinding
- IBM AIX has their own format
- AArch32 has another custom format

Additionally: minor differences for return address, stack handling, ....

Needs to work reliably for exception handling

Get back trace

- ► Map address to source file/line
- Show global and local variables
  - Local variables need scope information, e.g. shadowing
  - Data type information, e.g. int, string, struct, enum
- Set break point at line/function
  - Might require multiple actual breakpoints: inlining, template expansion
- Step through program by line/statement

Get back trace

► Map address to source file/line

- Show global and local variables
  - Local variables need scope information, e.g. shadowing
  - Data type information, e.g. int, string, struct, enum
- Set break point at line/function
  - Might require multiple actual breakpoints: inlining, template expansion
- Step through program by line/statement

 $\rightsquigarrow CFI$ 

#### Line Table

- Map instruction to: file/line/column; start of stmt; start of basic block; is prologue/epilogue; ISA mode
- Table can be huge; idea:

#### Line Table

- Map instruction to: file/line/column; start of stmt; start of basic block; is prologue/epilogue; ISA mode
- Table can be huge; idea: encode as bytecode
- Extracted information are bytecode registers
- Conceptually similar to CFI encoding
- Ilvm-dwarfdump -v --debug-line or readelf -wlL

Get back trace

► Map address to source file/line

- Show global and local variables
  - Local variables need scope information, e.g. shadowing
  - Data type information, e.g. int, string, struct, enum
- Set break point at line/function
  - Might require multiple actual breakpoints: inlining, template expansion
- Step through program by line/statement

 $\rightarrow$  CFI

Get back trace

Map address to source file/line

 $\rightarrow$  CFI  $\rightarrow$  Line Table

- Show global and local variables
  - Local variables need scope information, e.g. shadowing
  - Data type information, e.g. int, string, struct, enum
- Set break point at line/function

 $\rightsquigarrow$  Line Table/??

 $\rightarrow$  Line Table

- Might require multiple actual breakpoints: inlining, template expansion
- Step through program by line/statement

#### DWARF: Hierarchical Program Description

Extensible, flexible, Turing-complete<sup>49</sup> format to describe program

- Forest of Debugging Information Entries (DIEs)
  - Tag: indicates what the DIE describes
  - Set of attributes: describe DIE (often constant, range, or arbitrary expression)
  - Optionally children

#### DWARF: Hierarchical Program Description

Extensible, flexible, Turing-complete<sup>49</sup> format to describe program

- Forest of Debugging Information Entries (DIEs)
  - Tag: indicates what the DIE describes
  - Set of attributes: describe DIE (often constant, range, or arbitrary expression)
  - Optionally children

Rough classification:

- ▶ DIEs for types: base types, typedef, struct, array, enum, union, ...
- DIEs for data objects: variable, parameter, constant
- ▶ DIEs for program scope: compilation unit, function, block, ....

<sup>&</sup>lt;sup>49</sup>J Oakley and S Bratus. "Exploiting the Hard-Working DWARF: Trojan and Exploit Techniques with No Native Executable Code". In: WOOT. 2011. @.

#### DWARF: Data Types

```
DW_TAG_base_type [Ox4a]
DW_AT_byte_size (Ox04)
DW_AT_encoding (DW_ATE_signed)
DW_AT_name ("int")
```

## DWARF: Data Types

```
DW_TAG_structure_type [0x2e]
 DW_AT_byte_size (0x08)
 DW_AT_sibling (0x4a)
 DW_TAG_member [0x37]
   DW_AT_name ("x")
   DW_AT_type (0x4a "int")
   DW_AT_data_member_location (0x00)
 DW_TAG_member [0x40]
   DW_AT_name ("y")
   DW_AT_type (0x4a "int")
   DW AT data member location (0x04)
```

```
DW_TAG_base_type [0x4a]
DW_AT_byte_size (0x04)
DW_AT_encoding (DW_ATE_signed)
DW_AT_name ("int")
```

## DWARF: Data Types

```
DW_TAG_structure_type [0x2e]
 DW_AT_byte_size (0x08)
 DW_AT_sibling (0x4a)
 DW_TAG_member [0x37]
   DW_AT_name ("x")
   DW_AT_type (0x4a "int")
   DW_AT_data_member_location (0x00)
 DW_TAG_member [0x40]
   DW AT name ("v")
   DW_AT_type (0x4a "int")
   DW AT data member location (0x04)
```

```
DW_TAG_base_type [Ox4a]
DW_AT_byte_size (Ox04)
DW_AT_encoding (DW_ATE_signed)
DW_AT_name ("int")
```

```
DW_TAG_pointer_type [0xb1]
DW_AT_byte_size (8)
DW_AT_type (0xb6 "char *")
```

```
DW_TAG_pointer_type [Oxb6]
DW_AT_byte_size (8)
DW_AT_type (Oxbb "char")
```

DW\_TAG\_base\_type [Oxbb] DW\_AT\_byte\_size (OxO1) DW\_AT\_encoding (DW\_ATE\_signed\_char) DW\_AT\_name ("char")

#### **DWARF:** Variables

```
DW_TAG_formal_parameter [0x7f]
DW_AT_name ("argc")
// ...
```

#### DWARF: Expressions

Very general way to describe location of value:

#### DWARF: Expressions

Very general way to describe location of value: bytecode

- Stack machine, evaluates to location or value of variable
  - Simple case: register or stack slot
  - But: complex expression to recover original value after optimization e.g., able to recover *i* from stored i 1
  - Unbounded complexity!
- Can contain control flow
- Can dereference memory, registers, etc.

▶ Used for: CFI locations, variable locations, array sizes, ...

## DWARF: Program Structure

Follows structure of code

- ► Top-level: compilation unit
- Entries for namespaces, subroutines (functions)
  - Functions can contain inlined subroutines
- Lexical blocks to group variables
- Call sites and parameters
- Each node annotated with pc-range and source location

Get back trace

Map address to source file/line

 $\rightarrow$  CFI  $\rightarrow$  Line Table

- Show global and local variables
  - Local variables need scope information, e.g. shadowing
  - Data type information, e.g. int, string, struct, enum
- Set break point at line/function

 $\rightsquigarrow$  Line Table/??

 $\rightarrow$  Line Table

- Might require multiple actual breakpoints: inlining, template expansion
- Step through program by line/statement

Get back trace

► Map address to source file/line

Show global and local variables

 $\rightsquigarrow$  CFI  $\rightsquigarrow$  Line Table

 $\rightarrow$  DIF tree

 $\rightarrow$  Line Table

Local variables need scope information, e.g. shadowing

- Data type information, e.g. int, string, struct, enum
- ► Set break point at line/function → Line Table/DIE tree
  - Might require multiple actual breakpoints: inlining, template expansion
- Step through program by line/statement

## Other Debuginfo Formats

- DWARF is big despite compression
- Cannot run in time-constrained environments
  - Unsuited for in-kernel backtrace generation
- Historically: STABS string based encoding
   Complexity increased significantly over time
   Microsoft: PDB for PE
- Linux kernel: CTF for simple type information
- ► Linux kernel: BTF for BPF programs

## Unwinding and Debuginfo – Summary

- Some languages/setups must be able to unwind the stack
- Needs meta-information on call frames
- DWARF encodes call frame information is bytecode program
- Runtime must efficiently find relevant information
- Stack unwinding typically done in two phases
- Functions have associated personality function to steer unwinding
- DWARF encodes debug info in tree structure of DIEs
- DWARF info can become arbitrarily complex

## Unwinding and Debuginfo – Questions

- What are alternatives to stack unwinding?
- What are the benefits of stack unwinding through metadata?
- What are the two phases of unwinding? Why is this separated?
- How to construct a CFI table for a given assembly code?
- How to construct DWARF ops for a CFI table?
- ▶ How to find the correct CFI table line for a given address?
- What is the general structure of DWARF debug info?

#### Code Generation for Data Processing Lecture 11: JIT Compilation and Sandboxing

#### Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

## JIT Compilation

Ahead-of-Time compilation not always possible/sufficient

## JIT Compilation

Ahead-of-Time compilation not always possible/sufficient

- "Dynamic source" code: pre-compilation not possible
  - JavaScript, eval(), database queries
  - Binary translation of highly-dynamic/JIT-compiled code
- Additional verification/analysis or increased portability desired
  - (e)BPF, WebAssembly
- Dynamic optimization on common types/values
  - Run-time sampling of frequent code paths, allows dynamic speculation
  - Relevant for highly dynamic languages otherwise prefer PGO<sup>50</sup>

342

# JIT Compilation: Simple Approach

# JIT Compilation: Simple Approach

Use standard compiler, write shared library
Can write compiler IR, or plain source code
dlopen + dlsym to find compiled function
### JIT Compilation: Simple Approach

Use standard compiler, write shared library
Can write compiler IR, or plain source code
dlopen + dlsym to find compiled function

Example: libgccjit

### JIT Compilation: Simple Approach

Use standard compiler, write shared library
Can write compiler IR, or plain source code
dlopen + dlsym to find compiled function

Example: libgccjit

- + Simple, fairly easy to debug
- Very high overhead, needs IO



malloc() – memory often non-executable

> alloca()

- malloc() memory often non-executable
- alloca() memory often non-executable
- mmap(PROT\_READ|PROT\_WRITE|PROT\_EXEC)

- malloc() memory often non-executable
- alloca() memory often non-executable
- ▶ mmap(PROT\_READ|PROT\_WRITE|PROT\_EXEC)  $W \oplus X$  may prevent this
  - $W \oplus X$ : a page must never be writable and executable at the same time
  - Some OS's (e.g. OpenBSD) and CPUs (Apple Silicon) strictly enforce this
- ► For code generation: map pages read-write
- ▶ Before execution: change protection to (read–)execute

- malloc() memory often non-executable
- alloca() memory often non-executable
- ▶ mmap(PROT\_READ|PROT\_WRITE|PROT\_EXEC)  $W \oplus X$  may prevent this
  - $W \oplus X$ : a page must never be writable and executable at the same time
  - Some OS's (e.g. OpenBSD) and CPUs (Apple Silicon) strictly enforce this
- ► For code generation: map pages read-write
  - NetBSD needs special argument to allow remapping the page as executable
- ▶ Before execution: change protection to (read–)execute

#### Adjust page-level protections: mprotect

- OS will adjust page tables
- Typically incurs TLB shootdown
- Other steps might be needed, highly OS-dependent
  - Read manual

#### Flush instruction cache

- Flush DCache to unification point (last-level cache)
- Invalidate ICache in all cores for virtual address range
  - > After local flush, kernel might move thread to other core with old ICache

#### Flush instruction cache

- Flush DCache to unification point (last-level cache)
- Invalidate ICache in all cores for virtual address range
  - After local flush, kernel might move thread to other core with old ICache
- x86: coherent ICache/DCache hierarchy hardware detects changes
  - ► Also includes: transparent (but expensive) detection of self-modifying code
- AArch64, MIPS, SPARC, ... (Linux): user-space instructions
- ▶ ARMv7, RISC-V<sup>51</sup> (Linux), all non-x86 (Darwin): system call

#### Flush instruction cache

- Flush DCache to unification point (last-level cache)
- Invalidate ICache in all cores for virtual address range
  - After local flush, kernel might move thread to other core with old ICache
- x86: coherent ICache/DCache hierarchy hardware detects changes
  - ► Also includes: transparent (but expensive) detection of self-modifying code
- ► AArch64, MIPS, SPARC, ... (Linux): user-space instructions
- ▶ ARMv7, RISC-V<sup>51</sup> (Linux), all non-x86 (Darwin): system call

#### Skipping ICache flush: spurious, hard-to-debug problems

#### Code Generation: Differences AoT vs. JIT

## Code Generation: Differences AoT vs. JIT

|                                      | Ahead-of-Time                               | JIT Compilation                                                                                              |
|--------------------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| Code Model<br>Relocations<br>Symbols | Arbitrary<br>Linker/Loader<br>Linker/Loader | Large (or PIC with custom PLT)<br>JIT compiler/linker<br>JIT compiler/linker<br>may need application symbols |
| Memory Mapping<br>EHFrame            | OS/Loader<br>Compiler/Linker/Loader         | JIT compiler/linker<br>JIT compiler/linker                                                                   |
| Debuginfo                            | Compiler/Linker/Debugger                    | JIT compiler<br>register with debugger                                                                       |



#### JIT: Code Model

Code can be located anywhere in address space

Cannot rely on linker to put in, e.g., lowest 2 GiB

Large code model: allows for arbitrarily-sized addresses

Small-PIC: possible for relocations inside object

Needs new PLT/GOT for other symbols

Overhead trade-off: wide immediates vs. extra indirection (PLT)

Further restrictions may apply (ISA/OS)

#### JIT: Relocations and Symbols

#### JIT compiler must take care of relocations

- Can try to directly process relocations during machine code gen.
- Not always possible: cyclic dependencies
- Option: behave like normal compiler with separate runtime linker

### JIT: Relocations and Symbols

#### JIT compiler must take care of relocations

- Can try to directly process relocations during machine code gen.
- Not always possible: cyclic dependencies
- Option: behave like normal compiler with separate runtime linker
- Code may need to access functions/global variables from application
  - Option: JIT compiler "hard-codes" relevant symbols
  - Option: application registers relevant symbols
  - Option: application linked with --export-dynamic and use dlsym

### JIT: Memory Layout

Never place code and (writable) data on same page

- $W \oplus X$ ; and writes near code can trigger self-modifying code detection
- Avoid many small allocations with one page each
- But: editing existing code pages is problematic

### JIT: Memory Layout

Never place code and (writable) data on same page

- $W \oplus X$ ; and writes near code can trigger self-modifying code detection
- Avoid many small allocations with one page each
- But: editing existing code pages is problematic
- Choose suitable alignment for code
  - Page alignment is too large: poor cache utilization
  - ICache cache line size not too relevant, decode buffer size is typical value: 16 bytes
  - Some basic blocks (e.g., hot loop entries) can benefit from 16-byte alignment

JIT: .eh\_frame Registration (required for C++)

Unwinder finds .eh\_frame

#### JIT: .eh\_frame Registration (required for C++)

- Unwinder finds .eh\_frame using program headers
- > Problem: JIT-compiled code has no program headers
- Idea: JIT compiler registers new code with runtime
- libc provides \_\_register\_frame and \_\_deregister\_frame
  - Call with address of first Frame Description Entry (FDE)
  - Historically also called by init code

### JIT: GDB Debuginfo Registration (optional)

- GDB finds debug info from section headers of DSOs
- Problem: JIT-compiled code has no DSO

## JIT: GDB Debuginfo Registration (optional)

- GDB finds debug info from section headers of DSOs
- Problem: JIT-compiled code has no DSO
- Idea: JIT compiler registers new code with debugger
- Define function \_\_jit\_debug\_register\_code and global var. \_\_jit\_debug\_descriptor
  - Call function on update; GDB places breakpoint in function
  - Prevent function from being inlined
- Descriptor is linked list of in-memory object files
  - Needs relocations applied, also for debug info

▶ Users: LLVM, Wasmtime, HHVM, ...; consumers: GDB, LLDB

perf tracks binary through backing file of mmap

- perf tracks binary through backing file of mmap
- ▶ Problem 1: JIT-compiled code has no backing file for its mmap region
- Problem 2: after tracing, JIT-compiled code is gone

- perf tracks binary through backing file of mmap
- Problem 1: JIT-compiled code has no backing file for its mmap region
- Problem 2: after tracing, JIT-compiled code is gone
- ► Goal 1: map instructions to functions
- ► Goal 2: keep JIT-compiled code for detailed analysis

- perf tracks binary through backing file of mmap
- Problem 1: JIT-compiled code has no backing file for its mmap region
- Problem 2: after tracing, JIT-compiled code is gone
- ► Goal 1: map instructions to functions
- ► Goal 2: keep JIT-compiled code for detailed analysis
- Approach 1: dump function limits to /tmp/perf-<PID>.map<sup>52</sup>
   Text file: format: startaddr size name\n
- ► Approach 2: needs an extra slide

- ► JIT-compiler dumps function name/address/size/code<sup>53</sup>
  - ► JITDUMP file: record list for each function, may contain debuginfo
  - File name must be jit-<PID>.dump

- ► JIT-compiler dumps function name/address/size/code<sup>53</sup>
  - ▶ JITDUMP file: record list for each function, may contain debuginfo
  - File name must be jit-<PID>.dump
- ▶ JIT-compiler mmaps part of the file as executable somewhere
  - Only use: perf keeps track of executable mappings ~>> mapping is JIT marker, s.t. perf can find the file later
- ▶ Need to run perf report with -k 1 to use monotonic clock

- ► JIT-compiler dumps function name/address/size/code<sup>53</sup>
  - ► JITDUMP file: record list for each function, may contain debuginfo
  - File name must be jit-<PID>.dump
- ▶ JIT-compiler mmaps part of the file as executable somewhere
  - Only use: perf keeps track of executable mappings ~>> mapping is JIT marker, s.t. perf can find the file later
- ▶ Need to run perf report with -k 1 to use monotonic clock
- ▶ After profiling: perf inject --jit -i perf.data -o jit.data
  - Extracts functions from JITDUMP, each into its own ELF file
  - Changes mappings of profile to refer to newly created files

- ► JIT-compiler dumps function name/address/size/code<sup>53</sup>
  - ▶ JITDUMP file: record list for each function, may contain debuginfo
  - File name must be jit-<PID>.dump
- ▶ JIT-compiler mmaps part of the file as executable somewhere
  - Only use: perf keeps track of executable mappings ~> mapping is JIT marker, s.t. perf can find the file later
- ▶ Need to run perf report with -k 1 to use monotonic clock
- After profiling: perf inject --jit -i perf.data -o jit.data
  - Extracts functions from JITDUMP, each into its own ELF file
  - Changes mappings of profile to refer to newly created files

#### perf report -i jit.data - Profit!

### Compilation Time

Problem: code generation takes time

- Especially high-complexity frameworks like GCC or LLVM
- Compilation time of JIT compilers often matters
  - Example: website needing JavaScript on page load
  - Example: compiling database query
- Functions executed once are not worth optimizing
- But: often not known in advance

## Compilation Time

Problem: code generation takes time

- Especially high-complexity frameworks like GCC or LLVM
- Compilation time of JIT compilers often matters
  - Example: website needing JavaScript on page load
  - Example: compiling database query
- Functions executed once are not worth optimizing
- But: often not known in advance
- Idea: adaptive compilation
- Incrementally spend more time on optimization

# Compilation Time: Simple Approach

#### Compilation Time: Simple Approach

# Caching

#### Compilation Time: Simple Approach

# Caching


## Adaptive Execution

Execution tiers have different compile-time/run-time tradeoffs

- Bytecode interpreter: very fast/slow
- Fast compiler: medium/medium
- Optimizing compiler: slow/fast

# Adaptive Execution

Execution tiers have different compile-time/run-time tradeoffs

- Bytecode interpreter: very fast/slow
- Fast compiler: medium/medium
- Optimizing compiler: slow/fast
- Start with interpreter, profile execution
  - E.g., collect stats on execution frequency, dynamic types, ...
- For program worth optimizing, switch no next tier
  - > Depends on profile information, e.g. only optimize hot code
  - Compile in background, switch when ready

Switching only possible at compiler-defined points

Needs to serialize relevant state for other tier

Switching only possible at compiler-defined points

- Needs to serialize relevant state for other tier
- Simple approach: only switch at function boundaries

Switching only possible at compiler-defined points

- Needs to serialize relevant state for other tier
- Simple approach: only switch at function boundaries
  - Simple, well-defined boundaries; unable to switch inside loop

- Switching only possible at compiler-defined points
  - Needs to serialize relevant state for other tier
- Simple approach: only switch at function boundaries
  - Simple, well-defined boundaries; unable to switch inside loop
- Complex approach: allow switching at loop headers/everywhere
  - Needs tracking of much more meta-information
  - All entry points need well-defined interface
  - All exit points need info to recover complete state
  - Severely limits optimizations; all loops become irreducible
- Using LLVM is possible, but not a good fit

# Adaptive Execution: Partial Compilation and Speculation

- Observation: even in hot functions, many branches are rarely used
- Optimizing cold code is wasted time(/energy)

## Adaptive Execution: Partial Compilation and Speculation

- Observation: even in hot functions, many branches are rarely used
- Optimizing cold code is wasted time(/energy)
- Observation (JS): functions often get called with same data type
- Specializing on structure allows removing string lookup for fields

# Adaptive Execution: Partial Compilation and Speculation

- Observation: even in hot functions, many branches are rarely used
- Optimizing cold code is wasted time(/energy)
- Observation (JS): functions often get called with same data type
  Specializing on structure allows removing string lookup for fields
- Idea: speculate on common path using profiling data
- Add check whether speculation holds; if not, use side-exit
  - Side-exit can be patched later with actual code
- Side-exit must serialize all relevant state for lower tier
  - "Deoptimization"





- Executing untrusted code without additional measures may harm system
- Untrusted input may expose vulnerabilities

# Sandboxing

- Executing untrusted code without additional measures may harm system
- Untrusted input may expose vulnerabilities
- ► Goal 1: execute untrusted code without impacting security
  - Code in higher-level representation allows for further analyses but needs JIT compilation for performance
- ► Goal 2: limit impact potential of new vulnerabilities

# Sandboxing

Executing untrusted code without additional measures may harm system

- Untrusted input may expose vulnerabilities
- ► Goal 1: execute untrusted code without impacting security
  - Code in higher-level representation allows for further analyses but needs JIT compilation for performance
- ► Goal 2: limit impact potential of new vulnerabilities
- Other goals: portability, resource usage, performance, usability, language flexibility

# Approach: Sandbox Operating System as-is

- Idea: put entire operating system in sandbox ("virtual machine")
- Widely used in practice
- Virtualization needs hardware and OS support
  - CPU has hypervisor mode which controls guest OS; offers nested paging, hypercalls from guest OS to hypervisor

# Approach: Sandbox Operating System as-is

- Idea: put entire operating system in sandbox ("virtual machine")
- Widely used in practice
- Virtualization needs hardware and OS support
  - CPU has hypervisor mode which controls guest OS; offers nested paging, hypercalls from guest OS to hypervisor
- $+ \,$  Good usability and performance
- + Strong isolation
- Rather high overhead on resource usage: completely new OS
- Inflexible and high start latency (seconds)

- Restrict system calls: seccomp
  - Filter program for system calls depending on arguments

- Restrict system calls: seccomp
  - Filter program for system calls depending on arguments
- Separate namespaces: network, PID, user, mount, ...
  - Isolate program from rest of the system
  - Need to allow access to permitted resources

- Restrict system calls: seccomp
  - Filter program for system calls depending on arguments
- Separate namespaces: network, PID, user, mount, ...
  - Isolate program from rest of the system
  - Need to allow access to permitted resources
- Limit resource usage: memory, CPU, ... cgroups

Frequently and widely used ("container")

Frequently and widely used ("container")

- + Good usability and performance, low latency (milliseconds)
- + Finer grained control of resources
- $\sim\,$  Resource usage: often completely new user space
- Weak isolation: OS+CPU often bad at separation
  - Kernel has a fairly large interface, not hardened against bad actors
  - Privilege escalation happens not rarely

# Approach: Sandbox Native Code with Modification

#### Idea: enforce limitations on machine code

- > Define restrictions on machine code, e.g. no unbounded memory access
- Modify compiler to comply with restrictions
- Verify program at load time

## Approach: Sandbox Native Code with Modification

#### Idea: enforce limitations on machine code

- > Define restrictions on machine code, e.g. no unbounded memory access
- Modify compiler to comply with restrictions
- Verify program at load time
- ► Google Native Client<sup>54</sup>, originally x86-32, ported to x86-64 and ARM
- Designed as browser extension
- Native code shipped to browser, executed after validation

Problem: dynamic code not verifiable

Problem: dynamic code not verifiable

- $\Rightarrow$  No self-modifying/dynamically generated code
- Problem: overlapping instructions

- Problem: dynamic code not verifiable
  - $\Rightarrow$  No self-modifying/dynamically generated code
- Problem: overlapping instructions
  - $\Rightarrow$  All "valid" instructions must be reachable in linear disassembly
  - $\Rightarrow$  Direct jumps must target valid instructions
  - $\Rightarrow$  No instruction may cross 32-byte boundary
  - $\Rightarrow$  Indirect jumps/returns must be and eax, -32; jmp eax
- Problem: arbitrary memory access inside virtual memory

- Problem: dynamic code not verifiable
  - $\Rightarrow$  No self-modifying/dynamically generated code
- Problem: overlapping instructions
  - $\Rightarrow$  All "valid" instructions must be reachable in linear disassembly
  - $\Rightarrow$  Direct jumps must target valid instructions
  - $\Rightarrow$  No instruction may cross 32-byte boundary
  - $\Rightarrow$  Indirect jumps/returns must be and eax, -32; jmp eax
- Problem: arbitrary memory access inside virtual memory
  - $\Rightarrow$  Separate process, use segmentation restrict accessible memory
- Problem: program can run arbitrary CPU instructions

- Problem: dynamic code not verifiable
  - $\Rightarrow$  No self-modifying/dynamically generated code
- Problem: overlapping instructions
  - $\Rightarrow$  All "valid" instructions must be reachable in linear disassembly
  - $\Rightarrow$  Direct jumps must target valid instructions
  - $\Rightarrow$  No instruction may cross 32-byte boundary
  - $\Rightarrow$  Indirect jumps/returns must be and eax, -32; jmp eax
- Problem: arbitrary memory access inside virtual memory
  - $\Rightarrow$  Separate process, use segmentation restrict accessible memory
- Problem: program can run arbitrary CPU instructions
  - $\Rightarrow$  Blacklist "dangerous" instructions

### NaCl on non-i386 Systems

- ▶ Other architectures<sup>55</sup> use base register instead of segment offsets
  - Additional verification required

### NaCl on non-i386 Systems

▶ Other architectures<sup>55</sup> use base register instead of segment offsets

- Additional verification required
- Deprecated in 2017 in favor of WebAssembly

#### NaCl on non-i386 Systems

▶ Other architectures<sup>55</sup> use base register instead of segment offsets

- Additional verification required
- Deprecated in 2017 in favor of WebAssembly
- + Nice idea, high performance (5–15% overhead)
- $\sim\,$  Instruction blacklist not a good idea
- Not portable, severe restrictions on emitted code
- High verification complexity, error-prone

Idea: compile code to bytecode, JIT-compile on host

Benefit:

Idea: compile code to bytecode, JIT-compile on host

- ▶ Benefit: verification easy all code generated by trusted compiler
- Benefit: more portable

Idea: compile code to bytecode, JIT-compile on host

- Benefit: verification easy all code generated by trusted compiler
- Benefit: more portable
- Java applets
- PNaCl: bytecode version of NaCl

Idea: compile code to bytecode, JIT-compile on host

- Benefit: verification easy all code generated by trusted compiler
- Benefit: more portable
- Java applets
- PNaCl: bytecode version of NaCl
- + Fairly high performance, portable
- $\sim~$  Heavy runtime environment
  - Especially criticized for Java applets
- Very high complexity and attack surface

#### Approach: Subset of JavaScript: asm.js

- Situation: fairly fast JavaScript JIT-compilers present
- Idea: use subset of JavaScript known to be compilable to efficient code
  - All browsers/JS engines support execution without further changes
#### Approach: Subset of JavaScript: asm.js

Situation: fairly fast JavaScript JIT-compilers present

- Idea: use subset of JavaScript known to be compilable to efficient code
  - All browsers/JS engines support execution without further changes
- ▶ asm.js<sup>56</sup>: strictly, statically typed JS subset; single array as heap
- ► JS code generated by compilers, e.g. Emscripten
- JavaScript has single numeric type, but asm.js supports int/float/double
  - Coercion to integer: x|0
  - Coercion to double: +x
  - Coercion to float: Math.fround(x)

#### asm.js Example

```
var log = stdlib.Math.log;
var values = new stdlib.Float64Array(buffer);
function logSum(start, end) {
  start = start|0; // parameter type int
  end = end|0; // parameter type int
```

```
var sum = 0.0, p = 0, q = 0;
```

```
// asm.js forces byte addressing of the heap by requiring shifting by 3
for (p = start << 3, q = end << 3; (p|0) < (q|0); p = (p + 8)|0) {
    sum = sum + +log(values[p>>3]);
}
```

return +sum;

}

Example taken from the specification

Parsing costs time, type restrictions increase code size
 Idea: encode asm.js source as bytecode

Parsing costs time, type restrictions increase code size
 Idea: encode asm.js source as bytecode

First attempt: encode abstract syntax tree in pre-order

Parsing costs time, type restrictions increase code size
 Idea: encode asm.js source as bytecode

First attempt: encode abstract syntax tree in pre-order
 Second attempt: encode abstract syntax tree in post-order

Parsing costs time, type restrictions increase code size
 Idea: encode asm.js source as bytecode

First attempt: encode abstract syntax tree in pre-order
 Second attempt: encode abstract syntax tree in post-order
 Third attempt: encode as stack meaking

Third attempt: encode as stack machine

Parsing costs time, type restrictions increase code size
 Idea: encode asm.js source as bytecode

First attempt: encode abstract syntax tree in pre-order
 Second attempt: encode abstract syntax tree in post-order

- Third attempt: encode as stack machine
- ▶ ... and WebAssembly was born

# Approach: Using Bytecode – WebAssembly

- Strictly-typed bytecode format encoding a stack machine
- Global variables and single, global array as memory

# Approach: Using Bytecode – WebAssembly

- Strictly-typed bytecode format encoding a stack machine
- Global variables and single, global array as memory
- Functions have local variables
  - Parameters pre-populated in first local variables
  - ▶ No dynamic/addressable stack space! ~→ part of global memory used as stack
- Operations use implicit stack
  - Stack has well-defined size and types at each point in program

# Approach: Using Bytecode – WebAssembly

- Strictly-typed bytecode format encoding a stack machine
- Global variables and single, global array as memory
- Functions have local variables
  - Parameters pre-populated in first local variables
  - ▶ No dynamic/addressable stack space! ~→ part of global memory used as stack
- Operations use implicit stack
  - Stack has well-defined size and types at each point in program
- Structured control flow
  - Blocks to skip instructions, loop to repeat, if-then-else
  - ► No irreducible control flow representable

### Approach: Use Verifiable Bytecode – eBPF

Problem: want to ensure termination within certain time frame

Problem: need to make sure nothing can go wrong – no sandbox!

### Approach: Use Verifiable Bytecode – eBPF

- Problem: want to ensure termination within certain time frame
- Problem: need to make sure nothing can go wrong no sandbox!
- Idea: disallow loops and undefined register values, e.g. due to branch
   Combinatorial explosion of possible paths, all need to be analyzed
   No longer Turing-complete

## Approach: Use Verifiable Bytecode – eBPF

Problem: want to ensure termination within certain time frame

- Problem: need to make sure nothing can go wrong no sandbox!
- ▶ Idea: disallow loops and undefined register values, e.g. due to branch
  - Combinatorial explosion of possible paths, all need to be analyzed
  - No longer Turing-complete
- ▶ eBPF: allow user-space to hook into various Linux kernel parts
  - E.g. network, perf sampling, ...
- Strongly verified register machine
- JIT-compiled inside kernel

# JIT Compilation and Sandboxing – Summary

- ► JIT compilation required for dynamic source code or bytecode
- Bytecode allows for simpler verification than machine code, but is more compact
- Producing JIT-compiled code needs CPU, OS, and runtime support
- JIT compilers can do/need to do different kinds of optimizations adaptive execution is key technique to hide compilation latency
- Sandboxing can be done at various levels and granularities
- Virtualization and containers widely used for whole applications
- Bytecode formats popular for ad-hoc distribution of programs

### JIT Compilation and Sandboxing – Questions

- When is JIT-compilation beneficial over Ahead-of-Time compilation?
- How can JIT-compilation be realized using standard compilers?
- How can code be made executable after writing it to memory?
- ▶ Why do some architectures require a system call for ICache flushing?
- How can JIT compilers trade between compilation latency and performance?
- Why is sandboxing important?
- What methods of deploying code for sandboxed execution are widely used?

Code Generation for Data Processing Lecture 12: Binary Translation

Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

#### Motivation

Run program on other architecture

#### Motivation

- Run program on other architecture
- Use-case: application compatibility
  - Other architecture with incompatible instruction encoding
  - Applications using unavailable ISA extensions<sup>57</sup>

#### Motivation

Run program on other architecture

- Use-case: application compatibility
  - Other architecture with incompatible instruction encoding
  - Applications using unavailable ISA extensions<sup>57</sup>
- ► Use-case: architecture research
  - Development of new ISA extensions without existing hardware

Simplest approach: interpreting machine code

Simulate individual instructions, don't generate new code

#### Simplest approach: interpreting machine code

Simulate individual instructions, don't generate new code

Frequently used approach before JIT-compilation became popular

- Simplest approach: interpreting machine code
  - Simulate individual instructions, don't generate new code
- Frequently used approach before JIT-compilation became popular
- + Simple, works almost anywhere, high correctness

- Simplest approach: interpreting machine code
  - Simulate individual instructions, don't generate new code
- Frequently used approach before JIT-compilation became popular
- + Simple, works almost anywhere, high correctness
- Very inefficient

### **Binary Translation**

- Idea: translate guest machine code to host machine code
- Replace interpretation overhead with translation overhead
- Difficult: very rigid semantics, but few code constraints imposed
  - Self-modifying code, overlapping instructions, indirect jumps
  - Exceptions with well-defined states, status flags



Warning for same-ISA translation: passing all instructions through as-is is a bad idea! Behavior might differ.

# Static vs. Dynamic Binary Translation

#### Static BT

- Translate guest executable into host executable
- Do translation before execution

#### Dynamic BT

- Translate code on-the-fly during program execution
- Host code just lives in memory

# Static vs. Dynamic Binary Translation

#### Static BT

- Translate guest executable into host executable
- Do translation before execution
- + Low runtime overhead
- Binaries tend to be huge
- Cannot handle all cases
  - ► E.g., JIT-compiled code

#### Dynamic BT

- Translate code on-the-fly during program execution
- Host code just lives in memory
- + Allows for high correctness
- $\sim~{\rm Can}$  use JIT optimizations
- Translation overhead at run-time

▶ Goal: create new binary for host with same functionality

Program may access its own code/data in various ways
 Guest binary must be retained as-is in-place

- Program may access its own code/data in various ways
  - Guest binary must be retained as-is in-place
- Indirect jumps problematic
  - Need prediction of all possible targets
  - Keeping lots of dynamically possible entries prohibits optimizations

- Program may access its own code/data in various ways
  - Guest binary must be retained as-is in-place
- Indirect jumps problematic
  - Need prediction of all possible targets
  - Keeping lots of dynamically possible entries prohibits optimizations
- JIT-compiled/self-modifying code impossible to handle

- Program may access its own code/data in various ways
  - Guest binary must be retained as-is in-place
- Indirect jumps problematic
  - Need prediction of all possible targets
  - Keeping lots of dynamically possible entries prohibits optimizations
- JIT-compiled/self-modifying code impossible to handle
- Purely static translation impossible for the general case

**Translator Process** 





 Iteratively translate code chunks on-demand

Typically basic blocks



- Iteratively translate code chunks on-demand
  - Typically basic blocks



- Iteratively translate code chunks on-demand
  - Typically basic blocks



- Iteratively translate code chunks on-demand
  - Typically basic blocks


- Iteratively translate code chunks on-demand
  - Typically basic blocks



- Iteratively translate code chunks on-demand
  - Typically basic blocks
- Store new code in-memory for execution and later re-use
- Code executed in same address space as original
  - Guest code/data must be accessible



- Iteratively translate code chunks on-demand
  - Typically basic blocks
- Store new code in-memory for execution and later re-use
- Code executed in same address space as original
  - Guest code/data must be accessible



- Iteratively translate code chunks on-demand
  - Typically basic blocks
- Store new code in-memory for execution and later re-use
- Code executed in same address space as original
  - Guest code/data must be accessible

# Dynamic Binary Translation: Code Fragment

RISC-V Code

```
400560: slli a0, a0, 2
400564: jalr x0, ra, 0 // ret
```

Semantical representation

```
uintptr_t trans_400560(uint64_t* regs) {
  regs[10] = regs[10] << 2;
  return regs[1];
}</pre>
```

Translation Engine

```
void emulate(uintptr_t pc) {
    uint64_t* regs = init();
    while (true)
        pc = translate(pc)(regs);
}
```

# Dynamic Binary Translation: Code Fragment

RISC-V Code

```
400560: slli a0, a0, 2
400564: jalr x0, ra, 0 // ret
```

Translation Engine

```
void emulate(uintptr_t pc) {
  uint64_t* regs = init();
  while (true)
    pc = translate(pc)(regs);
}
```

Semantical representation

```
uintptr_t trans_400560(uint64_t* regs) {
  regs[10] = regs[10] << 2;
  return regs[1];
}</pre>
```

```
// or with tail call:
_Noreturn void trans_400560(uint64_t* regs) {
  regs[10] = regs[10] << 2;
  translate(regs[1])(regs);
  // unreachable
}</pre>
```

- Registers: general-purpose, floating-point, vector, ...
- Flags, control registers, system registers, segments, TLS base

Guest CPU state must be completely emulated

- Registers: general-purpose, floating-point, vector, ...
- Flags, control registers, system registers, segments, TLS base

Memory – user-space emulation: use host address space

- Registers: general-purpose, floating-point, vector, ...
- Flags, control registers, system registers, segments, TLS base
- Memory user-space emulation: use host address space
  - + no overhead through additional indirection
  - no isolation between emulator and guest

- Registers: general-purpose, floating-point, vector, ...
- Flags, control registers, system registers, segments, TLS base
- Memory user-space emulation: use host address space
  - + no overhead through additional indirection
  - no isolation between emulator and guest
- Memory system emulation: need software/hardware paging support

- Registers: general-purpose, floating-point, vector, ...
- Flags, control registers, system registers, segments, TLS base
- Memory user-space emulation: use host address space
  - + no overhead through additional indirection
  - no isolation between emulator and guest
- Memory system emulation: need software/hardware paging support
  - Software implementation: considerable performance overhead
  - Hardware implementation: guest and host need same page size

#### Guest Interface

► User-space emulation: OS interface needs to be emulated

- Mainly system calls, but also vDSO, memory maps, ...
- ► Host libraries are hard to use: ABI differences (e.g. struct padding)
- Syscall emulation tedious: different flag numbers, arguments, orders structs have different fields, alignments, padding bytes

### Guest Interface

► User-space emulation: OS interface needs to be emulated

- Mainly system calls, but also vDSO, memory maps, ...
- ▶ Host libraries are hard to use: ABI differences (e.g. struct padding)
- Syscall emulation tedious: different flag numbers, arguments, orders structs have different fields, alignments, padding bytes
- System-level emulation: CPU interface for operating systems
  - Many system/control registers
  - Different execution modes, memory configurations, etc.
  - Emulation of hardware components

# Dynamic Binary Translation: Optimizations

Fully correct emulation of CPU (and OS) is slow

- Every memory access is a potential page fault
- Signals can be delivered at any instruction boundary
- many other traps...
- ▶ But: these "special" features are used extremely rarely
- ► Idea: optimize for common case
- Aggressively trade correctness for performance

### Translation Granularity

Larger translation granules allow for more optimization

E.g., omit status flag computation; fold immediate construction

### Translation Granularity

- Larger translation granules allow for more optimization
  - E.g., omit status flag computation; fold immediate construction
- Instruction: great for debugging
- Basic block: allows for some important opt.
  - Easy to detect (up to next branch), easy to translate (no control flow)
- Superblock: up to next unconditional jump
  - Reduces transfers between blocks in fallthrough case
  - Translated code not necessarily executed
- Function: follow all conditional control flow
  - Allows most optimizations, e.g. for loop induction variables
  - Complex codegen, ind. jumps problematic, lot of code never executed

# Chaining

Observation: many basic blocks have constant successors

- Often conditional branches with fallthrough and constant offset
- (Hash)map lookup and indirect jump after everyblock expensive

# Chaining

Observation: many basic blocks have constant successors

- Often conditional branches with fallthrough and constant offset
- (Hash)map lookup and indirect jump after everyblock expensive
- Idea: after successor is translated, patch end to jump directly to that code
   First execution is expensive, later executions are fast

```
// Initially generated code
// ...
mov rdi, 0x40068c
lea rsi, [rip+1f]
jmp translate_and_dispatch
1:.byte ... // store patch information
```

```
// After patching
// ...
jmp trans_40068c
// (garbage remains)
```

First execution still slow, patching adds overhead

- Can speculatively translate continuations
- Translation of possibly unneeded code adds overhead

First execution still slow, patching adds overhead

- Can speculatively translate continuations
- Translation of possibly unneeded code adds overhead
- Does not work for indirect jumps
  - ▶ Not necessarily predictable, esp. when considering a single basic block
  - Occur fairly often: function returns

First execution still slow, patching adds overhead

- Can speculatively translate continuations
- Translation of possibly unneeded code adds overhead
- Does not work for indirect jumps
  - ▶ Not necessarily predictable, esp. when considering a single basic block
  - Occur fairly often: function returns

Removing translated functions from code cache becomes harder

- Arbitrary other code may directly branch to translated chunk
- Often solved by limiting chaining to same page or memory region

### Return Address Prediction

Observation: function calls very often return ordinarily

- Return is an indirect jump, but highly predictable
- But: even for "normal" code, this is not always the case: setjmp/longjmp, exceptions

### Return Address Prediction

Observation: function calls very often return ordinarily

- Return is an indirect jump, but highly predictable
- But: even for "normal" code, this is not always the case: setjmp/longjmp, exceptions

► Hardware has return address stack keeping track of call stack

- call pushes next address to stack, ret predicted to pop
- Usually implemented as 16/32 entry ring buffer

### Return Address Prediction

Observation: function calls very often return ordinarily

- Return is an indirect jump, but highly predictable
- But: even for "normal" code, this is not always the case: setjmp/longjmp, exceptions

Hardware has return address stack keeping track of call stack

- call pushes next address to stack, ret predicted to pop
- Usually implemented as 16/32 entry ring buffer

Idea: similarly optimize for common case of ordinary return

- Can be implemented as ring buffer, too
- Pop from stack needs verification of actual guest return address

- Can be implemented as ring buffer, too
- Pop from stack needs verification of actual guest return address
- Doesn't use host hardware return address prediction

- Can be implemented as ring buffer, too
- Pop from stack needs verification of actual guest return address
- Doesn't use host hardware return address prediction
- Option 2: use host stack as shadow stack
  - Allows using host call/ret instructions
  - Verification before/after return still required

- Can be implemented as ring buffer, too
- Pop from stack needs verification of actual guest return address
- Doesn't use host hardware return address prediction
- Option 2: use host stack as shadow stack
  - Allows using host call/ret instructions
  - Verification before/after return still required
  - Can degenerate, need to bound shadow stack (guest might repeatedly call, discard return address, but never return)

### Status Flags

- Observation: many status flags are rarely used
- But: eager computation can be expensive
  - ► E.g., x86 parity (PF) or auxiliary carry (AF)

## Status Flags

- Observation: many status flags are rarely used
- But: eager computation can be expensive
  - E.g., x86 parity (PF) or auxiliary carry (AF)
- Idea: compute flags only when needed
- > On flag computation, store operands needed for flag computation
- Flag usage in same block allows for optimizations
  - E.g., use idiomatic branches (jle, ...)
- Flag usage in different block: compute flags from operands
  - More expensive, but happens seldomly

- ► Goal 1: precise emulation application works properly
- ► Goal 2: stealthness/isolation application can't compromise DBT

- ► Goal 1: precise emulation application works properly
- Goal 2: stealthness/isolation application can't compromise DBT
- Problem: CPU and OS have huge and very-well-specified interfaces
   ... and even if unspecified, software often depends on it

- ► Goal 1: precise emulation application works properly
- Goal 2: stealthness/isolation application can't compromise DBT
- Problem: CPU and OS have huge and very-well-specified interfaces
   ... and even if unspecified, software often depends on it
   Increased difficulty: different guest/host architectures
   E.g., different page size or memory semantics

- ► Goal 1: precise emulation application works properly
- Goal 2: stealthness/isolation application can't compromise DBT
- Problem: CPU and OS have huge and very-well-specified interfaces

   ... and even if unspecified, software often depends on it

   Increased difficulty: different guest/host architectures

   E.g., different page size or memory semantics

   Increased difficulty for user-space: different guest/host OS
  - Depending on syscall interface, nearly impossible (see WSL1)
## **POSIX Signals**

- POSIX specifies signals, which can interrupt program at any point
- Kernel pushes signal frame to stack with user context and calls signal handler
- Signal handler can read/modify user context and continue execution

# **POSIX Signals**

- POSIX specifies signals, which can interrupt program at any point
- Kernel pushes signal frame to stack with user context and calls signal handler
- Signal handler can read/modify user context and continue execution
- Synchronous signals: e.g., SIGSEGV, SIGBUS, SIGFPE, SIGILL
  - ► For example, due to page fault or FP exception
  - Delivered in response to "error" in current thread
- Asynchronous signals: e.g., SIGINT, SIGTERM, SIGCHILD
  - Delivered externally, e.g. using kill
  - Can be delivered to any thread at any time
  - (usually a bad idea to use them)

## Correct DBT: Signals

DBT must register signal handler and propagate signals

Synchronous signals

# Correct DBT: Signals

DBT must register signal handler and propagate signals

#### Synchronous signals

- Delivered at "constrainable" points in program
- Must recover fully consistent guest architectural state
- JIT-compiled code must be sufficiently annotated for this

Asynchronous signals

# Correct DBT: Signals

DBT must register signal handler and propagate signals

#### Synchronous signals

- Delivered at "constrainable" points in program
- Must recover fully consistent guest architectural state
- JIT-compiled code must be sufficiently annotated for this

#### Asynchronous signals

- Can really be delivered at any time
- Must not be immediately delivered to guest
- $\rightsquigarrow$  Usually delivered when convenient
- But: real-time signals have special semantics

Option: emulating paging in software (slow, but works)

- Every memory accesses becomes a hash table lookup
- Shared memory still problematic: host OS might have larger pages

Using host paging is much faster, but problematic for correctness

- Every memory accesses becomes a hash table lookup
- Shared memory still problematic: host OS might have larger pages
- Using host paging is much faster, but problematic for correctness
- Host OS might have larger pages

- Every memory accesses becomes a hash table lookup
- Shared memory still problematic: host OS might have larger pages
- Using host paging is much faster, but problematic for correctness
- Host OS might have larger pages
- Every memory access can cause a page fault (see signal handling)

- Every memory accesses becomes a hash table lookup
- Shared memory still problematic: host OS might have larger pages
- Using host paging is much faster, but problematic for correctness
- Host OS might have larger pages
- Every memory access can cause a page fault (see signal handling)
- Guest can access/modify arbitrary addresses in its address space... including the DBT and its code cache

- Every memory accesses becomes a hash table lookup
- Shared memory still problematic: host OS might have larger pages
- Using host paging is much faster, but problematic for correctness
- Host OS might have larger pages
- Every memory access can cause a page fault (see signal handling)
- Guest can access/modify arbitrary addresses in its address space... including the DBT and its code cache
- Tracking read/write/execute permissions, e.g. check X before translation

## Correct DBT: Memory Ordering

CPUs (aggressively) reorder memory operations

- ▶ x86: total store ordering stores can be reordered after loads
- Most others: weak ordering everything can be reordered
- Relevant for multi-core systems: other thread can observe ordering
- Atomic operations and fences limit reordering (e.g., acq/rel/seqcst)
- Emulating weak memory on TSO:

## Correct DBT: Memory Ordering

- CPUs (aggressively) reorder memory operations
  - ▶ x86: total store ordering stores can be reordered after loads
  - Most others: weak ordering everything can be reordered
- Relevant for multi-core systems: other thread can observe ordering
- Atomic operations and fences limit reordering (e.g., acq/rel/seqcst)
- Emulating weak memory on TSO: easy
- Emulating TSO on weak memory:

## Correct DBT: Memory Ordering

- CPUs (aggressively) reorder memory operations
  - ▶ x86: total store ordering stores can be reordered after loads
  - Most others: weak ordering everything can be reordered
- Relevant for multi-core systems: other thread can observe ordering
- Atomic operations and fences limit reordering (e.g., acq/rel/seqcst)
- Emulating weak memory on TSO: easy
- Emulating TSO on weak memory: hard
  - Can try to make all operations atomic
  - Atomic operations often need alignment guarantees (not on x86)
  - Only viable solution so far: insert fences everywhere

- Idea: before translation, remap as read-only
- ▶ On page fault (SIGSEGV), remove relevant parts from code cache
  - Requires code cache segmentation and mapping of code to original page

- Idea: before translation, remap as read-only
- ► On page fault (SIGSEGV), remove relevant parts from code cache
  - Requires code cache segmentation and mapping of code to original page
- ▶ When executing possibly modifiable code: every store can change code!

- Idea: before translation, remap as read-only
- ► On page fault (SIGSEGV), remove relevant parts from code cache
  - Requires code cache segmentation and mapping of code to original page
- ▶ When executing possibly modifiable code: every store can change code!
- Doesn't easily work for shared memory, need to track this, too
  - Might be impossible when shared with other process

# Correct DBT: Floating-point

Floating-point arithmetic is standardized in IEE-754

## Correct DBT: Floating-point

Floating-point arithmetic is standardized in IEE-754

- …except for some details and non-standard operations
- > x86 maxsd: if one operand is NaN, result is second operand
- RISC-V fmax.d: if one operand is NaN, result is non-NaN operand
- AArch64 fmax: if one operand is NaN, result is NaN operand
  - Unless configured differently in fpcr

## Correct DBT: Floating-point

Floating-point arithmetic is standardized in IEE-754

- …except for some details and non-standard operations
- > x86 maxsd: if one operand is NaN, result is second operand
- RISC-V fmax.d: if one operand is NaN, result is non-NaN operand
- AArch64 fmax: if one operand is NaN, result is NaN operand
  - Unless configured differently in fpcr

Correctness typically requires software emulation (e.g., QEMU does this)

## Correct DBT: OS and CPU Specifics

Emulating all syscalls correctly is hard

- Version-specifics, structure layouts, feature support
- Huge interface
- /proc/self/\* how to emulate?
  - Catch all file system accesses? Follow all possible symlinks?
  - What if procfs is mounted somewhere else?
- cpuid how to emulate?
  - Cache sizes, processor model, ...
  - Application can do timing experiment to detect DBT

## Binary Translation – Summary

- ISA emulation often used for cross-ISA program execution
- Binary Translation allows for more performance than interpretation
- Static Binary Translation handles whole program ahead-of-time
- Dynamic Binary Translation translates code on-demand
- ISA often highly restricts optimization possibilities
- Optimizations typically very low-level
- Correct emulation of CPU/OS challenging due to large interface

#### Binary Translation – Questions

- What are use cases of binary translation?
- ▶ What is the difference between static and dynamic binary translation?
- Why is static BT strictly less powerful than dynamic BT?
- What are typical translation granularities for DBT?
- How to optimize control flow handling in DBT?
- Why is correct binary translation hard to optimize?
- What problem can occur when not emulating paging for user-space emulation?

Code Generation for Data Processing Lecture 13: Query Compilation

Alexis Engelke

Chair of Data Science and Engineering (125) School of Computation, Information, and Technology Technical University of Munich

Winter 2023/24

Databases are often used in latency-critical situations

Mostly transactional workload

Databases are often used in latency-critical situations

Mostly transactional workload

Databases are often used for analyzing large data sets

- Mostly analytical workload; queries can be complex
- Latency not that important, but through-put is

Databases are often used in latency-critical situations

Mostly transactional workload

Databases are often used for analyzing large data sets
Mostly analytical workload; queries can be complex
Latency not that important, but through-put is

- Databases are also used for storing data streams
  - Streaming databases, e.g. monitoring sensors
  - Throughput is important; but queries often simple

#### Data Representation

Relational algebra: set/bag of tuples

- Tuple is sequence of data with different types
- All tuples in one relation have same schema
- Order does not matter
- Duplicates might be possible (bags)
- ▶ Might have special values, e.g. NULL
- Values might be variably-sized, e.g. strings
- ▶ But: databases have *high* degree of freedom wrt. data representation



Query often specified in "standardized format" (SQL)



- Query often specified in "standardized format" (SQL)
- SQL is transformed into (logical) query plan

## Query Plan

Query often specified in "standardized format" (SQL)

- SQL is transformed into (logical) query plan
- Logical query plan is optimized
  - E.g., selection push down, transforming cross products to joins, join ordering

# Query Plan

Query often specified in "standardized format" (SQL)

- SQL is transformed into (logical) query plan
- Logical query plan is optimized
  - E.g., selection push down, transforming cross products to joins, join ordering
- Physical query plan
  - Selection of actual implementation for operators
  - Determine use index structures, access paths, etc.

#### Query Plan: Subscripts

Query plan strongly depends on query

Operators have query-dependent subscripts

E.g., selection/join predicate, aggregation function, attributes

Implementation of these also depends on schema

Can include arbitrarily complex expressions

► Examples: 
$$\bowtie_{s.matrnr=h.matrnr}^{HJ}$$
,  $\sigma_{a.x<5\cdot(b.y-a.z)}$ 





#### Subscripts: Execution

► Option: keep as tree, interpret
- Option: keep as tree, interpret
  - + Simple, flexible
  - Slow



- Option: keep as tree, interpret
  - + Simple, flexible
  - Slow
- ► Option: compile to bytecode

- Option: keep as tree, interpret
  - + Simple, flexible
  - Slow
- Option: compile to bytecode
  - $+ \ \, {\rm More} \ \, {\rm efficient}$
  - More effort to implement, some compile-time



- Option: keep as tree, interpret
  - + Simple, flexible
  - Slow
- Option: compile to bytecode
  - + More efficient
  - More effort to implement, some compile-time
- Option: compile to machine code

- Option: keep as tree, interpret
  - + Simple, flexible
  - Slow
- Option: compile to bytecode
  - + More efficient
  - More effort to implement, some compile-time
- Option: compile to machine code
  - Code can be complex to accurately represent semantics
  - + Most efficient
  - Most effort to implement, may need short compile-times

## SQL Expressions

Arithmetic expressions are fairly simple

- Need to respect data type and check for errors (e.g., overflow)
- Numbers in SQL are (fixed-point) decimals

String operations can be more complex

# SQL Expressions

Arithmetic expressions are fairly simple

- Need to respect data type and check for errors (e.g., overflow)
- Numbers in SQL are (fixed-point) decimals
- String operations can be more complex
  - like expressions
  - Regular expressions strongly benefit from optimized execution
  - But: full-compilation may not be worth the effort often, calling runtime functions is beneficial
  - Support Unicode for increased complexity

#### Query Execution: Simplest Approach

 $\bowtie_{s.matrnr=h.matrnr}^{HJ}$ studenten hoeren

# Query Execution: Simplest Approach



- Execute operators individually
- Materialize all results after each operator
- "Full Materialization"

# Query Execution: Simplest Approach



- Execute operators individually
- Materialize all results after each operator
- "Full Materialization"
- + Easy to implement
- + Can dynamicnamically adjust plan
- Inefficient, intermediate results can be big

## Iterator Model<sup>58</sup>

- Idea: stream tuples through operators
- Every operator implements set of functions:
  - open(): initialization, configure with child operators
  - next(): return next tuple (or indicate end of stream)
  - close(): free resources

## Iterator Model<sup>58</sup>

- Idea: stream tuples through operators
- Every operator implements set of functions:
  - open(): initialization, configure with child operators
  - next(): return next tuple (or indicate end of stream)
  - close(): free resources
- Current tuple can be pass as pointer or held in global data space
   Possible: only single tuple is processed at a time

#### Iterator Model: Example

```
struct TableScan : Iter {
 Table* table:
 Table::iterator it;
 void open() { it = table.begin(); }
 Tuple* next() {
   if (it != table.end())
     return *it++;
   return nullptr;
 } };
struct Select : Iter {
 Predicate p;
 Iter base:
 void open() { base.open(); }
 Tuple* next() {
   while (Tuple* t = base.next())
     if (p(t))
       return t;
   return nullptr;
 } }:
```

## Iterator Model: Example

```
struct TableScan : Iter {
 Table* table:
 Table::iterator it:
 void open() { it = table.begin(); }
 Tuple* next() {
   if (it != table.end())
     return *it++:
   return nullptr;
 } }:
struct Select : Iter {
 Predicate p:
 Iter base:
 void open() { base.open(); }
 Tuple* next() {
   while (Tuple* t = base.next())
     if (p(t))
       return t;
   return nullptr;
 } }:
```

```
struct Cross : Iter {
  Iter left, right;
 Tuple* curLeft = nullptr;
 void open() { left.open(); }
 Tuple* next() {
   while (true) {
     if (!curLeft) {
       if (!(curLeft = left.next()))
         return nullptr;
       right.open();
     3
     if (Tuple* tr = right.next())
       return concat(curLeft, tr);
     curLeft = nullptr;
 }
1:
```

HashJoin builds hash table on first read; materialization might be useful

#### Iterator Model

- "Pull-based" approach
- Widely used (e.g., Postgres)
- Often have separate function for first() or rewind

#### Iterator Model

- "Pull-based" approach
- Widely used (e.g., Postgres)
- Often have separate function for first() or rewind
- + Fairly straight-forward to implement
- $+\,$  Avoids data copies, no dynamic compilation
- Only single tuple processed at a time, bad locality
- Huge amount virtual function calls

## Push-based Model<sup>59</sup>

Idea: operators push tuples through query plan bottom-up

Every operator implements set of functions:

- open(): initialization, store parents
- produce(): produce items
  - Table scan calls consume() of parents
  - Others call produce() of their child
- consume(): consume items from children, push them to parents



### Push-based Model: Example

```
struct TableScan {
 Table table;
 Consumer cons;
 void produce() {
   for (Tuple* t : table)
     cons.consume(t, this);
 }
};
struct Select {
 Predicate p;
  Producer prod;
 Consumer cons;
 void produce() { prod.produce(); }
 void consume(Tuple* t, Producer src) {
   if(p(t))
     cons.consume(t)
 }
};
```

### Push-based Model: Example

```
struct TableScan {
 Table table:
 Consumer cons;
 void produce() {
   for (Tuple* t : table)
     cons.consume(t, this);
 }
};
struct Select {
 Predicate p:
 Producer prod:
 Consumer cons;
 void produce() { prod.produce(); }
 void consume(Tuple* t, Producer src) {
   if (p(t))
     cons.consume(t)
 }
};
```

```
struct Cross {
 Producer left, right;
 Consumer cons:
 Tuple* curLeft = nullptr;
 void produce() { left.produce(); }
 // Materializing one side might be better
 void consume(Tuple* t, Producer src) {
   if (src == left) {
     curLeft = t:
     right.produce();
   } else { // src == right
     cons.consume(concat(curLeft, t));
 }
1:
```

#### Push-based Model

"Push-based" approach

More recent approach

#### Push-based Model

- "Push-based" approach
- More recent approach
- + Fairly straight-forward, but less intuitive than iterator
- + Avoids data copies, no dynamic compilation
- $-\,$  Only single tuple processed at a time, bad locality
- Huge amount virtual function calls

## Pull-based Model vs. Push-based Model<sup>60</sup>

- Two fundamentally different approaches
- Push-based approach can handle DAG plans better
  - Pull-model: needs explicit materialization or redundant iteration
  - Push-model: simply call multiple consumers



<sup>&</sup>lt;sup>60</sup>A Shaikhha, M Dashti, and C Koch. "Push versus pull-based loop fusion in query engines". In: Journal of Functional Programming 28 (2018).

## Pull-based Model vs. Push-based Model<sup>60</sup>

- Two fundamentally different approaches
- Push-based approach can handle DAG plans better
  - Pull-model: needs explicit materialization or redundant iteration
  - Push-model: simply call multiple consumers
- Performance: nearly identical
  - Push-based model needs handling for limit operations otherwise table scan would not stop, even all tuples are dropped
- But:

## Pull-based Model vs. Push-based Model<sup>60</sup>

- Two fundamentally different approaches
- Push-based approach can handle DAG plans better
  - Pull-model: needs explicit materialization or redundant iteration
  - Push-model: simply call multiple consumers
- Performance: nearly identical
  - Push-based model needs handling for limit operations otherwise table scan would not stop, even all tuples are dropped
- But: push-based code is nice after inlining

<sup>60</sup>A Shaikhha, M Dashti, and C Koch. "Push versus pull-based loop fusion in query engines". In: Journal of Functional Programming 28 (2018).

# Pipelining

Some operators need materialized data for their operation

- Pipeline breaker: operator materializes input
- ► Full pipeline breaker: operator materializes complete input before producing
- Other operators can be *pipelined* (i.e., no materialization)

# Pipelining

Some operators need materialized data for their operation

- Pipeline breaker: operator materializes input
- ► Full pipeline breaker: operator materializes complete input before producing
- Other operators can be *pipelined* (i.e., no materialization)

#### Aggregations

- Join needs one side materialized (pipeline breaker on one side)
- Sorting needs all data (full pipeline breaker)

# Pipelining

Some operators need materialized data for their operation

- Pipeline breaker: operator materializes input
- ► Full pipeline breaker: operator materializes complete input before producing
- Other operators can be *pipelined* (i.e., no materialization)

#### Aggregations

- Join needs one side materialized (pipeline breaker on one side)
- Sorting needs all data (full pipeline breaker)

System needs to take care of semantics, e.g. for memory management

#### Code Generation for Push-Based Model

Inlining code in push-based model yields nice code

- No virtual function calls
- Producer iterates over materialized tuples and loads relevant data
  - Tight loop over base table data locality
- Operators of parent operators are applied inside the loop
- Pipeline breaker materializes result (e.g., into hash table)

## Code Generation: Example



#### Code Generation: Example

```
\sigma_{s.matrnr=h.matrnr}
|
\times
\checkmark
studenten hoeren
```

```
struct Query {
 Output out;
 Table tabLeft, tabRight;
 Tuple* curLeft = nullptr;
 void produce() {
   for (Tuple* tl : tabLeft) {
     curLeft = tl:
     for (Tuple* tr : tabRight) {
       Tuple* t = concat(curLeft, tr);
       if (t.s_matrnr == t.h_matrnr)
         out.write(t);
     }
}:
```

#### How to Generate Code

Code generator executes produce/consume methods

- Method bodies don't do actual operations, but construct code
- E.g., call IRBuilder
- Call to helper functions for complex operations e.g. hash table insert/lookup, string operations, memory allocation, etc.
- Resulting code doesn't contain produce/consume methods only loops that iterate over data
  - No overhead of function calls

#### How to Generate Code

Code generator executes produce/consume methods

- Method bodies don't do actual operations, but construct code
- E.g., call IRBuilder
- Call to helper functions for complex operations e.g. hash table insert/lookup, string operations, memory allocation, etc.
- Resulting code doesn't contain produce/consume methods only loops that iterate over data
  - No overhead of function calls
- Generate (at most) one function per pipeline
  - Allows for parallel execution of different pipelines

Code generation allows for substantial performance increase

- Fairly popular, even in commercial systems, despite engineering effort
- Competence in compiler engineering is a problem, though

Code generation allows for substantial performance increase

- Fairly popular, even in commercial systems, despite engineering effort
- Competence in compiler engineering is a problem, though
- ► Bytecode
  - Extremely popular: fairly simple, portable, and flexible

Code generation allows for substantial performance increase

- Fairly popular, even in commercial systems, despite engineering effort
- Competence in compiler engineering is a problem, though

#### Bytecode

- Extremely popular: fairly simple, portable, and flexible
- ▶ Machine code through programming language (C, C++, Scala, ...)
  - Also popular: no compiler knowledge required, but compile-times are bad
- Machine code through compiler IR (mostly LLVM)
- Machine code through specialized IR (Umbra only)


# Case Study: Amazon Redshift<sup>61</sup>

"Redshift generates C++ code specific to the query plan and the schema being executed. The generated code is then compiled and the binary is shipped to the compute nodes for execution [12, 15, 17]. Each compiled file, called a segment, consists of a pipeline of operators, called steps. Each segment (and each step within it) is part of the physical query plan. Only the last step of a segment can break the pipeline."

### Case Study: Amazon Redshift<sup>62</sup>



"Figure 7(a) illustrates [...] from an out-of-box TPC-H 30TB dataset [...]. The TPC-H benchmark workload runs on this instance every 30 minutes and we measure the end-to-end runtime. Over time, more and more optimizations are automatically applied reducing the total work- load runtime. After all recommendations have been applied, the workload runtime is reduced by 23% (excluding the first execution that is higher due to compilation).

# Compile Times: Umbra



Problem: still only process single tuple at a time

Problem: still only process single tuple at a time

- Doesn't utilize vector extensions of CPUs
- Idea: process multiple tuples at once
  - Also allows eliminating data-dependent branches, which not well-predictable
  - ▶ Esp. relevant when selectivity is between 10–90%
- ► Use of SIMD instructions requires column-wise store
  - Row-wise store would require gather operation for each load
  - Gather is very expensive

#### Vectorized Execution: SIMD Instructions

Obvious candidate:

#### Vectorized Execution: SIMD Instructions

Obvious candidate: initial selection over tables

- Load vector of elements, use SIMD operations for comparison
- Write back compressed result to temporary location for use in subsequent operations
- Special compress instructions (AVX-512, SVE) highly beneficial
- Other operations much more difficult to vectorize
  - Initial hash table lookup requires gather; collisions difficult
  - When many elements are masked out, performance suffers

- Bytecode interpretation substantially benefits from vectorized execution
- Key benefit: less dispatch overhead
- ▶ Typically much larger "vectors" (>1000)

- Bytecode interpretation substantially benefits from vectorized execution
- Key benefit: less dispatch overhead
- ▶ Typically much larger "vectors" (>1000)
- Comparison with non-vectorized machine code generation:
  - Vectorization often beneficial for initial scan
  - Code generation is faster than bytecode-interpred vec. execution
  - But: a good vectorized engine is not necessarily slow
- Vectorized execution probably more popular than code generation

# Query Compilation – Summary

- Databases have trade-off between low latency and high throughput
- Evaluation needed for operators and subscripts
- Subscripts easy to compile
- > Operator execution: full materialization vs. pipelined execution
- Pull-based vs. push-based execution
- Push-based allows for good code generation
- Bytecode and programming languages are widely used in practice
- Vectorized execution improves performance without native code gen.

### Query Compilation – Questions

- Why are low compile times important for databases?
- What is the difference between push-based and pull-based execution?
- > Why does push-based execution allow for higher performance?
- How to generate code for a query?
- How does vectorized execution improve performance?
- Why do many database engines not use machine code generation?