Code Generation for Data Processing

Contents

Code generation is a key technique for efficient program execution and data processing. This lecture will cover the following topics from a practical perspective with accompanying hands-on exercises:

  • Execution models of programs (interpretation, bytecode, machine code generation, etc.)
  • Program representations (source code, intermediate representations (IRs), different forms of bytecode)
  • Classical techniques of code generation
    • SSA and optimization techniques, exemplary described on LLVM-IR
    • Machine code generation: instruction selection and register allocation
  • Execution of programs in virtual machines (e.g., WebAssembly, BPF, JavaScript)
    • Sandboxing and optimizations for JIT compilation
  • Execution of database queries (e.g., SQL, data frame API)
    • Execution models and code representations
  • Execution of machine code/binary translation (e.g., RISC-V)
    • Specifics when translating machine code

Organization

  • Lecture with integrated exercises: Thu 10–14 (c.t., with break) in 02.11.018
  • Exercises will include hands-on programming tasks
    • (Basic) C++ knowledge is required
  • Language: English
  • Module: CIT3230001, 6 ECTS, Bachelor/Master elective
    • Area "Databases and Information Systems" for Informatics, Wirtschaftsinformatik/Information Systems, Informatics: Games Engineering, Biomedical Engineering
    • B1.2 "Advanced Topics in Data Engineering" for Data Engineering and Analystics
  • Written exam (90 minutes), might change to oral on low registration count. Exam from winter 2022/23: [exam2223.pdf].
  • Zulip stream for this lecture; private contact via e-mail

Prerequisites

The course is aimed at bachelor/master students who have taken the following (or similar) courses:

  • IN0004 Introduction to Computer Architecture
  • IN0008 Fundamentals of Databases

Material

Material and exercises will be regularly provided throughout the semester.

Script (updated regularly with new lecture content, includes content of slides and additional information and comments): [book.pdf]

Note: The schedule below is preliminary and is likely to change during the semester.

DateLecture TopicHomework
17.10. Overview, Motivation, Interpretation Techniques [lec01.pdf]
Exercise: (no exercise session)
[hw1.txt] [hw1-sol.c]
24.10. Compiler Front-end [lec02.pdf]
Exercise: discussion of homework
[hw2.txt] [hw2-sol.cc]
31.10. IR Concepts, Control Flow Graph, SSA Construction [lec03.pdf] [prs03.pdf]
Exercise: IR design [ex03.txt]
07.11. LLVM-IR [lec04.pdf]
Exercise: getting started with LLVM [ex04.txt]
[hw3.txt] [hw3-sol.cc]
14.11. Analyses and Transformations [lec05.pdf] [prs05.pdf]
Exercise: writing an LLVM-IR pass [ex05.txt]
21.11. Vectorization (canceled)
Exercise: auto-vectorization
28.11. Instruction Selection [lec07.pdf] [prs07.pdf]
Exercise: discussion of hw2 and hw3
[hw4.txt] [hw4-sol.cc]
05.12. (no lecture/exercise, Dies Academicus)
12.12. Register Allocation [lec08.pdf] [prs08.pdf]
Exercise: SSA destruction
[hw5.txt]
19.12. Object Files, Linker, Loader [lec09.pdf]
Exercise: ELF file construction [ex09.txt]
09.01. Unwinding and Debuginfo
Exercise: unwind info generation
16.01. JIT-compilation and Sandboxing
Exercise: WebAssembly hands-on
23.01. Query Compilation
Exercise: SQLite hands-on
30.01. Binary Translation tentively: Vectorization
Exercise: QEMU hands-on
06.02. TBD (possible guest lecture)
Exercise: TBD/wrap up