Apertus: Improving Coding Capabilities

Context

The Apertus project from EPFL, ETH, focuses on developing a Swiss-based Large Language Model (LLM) with strong multilingual capabilities. While the model performs competitively on general language tasks, it currently struggles with structured programming challenges such as:

Long-horizon reasoning over multiple files
Code refactoring and abstraction
Repository-level understanding
Debugging and test-driven development
Reliable code generation under constraints

Goal

The goal of this project is to improve Apertus’ performance on coding tasks through:

Advanced masking strategies
Dataset augmentation and transformation
Code refactoring pipelines
Reinforcement Learning (RL)-based optimization
Tool-calling and execution-aware training

Students will explore both academic and practical approaches to improving LLM-based code generation systems.

Tasks

Design and implement data augmentation pipelines for code
Develop evaluation benchmarks for coding capabilities
Experiment with RLHF / RLAIF for programming tasks
Integrate tool usage into training loops (e.g., compilers, linters)
Analyze performance improvements via automated testing

Requirements

Strong programming skills (Python required)
Basic understanding of Machine Learning
Interest in LLMs and generative models
Experience with PyTorch, HuggingFace Transformers, or testing frameworks is a plus

Contact

Roman Macháček