Apertus: Improving Coding Capabilities
Context
The Apertus project from EPFL, ETH, focuses on developing a Swiss-based Large Language Model (LLM) with strong multilingual capabilities. While the model performs competitively on general language tasks, it currently struggles with structured programming challenges such as:
- Long-horizon reasoning over multiple files
- Code refactoring and abstraction
- Repository-level understanding
- Debugging and test-driven development
- Reliable code generation under constraints
Goal
The goal of this project is to improve Apertus’ performance on coding tasks through:
- Advanced masking strategies
- Dataset augmentation and transformation
- Code refactoring pipelines
- Reinforcement Learning (RL)-based optimization
- Tool-calling and execution-aware training
Students will explore both academic and practical approaches to improving LLM-based code generation systems.
Tasks
- Design and implement data augmentation pipelines for code
- Develop evaluation benchmarks for coding capabilities
- Experiment with RLHF / RLAIF for programming tasks
- Integrate tool usage into training loops (e.g., compilers, linters)
- Analyze performance improvements via automated testing
Requirements
- Strong programming skills (Python required)
- Basic understanding of Machine Learning
- Interest in LLMs and generative models
- Experience with PyTorch, HuggingFace Transformers, or testing frameworks is a plus