RL-based Training for Code in LLMs

Context

Large Language Models (LLMs) have shown strong performance in code generation, completion, and repair tasks. However, supervised pretraining on massive code corpora is limited by data quality, lack of explicit feedback, and the inability to capture correctness beyond next-token prediction. Recent research has explored Reinforcement Learning (RL) based training approaches to refine LLMs for code. By leveraging feedback signals—such as compilation success, test case execution, or static analysis warnings—models can be trained to better align with correctness and developer intent.

Motivation

Supervised learning provides only implicit training signals (predicting the next token), which may result in syntactically valid but semantically incorrect code. Reinforcement learning enables training LLMs with explicit correctness-oriented rewards, making them more reliable for real-world coding tasks. This seminar project investigates how different RL strategies (e.g., reward shaping, self-play, policy optimization) influence the quality and reliability of code generation. Special attention will be given to integrating execution-based rewards (unit tests, runtime checks) and static analysis rewards (linting, type-checking).

Goal

The objective is to explore and evaluate reinforcement learning approaches for improving LLM code generation. Various RL-based training setups will be compared in terms of:

Syntactic correctness (compilation success rates)
Semantic correctness (unit test pass rates)
Efficiency (solution length, readability, maintainability)

The project aims to assess whether RL fine-tuning can significantly improve LLM coding abilities beyond supervised pretraining.

Requirements

Python (for RL frameworks, training pipelines, and evaluation scripts)

Contact

Roman Macháček