Diffusion-based Code Language Model

Context

While standard code models generate text one token at a time (autoregressive), Diffusion Language Models (DLMs) generate and refine the entire block of code simultaneously. This allows the model to look ahead and fix structural errors in a non-linear fashion, where inference becomes an online optimization problem.

Goal

The student will fine-tune a Discrete Diffusion Model like LLaDA-Instruct specifically for reasoning about code execution on a custom tracing dataset for python code.

Requirements

Knowledge of Machine Learning and PyTorch.
Passion for state-of-the-art machine learning models, especially diffusion models.

Pointers

LLaDA: Large Language Diffusion with Autoregressive-Level Performance (2025) [https://arxiv.org/pdf/2502.09992]
Discrete Diffusion Language Modeling (Austin et al.) [https://arxiv.org/pdf/2310.16834]

Contact

Jonas Spieler
Roman Macháček