Diffusion-based Code Language Model
Context
While standard code models generate text one token at a time (autoregressive), Diffusion Language Models (DLMs) generate and refine the entire block of code simultaneously. This allows the model to look ahead and fix structural errors in a non-linear fashion, where inference becomes an online optimization problem.
Goal
The student will fine-tune a Discrete Diffusion Model like LLaDA-Instruct specifically for reasoning about code execution on a custom tracing dataset for python code.
Requirements
- Knowledge of Machine Learning and PyTorch.
- Passion for state-of-the-art machine learning models, especially diffusion models.
Pointers
- LLaDA: Large Language Diffusion with Autoregressive-Level Performance (2025) [https://arxiv.org/pdf/2502.09992]
- Discrete Diffusion Language Modeling (Austin et al.) [https://arxiv.org/pdf/2310.16834]