Competitive Training of LLMs

Context

LLMs increasingly rely on synthetic data for continued improvement, as most publicly available datasets (e.g., GitHub) have already been extensively used in training existing models. However, ensuring the correctness and usefulness of synthetic code remains a major challenge.

Motivation

This project proposes a competitive training framework inspired by GAN-like systems:

One model generates synthetic code samples
Another model evaluates and tests correctness
Feedback is used to iteratively improve generation quality

Goal

The long-term objective is to build a scalable system for generating verified, high-quality synthetic code for future LLM training.

Tasks

Implement generator–evaluator training loops
Develop automatic code testing pipelines
Create reward functions based on execution success
Benchmark synthetic vs real dataset performance
Investigate adversarial or cooperative training dynamics

Requirements

Strong programming skills
Familiarity with machine learning concepts
Interest in synthetic data generation
PyTorch / HuggingFace experience recommended

Contact

Roman Macháček