Competitive Training of LLMs
Context
LLMs increasingly rely on synthetic data for continued improvement, as most publicly available datasets (e.g., GitHub) have already been extensively used in training existing models. However, ensuring the correctness and usefulness of synthetic code remains a major challenge.
Motivation
This project proposes a competitive training framework inspired by GAN-like systems:
- One model generates synthetic code samples
- Another model evaluates and tests correctness
- Feedback is used to iteratively improve generation quality
Goal
The long-term objective is to build a scalable system for generating verified, high-quality synthetic code for future LLM training.
Tasks
- Implement generator–evaluator training loops
- Develop automatic code testing pipelines
- Create reward functions based on execution success
- Benchmark synthetic vs real dataset performance
- Investigate adversarial or cooperative training dynamics
Requirements
- Strong programming skills
- Familiarity with machine learning concepts
- Interest in synthetic data generation
- PyTorch / HuggingFace experience recommended