Competitive Training of LLMs

Context

LLMs increasingly rely on synthetic data for continued improvement, as most publicly available datasets (e.g., GitHub) have already been extensively used in training existing models. However, ensuring the correctness and usefulness of synthetic code remains a major challenge.

Motivation

This project proposes a competitive training framework inspired by GAN-like systems:

Goal

The long-term objective is to build a scalable system for generating verified, high-quality synthetic code for future LLM training.

Tasks

Requirements

Contact