Data Contamination of LLMs
Context
A major challenge in evaluating modern LLMs is determining whether a model has previously seen benchmark data during training. This project focuses on detecting and mitigating training data contamination.
Motivation
Students will develop frameworks that:
- Refactor or transform code samples
- Generate semantically equivalent but unseen variants
- Evaluate model generalization capabilities
Goal
- Implement code augmentation pipelines
- Measure performance on transformed datasets
- Detect memorization vs generalization
- Analyze contamination risks in benchmarks
Requirements
- Strong programming skills
- Interest in evaluation and benchmarking
- Familiarity with ML tools is a plus