Model Contamination in the Wild

Page content

Context

Nowadays, huge ML models are pre-trained on extremely large datasets, giving rise to for example LLMs. During pre-training, models use datasets encompassing for instance GitHub repositories, or even, whole GitHub. Problem arises in evaluation, where we want to have separate datasets for training and testing. This helps us evaluating whether model generalizes well on unseen data, therefore it’s important to have zero overlap between the data samples from training and testing datasets. However, it was showed that models were trained on benchmarks and datasets on which they are then tested.

Goal

In this project, student will try to replicate results from the recent papers ConStat, MinK++ and potentially extend the result to the benchmarks popular in Software Engineering (such as Defects4j). Further experimentation could involve machine unlearning, where we would like to for example unlearn benchmarks from the model without causing the loss of quality.

Requirements

It would be preferable and advantageous for the student to understand a bit about ML and NLP and coding using PyTorch/Tensorflow, however this can be done throughout the project. Strict requirement in this case is to understand and have some experience with Python. Experience with Git is also recommended, since the work will be using tools from other GitHub repos.

Pointers

ConStat
Defects4j
Min-K%++

Contact

Roman Machacek, roman.machacek@unibe.ch