Feature Engineering for Classification-Based Merge Conflict Resolution
Context
Merge conflict resolution remains a significant challenge in Git-based software development, as manual conflict resolutions slow down collaboration and reduce developer productivity. However, empirical research results suggest that a vast majority of chunk resolutions found in practice can be derived from a fixed set of conflict resolution patterns, combining the ours, theirs, and base parts of a conflicting chunk in a pre-defined way. These findings form the foundation for phrasing merge conflict resolution as a classification problem, and thus using traditional machine learning for predicting conflict resolutions.
Motivation
In a preliminary study, we collected a large dataset by extracting conflicts and their resolutions from the evolution of thousands of open-source projects, which may be used for training conflict resolution classifiers. In fact, traditional classifiers (e.g., logistic regression, random forests, and support vector machines) have been shown to perform reasonably well. However, they struggle to generalize across repositories and programming languages, largely due to suboptimal feature representations: the presence of more than 50 features introduces redundancy and noise that must be pruned using established feature selection techniques.
Goal
The goal of this work is to evaluate classical classification algorithms on a merge conflict dataset after systematic feature engineering and selection. We aim to determine the most informative subset of features from the original 50+ variables and to compare classifier performance using metrics such as accuracy, F1-score, and AUC. Based on these results, we will recommend an optimal feature set for predicting merge conflict resolutions.
Requirements
The student should have:
- Familiarity with traditional classification methods (e.g., logistic regression, decision trees, random forests, SVM, KNN).
- Expertise in feature selection (e.g., filter: chi-squared, mutual info; wrapper: RFE, forward/backward selection; embedded: LASSO, tree importance) and engineering (e.g., PCA, polynomial features, scaling).
- Proficiency in handling tabular data with Python/R (scikit-learn, pandas) for redundancy detection and model evaluation.
- Understanding of Git/version control and software engineering for context.
Pointers
- https://dl.acm.org/doi/abs/10.1145/3661167.3661197
- https://www.sciencedirect.com/science/article/pii/S0950584923001878