A Meta-Analysis of Overfitting in Machine Learning

Like
0 Likes

A Meta-Analysis of Overfitting in Machine Learning

Last updated on April 6, 2021, 4:21 p.m. by sakshi4

Summary

This paper conducts a large meta-analysis of overfitting due to test set reuse in the

machine-learning community. The holdout method is central to empirical progress in the machine learning community. Competitions, benchmarks, and large-scale hyperparameter searches all rely on splitting a data set into multiple pieces to separate model training from the evaluation. However, when practitioners repeatedly reuse holdout data, the danger of overfitting the holdout data arises. In this paper, an empirical study of holdout reuse at a significantly larger scale by analyzing data from 120 machine learning competitions on the popular Kaggle platform is done.

This paper focuses on adaptive overfitting, which is overfitting caused by test set reuse.

The structure of Kaggle competitions makes MetaKaggle a useful dataset for investigating overfitting empirically at a large scale.

The first level considers all submissions in competition and checks for systematic overfitting that would affect a substantial number of submissions. The second level then zooms into the top 10% of submissions and conducts a similar comparison of the public to private scores. The third analysis level then takes a mainly quantitative approach and computes the probabilities of the observed public vs. private accuracy differences under an ideal null model.

In the following subsections, three analysis methods are applied to investigate four accuracy competitions. These four competitions are the accuracy competitions with the largest number of submissions and serve as representative examples for a typical accuracy competition in the MetaKaggle dataset.

120 competitions on Kaggle were surveyed covering a wide range of classification tasks but found little to no signs of adaptive overfitting. Results cast doubt on the standard narrative that adaptive overfitting is a significant danger in the common machine learning workflow. Findings call into question whether common practices such as limiting test set re-use increase the reliability of machine learning.

Important points:

The holdout method is central to empirical progress in the machine learning community. Competitions, benchmarks, and large-scale hyperparameter search all rely on splitting a data set into multiple pieces to separate model training from the evaluation.

In this paper, we empirically study holdout reuse at a significantly larger scale by analyzing data from 120 machine learning competitions on the popular Kaggle platform.

“Overfitting” is often used as an umbrella term to describe any unwanted performance drop of a machine learning model. Here, we focus on adaptive overfitting, which is overfitting caused by test set reuse.

Kaggle is the most widely used platform for machine learning competitions, currently hosting 1,461 active and completed competitions.

Considering the danger of overfitting to the test set in a competitive environment, Kaggle subdivides each test set into public and private components.

The subsets are randomly shuffled together and the entire test set is released without labels so that participants should not know which test samples belong to which split. Hence participants submit predictions for the entire test set.

The Kaggle server then internally evaluates each submission on both public and private splits and updates the public competition leaderboard only with the score on the public split.

Kaggle has released the MetaKaggle dataset2 which contains detailed information about competitions, submissions, etc. on the Kaggle platform.

The first level considers all submissions in a competition and checks for systematic overfitting that would affect a substantial number of submissions (e.g., if public and private score diverge early in the competition).

The second level then zooms into the top 10% of submissions (measured by public accuracy) and conducts a similar comparison of public to private scores. The goal here is to understand whether there is more overfitting among the best submissions since they are likely most adapted to the test set.

The third analysis level then takes a mainly quantitative approach and computes the probabilities of the observed public vs. private accuracy differences under an ideal null model. This allows us to check if the observed gaps are larger than purely random fluctuations.

In the following subsections, we will apply these three analysis methods to investigate four accuracy competitions. These four competitions are the accuracy competitions with the largest number of submissions and serve as representative examples for a typical accuracy competition in the MetaKaggle dataset then complement these analyses with a quantitative look at all competitions before we summarize our findings for accuracy competitions.

The preceding subsections provided an increasingly fine-grained analysis of the Kaggle competitions evaluated with classification accuracy.

Our results cast doubt on the standard narrative that adaptive overfitting is a significant danger in the common machine learning workflow.

Our findings call into question whether common practices such as limiting test set re-use increase the reliability of machine learning.

Our analysis here focused on classification competitions, but Kaggle hosts many regression competitions

by sakshi4

Gyaanibuddy

A Meta-Analysis of Overfitting in Machine Learning

Suggested Posts

Suggested Tags