Evaluating Train-Test Split Strategies in Machine Learning: Beyond the Basics | by Federico Rucci | Sep, 2024
With this article, I want to examine a question often overlooked by both those who ask it and those who answer: “How do you partition a dataset into training and test sets?”
When approaching a supervised problem, it is common practice to split the dataset into (at least) two parts: the training set and the test set. The training set is used for studying the phenomenon, while the test set is used to verify whether the learned information can be replicated on “unknown” data, i.e., data not present in the previous phase.
Many people typically follow standard, obvious approaches to make this decision. The common, unexciting answer is: “I randomly partition the available data, reserving 20% to 30% for the test set.”
Those who go further add the concept of stratified random sampling: that is, sampling randomly while maintaining fixed proportions with one or more variables. Imagine we are in a binary classification context and have a target variable with a prior probability of 5%. Random sampling stratified on the target variable means obtaining a training set and a test set that maintain the 5% proportion on the target variable’s prior.
Reasoning of this kind is sometimes necessary, for example, in the case of classification in a very imbalanced context, but they don’t add much excitement to the…