Your model’s metrics isnt always correct

So you have data about a problem which you want to solve with ML, you go ahead and clean it and then build models over it and for all your hardwork, you get awesome metric values and you are happy about it, and for all that hardwork i would be happy too until i understood this very cheeky and hidden concept of data leakage and i really wish i knew this earlier.

AshutoshGhattikar
Analytics Vidhya

--

HAPPY ABOUT THAT GOOD ACCURACY?

The story starts with an agenda that we have to create a model which can generalize over the ‘unseen’ data very well, and by ‘unseen’ i literally mean unseen. But when we split the data into train and test data after all that amazing EDA and to our conscience we might be tricked into thinking that our model would generalize over unseen data(test data) well, but unknowingly there is a flaw in some techniques which leads to data leakage and the accuracy which we achieved can be a false guide that our model can generalize well.

POSSIBLE AREAS OF DATA LEAKAGE

Consider this usual scenario where you have missing values and you want to impute them so you decide to impute them with mean, median or mode, but while doing this notice that the same mean, median or mode is calculated considering the entire dataset and so when you are going to impute the missing values and later split the dataset into train and test there is already contamination in data where the the test data clearly shares the info of the training data (the missing value was imputed with some value which was inturn calculated considering all of the data).

Another possible area where leakage might occur is while standardizing/normalizing the data, data is standardised/normalized considering all of data and later we go ahead and split the standardized/normalized data into train and test but the info is shared and the contamination has happened and test data is not ‘unseen’ anymore and hence the model performs well on test data too giving us higher metrics.

HOW TO AVOID DATA LEAKAGE?

Avoiding this requires some maturity in understanding the need to have totally different data while training and testing, there are a few good practices to follow during EDA which ensures that there is no to little data leakage all in all.

  1. Split entire data into training and testing data before hand.

2. Create pipelines or user defined functions with all the EDA steps and apply them on train and test data seperately (for instance, the mean/median/mode of train data would be used to impute the missing values of train data only and same is the case for test data.)

3. Use k-fold cross validation technique.

And all this was a brief idea of what data leakage is and how it can be avoided, generally as a beginner this will not be the area of concern but this is critical and important too and understanding and following the above practices might get us legitimate and more generalized models which are more similar to real world situations.

Hope this was informative, Thank you! 🙏🏻

--

--