Mr. Wolf Fools the Data Science Team Again — Data Leakage Scam 🐺

Shaurya Uppal
4 min readMar 1, 2022
Mr. Wolf

Mr. Wolf earlier p-hacked an AB-Testing Experiment and now comes back with another mischief — Data Leakage.

Over the years, I have seen and witnessed some evil practices from Mr. Wolf (expert data scientist — dummy character). Earlier, he was p-hacking and now he came up with another trick to fool the team. To stop Mr. Wolf, I use my power of writing to unmask and whistle blow all the evil tricks of him.

What wrong did Mr.Wolf do?

Mr. Wolf created a Data Science model which over-performed due to data leakage. Careless handling of data can sabotage your Data Science models. Metrics shared by Mr. Wolf looked very appealing but in production, the model broke.

What was Mr. Wolf working on?

Mr. Wolf was working on a CTR prediction model where he used some features like average session time, average clicks by the user, etc. but while taking average he included the training data period as well, which caused Data Leakage.

Explanation with Example: If you want to predict CTR for X day you only know average clicks during inference to X-1 days (not the current date) but while training the model you included average which had X days input as well which caused data leakage.

Post-training the model Mr. Wolf shared test data metrics: AUC-ROC of 0.8 above — which was too good to be true.

Celebration

Mr. Wolf Rejoiced and Celebrated!!! 🎊 🎉

This happiness did not last long Mr. Wolf’s model post deployed broke in production and performed very poorly. Later, post reviewing and thoroughly checking Mr. Wolf’s codebase found out the data leakage issue and pointed it out to him. I felt bad for him as he was very happy earlier — this reminds me of a quote. 😂

Upholding data hygiene is of paramount importance when carrying out a data science task. There is a lack of awareness of an immense threat to data hygiene: data leakage; even seasoned data scientists can get trapped by this unknowingly.

Data leakage is a phenomenon that occurs when information outside of the training data is used to train a model. It essentially violates the independence of the training data and allows it to be altered by information from external sources.

In simple terms, Data Leakage occurs when the data used in the training process contains information about what the model is trying to predict. The training data needs to be completely independent of the testing data. The values in the testing set should have no bearing on the values in a training set.

If you fail to identify data leakage in such situations, you could be tricked into thinking that your model is robust, only and find out that it is completely unreliable after deploying it just like in the case of Mr. Wolf.

Thus, it is important to ensure that you don’t inadvertently cause any data leakage as you process your data.

Bonus: Another classical data leakage case, I have experienced is the wrong implementation of StandardScaler.

Wrong Method:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)

Correct Method:

from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)

To understand fit_transform(), transform() and fit() difference: refer to my StackOverFlow answer (image added below).

I hope you learned something new. If you liked it, do subscribe, hit 👍 or ❤️, and share this with others. Stay tuned for the next one!

Connect, Follow or Endorse me on LinkedIn if you found this read useful. To learn more about me visit: Here

To Connect with me (1:1 conversation): Block my Calendar for Consultation / Advice.

--

--