The Journey of Handling Missing Data as a Data Scientist
Understanding and Addressing Missing Data in Data Systems
In data systems, the most common reason for missing data often arises from system integration problems as data is passed from one platform to another. You wonder how, right …? 🤔
Of course, ensuring integrity and entry compatibility between them is essential for reliable information.
Validation schema, appropriate data type conversion, proper error handling in ETL (pipelines), automated monitoring (tracking missing values across), and standardisation can efficiently prevent missing data when transitioning them between pipelines and external applications. Oh, and “ETL” stands for Extract, Transform, Load 🙂
When data are missing (very common and quite often unavoidable occurrence), you can use data visualisations (e.g. heatmaps with the lightweight “missingno” package by Aleksey Bilogur in Python . And Naniar/VIM in R), or just a simple summary tables to highlight the gaps.
Heatmaps, bar charts, and matrix plots can be used to clearly highlights clusters of missing values thanks to visual/structured insights (for instance. show which columns have the most missing values with a bar chart).
Data visualisation helps identifying trend and correlation. For instance, all recorded orders from last month have missing delivery dates. And the visualisation made it obvious 🙂
Both data visualisation and summarisation can (and should) be used as techniques for exploring and understanding missing data. One is visual, the other numerical.
Visualisation techniques help identify patterns (such as clusters or correlations), while summarisation techniques are used to quantify and act on the missing data.
Together, they form part of exploratory data analysis (EDA) strategies for handling missing data.
You can quickly check for missing data with a simple function. In Python (efficiently with a pandas DataFrame/NumPy isnan()) or in R (e.g. dplyr from tidyverse 🙂).
We can use summarisation like… df.isnull().sum() in Python 🙃def check_missing(df):
return df.isnull().sum()
In R, with: colSums(is.na(df)) like:sapply(df, function(x) sum(is.na(x)))
Stay tuned, because I will soon share a complete DataSci practical cookbook! Super practical, with plenty of use cases. It will be incredibly handy! ⚡️
More to come on my LinkedIn. Follow and stay tuned! 🔥
You can get inspired by the open-source projects I’ve built over the past several years, available on my GitHub.


