Unsupervised Learning: Outlier Detection
Why Outliers Matter and How to Find Them with K-Nearest Neighbours
The unsupervised learning method is about finding structure in data without labels. In clustering, we group observations that are similar to each other. Outlier detection is the flip side. Instead of finding what is common, we look for what does not fit.
Why outliers matter?
Outliers are observations that deviate strongly from the rest. They are not just noise. Often, they are the most interesting part of the data.
Examples:
Fraudulent credit card transactions
New types of network intrusions
Sudden ecological disturbances
Unusual patterns in medical data
Unlike supervised anomaly detection, where you already know the anomaly type and train a classifier, unsupervised detection is about finding the unknown.
The statistical view
From a statistical perspective, “normal” data is generated by some distribution or mechanism. Outliers are data points that are unlikely under this process.
A common approach is parametric. For example, assume data is Gaussian. Fit the mean and standard deviation, then flag points more than three standard deviations from the mean as outliers (the 3σ rule).
This approach is simple, but limited:
It assumes a distribution that may not hold in practice.
Outliers themselves distort the fit.
It works poorly in high dimensions.
Non-parametric approaches
One of the simplest is K-Nearest Neighbours (KNN) outlier detection.
Compute the distance from each point to its k-th nearest neighbour.
Points inside clusters have small distances (low outlier score).
Points far from any cluster have large distances (high outlier score).
The intuition is straightforward: outliers live in sparse regions of the data.
Why you should care?
Outlier detection is everywhere. It is part of risk systems in finance, cyber security monitoring, health screening, and more.
Even if modern ML automates much of this, understanding the core ideas keeps you in control. It helps you explain why a model flagged a transaction as fraud, or why an unusual reading in medical data deserves further investigation.
Detecting Outliers with KNN
Another core concept in data science is detecting unusual data points, or outliers. Outliers matter because they can skew models, highlight errors, or reveal genuinely interesting anomalies.
One useful method is the K-Nearest Neighbours (KNN) outlier score. It can be seen quite simply like:
For each observation, calculate the distance to its nearest neighbours.
The farther away an observation is from its neighbours, the more likely it is an outlier.
In R, dbscan package provides the function kNNdist()
to do this.
Why this matters
This is more than just an academic exercise.
For example, detection shows up in:
In finance, it could flag fraudulent transactions.
In healthcare, it might reveal abnormal test results.
In IoT or security, it can help detect unusual patterns or intrusions.
By combining KNN scores with visualisation, you get both quantitative evidence and an intuitive view of your data.
KNN is one of the simplest approaches, but it gets you thinking about distance, density, and what it really means for a data point to be “different.”
Takeaway
Clustering tells you what is normal. Outlier detection tells you what is unusual. Both are crucial in data science.
Spend a short block of time this week experimenting with a simple KNN outlier detection implementation. Run it on a dataset and see what it flags. That hands-on intuition will stick with you more than reading any definition.