Why should treat outliers with Nearest Neighbor and Local Outlier Factor?

3 min readOct 8, 2022

Local Outlier Factor is an unsupervised outlier determination algorithm. Computes the local density deviation of a given data point with respect to its neighbors. Proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000.

As can be seen from the diagram above, the points that are far from the clusters are outliers. To determine these, Local Outlier Factor uses k-distance.

In this article, I will skip to coding Local Outlier Factor rather that theory and formulas .

For more detailed information, see the resources links.

Here’s how Local Outlier Factor is:

First, a dataset is selected. Data is fitting to Local Outlier Factor from within Scikit Learn.

Then the scores are determined with negative_outlier_factor_ attribution. Scores are sorting and observed.

Then the hyperparameter is selected by an outside intervention. As a threshold value. Then this threshold value is assigned to all outliers.

That’s all.

It’s actually a simple process, but we can ask the following question when we look more closely: Why do we assign same value (threshold value) to all outliers.

Looking at the image below, does it make sense to assign far neighbor’s values to the outlier A?

Wouldn’t it be better to find the nearest neighbor of an outlier and assign it to it?

That’s what we’re going to do today.

You can assign the outliers values to the nearest neighbor in the Local Outlier Factor process by following the codes below step by step.

Enjoy!

Now we come to the point. I need to inform a little bit about the Nearest Neighbors function before going into the codes.

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data (possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree).