Last update: 26.03.2023. All opinions are my own.

1. Overview

This blog post provides a tutorial on implementing the K Nearest Neighbors algorithm using Python and NumPy. We will set up a simple class object, implement relevant methods to perform the prediction, and illustrate how it works on a toy dataset.

Why are we implementing KNN from scratch if the algorithm is already available in scikit-learn? First, coding something from scratch is the best way to understand it. You may know many ML algorithms, but being able to write it down indicates that you have really mastered it. Second, implementing algorithms from scratch is a common task in ML interviews in tech companies, which makes it a useful skill that a job candidate should practice. Last but not least, it's a fun exercise, right? :)

This post is part of "ML from Scratch" series, where we implement established ML algorithms in Python. Check out other posts to see further implementations.

2. How KNN works

Before we jump to the implementation, let's quickly refresh our minds. How does KNN work?

KNN is one of the so-called lazy algorithms, which means that there is no actual training step. Instead, KNN memorizes the training data by storing the feature values of training examples. Given a new example to be predicted, KNN calculates distances between the new example and each of the examples in the training set. The prediction returned by the KNN algorithm is simply the average value of the target variable across the K nearest neighbors of the new example.

P.S. If you need a more detailed summary of how KNN works, check out this Wiki page.

3. Implementing KNN

Let's start the implementation! The only library we need to import is numpy:

for size in layer_sizes:
    x = tf.keras.layers.Dense(
        size,
        kernel_initializer="he_uniform",
        activation=activation_fn,
    )(x)
    if size < layer_sizes - 1:
        x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dropout(dropout_rate)(x)

x = tf.keras.layers.Dense(
    n_outputs, kernel_initializer="he_uniform", activation="sigmoid", name="events_predictions"
)(x)
import numpy as np

In line with object-oriented programming practices, we will implement KNN as a class with a set of methods. We will need the following five:

  1. __init__(): initialize the class object.
  2. fit(): memorize the training data and store it as a class variable.
  3. predict(): predict label for a new example.
  4. get_distance(): helper function to calculate distance between two examples.
  5. get_neighbors(): helper function to find and rank neighbors by distance.

The last two functions are optional: we can implement the logic inside the predict() method, but it will be easier to split the steps.

Let's sketch a class object template. Since we implement functions as class methods, we include self argument for each method:

class KNN:
    
    def __init__(self):
        pass
    
    def fit(self):
        """
        Memorize training data
        """
        pass
    
    def predict(self):
        """
        Predict labels
        """
        pass

    def get_distance(self):
        """
        Calculate distance between two examples
        """
        pass    
    
    def get_neighbors(self):
        """
        Find nearest neighbors
        """
        pass

Now let's go through each method one by one.

The __init__() method is run once when the initialize the KNN class object. The only thing we need to do on the initialization step is to store meta-parameters of our algorithm. For KNN, there is only one key meta-parameter we specify: the number of neighbors. We will save it as self.num_neighbors:

def __init__(self, num_neighbors: int = 5):
    self.num_neighbors = num_neighbors

Next, let's implement the fit() method. As we mentioned above, on the training stage, KNN needs to memorize the training data. To simplify further calculations, we will provide the input data as two numpy arrays: features saved as self.X and labels saved as self.y:

def fit(self, X: np.array, y: np.array):
    """
    Memorize training data
    """
    self.X = X
    self.y = y

Now, let's write down a helper function to calculate distance between two examples, which are two numpy arrays with feature values. For simplicity, we will assume that all features are numeric. One of the most commonly used distance metrics is the Euclidean distance, which is calculated as a root of the sum of the squared differences between feature values. If the last sentence sounds complicated, here is how simple it looks in Python:

def get_distance(self, a: np.array,  b: np.array):
    """
    Calculate Euclidean distance between two examples
    """
    return np.sum((a - b) ** 2) ** 0.5

Now we are getting to the most difficult part of the KNN implementation! Below, we will write a helper function that finds nearest neighbors for a given example. For that, we will do several steps:

  1. Calculate distance between the provided example and each example in the memorized dataset self.X.
  2. Sort examples in self.X by their distances to the provided example.
  3. Return indices of the nearest neighbors based on the self.num_neighbors meta-parameter.

For step 1, we will leverage the get_distance() function defined above. The trick to implement step 2 is two save a tuple (example ID, distance) when going through the training data. This will allow to sort the examples by distance and return the relevant IDs at the same time:

def get_neighbors(self, example: np.array):
    """
    Find and rank nearest neighbors of example
    """
    
    # placeholder
    distances = []

    # calculate distances as tuples (id, distance)
    for i in range(len(self.X)):
        distances.append((i, self.get_distance(self.X[i], example)))

    # sort by distance
    distances.sort(key = lambda x: x[1])
    
    # return IDs and distances top neighbors
    return distances[:self.num_neighbors]

The final step is to do the prediction! For this purpose, we implement the predict() method that expects a new dataset as a numpy array and provides an array with predictions. For each example in the new dataset, the method will go through its nearest neighbors identified using the get_neighbors() helper, and average labels across the neighbors. That's it!

def predict(self, X: np.array):
    """
    Predict labels
    """
    
    # placeholder
    predictions = []
    
    # go through examples
    for idx in range(len(X)):
        example     = X[idx]
        k_neighbors = self.get_neighbors(example)
        k_y_values  = [self.y[item[0]] for item in k_neighbors]
        prediction  = sum(k_y_values) / self.num_neighbors
        predictions.append(prediction)

    # return predictions
    return np.array(predictions)

Putting everything together, this is how our implementation looks like:

### END-TO-END KNN CLASS

class KNN:
    
    def __init__(self, num_neighbors: int = 5):
        self.num_neighbors = num_neighbors
        
        
    def fit(self, X: np.array, y: np.array):
        """
        Memorize training data
        """
        self.X = X
        self.y = y
        
        
    def get_distance(self, a: np.array,  b: np.array):
        """
        Calculate Euclidean distance between two examples
        """
        return np.sum((a - b) ** 2) ** 0.5


    def get_neighbors(self, example: np.array):
        """
        Find and rank nearest neighbors of example
        """
        
        # placeholder
        distances = []

        # calculate distances as tuples (id, distance)
        for i in range(len(self.X)):
            distances.append((i, self.get_distance(self.X[i], example)))

        # sort by distance
        distances.sort(key = lambda x: x[1])
        
        # return IDs and distances top neighbors
        return distances[:self.num_neighbors]


    def predict(self, X: np.array):
        """
        Predict labels
        """
        
        # placeholder
        predictions = []
        
        # go through examples
        for idx in range(len(X)):
            example     = X[idx]
            k_neighbors = self.get_neighbors(example)
            k_y_values  = [self.y[item[0]] for item in k_neighbors]
            prediction  = sum(k_y_values) / self.num_neighbors
            predictions.append(prediction)

        # return predictions
        return np.array(predictions)

4. Testing the implementation

Now that we have our implementation, let's check whether it actually works. We will generate toy data using numpy. The gen_data() function below uses the np.random module to draw feature values from a random Normal distribution and assign a 0/1 label.

### HELPER FUNCTION

def gen_data(
    mu: float = 0, 
    sigma: float = 1, 
    y: int = 0,
    size: tuple = (1000, 10),
):
    """
    Generate random data
    """
    X = np.random.normal(loc = mu, scale = sigma, size = size)
    y = np.repeat(y, repeats = size[0])
    
    return X, y

To simulate a simple ML problem, we will generate a dataset consisting of two samples:

  1. 30 examples with mean features value of 1 and a label of 0.
  2. 20 examples with mean features value of 5 and a label of 1.
### TOY DATA GENERATION

X0, y0 = gen_data(mu = 1, sigma = 3, y = 0, size = (30, 10))
X1, y1 = gen_data(mu = 5, sigma = 3, y = 1, size = (20, 10))
X = np.concatenate((X0, X1), axis = 0)
y = np.concatenate((y0, y1), axis = 0)

Now, let's instantiate our KNN class, fit it on the training data and provide predictions for some new examples!

To see if the algorithm works properly, we will generate four new examples as X_new, gradually increasing the feature values from 1 to 5. We expect the label predicted by KNN to increase from 0 to 1, since we are getting closer to examples in X1. Let's check!

### PREDICTION

# fit KNN
clf = KNN(num_neighbors = 5)
clf.fit(X, y)

# generate new examples
X_new = np.stack((
    np.repeat(1, 10),
    np.repeat(2, 10),
    np.repeat(4, 10),
    np.repeat(5, 10),
))

# predict new examples
clf.predict(X_new)
array([0. , 0.2, 0.8, 1. ])

Yay! Everything works as expected. Our KNN algorithm provides four predictions for the new examples, and the prediction goes up with the increase in feature values. Our job is done!

5. Closing words

This is it! I hope this tutorial helps you to refresh your memory on how KNN works and gives you a good idea on how to implement it yourself. You are now well-equipped to do this exercise on your own!

If you liked this tutorial, feel free to share it on social media and buy me a coffee :) Don't forget to check out other posts in the "ML from Scratch" series. Happy learning!