Building K Nearest Neighbors from Scratch

Aug 21, 2024

Introduction

Hey guys! So I had this random thought today, why not try to build a machine learning algorithm from scratch? I figured it'd be cool to implement K-Nearest Neighbors and see how it goes. In this blog post I'm gonna walk you through my attempt. If you spot any errors in my explanation or code, please let me know :)

So yeah, KNN is a simple yet powerful machine learning algorithm used for both classification and regression tasks. Read more here.

PS: I picked KNN because it's relatively simple but still super useful.

Core idea of the algorithm

The core idea behind KNN is straightforward:

  1. When given a new data point to classify or predict, KNN looks at the 'K' closest points in the training dataset.
  2. For classification, it assigns the most common class among these K neighbors.
  3. For regression, it calculates the average value of these K neighbors.

Python Implementation

Alright, let's dive into the code! Let's build our KNN from scratch, woohoo let's get coding!

First things first, we need some built-in Python modules and numpy:

import math
import numpy as np
from collections import Counter

Calculating the euclidian distance:

def euclidean_distance(x1, x2):
    return math.sqrt(np.sum((x1 - x2) ** 2))

This function takes two points and tells us how far apart they are.

KNN class

class KNN:
    def __init__(self, k=3):
        self.k = k

We start by initializing our class. The k parameter is how many neighbors we want to look at. We're setting a default of 3, but feel free to change it!

def fit(self, X, y):
        self.X_train = X
        self.y_train = y

The fit method is just storing our training data.

def _predict(self, x):
     # Compute distances between x and all examples in the training set
     distances = [euclidean_distance(x, x_train) for x_train in self.X_train]

     # Sort by distance and return indices of the first k neighbors
     k_idx = np.argsort(distances)[:self.k]

     # Extract the labels of the k nearest neighbor training samples
     knn_labels = [self.y_train[i] for i in k_idx]  

     # return the most common class label
     most_common = Counter(knn_labels).most_common(1)
     return most_common[0][0]

So let me explain this one by one ,

  1. We calculate the distance between our point and all training points.
  2. We find the k closest points.
  3. We look at the labels of these k points.
  4. We return the most common label. That's our prediction!
def predict(self, X):
     y_pred = [self._predict(x) for x in X]
     return np.array(y_pred)

And there we have it! Pretty cool, huh?

Conclusion

This KNN implementation might not be as optimized as the ones you'll find in libraries, but it gets the job done and, more importantly, helps you understand what's really going on under the hood.You can find the full implementation on GitHub.

Feel free to clone it, play around with it make it your own. Happy coding! :)