Machine learning from scratch (part 1)

Erik Munkby
8 min readOct 18, 2022

--

Gain a deeper understanding by looking beyond your go-to ML libraries.

Figure 1: Image by Allison Horst, the three mascots of the data in this blog series.

In this day and age, applying machine learning (ML) means installing the best library for the job and experiment, rather than coding the algorithms yourself. And when we have libraries such as scikit-learn backed by 2484 contributors, or tensorflow backed by 3196 contributors, why wouldn’t you? The answer to the question above — and the reason this series of articles exist — is to have a better understanding of fundamental concepts! I will take you through a journey of many of the most important ML concepts, by walking through these basic algorithms. With a heavy emphasis on visualizations together with the underlying code of the algorithms, you will finish this article series with more confidence in the field of ML. To make things interesting, I’ll do this only using python and numpy, removing my reliance on the obvious ML libraries.

This series is directed at those who have some experience interacting with machine learning libraries, but wish to learn more about how they actually work.

Algorithms (ML models) and concepts in this article

The blog series is split into multiple segments to create more digestible amounts of concepts per article. In this article we will build the following algorithms from scratch:

  • k-NN (k-Nearest Neighbors)
  • k-means

Taking us through this journey will be our three penguin friends visualized in Figure 1. These three penguin species come from the dataset Palmer Archipelago (Antarctica) penguin data, which is a (subjectively) cuter alternative to the Iris flower dataset. It is built to work for both classification and regression problems. In the figure below you can see a brief example of the data, and in this post we will focus on the culmen (beak) features together with the species. For more data exploration of the dataset check out this notebook. Let’s get into it!

Figure 2: Image by Author, sample data for each penguin species.

k-NN (k-Nearest Neighbors)

Our first goal will be to build a model that can predict the correct penguin species label, using only the culmen features. Framing the task like this is what’s called a supervised classification problem, using previously known datapoints (penguin observations and measurements) we build a model that can predict unseen datapoints. To start off, let’s plot the distribution of the penguin species data points using culmen_width_mm as our x-axis, and culmen_depth_mm as our y-axis:

Figure 4: Image by Author, distribution of the three penguin species using culmen features.

In the figure above we can see that there seems to be a fairly clear separation between species using the culmen features. However there are still some overlap around the center. k-NN functions by looking at the already classified datapoints’ — you guessed it — neighbors. Before we can do that we need to define what a neighbor is. This is accomplished by selecting your preferred distance metric.

Distance metrics

The first building block of the algorithms we will work with in this article is the euclidean distance metric. In a nutshell euclidean distance is the shortest distance between two points calculated using the pythagorean theorem. Aside from the euclidean distance two other frequently used distance metrics are the manhattan (city block) and cosine distance. Manhattan distance only travels parallel along any of the dimensional planes. Manhattan distance is usually preferred when working with a large number of features (dimensions) as it suffers less from the curse of dimensionality. Cosine distance considers the angle between two points in an N dimensional space (ranging from [0, 1] where 0 denotes no difference in angle). Cosine distance is when the similarity (distribution) of the feature values are more important that the actual values. The figure below visualizes these three different distance metrics, together with their distance in the legend parenthesis.

Figure 3: Image by Author, the three most common distance metrics.

The python implementation of the euclidean distance metric can take many forms. But for some slightly cleaner code, and to fit the k-NN algorithm, we will in this article calculate the euclidean distance in a one-vs-all fashion:

If we were to add a new point at position [3, 2] in Figure 3, the first step in the euclidean calculation (code line 3) would look like:

Calculating the euclidean distance between point [3, 2] and the two points in Figure 3.

Building the model

In contrast to most machine learning algorithms, k-NN does not require any training. Instead the algorithm retains all the data points within the model, and uses these to predict the class of new data points. k-NN can be summarized as: For a previously unseen data point, find the k closest neighbors. Then calculate the most frequent class among those neighbors, and assign that class to the new data point. In python this looks like:

The only thing left now is to select the algorithm’s namesake, our hyperparameter k. Selecting the correct k for the k-NN is not entirely trivial. In the figure below, we have selected a point (marked as a black diamond) and attempt to predict its class using the k-NN algorithm above. Taking a look at the animated figure’s title, we see that selecting a k>6 changes our predicted class from Chinstrap (purple) to Adélie (orange).

Figure 5: Image by Author, animated class prediction of black diamond using k-NN.

The most straightforward approach to selecting your k is by iterating through a range of possible k values, using each k to predict classes for all data points. Then we can compare the accuracy of each k on the original dataset. This is performed by sequentially going through all data points in the dataset, temporarily remove then, and attempt to predict their original class. Additionally, it is usually bad practice to select even values for k, since that leaves the possibility of having an equal number of different classes among the neighbors. In the figure below we see that the best possible k we can select for classification based on the culmen size features is k=5 or k=11. Following Occam’s razor we should pick our k as 5.

Figure 6: Image by Author, accuracy for each k ranging from 3 to 13.

To summarize, we built a k-NN algorithm based on the euclidean distance metric. k-NN is a supervised classification algorithm since we use previous knowledge of existing datapoints’ classes to predict the class of previously unseen data points. We selected our k as 5 since it had the highest accuracy, while being a lower number of neighbors than k=11 which had the same accuracy as 5.

k-means

Using the building blocks created for the k-NN algorithm, and adding a few more we can end up with another very popular machine learning algorithm: k-means. While they share a similar name, and have a similar mathematical approach, their use cases are very different. k-means is a clustering algorithm i.e. we try to separate datapoints into different classes based on attributes (features) apart from the label. Clustering algorithms are used when we expect multiple groups of entities (datapoints), but we do not yet have them separated by existing labels. In our case, we do have this separation via the already entered species feature. However for the purpose of this article we will pretend that we do not. These properties of k-means categorizes the algorithm as an unsupervised algorithm, in contrast to k-NN. The shared prefix k in the two algorithms refers to their primary hyper parameter. As for k-means, the k refers to how many clusters we expect a dataset to include. For our case we know that we expect three different clusters since there are three different penguin species. In pseudo code k-means can be summarized as:

randomly assign each data point a cluster label [0, k]
for n iterations:
calculate the cluster center points
change each data point to the closest cluster center's label
if the cluster centers did not move we have converged

We continue this loop until we run out of iteration steps, or until we have converged. Converging in this implementation of k-means means that upon recalculation of the cluster centers, they are no longer different from the previous iteration. The cluster center depends on the selection of distance metric, for now we will stick with the euclidean distance metric. Then the position of the cluster centers (or cluster means) is simply the average (mean) of each cluster’s x- and y-axis positions. In the figure below, we start by randomly assigning each data point a cluster label, denoted by its color. Following that we add the initially calculated cluster center points. After that we iteratively assign new clusters to all data points copying the class (label) of the closest cluster center. For visualization purposes (not part of the algorithm) we have also calculated the decision boundaries, denoted as the color-filled areas. This continues until we have converged with a final cluster setup, or until we reached the max number of iterations. The final frame of the animation changes each point back to its true class color.

Figure 7: Image by Author, animated visualization of k-means iterations. Final frame shows points in their true class colors.

As seen in the last frame of the animation the final clusters are not ideal. We seem to be able to separate the orange class (Adélie) from the rest, but green (Gentoo) and purple (Chinstrap) classes have too similar culmen sizes to properly be separated using k-means. Taking a look at the distribution of species in Figure 4, a potential improvement could be to calculate clusters using the cosine distance metric instead of the euclidean.

Below is the code for the final implementation of the k-means algorithm, each step explained via the comments.

Conclusion

In this post we have analyzed the Palmer penguins dataset using the supervised classification algorithm k-NN, and the unsupervised clustering algorithm k-means. Both of these algorithms were implemented using the euclidean distance metric. Using the k-NN algorithm we managed to achieve an accuracy of 92% with the hyper parameter k=5. In the k-means attempt we reached convergence after three iterations. However we also discovered that the given the chosen features and the euclidean distance metric we could not sufficiently separate all three species. A potential improvement could be to change introduce more features, or change the distance metric to cosine similarity.

All code used within this post, including both algorithms and plot code, can be found in my github repo. In the next post in this series, we will among other things dig into regression, cost functions and gradient descent. Follow me on medium in order to not miss the next part!

--

--

Erik Munkby

ML Engineer and Data Driven Culture Champion | Writing about ML, Data Science and other data related things | Co-founder of Data Dao