Machine learning from scratch (part 2)

Erik Munkby
14 min readNov 3, 2022

--

Gain a deeper understanding by looking beyond your go-to ML libraries.

Figure 1: Image by Allison Horst, the three mascots of the data in this blog series.

In this day and age, applying machine learning (ML) means installing the best library for the job, rather than coding the algorithms yourself. And when we have libraries such as scikit-learn backed by 2484 contributors, or tensorflow backed by 3196 contributors, why wouldn’t you? The answer to the question above — and the reason this series of articles exist — is to have a better understanding of fundamental concepts! I will take you through a journey of many of the most important ML concepts, by walking through these basic algorithms. With a heavy emphasis on visualizations together with the underlying code of the algorithms, you will finish this article series with more confidence in the field of ML. To make things interesting, I’ll do this only using python and numpy, removing my reliance on the obvious ML libraries.

This series is directed at those who have some experience interacting with machine learning libraries, but wish to learn more about how they actually work.

Algorithms (ML models) and concepts in this article

In the previous installment of this blog series, Machine Learning from Scratch (part 1), we took a look at the algorithms k-NN and k-means. Both of these algorithms’ primary objective is to predict/identify classes within the data. In this post we will instead attempt to build a model able to predict continuous values — predicting penguin weights using their culmen length.

Imagine we hiked all the way out the Biscoe islands only to realize — we forgot the scale. However for some reason our data entry program will only allow entries that includes both culmen length and weight. Luckily we brought our measurement tape and computer. In order to not waste this entire trip we quickly build an algorithm using previous entries, that allows us to predict the penguins’ weight using the culmen length measurement.

We will approach this prediction as a linear regression problem, and build the model using three algorithms:

  • Ordinary Least Squares (OLS)
  • Gradient Descent (GD)
  • Stochastic Gradient Descent (SGD)

Taking us through this journey will be our three penguin friends visualized in Figure 1. These three penguin species come from the dataset Palmer Archipelago (Antarctica) penguin data, which is a (subjectively) cuter alternative to the Iris flower dataset. It is built to work for both classification and regression problems. In the figure below you can see a brief example of the data, and in this post we will focus on the culmen (beak) features together with the body mass. For a full data analysis check this notebook. Let’s get into it!

Figure 2: Image by Author, sample data for each penguin species.

Can we build a linear model?

The first question we should ask ourselves is: can we build a linear model for the task, predict penguin weight using culmen length? The first step to answer this is — like most data problems — plot the data we are working with.

Figure 3: Image by Author, distribution of penguin species by weight & culmen length.

Studying the graph above, we can see that there is more linear correlation within each species than among all three species as a whole. More specifically Chinstrap (purple) seems to deviate from the other two species in the relation between culmen length and body mass. Since we are focusing on a regression problem estimating a continuous variable, it can in situations like this be preferable to isolate each class into their own linear model. As such we will further on limit ourselves to build the initial model using only entries from the Gentoo (green) penguin species.

Figure 4: Image by Author, the distribution of Gentoo penguins by weight & culmen length.

While somewhat noisy, there still seems to be a linear correlation throughout the data entries we have.

Ordinary Least Squares (OLS)

The first algorithm we will use in our attempt to find a linear explanation of the culmen length and weight relationship is ordinary least squares (OLS). One could argue that OLS is more statistics than machine learning due to its non-iterative calculations, but in my opinion they are nothing but two different flavors of the same delicious dessert. If you do want to separate the two the somewhat consensual definition is in what you use it for — to draw conclusions about data (statistics), or to generalize and predict unseen data (machine learning).

The algorithm built as matrix operations for OLS can be seen in Equation 1 below. X represents a matrix (input data) with one column per feature, and y represents the output of the function f(X). This equation gives us what we will refer to as beta, which is the coefficients including bias b0 [b0, b1, …, bN]. We can then use this beta to predict the y of previously unseen values of X.

Equation 1: Ordinary Least Squares (OLS)

I know this is still vague for you who have not encountered OLS before, let’s apply it to our data to make it more understandable. In our use case we are only working with two dimensions, x (culmen length) and y (body mass). We are trying to find the optimal coefficients for the linear equation y = kx + m. In other words the output of the OLS equation above will be an array beta with two values [b0, b1] or synonymously [m, k] in regards to the linear equation y = kx + m.

In the linear equation y = kx + m formula, the term m is referred to as the intercept i.e. where the line crosses the y-axis. And the term k is referred to as the slope i.e. the ratio of vertical change as we move along the x-axis.

In python+numpy OLS can be implemented as seen below. The implementation is split up into individual steps together with comments for more readable code. The initial X feature matrix is extended with a unit vector np.ones(len(X)) in order to make it possible to calculate the intercept. The @ symbol is shorthand for the dot product in numpy.

Applied to our data we get the coefficients beta=[-12.7, 107.3], or as the linear equation: y=107.3x — 12.7. We can visualize this line by calculating the y for x=35 and x=65, then plot the line between these points as seen in the figure below.

Figure 5: Image by Author, linear regression applied to our data using OLS.

There we go, we now have the best possible fitted linear representation of our data! But what does best possible mean? When using Ordinary Least Squares best possible means— referenced by the name — that we have the line with the least amount of squared errors. These errors are calculated as the distance of the errors parallel to the y-axis, the residuals. In the figure below we can see the residuals for a subset of data points, each with the predicted y (the line) measured against its true y value.

Figure 6: Image by Author, residuals (distance) plotted between data points and OLS line. A smaller subset of datapoints have been sampled to make the residuals more clearly visible.

We then square each distance (residual) for two reasons: to keep residuals positive, and to punish each error more the further away the predicted values are from the true value. Finally we take the mean (average) of all these squared errors. The name of the equation described above is Mean Squared Error (MSE). From the perspective of OLS, the lowest possible output of MSE is the best possible fit. The equation for MSE can be seen below, followed by the python code for predicting the y values and then calculating the MSE.

Mean squared error belong to a group of functions referred to as cost, or loss functions. A cost function’s purpose is to provide a measurement on the difference between the estimated (predicted) and true values. One important feature of cost functions is that they are differentiable.

Equation 2: Mean Squared Error (MSE)

Calculating the cost using MSE for our linear representation leaves us with a cost of 137465. For those who have studied loss graphs before, a cost of above a hundred thousand seems pretty high! The high cost value is due to the combination of high values on the y-axis, which lives within the thousands, and the squared nature of MSE. This will lead to some issues which we will discuss in the next segment.

Gradient Descent

We will now calculate the beta for the same dataset again, using a versatile algorithm called gradient descent (GD) instead of OLS. But why solve it in another way, when we just went straight to the optimal using OLS? There are two primary reasons for using GD over OLS:

  • OLS is slow compared to GD when the number of features (columns) is large
  • Ordinary least squares is limited to finding the best solution in regards to the squared residuals. GD on the other hand can use any differentiable cost function, e.g. categorical cross-entropy which you might have stumbled upon within machine learning.

For educational purposes however, we will not introduce another cost function and instead stick to MSE. Via the GD approach we will iteratively optimize our beta in order to minimize the cost function. The way to do this is quite straight forward for those of you who remember school math: we calculate the derivative of our cost function (this is why it must be differentiable!). Then we take a small step in the negative direction of the derivative of the cost function, which we from now on will refer to as the gradient. The size of the step in respect to the gradient is called learning rate. The learning rate we want to keep small in order to not overshoot our optimal beta, while not too small since then we might not get anywhere.

In fact, when we say train a model — any machine learning model — we just mean that we iteratively minimize the cost according to the chosen cost function.

The most common visualization of gradient descent is usually describe as a bowl in as seen in Figure 7 below. Here we iteratively take steps towards what we hope to be the global minima. In this scenario we have not changed the learning rate: the length of the steps simply becomes a function of both the steepness of the curve as well as the learning rate. Each step of this iterative approach is summarized by Equation 3 below.

Figure 7: Image by Author, gradient descent towards the minima.
Equation 3: step-wise adjustment of beta in gradient descent.

Using the Equation 3 above, each step in the iteration we take our current beta (line coefficients) and subtract (take a negative step) towards the derivative of our cost function multiplied by our learning rate (alpha). The derivative of MSE in respect to our beta can be summarized in the steps seen in Equation 4 below. Due to the chain rule and partial derivatives we end up multiplying our sum of errors by xij where j represents the coefficient index in our beta (remember in this article we only have two beta coefficients [m, k]).

Equation 4: Derivative of MSE in respect to beta where j is the beta coefficient index

Changing our calculations to matrix operations, and realizing that y hat is equal to the dot product of X and beta, we can rewrite the equation into what we see in Equation 5.

Equation 5: Derivative (gradient) of MSE as matrix operations.

Below you find the implementation of above equation, broken up into commented segments.

Now we just need to run this function step by step until we seem to have reached convergence. This is indicated by studying our graph of losses (cost function outputs) per step. This iterative approach, our complete gradient descent algorithm, can be implemented as seen below.

Now let’s see it in action!

Applying Gradient Descent

In the beginning of applying gradient descent to a dataset it is a good idea to start with a low learning rate, we will set ours to 0.001. Running the algorithm for 10 steps we can plot the loss per iteration (when we have iterated over an entire dataset once we refer to that as one epoch).

Figure 8: Loss per epoch using gradient descent using learning rate 0.001

Uh-oh, the loss over time is supposed to trend downwards, what is happening in the graph above is the complete opposite. Even though we set a relatively low learning rate, it seems that the learning rate was still high enough to overshoot the gradients. What has happened is that since the very first step of our supposed “descent” we took a step too far, making us end up on the opposite side of the gradient descent bowl. More than that, we took a step to the other side, but also higher up landing in a greater steepness. Thus our step number 2 and each subsequent steps will be overshooting the bowl even more. In the two graphs below we apply the beta received out of each epoch step and plot:

  • Graph 1: The GDS line, built by the GDS beta, next to our previously calculated OLS line and the original data.
  • Graph 2: The GDS bowl as we saw in Figure 7, however since we have two coefficients in our beta, this bowl is seen from “above”, with colors indicating the cost (or steepness). On the x-axis we have the range of possible values for the intercept (m), and on the y-axis the range of possible values for the slope (k). Thanks to us previously calculating the beta via OLS, we also know the location of the global minima we are looking for.
Figure 9: Image by Author, top graph describes GDS line over epochs. Bottom graph describes GDS algorithm trying to find the global minima of linear equation.

Our first change will be execute our gradient descent again but with a lower learning rate: 0.0001.

Figure 10: Image by Author, gradient descent using learning rate 0.0001

Now our loss trends downwards like just like we want, and it seems to have converged (found the global minima) somewhere around epoch 60–80.

Figure 11: Image by Author, top graph describes GDS line over epochs. Bottom graph describes GDS algorithm trying to find the global minima of linear equation.

In contrary to most real life cases, in our example we actually know what the ideal linear fit looks like. Looking at the top graph in Figure 11 we see that we do not properly reach convergence until around epoch #130… And in addition to that, the bottom graph in Figure 11 shows that we never reached the optimal beta at all. The reason for this is due to our “birds eye view bowl” looking more like a half pipe than a bowl. If you take a look at the top graph of the figure above and make a note of the values of x- and y-axis, the values in the y-axis are roughly one hundred times higher. The effect of this we see in the bottom graph — it does not matter which intercept value we choose in regards to the cost. There is luckily a way to fix this called feature scaling.

Feature Scaling

Feature scaling is the method of manipulating each feature individually in order to globally put them on a more equal playing field. The type of feature scaling we will use in this post is called feature standardization. Standardizing our features will cause them to have zero-mean and unit-variance. This is achieved by subtracting each value with the mean of its feature, and then dividing by the standard deviation. In order to be able to flip back to its original values, we need to store the calculated mean and standard deviation in some kind of data structure. Below is an implementation of such a feature scaler built to mimic scikit-learn syntax:

Using these newly scaled features we can plot them side by side, and see that the shape of the original distribution is equal to what the distribution looked like previously.

Figure 12: Image by Author, distribution comparison of original and standardized features.

Since our values are far smaller now, we also have to adjust the learning rate. In Figure 13 below we now in the bottom graph see the “birds eye view bowl” we have been looking for. In the top graph we have rescaled the culmen length and body mass back to its original values before plotting.

Figure 13: Image by Author, top graph describes GDS line over epochs. Bottom graph describes GDS algorithm trying to find the global minima of linear equation.

It does however take us almost 200 epochs to reach the optimal beta. Which for those who have been dabbling in neural networks, is a lot of epochs. What we can do is instead calculating the gradient by looking at the entire dataset, we can optimize our beta using every single datapoint by themselves! This approach to gradient descent is called Stochastic Gradient Descent.

Stochastic Gradient Descent (SGD)

Yes you read the previous sentences correct, we can update our beta once on every single data point. The effect of using stochastic gradient descent means we can now reach convergence much faster since we update our beta many times each epoch. The drawback is that we might not take a step directly towards the global minima each update, but the assumption (and effect) is that overall we will head towards the global minima. In Figure 14 below we can see that the road looks more bumpy than before, but we reach our lowest possible loss after only around 5 epochs!

Figure 14: Image by Author, loss over epochs using Stochastic Gradient Descent.
Figure 15: Image by Author, top graph describes GDS line over epochs. Bottom graph describes GDS algorithm trying to find the global minima of linear equation.

Taking a look at the animated graphs above we can see why the loss graph looks bumpy. We reach a point close to the optimal beta, but as we continuously iterate we instead hop further away from the optimal point. There are two primary ways to approach this issue:

  • Early stopping criterium: we either store a model per epoch and retrospectively select the one with the lowest loss. Or we can set a threshold which if crossed prematurely stops the algorithm.
  • Decaying learning rate, some more advanced optimization algorithms adjust the learning rate over time. Thus we can reduce the amount of descent “hopping” as we progress closer to the optimum.

The implementation required for going from Gradient Descent to Stochastic Gradient Descent is very minor. We only need a function to randomly (without replacement) select indices in our data as we go through the dataset. In effect we get a loop of loops.

Conclusion

In this post we have built a model in order to predict (or estimate) the weight of Gentoo penguins based on their culmen length. We started off using Ordinary Least Squares (OLS) which gave us the optimal beta from the point of view of Mean Squared Error (MSE). Following that we found the optimal beta via the iterative algorithm Gradient Descent instead. Gradient Descent allows us to use cost functions other than MSE. This approach taught us the importance of feature scaling, which made us standardize our features. Finally in order to reduce the number of iterations required we implemented the Stochastic Gradient Descent algorithm, which reduced the number of required iterations for convergence from more than 130 epochs to roughly 5 epochs.

All code used within this post, including both algorithms and plot code, can be found in my github repo. In the next post of this series, we will implement a neural network from scratch, stay tuned and follow!

--

--

Erik Munkby
Erik Munkby

Written by Erik Munkby

ML Engineer and Data Driven Culture Champion | Writing about ML, Data Science and other data related things | Co-founder of Data Dao

No responses yet