Anomaly detection made easy

Erik Munkby
9 min readApr 21, 2023

--

Establish data trust using automated anomaly detection!

Photo by Florentine Pautet on Unsplash

Trusting your data is an essential pre-requisite before making decisions based on analysis. The struggle you end up in is how to start trusting your data in the first place. The risk of working with data that is not tested nor monitored is introducing errors into downstream applications/analysis without ever noticing them. What we end up at is an area we can call unknown unknowns, we have incorrect but we don’t know about it. Finding these faulty data points — anomalies — are difficult since anomalies are by definition: things that have not been seen before.

In this blog post I will give you some ideas and tools in order to turn these unknown unknowns into known unknowns, you will learn how to build your own anomaly detection. This will be accomplished with building blocks regarding building an expected value, using this expected value to build outer boundaries for when to send an anomaly alert. Finally we will also take weekly seasonality into account.

Setting the stage

In this blog post we will look at BeerioCart, an e-commerce site selling beer. BeerioCart gathers a bunch of different datapoints regarding its online visitors in order to improve and optimize the user experience of its website. Lately two primary metrics have been mistrusted due to showing previously before unseen behaviour, the number of visitors on the site as well as how long time they spend there. To make matters even worse, the bug reports came from downstream consumers of the data long after the issues first occurred. The data team of BeerioCart wants to reduce the Time-to-Detect (TTD) in order to find the faulty data, ideally before anyone has time to make decisions based on them. To achieve this they build an automated anomaly detection system that is triggered any time new data arrives.

In order to figure out which data is a potential anomaly, we need two primary components. An expected value of where we think a value should be at, together with boundaries of how far away we are comfortable with trusting the data at. If we take a look at the number of daily visitors of BeerioCart, together with these boundaries, a detected anomaly end result can look something like below:

Image by Author: An anomaly example, a data point existing outside of its boundaries.

In the above visualization we have a trend line (green) together with an expected area, with the anomaly highlighted in orange where the trend line steps outside the expected area. The filled-in area has an upper and lower boundary which is where we expect the values to exist within. Now that you have a grasp of where we want to go, let’s start figuring out how to get there!

Building the estimated value

When starting out with anomaly detection my recommendation is to opt for easy-to-implement, easy-to-understand algorithm. A choice of a “simpler” algorithm might lead to more false positives starting out, but this can be fine-tuned as explained later in the blog post. In addition to this, you will yourself have more control over how and when anomaly flags should trigger. This in turn will help you build a better understanding of the behaviour of your data!

I will work with three different expected value estimation algorithms, two averaging algorithms and one regression algorithm:

  • Simple Moving Average (SMA)
  • Exponential Moving Average (EMA)
  • Ordinary Least Squares (OLS)

For all of these measurements we will work with the 15 previous data points, in order to build an estimate of what value data point number 16 should be at. Let’s first zoom out a bit and take a look at a longer time span of what our number of visitors per day look like for BeerioCart.

Image by Author: BeerioCart website’s number of visitors per day.

In the visualization above our number of sessions move between ~6k to ~10k, with a slight upwards trend as seen in the trend line. The number of BeerioCart visitors seem to have some weekly patterns with recurring spikes and valleys every 7th day.

In the visualization below we have chosen an arbitrary date 7th of April as a target for our expected value. We use the 15 previous dates to figure out the trend — at least two seasonal cycles (i.e. two weeks as assumed via the visualization above) however this number can also be changed and optimized. Using Ordinary Least Squares (OLS) we build the trend and get an estimation of what value date number 16 (7th of April) is supposed to be at (if you want to learn more about Ordinary Least Squares check out my Machine Learning from scratch (part 2) blog post).

Image by Author: Expected value estimation using Ordinary Least Squares (OLS).

In comparison to OLS, Simple Moving Average (SMA) and Exponential Moving Average (EMA), builds an average of previous dates instead of calculating the trend. Both algorithms builds on moving averages where SMA weighs all included values equally (i.e. a regular simple average), while EMA gives higher weight to more recent values. The difference is how much impact older & newer data points should have. Applying the same methodology to all algorithms — using 15 previous datapoints to estimate the next — we can compare the estimated values against the true values over time in the visualization below.

Image by Author: Comparison of different value estimation algorithms marked in dashed lines.

In the visualization above we see that SMA is the value estimation with the least variance. OLS seems to lag behind by one day, a result of building trend lines when there are big up/down short-term movements. Finally EMA seems to be a mix of the two, slightly more adaptive than SMA but without the lag effect of OLS. As a good in-between, going forward we will be using EMA to build our expected values.

Building the boundaries

The most straight forward option in order to build the boundaries is to base them on how much variance we find within the data. The empirical rule says that within 1 standard deviation (sigma σ) from the mean we will find 68% of the data, within 2σ we will find 95% of the data, and within 3σ we will find 99.7% of the data (given that the data follows a normal distribution). Choosing the correct distance from the estimated value is a trade off between true and false positives. We want to be sure that all true positives (anomalies) are alerted, without the hassle of investigating faulty alerts too frequently.

The best way is to start off using narrower boundaries, and then iteratively increase them as you get to know your data better (you can also have fractions as coefficient e.g. 1.3σ). In this blog post going forward we will use a 2σ distance from the EMA estimated value to build the boundaries.

Choosing the correct distance from the estimated value is a trade off between true and false positives. We want to be sure that all true positives (anomalies) are alerted, without the hassle of investigating faulty alerts too frequently.

Let’s apply it on some data!

Now that we build our boundaries using 2σ (standard deviations) upper and lower limit away from our estimated value using EMA (exponential moving average), we can see if any anomalies appear.

Image by Author: Anomaly detection for number of visitors.

In the visualization above we receive a potentially detected anomaly on a Thursday, March 17. Great! Next step is one more tool towards further pin-pointing actual anomalies and reducing the number of false positives received.

Approaching seasonality

Seasonality is the term for recurring expected changes in data over time, e.g. shopping focused holidays such as Christmas, or market specific seasons such as the winning combo of ice cream and summer. Seasonality can also be on a smaller scale — recurring fluctuations week by week. The question is what do we care about? The answer is context. If we want to make sure our warehouse has enough stock maybe we ignore the weekly fluctuations since we don’t get daily deliveries, but we definitely wanna get extra beer on our shelves before Christmas. On the other hand, for anomalies we most likely care more about the weekly fluctuations. If we get an anomaly alert a few times per year on major holidays, we already have a hunch as to why these happen. But if we get an alert every Tuesday, because for some reason people love buying beer on Tuesdays, it will quickly become annoying.

One approach to taking care of the weekly seasonality is to bake it into the model. The simpler approach is to attack it directly, by comparing apples to apples. Studying the longer patter of number of visitors earlier in the post, we saw something that looked like a weekly seasonality. If we stop treating different weekdays the same and instead look at the them individually, visualized in the scatter plot below:

Image by Author: Scatter plot of number of visitors broken up by each weekday.

In the visualization above we see that especially Thursdays and Sundays behave differently, generally lower and higher number of visitors respectively. It turns out that maybe people want to get that online order of beer in preparation for the week. If we apply the same method as before — using the 15 previous datapoints to estimate the next — but exclusively looking at the same weekdays (i.e. in order to build boundaries for Sunday 16, we use Sunday 1–15), we get the new visualization seen below.

Image by Author: Weekday based anomaly detection.

Suddenly we have slimmer boundaries and our anomaly detection date changed! Turns out our previous anomaly date, Thursday the 17th was not abnormal from the perspective of being a Thursday, a day with generally fewer visitors. Now we instead have a new anomaly on Friday the 18th! Additionally, you can also see that the colored area (boundaries) are significantly tighter than previously. This is because a lot of the variance could be explained via the weekly seasonality, i.e. each weekday is much more consistent when compared to itself.

Conclusion

In this blog post you have learnt how to build your very first anomaly detection on time series data. This approach combines a value estimation algorithm together with building boundaries using a multiple of standard deviations from the historical sample. Additionally if your data exhibits strong seasonality we have taken simple measures to combat this as well.
Regarding choice of metrics: when applying anomaly detection on multiple metrics from the same dataset make sure to select complementary and not additive metrics. I.e. in addition to number of visitors, don’t run anomaly detection on total time spent on website because these values will most likely alert at the same time. Instead, as a time spent metric, choose mean time spent per user! A bonus for curious readers can be found below, where I showcase an alternative solution using ML (Prophet)!

All code used for this blog post can be found on my github, and follow me on medium for more data-oriented posts! Happy anomaly detecting!

Bonus: Machine Learning approach using Prophet

In the past on my quest towards a go-to anomaly detection approach I have also experimented using Meta’s Prophet forecasting machine learning model. Prophet is a popular ML library for building forecasting on time series data. For demonstrational purposes I have taken a similar approach to the rest of the blog post, except increased the number of historical dates from 15 to 45 in order to predict the next day. The primary Prophet outputs we are interested in are the ones called yhat, yhat_lower and yhat_upper. I.e. the forcasted value together with its lower and upper boundary. Applying this iterative 1 day forecasting to our data we get the following plot:

Image by Author: Anomaly boundaries built using Prophet.

In the plot above the boundaries are far tighter than our previous approaches. As such, using the prophet output as-is would give us an improbable amount of anomaly alerts. Like I stated earlier in the blog post, I still believe the best approach is to first build your own algorithm in order to both learn from your data, and keep the most amount of control of how/when the alerts should happen. However with some tweaking or only using estimated value yhat together with boundaries built via a multiple of standard deviations it might be something you want to try out!

--

--

Erik Munkby
Erik Munkby

Written by Erik Munkby

ML Engineer and Data Driven Culture Champion | Writing about ML, Data Science and other data related things | Co-founder of Data Dao