Ch Ch Ch Ch Changepoints

No, I didn’t just forget the words to David Bowie’s Changes, in today’s post we’re going to be talking about changepoints! In this brief introduction to changepoint analysis we’ll be covering what is actually is, how is it useful and when can we apply it. At the end of this post, I’ll also be sharing some code resources, which you can use to carry out your own changepoint analysis!

Changepoint analysis is a really well-established area of Statistics. It dates back as early as the 1950s, and since then has been the focus for LOTS of interesting and important research.

Changepoint detection looks at time series data. A time series is a series of data points which are indexed in time order. Usually, a time series is a sequence of discrete measurements, taken at equally spaced points in time. This could be the number of viewers for a particular TV show taken at one minute intervals over the course of an hour, or maybe the heights of ocean tides taken every hour throughout the day.

As the name suggests, the aim of changepoint detection is to identify the points in time at which the probability distribution of a time series changes. We can think of this as follows:

Let’s say we have some time series data given by y1, y2, …, yn, where yi is the measurement taken at time i. Then, if a changepoint exists at time τ, this means that the measurements y1, y2, …, yτ differ from the measurements yτ+1, …, yτ in some way.

If we are performing a changepoint analysis, there are some key questions that we’d like to consider:

  • Has a change occurred?
  • If yes, where is the change?
  • What is the probability that a change has occurred?
  • How certain are we of the location of the changepoint?
  • What is the statistical nature of this change?

Online v Offline Detection

Changepoint detection can either be online or offline. Imagine that we have access to some data, which describes the temperature taken at Lancaster University at 12pm everyday over the course of a month. We then want to look for changepoints in this data, to see whether there were any freak increases or dips in the mean temperature, or maybe periods with very high variance. This type of analysis would require offline changepoint detection methods, because we have access to the complete time series data. That is, we are looking at the data after all the data has been collected.

On the other hand, imagine that The Great British Bake Off is on TV right now. The number of viewers tuned in for the programme is being streamed to us live every second, and we want to look for changepoints in the number of viewers now, as the programme is being aired. This type of analysis would require us to use online changepoint dection methods, which run concurrently with the process that they are monitoring.

Let’s recap that. In offline changepoint detection …

  • Live streaming data is not used.
  • The complete time series is required for statistical analysis.
  • All data is received and processed at the same time.
  • We are interested in detecting all changes in the data, and not just the most recent.
  • We usually end up with more accurate results, as the entire time series has been analysed.

Whereas in online changepoint detection …

  • The algorithm runs concurrently with the process that it is monitoring.
  • Each data point is processed as it becomes available.
  • Speed is of the essence! The goal is to detect a changepoint as soon as possible after it occurs, ideally before the arrival of the next data point!

Examples

Let’s consider a fitness tracker that can tell when you are walking, running, climbing stairs … you get the idea. Maybe your mobile phone does this. One way in which devices can tell what activity you were performing at a particular point during the day is by using offline changepoint detection!

Online changepoint detection is often used in areas like quality control, or for monitoring systems. For example, a broadband provider might receive live data that details the performance of their broadband network at some site. Detection of a changepoint in this scenario might indicate that there is an issue with the network! This brings us to another required feature for a good online changepoint detection method: alongside the need for speed, it is also important that we have a method that is robust to noise, false positives and outliers. This makes sense, as the broadband provider doesn’t want to send out an engineer if there isn’t actually anything wrong with the network!

Now that we have covered what changepoint detection is, and the differences between offline and online detection methods, can you think of any other scenarios where we would want to use offline changepoint detection methods? What about online detection methods?

Further Reading

Sadly there is only so much I can write in one blog, so I have included plenty of further reading resources for you if you enjoyed today’s post!

  • Offline changepoint detection and implementation: this post provides a great place to start if you want to know more about the types of changepoint detection methods available, and if you want to have a go at using applying some of the methods to data in R.
  • Online changepoint detection and implementation: this post describes one possible method of online changepoint detection. I feel it gives a great intuitive understanding, and also explains how to code up the method if this is something that you would like to try!
  • PELT method: this paper provides a more mathematical explanation of how one of the most popular offline changepoint detection methods works. I’d recommend reading this if you are looking for a deeper understanding into how the method works.
  • Binary Segmentation method: this post provides an introduction to another popular offline changepoint detection method, and also gives some code that you can use to implement the algorithm yourself. I find that this gives a better understanding than simply using some possible R packages.