Welcome to my first blog post! I’m Luke, an MRes student at the STOR-i (Statistics and Operational Research in partnership with Industry) Centre for Doctoral Training. As part of this one-year course that prepares us for our PhD with STOR-i, we are tasked with the upkeep of this blog as a way to discuss and consolidate the wealth of ideas we come across throughout our time in academia, at a level approachable for a first year STEM student, as well as anything else we may have found interesting since the last blog post. As a STEM student reading this blog, you may be interested in applying to the STOR-i Summer Internship during your penultimate year, which gives a good taster into life at the CDT (Note: I have not been bribed to say this, it’s a very useful experience for anyone interested in this type of research). As such, for this first post, I would like to discuss the topic I looked into during my own internship with STOR-i, Extreme Value Theory, to give a flavour of some research level statistics. My internship was completed under the supervision of Stan Tendjick, who was a first year PhD at the time. The final presentation can be found here, but I will give a more gentle introduction in this post.
The first task of Extreme Value Theory is to model only the values of a set of data which are considered to be “extreme”, giving classes of distributions to model these values more broadly. Specifically, for my internship, I modelled the extreme values of oceanic wave data, such as wave speed and height, near an oil rig. This was to ensure that the facility would remain safe under extreme scenarios. So what is an extreme value? There are many answers, but my study focuses on the “Threshold Model”, which defines an extreme value to simply be a value that surpasses some predefined high threshold. Theoretically, as this threshold is increased, the distribution of values above the threshold almost always collapses into a distribution called the Generalised Pareto Distribution, or GPD (don’t worry about the jargon, that’s just its name). A “good” threshold to choose is the lowest threshold such that the data above that value looks like it matches the GPD well (we must, of course, actually have some data above that threshold, often meaning we need a good amount of data). Once we have a good model for this data, we can use this model to answer certain questions, such as the probability of seeing a wave above a certain height, or the highest speed of a wave we would expect to see once every 1000 years (known as a return level). Knowing this, we can build structures specified to weather these kind of extreme events.
This describes extreme value theory for one type of value, such as only looking at wave speed or height on its own. However, we may want to look at the extremes of these data together, and look into how extreme values of one variable might impact the value of another variable (the jargon for this being the “asymptotic dependence”). This turns out to be a more involved problem, which we can tackle using the Heffernan-Tawn model (Jon Tawn being the Director of STOR-i), which can be considered an extension of the threshold model discussed earlier. The process of this model can be broken down into a few steps:
– “Transform” the data into some more suitable distribution (While this process distorts the distribution of the data, it is reversible, and importantly maintains the dependence structure)
– Fix one of the variables to be above some suitable extreme threshold
– Find a curve of best fit through this extreme data, with non-stationary variation about this curve
– Residuals (how the data differ from the model) of this model should be independent of the extreme variable, which we can check. If they are, our model is done!
With the model finished, we can use it to simulate extreme data, using the fitted curve, knowledge of variance about the curve, and the distribution of the residuals. This is especially useful where we don’t have enough data to immediately draw conclusions. These simulated points can then be used to calculate the likes of probabilities and return levels as discussed for the single-variable case.
I hope this has been a useful introduction to one of the many topics we specialise in here at STOR-i, and a good taste of the kind of problem we tackle at the research level. Over the coming months, I will be making similar posts about more of the interesting areas and topics we come across during our Masters year, to give a more representative taste of the range of problems covered at the Centre.