Virtual Labs – Breaking down barriers in Environmental Data Science.


Posted on

graphs

Complex statistical and data science analytical methods often allow environmental scientists to gain useful information from an ever increasing volume and variety of datasets. However, barriers exist to gaining access to these methods (E.g. Changepoint detection or Extreme Value Theory) and require collaboration between numbers of different experts. Such efforts typically require the sharing of large volumes of data and the utilisation of methodologies that need the sharing of code between often very different computational environments. Further to this, interpretation of the results are usually done by domain experts, some of which may not have a coding background. Virtual labs provide a cloud-based collaborative platform to facilitate ease of access to complex statistical methods and provide coherent environment in which to open up environmental data science. This blog post introduces a case study which highlights some of the latest developments in the virtual labs (Hollaway et al. 2020), and how they were used to develop a new method to evaluate complex numerical models.

The Challenge

Numerical models are heavily relied on within the environmental sciences to predict how the dynamic natural environment will respond to key pressures and drivers (e.g. climate change). Over time, these models have become more complex with ever increasing numbers of parameters and thus require robust evaluation to see how well they capture reality. One such example is to see how well a high resolution meteorological reanalysis model dataset captures key events over the Greenland Ice Sheet. Typically, global statistics (mean, correlation, Nash-Sutcliffe Index) are used to determine model performance on average across the whole time series. However, where models perform well on such global measures, they can fall down when trying to represent individual events in time (so called local scale events). Therefore, new approaches are necessary to see how well these local scale events are captured. In this example, changepoint detection was used to see how well a complex numerical model derived dataset represented changepoints detected in corresponding observations over Greenland. In short, a changepoint is a point in a time series where the properties of a statistical representation of that time series (e.g. mean, variance or trend) undergo significant change. This required the input from a number of different experts from various domains to develop the new evaluation approach. This is typically done by passing various data and code files around from one person to another which often results in many different points of entry, non-coherent computational environments, multiple copies of the same dataset and no end-to-end record of assumptions made in the analysis. In this case, virtual labs were able to offer a solution to some of these problems and enabled development of the new method.

The DataLabs solution

A virtual lab was developed to tackle this challenge. In short, a cloud-based environment provides a storage volume that sits below data processing, analysis and visualisation resources. A Jupyter notebook environment is provided to document the process of reading in the data, processing the data into a format for the changepoint detection algorithm and visualisation of the results. Finally, an RShiny application sits above the notebook code to allow users of a non-coding background to explore the results and execute the code beneath to explore changepoints in the dataset and evaluate how well the numerical model captures them. Most importantly, the Shiny app and the code are all running in the same environment and from the exact same version of the dataset and analytical workflow (see figure 1).

Figure 1: Overview of the Changepoint Case Study DataLab Demonstrating the Different Levels of Abstraction. (A): The raw R code for computing changepoint locations, (B): The Jupyter notebook demonstrating the method, (C) R Shiny app to allow exploration of the changepoints at different sites across Greenland. Source (Hollaway et al. 2020).

Sharing the results through Virtual Labs

Through this workflow, changepoint detection was used in combination with Fuzzy Logic to see how well a model dataset captured changepoints over the Greenland Ice sheet whilst factoring in uncertainty in the changepoint location (Hollaway et al 2021). It was shown, that the model dataset was able to pick up observed changepoints in temperature records with varying degrees of success (Figure 2). This provides useful insight as, if there are common local scale events (in this case changepoints) that the model is failing to capture, it could highlight areas to focus on with future model development. The outcome of this application is a demonstration of the end to end nature of the workflow which here takes on an iterative and cyclical approach facilitated by the Virtual Labs environment. Here domain experts are able to come in at different stages of the analysis to develop the method. For example, the data engineers and environmental science expert comes in at the data ingress and interpretation stage and the statisticians and data scientists come in at the method development and execution stage. Finally, the environmental scientists visualise and interpret the results to understand the outputs within the domain of application (in this case ice sheet melt). The whole process takes on a ‘well we showed that but what if we tried this approach?’ which is simplified by harnessing the power of notebook and dashboard technologies within the virtual labs.

Figure 2: Evaluation of numerical model (blue) at capturing changepoints when compared to observations (red). The triangles represent uncertainty in the changepoint locations with greater overlap (grey shading) indicating a better representation of reality by the model. Source: Hollaway et al. (2021).

Harnessing the power of Virtual Labs to share data and methods



The use of notebook technologies also provides a detailed narrative of the workflow deployed covering aspects of data ingress through to the visualisation of the final results in the Shiny App. The combination of the code and narrative that notebooks provide allow different users to understand the assumptions made when that particular piece of code was written along with providing context to any decision made when it was run. In addition, as the notebooks are stored in the common datastore (with the computational environment made available through package management systems such as conda) a new user is able to come along and run the method on their own dataset or adapt it to be used in another workflow (E.g. Use a different changepoint detection technique). Finally, the virtual labs are cloud-based which allows the user to scale up any analysis when the volume of data increases or they wish to run a more complex statistical model if needed (using parallel computing libraries such as Dask for python and Spark for R).

Overall, this case study demonstrates that virtual labs play a key role in breaking down barriers in environmental data science and hopefully can provide support for the ever increasing need for collaborative and open data science approaches.

The Rshiny app produced for the case study as available at the following URL:

https://dsne-fuzzycpteval.datalabs.ceh.ac.uk/

References:

Hollaway et al., Tackling the Challenges of 21st-Century Open Science and Beyond: A Data Science Lab Approach, Patterns (2020), https://doi.org/10.1016/j.patter.2020.100103

Hollaway, M.J., Henrys, P.A., Killick, R., Leeson, A., Watkins, J., Evaluating the ability of numerical models to capture important shifts in environmental time series: A Fuzzy changepoint approach, Environmental Modelling and Software, 139, 104993, 2021.

https://doi.org/10.1016/j.envsoft.2021.104993

Related Blogs


Disclaimer

The opinions expressed by our bloggers and those providing comments are personal, and may not necessarily reflect the opinions of Lancaster University. Responsibility for the accuracy of any of the information contained within blog posts belongs to the blogger.


Back to blog listing