Within the Data Science Institute, we aim to improve research's reproducibility and replicability by improving the reusability, sustainability and quality of research software developed across the University. We are currently funded by the N8CIR and work closely with our partner institutions
What is Research Software Engineering?
Research Software Engineering combines academic research with good software engineering principles.
Combining good software engineering principles with high academic standards allows us to create the best quality open, reproducible, and replicable research. However, a lack of appreciation for software as a research output results in software being a secondary concern. This leads to lower quality code that is not reproducible, or even shareable.
Research Software Engineering and Research Software Engineers (RSEs) combine the skills needed, which allows them to work with researchers to develop software and research of the highest qualities.
Training Courses
We are no longer able to offer in-person training. Many of the resources we use can be used as self-teaching guides and links to them are provided below.
Various types of infrastructure are available to researchers at Lancaster to support them in computationally intensive research or to meet their data needs.
The High-End Computing (HEC) Cluster is a centrally-run service to support researchers and research students at Lancaster who require high-performance and high-throughput computing. This includes computing workloads with requirements that can’t be met by the Interactive Unix Service (IUS), desktop PCs or Virtual Desktops (MyDesktop and MyLab).
The Data Immersion Suite, and its associated Control Room, are dedicated teaching and research facilities at the University enabling educators to create complex scenarios which might unfold over hours or even days.
A SafePod is a standardised safe setting that provides the necessary security for a researcher to access sensitive datasets from participating Data Centres across the UK. The SafePod at Lancaster University is part of the SafePod Network (SPN). A total of 25 SafePods will make up the SafePod Network, spread geographically across the UK. This network will remove the need for long distance travel to a dedicated safe setting provided by a data centre.
MyLab gives you online access to software on your device via a virtual PC Lab computer (just like you'd get in the Library and PC Labs on campus) from anywhere using your web browser.
Bede is the N8 GPU-accelerated Supercomputer. It is housed at Durham University and accessible by any members of the N8 - which includes Lancaster University.
The Interactive UNIX Service (IUS) provides a UNIX facility in the areas of study, research, development, and teaching for researchers whose needs are not best served by the University's Microsoft Windows-based services.
Community
Many researchers use software heavily as part of their research, and many of the problems we need to solve with software can be common across disciplines. By bringing researchers together we can create a collaborative community to share tools and libraries that we use and provide support in writing code.
To do this we have a Microsoft Teams group where researchers can ask for help and advice, or post useful tools and information related to research software. Follow this link to join the Research Software Network Teams Group.
DSI is pleased to announce the N8CIR internships programme for 2025
DSI is pleased to announce the N8CIR internships programme for 2025. This programme is aimed at 2nd and 3rd year undergraduates interested in investigating research software engineering as a career, and we are offering up to 2 8-week positions starting on the 18th June with a £3500 stipend for the period. Interested students should read the list of projects listed below, and contact the relevant supervisors ahead of a nomination **deadline of 17th April**. Candidates will then be invited for an interview process at which the recipients of the internships will be selected.
Please email dsi-enquiries@lancaster.ac.uk if you have questions.
accordion
1. Optimising on-disc storage of Monte Carlo output for machine learning problems
In statistics, machine learning, and AI, models commonly lead to complex multi-dimensional probability distributions as the focus of interest. These probability distributions are often represented as a collection of random numbers – “Monte Carlo samples” – which must be stored for further processing and analysis. In many situations, the number and complexity of these Monte Carlo samples means that they must be streamed out of RAM to non-volatile storage, for example a binary file, column database, or object storage. However, the order in which we might write such samples is often incompatible with efficient read access in the future.
This project will explore efficient storage patterns across a variety of modern on-disk and cloud storage formats. A typical Monte Carlo sampler will be used as a test case, with different storage formats being investigated on a range of hardware from personal computers to HPC to cloud. Storage formats include, but are not limited to, HDF5, Zarr, Parquet, and TileDB. The project will culminate in a model that enables researchers (or even a machine) to choose the best storage format for their Monte Carlo samples, with the chance to design a library that abstracts the storage formats to provide a consistent interface for the user.
2. Distributed Search Space Reduction for Program Synthesis
Supervisor: Barry Porter <b.f.porter@lancaster.ac.uk> (SCC)
RSE Mentor: John Vidler (SCC)
Genetic programming (GP) is an approach to synthesising new programs for novel problems, by searching through theoretical program space following a reward signal. Compared to large language models, this approach allows the synthesis of novel programs to entirely unseen problems. GP typically starts from an empty program and navigates outwards in various directions to try to find improved candidates. In this project you will develop an alternative approach, in which distributed parallel computing is used to incrementally narrow a search space. Your project will use an existing, novel program search space framework as a starting point; this framework is able to represent all of program search space as a regular rectangle, and operates part of its search process across GPUs. Your system will start by splitting the total theoretical search space into a number of equally-sized regions for parallel distributed search, and sampling random points from within each region. The most promising of these regions will then be selected as the area of focus, itself being split into a number of equally-sized regions for parallel distributed search. Starting from our existing framework, you will focus specifically on the distributed systems aspect and the implementation of parallel search sampling. The resulting framework should be drive-able and observable using a REST API.
3. gemlib: a python library for epidemic modelling
Come and contribute to the development of gemlib, an open-source python library for simulating and calibrating epidemic models to real-world outbreak data! Epidemic models are used to understand and predict how infections spread in different settings. During the COVID-19 pandemic, real-time modelling was used to improve understanding of the pathogen, forecast disease dynamics, and evaluate interventions. Epidemic models are fundamentally Markov state-transition models, whereby a population of individuals is divided into mutually-exclusive disease states and move between these states according to time-varying transition rates. Models such as these become complex quickly, often including spatial features and individual interactions, stratifying the population by demographic characteristics at various scales. The parameters that govern these models are often unknown and need to be estimated. Bayesian inference methods such as MCMC and SMC are often used to account for censored data (such as unknown infection times) and estimate parameters of interest. These methods are computational complex, and the implementation is technically challenging and time-consuming.
gemlib presents a unified framework for expressing and simulating models, as well as automatic generation of probability functions for parameter inference. The library enables researchers to rapidly spin-up epidemic models during emerging outbreaks in a robust, reproducible manner. gemlib is based on the machine learning library TensorFlow, allowing complex models to be optimised on a GPU when needed. This project will enable an intern to contribute to Open-source Software development, implementing Bayesian inference algorithms as new classes to the library. There will be opportunities to expand skills in functional programming, high performance computing, and functional testing.
4. Pipeline for fitting thermal responses of mosquito traits
Mechanistic models for the impact of climate on transmission of vector-borne diseases rely on thermal responses that describe how vector and pathogen life history traits respond to temperature. This project will train a student intern in developing a data analysis pipeline to fit these thermal responses using a dataset of mosquito trait data digitised from previously published lab experiments. Over 8 weeks, the student will work with the project lead to: 1) Perform basic quality checks on the trait data and collect relevant meta-data from the original articles, such that the completed dataset can be uploaded to the VecTraits database; 2) Write a pipeline in R to allow anyone to retrieve this data from VecTraits using the ohvbd package; 3) Fit a series of thermal performance curves (TPCs) to the data using rTPC package; and 4) Visualise and interpret trends in these TPC fits using ggplot2 and conduct an appropriate statistical analysis. The training will emphasise writing pipelines that are open, reproducible, and flexible. The final pipeline will be hosted on GitHub.
5. Integrated data plotting for quantum electronics experiments
Supervisor: Edward Laird <e.a.laird@lancaster.ac.uk> (Physics)
RSE mentor: John Fozard (SMS)
Versatile and easy-to-use measurement software is essential for experiments in quantum electronics, which is one of the most rapidly growing areas of physics. For data acquisition, this need is now met by the open-source QCoDeS framework, which has been adopted by most of the groups in the field, including mine. However, data inspection must be done outside this framework, either by exporting data step-by-step to an analysis program such as Matlab, or by writing ad-hoc plotting programs that need to be changed with each experiment.
In this project, the intern will develop a set of generalised plotting routines that interface with the existing generalised sweep routines that are part of QCoDeS. The aim will be to automatically plot all measurement results in matplotlib, regardless of what is being measured; for example, we should be able to live plot the transition intensity, frequency, and coherence time of a qubit regardless of what parameter(s) we are sweeping. All these things can be done with existing libraries, but these are not well interfaced to the code that actually runs experiments, which makes it difficult to make evaluations on the fly.
The intern will be embedded in my research group of seven experimentalists and will immediately be able to see successive versions of their code in use. The output of the project should be submitted for inclusion in the QCoDeS library and I predict that it will be widely used in my group and ultimately by quantum electronics researchers worldwide.
6. Modelling photosynthesis
Supervisor: Samuel Taylor <s.taylor19@lancaster.ac.uk> (LEC)
RSE Mentor: Dr. Supreeta Vijaykumar (LEC)
Models of photosynthesis are important tools for understanding plant responses to global change and are increasingly used to predict opportunities for targeted engineering of core metabolic processes like photosynthesis, in support of improved agricultural productivity or carbon storage. In the project PhotoBoost (https://www.photoboost.org/), a digital twin of photosynthetic metabolism, e-Photosynthesis, is used to explore opportunities for engineering next level photosynthesis in potato and rice. An objective is to evaluate molecular biology interventions that could enhance photosynthetic carbon assimilation. We want to do this to fuel improved crop yields and more resilient crop growth, taking account of key environmental controls on photosynthesis, including light, water and atmospheric carbon dioxide. By participating, you will learn about fundamentals of widely used leaf-level models applicable not only to simulations in crop biology, but also to ecology and global change modelling, using this understanding to apply quality control and parameterise non-linear models specific to the target crops. A key goal of the internship will be to evaluate novel data that describes photosynthetic responses to light and carbon dioxide. These will be used to test simulations produced by an advanced version of e-Photosynthesis that models metabolic regulation affecting the central carbon fixing enzyme Rubisco. You will develop your skills in use of R and MATLAB, for programming and analysis and visualisation of data. You will work alongside an experienced researcher mentor on a day-to-day basis, with weekly support from your supervisor and weekly small group team meetings where skills in data science are shared.
As our global climate changes, our reliance on pesticides to grow crops is ever-increasing. One method to reduce the overuse of pesticides may be the age-old tradition of “companion planting”, where a second plant is grown alongside the rst to attract predators that eat the pests. To determine if this approach can be applied at scale we have partnered with RHS Wisley. If shown to be successful, this method can be applied to fruit and vegetable crops to improve both food security and food sustainability.
However, to determine if this approach effectively reduces pest infestation we must track how pest numbers change across the growing season. Traditionally this requires regular, by-hand ‘bug counting days’: a highly inefficient technique that definitely isn’t applicable at scale. To optimise this process for wide-scale use we will combine high-resolution photography with source detection techniques developed in astronomy.
In this internship we will apply a set of AI algorithms, with a training set built from labels determined by the general public (“citizen science”) to automate the detection of invasive pests. The astrophysics group at Lancaster has already developed and applied such techniques to a diverse range of challenges spanning global security, healthcare and catastrophe management. In this 8 week study, we will develop a framework required to store, analyse and interpret each image collected, and develop machine learning algorithms that will automate the detection of pests. If successful, this technique will be rolled-out across the RHS, to quantify the importance of companion planting across all types of crop.
8. Applying new methods for historical spelling normalisation to Early English Books Online
Early Modern English (EModE, c. 1500–1700) is the earliest period of the English language which can be analysed at a large scale, due to the introduction of the printing press by Caxton in 1476. The Early English Books Online (EEBO) project1 set out to digitise, and then transcribe every printed book available during the EModE period, resulting in a dataset of c. 1.1 billion words. Dealing with a corpus of such a size presents challenges for Digital Humanities. Previous efforts include the Linguistic DNA project2 , and UCREL has processed an instance on CQPWeb3 .
Spelling variation, prevalent in the EModE period due to the lack of language standardisation, is an issue for any linguistic analysis of EModE texts, with word frequencies split between spelling variants, and key tasks such as Part-of-Speech tagging having considerably reduced accuracy. The standard process developed has been to introduce a spelling normalisation step within the processing pipeline, which normalises spellings to a modern form, to improve the accuracy of downstream tasks.
This project aims to develop and evaluate new methods for this spelling normalisation step, particularly focusing on translation models, which have been shown to have previous success in the task. The result will be to establish which methods are most appropriate for this careful task, without introducing spurious normalisations, which can introduce artificial noise. The best method will then be applied to EEBO, providing an enriched version which can be processed for further linguistic analysis.
The first stage of the research will be to evaluate existing translation methods previously used on historical texts: Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) using Bidirectional LSTMs, and also apply a newer translation model, fine-tuning a Transformer-based large language model, such as BiBERT or DistilBERT. This evaluation will focus only on historical English with previously prepared corpora: ICAMET, Shakespeare, Newsbooks, and CEEC. The intern will implement these models, utilising existing code-bases, and apply them to the listed corpora, reporting which method performs best for each of the corpora, in terms of normalisations made and the number of spurious normalisations. Providing the first stage is successful, the second stage will be to apply the best performing normalisation method to the very large EEBO dataset. This will involve the intern building a pipeline to process the texts in an efficient manner, performing the normalisation, and outputting in a format amenable for further analysis, with original spellings and normalised forms aligned.