Research Software Engineering

1. Optimising on-disc storage of Monte Carlo output for machine learning problems

Supervisor: Chris Jewell <c.jewell@lancaster.ac.uk> (SMS)

RSE mentor: John Fozard (SMS)

In statistics, machine learning, and AI, models commonly lead to complex multi-dimensional probability distributions as the focus of interest. These probability distributions are often represented as a collection of random numbers – “Monte Carlo samples” – which must be stored for further processing and analysis. In many situations, the number and complexity of these Monte Carlo samples means that they must be streamed out of RAM to non-volatile storage, for example a binary file, column database, or object storage. However, the order in which we might write such samples is often incompatible with efficient read access in the future.

This project will explore efficient storage patterns across a variety of modern on-disk and cloud storage formats. A typical Monte Carlo sampler will be used as a test case, with different storage formats being investigated on a range of hardware from personal computers to HPC to cloud. Storage formats include, but are not limited to, HDF5, Zarr, Parquet, and TileDB. The project will culminate in a model that enables researchers (or even a machine) to choose the best storage format for their Monte Carlo samples, with the chance to design a library that abstracts the storage formats to provide a consistent interface for the user.

2. Distributed Search Space Reduction for Program Synthesis

Supervisor: Barry Porter b.f.porter@lancaster.ac.uk (SCC)

RSE Mentor: John Vidler (SCC)

Genetic programming (GP) is an approach to synthesising new programs for novel problems, by searching through theoretical program space following a reward signal. Compared to large language models, this approach allows the synthesis of novel programs to entirely unseen problems. GP typically starts from an empty program and navigates outwards in various directions to try to find improved candidates. In this project you will develop an alternative approach, in which distributed parallel computing is used to incrementally narrow a search space. Your project will use an existing, novel program search space framework as a starting point; this framework is able to represent all of program search space as a regular rectangle, and operates part of its search process across GPUs. Your system will start by splitting the total theoretical search space into a number of equally-sized regions for parallel distributed search, and sampling random points from within each region. The most promising of these regions will then be selected as the area of focus, itself being split into a number of equally-sized regions for parallel distributed search. Starting from our existing framework, you will focus specifically on the distributed systems aspect and the implementation of parallel search sampling. The resulting framework should be drive-able and observable using a REST API.

3. gemlib: a python library for epidemic modelling

Supervisor: Jess Bridgen j.bridgen@lancaster.ac.uk (SMS)

RSE mentor: John Fozard (SMS)

Come and contribute to the development of gemlib, an open-source python library for simulating and calibrating epidemic models to real-world outbreak data! Epidemic models are used to understand and predict how infections spread in different settings. During the COVID-19 pandemic, real-time modelling was used to improve understanding of the pathogen, forecast disease dynamics, and evaluate interventions. Epidemic models are fundamentally Markov state-transition models, whereby a population of individuals is divided into mutually-exclusive disease states and move between these states according to time-varying transition rates. Models such as these become complex quickly, often including spatial features and individual interactions, stratifying the population by demographic characteristics at various scales. The parameters that govern these models are often unknown and need to be estimated. Bayesian inference methods such as MCMC and SMC are often used to account for censored data (such as unknown infection times) and estimate parameters of interest. These methods are computational complex, and the implementation is technically challenging and time-consuming.

gemlib presents a unified framework for expressing and simulating models, as well as automatic generation of probability functions for parameter inference. The library enables researchers to rapidly spin-up epidemic models during emerging outbreaks in a robust, reproducible manner. gemlib is based on the machine learning library TensorFlow, allowing complex models to be optimised on a GPU when needed. This project will enable an intern to contribute to Open-source Software development, implementing Bayesian inference algorithms as new classes to the library. There will be opportunities to expand skills in functional programming, high performance computing, and functional testing.

4. Pipeline for fitting thermal responses of mosquito traits

Supervisor: Marta Shocket m.shocket@lancaster.ac.uk (LEC)

RSE Mentor: John Fozard (SMS)

Mechanistic models for the impact of climate on transmission of vector-borne diseases rely on thermal responses that describe how vector and pathogen life history traits respond to temperature. This project will train a student intern in developing a data analysis pipeline to fit these thermal responses using a dataset of mosquito trait data digitised from previously published lab experiments. Over 8 weeks, the student will work with the project lead to: 1) Perform basic quality checks on the trait data and collect relevant meta-data from the original articles, such that the completed dataset can be uploaded to the VecTraits database; 2) Write a pipeline in R to allow anyone to retrieve this data from VecTraits using the ohvbd package; 3) Fit a series of thermal performance curves (TPCs) to the data using rTPC package; and 4) Visualise and interpret trends in these TPC fits using ggplot2 and conduct an appropriate statistical analysis. The training will emphasise writing pipelines that are open, reproducible, and flexible. The final pipeline will be hosted on GitHub.

5. Integrated data plotting for quantum electronics experiments

Supervisor: Edward Laird e.a.laird@lancaster.ac.uk (Physics)

RSE mentor: John Fozard (SMS)

Versatile and easy-to-use measurement software is essential for experiments in quantum electronics, which is one of the most rapidly growing areas of physics. For data acquisition, this need is now met by the open-source QCoDeS framework, which has been adopted by most of the groups in the field, including mine. However, data inspection must be done outside this framework, either by exporting data step-by-step to an analysis program such as Matlab, or by writing ad-hoc plotting programs that need to be changed with each experiment.

In this project, the intern will develop a set of generalised plotting routines that interface with the existing generalised sweep routines that are part of QCoDeS. The aim will be to automatically plot all measurement results in matplotlib, regardless of what is being measured; for example, we should be able to live plot the transition intensity, frequency, and coherence time of a qubit regardless of what parameter(s) we are sweeping. All these things can be done with existing libraries, but these are not well interfaced to the code that actually runs experiments, which makes it difficult to make evaluations on the fly.

The intern will be embedded in my research group of seven experimentalists and will immediately be able to see successive versions of their code in use. The output of the project should be submitted for inclusion in the QCoDeS library and I predict that it will be widely used in my group and ultimately by quantum electronics researchers worldwide.

6. Modelling photosynthesis

Supervisor: Samuel Taylor s.taylor19@lancaster.ac.uk (LEC)

RSE Mentor: Dr Supreeta Vijaykumar (LEC)

Models of photosynthesis are important tools for understanding plant responses to global change and are increasingly used to predict opportunities for targeted engineering of core metabolic processes like photosynthesis, in support of improved agricultural productivity or carbon storage. In the project PhotoBoost, a digital twin of photosynthetic metabolism, e-Photosynthesis, is used to explore opportunities for engineering next level photosynthesis in potato and rice. An objective is to evaluate molecular biology interventions that could enhance photosynthetic carbon assimilation. We want to do this to fuel improved crop yields and more resilient crop growth, taking account of key environmental controls on photosynthesis, including light, water and atmospheric carbon dioxide. By participating, you will learn about fundamentals of widely used leaf-level models applicable not only to simulations in crop biology, but also to ecology and global change modelling, using this understanding to apply quality control and parameterise non-linear models specific to the target crops. A key goal of the internship will be to evaluate novel data that describes photosynthetic responses to light and carbon dioxide. These will be used to test simulations produced by an advanced version of e-Photosynthesis that models metabolic regulation affecting the central carbon fixing enzyme Rubisco. You will develop your skills in use of R and MATLAB, for programming and analysis and visualisation of data. You will work alongside an experienced researcher mentor on a day-to-day basis, with weekly support from your supervisor and weekly small group team meetings where skills in data science are shared.

7. Automated pest detection in plant trials

Supervisor: Dr Mathew Smith mat.smith@lancaster.ac.uk and Samatha Oates

RSE Mentor: John Fozard

As our global climate changes, our reliance on pesticides to grow crops is ever-increasing. One method to reduce the overuse of pesticides may be the age-old tradition of “companion planting”, where a second plant is grown alongside the rest to attract predators that eat the pests. To determine if this approach can be applied at scale we have partnered with RHS Wisley. If shown to be successful, this method can be applied to fruit and vegetable crops to improve both food security and food sustainability.

However, to determine if this approach effectively reduces pest infestation we must track how pest numbers change across the growing season. Traditionally this requires regular, by-hand ‘bug counting days’: a highly inefficient technique that definitely isn’t applicable at scale. To optimise this process for wide-scale use we will combine high-resolution photography with source detection techniques developed in astronomy.

In this internship we will apply a set of AI algorithms, with a training set built from labels determined by the general public (“citizen science”) to automate the detection of invasive pests. The astrophysics group at Lancaster has already developed and applied such techniques to a diverse range of challenges spanning global security, healthcare and catastrophe management. In this 8 week study, we will develop a framework required to store, analyse and interpret each image collected, and develop machine learning algorithms that will automate the detection of pests. If successful, this technique will be rolled-out across the RHS, to quantify the importance of companion planting across all types of crop.

8. Applying new methods for historical spelling normalisation to Early English Books Online

Supervisor: Alistair Baron a.baron@lancaster.ac.uk (SCC)

RSE mentor: John Vidler (SCC)

Early Modern English (EModE, c. 1500–1700) is the earliest period of the English language which can be analysed at a large scale, due to the introduction of the printing press by Caxton in 1476. The Early English Books Online (EEBO) project1 set out to digitise, and then transcribe every printed book available during the EModE period, resulting in a dataset of c. 1.1 billion words. Dealing with a corpus of such a size presents challenges for Digital Humanities. Previous efforts include the Linguistic DNA project2 , and UCREL has processed an instance on CQPWeb3 .

Spelling variation, prevalent in the EModE period due to the lack of language standardisation, is an issue for any linguistic analysis of EModE texts, with word frequencies split between spelling variants, and key tasks such as Part-of-Speech tagging having considerably reduced accuracy. The standard process developed has been to introduce a spelling normalisation step within the processing pipeline, which normalises spellings to a modern form, to improve the accuracy of downstream tasks.

This project aims to develop and evaluate new methods for this spelling normalisation step, particularly focusing on translation models, which have been shown to have previous success in the task. The result will be to establish which methods are most appropriate for this careful task, without introducing spurious normalisations, which can introduce artificial noise. The best method will then be applied to EEBO, providing an enriched version which can be processed for further linguistic analysis.

The first stage of the research will be to evaluate existing translation methods previously used on historical texts: Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) using Bidirectional LSTMs, and also apply a newer translation model, fine-tuning a Transformer-based large language model, such as BiBERT or DistilBERT. This evaluation will focus only on historical English with previously prepared corpora: ICAMET, Shakespeare, Newsbooks, and CEEC. The intern will implement these models, utilising existing code-bases, and apply them to the listed corpora, reporting which method performs best for each of the corpora, in terms of normalisations made and the number of spurious normalisations. Providing the first stage is successful, the second stage will be to apply the best performing normalisation method to the very large EEBO dataset. This will involve the intern building a pipeline to process the texts in an efficient manner, performing the normalisation, and outputting in a format amenable for further analysis, with original spellings and normalised forms aligned.

Our use of cookies

What is Research Software Engineering?

Training Courses

Research Infrastructure

High End Computing Cluster

Data Immersion Suite

SafePod

MyLab

Bede

Interactive UNIX Service (IUS)

Community

DSI is pleased to announce the N8CIR internships programme for 2025

accordion

News

Lancashire Cyber Festival highlights North West’s opportunity to build a world-leading cyber ecosystem

Breaking barriers and building futures: Lancaster University inspires next generation of cyber experts

What animals do at scale: a new lens on behaviour and conservation

CEDARS survey now open

Prob_AI Hub and GCHQ – Training Data Reconstruction Event

Science on tap across Lancaster and Morecambe with Pint of Science

Research Software Engineering

What is Research Software Engineering?

Training Courses

Research Infrastructure

High End Computing Cluster

Data Immersion Suite

SafePod

MyLab

Bede

Interactive UNIX Service (IUS)

Community

DSI is pleased to announce the N8CIR internships programme for 2025

accordion

Project proposal - 1. Optimising on-disc storage of Monte Carlo output for machine learning problems

Project proposal - 2. Distributed Search Space Reduction for Program Synthesis

Project proposal - 3. gemlib: a python library for epidemic modelling

Project proposal - 4. Pipeline for fitting thermal responses of mosquito traits

Project proposal - 5. Integrated data plotting for quantum electronics experiments

Project proposal - 6. Modelling photosynthesis

Project proposal - 7. Automated pest detection in plant trials

Project proposal - 8. Applying new methods for historical spelling normalisation to Early English Books Online

News

Lancashire Cyber Festival highlights North West’s opportunity to build a world-leading cyber ecosystem

Breaking barriers and building futures: Lancaster University inspires next generation of cyber experts

What animals do at scale: a new lens on behaviour and conservation

CEDARS survey now open

Prob_AI Hub and GCHQ – Training Data Reconstruction Event

Science on tap across Lancaster and Morecambe with Pint of Science