Programme

Registration, lunches and the poster session will be in Management School breakout area 2 (outside LT2 and LT3). The talks will take place in Management School LT3.

Day 1: Monday 12th September 2016

12:00 - 13:00 Registration and Lunch
13:15 - 14:45 Session 1
Mark Girolami
Shreena Patel
Coffee Break
15:30 - 17:00 Session 2
Tim Park
Chris Williams
17:00 - 18:30 Poster Session
Lancaster University Management School

Day 2: Tuesday 13th September 2016

09:30 - 11:00 Session 3
Patrick Rubin-Delanchy
Nick Heard
Coffee Break
11:30 - 12:15 Session 4
Brian McWilliams
Lunch Break
14:00 - 15:30 Session 5
Rob Johnson
Sumeetpal Singh
Coffee Break
16:00 - 17:30 Session 6
Jak Marshall
Phillipa Spencer
19:30 - Workshop Dinner
Lancaster House Hotel

Day 3: Wednesday 14th September 2016

9:30-11:00 Session 7
Christine Currie
John Reid
Coffee Break
11:30 - 12:15 Session 8
Magnus Rattray
12:15 - Lunch and close

Mark Girolami

Title: Probabilistic Numerical Computation: A New Concept?

Abstract: The vast amounts of data in many different forms becoming available to politicians, policy makers, technologists, and scientists of every hue presents tantalising opportunities for making advances never before considered feasible. Yet with these apparent opportunities has come an increase in the complexity of the mathematics required to exploit this data. These sophisticated mathematical representations are much more challenging to analyse, and more and more computationally expensive to evaluate. This is a particularly acute problem for many tasks of interest such as making predictions since these will require the extensive use of numerical solvers for linear algebra, optimization, integration or differential equations. These methods will tend to be slow, due to the complexity of the models, and this will potentially lead to solutions with high levels of uncertainty. This talk will introduce our contributions to an emerging area of research defining a nexus of applied mathematics, statistical science and computer science, called “probabilistic numerics”. The aim is to consider numerical problems from a statistical viewpoint, and as such provide numerical methods for which numerical error can be quantified and controlled in a probabilistic manner. This philosophy will be illustrated on problems ranging from predictive policing via crime modelling to computer vision, where probabilistic numerical methods provide a rich and essential quantification of the uncertainty associated with such models and their computation.

Christine Currie

Title: A worldwide investigation of tuberculosis epidemics

Abstract: Tuberculosis (TB) is a worldwide problem but the epidemiology varies dependent on country characteristics. In this presentation I will discuss the statistical analysis of TB epidemics in 211 countries with a view to proposing more efficient and targeted TB control strategies. Countries are classified by how their TB case notification rates have evolved over time and the age distribution of those suffering from active TB disease in 2008. Further analysis of key statistics associated with each of the countries shows the impact of different indicators. As expected, HIV is a key driver of TB epidemics and affects their age-distribution and their scale. The level of development of a country, its wealth and immigration are also found to influence the shape and severity of a country’s TB epidemic. Results of the analysis can be used to recommend how countries might prioritise their control efforts. Joint work with Kathryn Hoad (Warwick Business School)

Jak Marshall

Title: Massively Multiplayer Data: Challenges in Mobile Game Analytics

Abstract: This year, mobile gaming will be come the majority revenue driver for the entire games industry for the first time in history, surpassing revenues generated by PC gaming, and raking in a projected $52.5bn in revenue worldwide. The continuing success of individual companies in this field is largely dependent on their ability to derive additional value from their player base by exploiting the wealth of data available to them, particularly in server based titles. As well as facing the unique challenges of this diverse range of products, game developers are also missing out immensely on the existing knowledge base contained in the academic world. This talk aims to serve as a primer for those interested in working with mobile games and a call to arms for enterprising academics to make an impact in a rapidly growing global market.

Nick Heard

Title: Modelling structure within computer network traffic data

Abstract: NetFlow data are aggregated summaries (meta-data) of the traffic passing around a computer network from one internet protocol (IP) address to another, and lie at the heart of much of the statistical research in cyber-security. Understanding the NetFlow data generated by an IP address requires modelling several layers of dependency: Some NetFlow events are automated traffic, some are generated by humans; some are a mixture of the two, with human events triggering automated ones; and the events typically arise in bursts within a higher level diurnal seasonal pattern. Furthermore, the human events are themselves typically generated from a mixture of individuals or behavioural facets. However, a typical enterprise computer network, for example, comprises tens or hundreds of thousands of IP addresses, each with their own specific characteristics and vulnerabilities. And so against the need for complex models to capture the complex behaviour of each IP address is the overarching need for scalable analytics for performing statistical cyber-security monitoring in real-time. This talk discusses some approaches which seek to strike a balance to these conflicting requirements by capturing some of the highest level dependencies in NetFlow data.

Rob Johnson

Title: Identifying Fraud Networks

Abstract: Fraud is a fast moving area where fraudsters are constantly seeking to identify weaknesses in a bank’s systems. Once a loophole has been identified, fraudsters are very quick to disseminate the technique in order to maximise their monetary gains through convincing other fraudsters in their network to do the same. This work endeavours to identify links between fraudsters so the bank can predict where the next attack is likely to come from and prevent the fraud spreading. Using a network where each node represents a person, an autologistic model then determines who is most likely to be fraudulent based on a node’s neighbours

Patrick Rubin-Delanchy

Title: Large-scale network data analysis with applications in cyber-security

Abstract: Network data is ubiquitous in cyber-security applications. Accurately modelling such data allows discovery of anomalous edges, subgraphs or paths, and is key to many signature-free cyber-security analytics. On the other hand, a number of features of the data, e.g. disassortivity, information on edges, scale, make many standard methods of analysis inadequate. Starting with exchangeability assumptions, we present a generic Bayesian framework for modelling such data. Under further simplifications, the approach can be reduced to a latent space network model where, crucially, the latent space is sometimes pseudo-Euclidean, as with space-time in special relativity. We derive a consistent spectral estimate which can be deployed at massive scales. Consistent hypothesis tests and asymptotic confidence intervals for the stochastic block model and mixed membership stochastic blockmodel are derived as a by-product. Results are illustrated on network flow data collected on an enterprise computer network.

Chris Williams

Title: Input-Output Non-Linear Dynamical Systems applied to Physiological Condition Monitoring

Abstract: We present a non-linear dynamical system for modelling the effect of drug infusions on the vital signs of patients admitted in Intensive Care Units (ICUs). More specifically we are interested in modelling the effect of a widely used anaesthetic drug called Propofol on a patient's monitored depth of anaesthesia and haemodynamics. We compare our approach with one from the Pharmacokinetics/Pharmacodynamics (PK/PD) literature and show that we can provide significant improvements in performance without requiring the incorporation of expert physiological knowledge in our system. Joint work with Konstantinos Georgatzis and Chris Hawthorne

John Reid

Title: Estimating pseudo-times from time series

Abstract: When studying dynamic systems biologists will often assay gene expression at several fixed capture times. Unfortunately most experimental protocols are destructive and hence longitudinal time courses cannot be generated. Cross-sectional time courses can be achieved but are more difficult to analyse especially when there is uncertainty around how far each sample has progressed through the system under study (its pseudo-time). This uncertainty confounds subsequent downstream analysis of the gene expression profiles generated from the data. We present a novel Gaussian process latent variable model to estimate these confounders by sharing statistical strength between time series. Improvements in single cell sequencing techniques are generating ever larger gene expression data sets. We demonstrate how low-rank Gaussian process approximations can be used so the model is applicable to big data sets. Our model accurately estimates phases of the cell cycle in single cell data from prostate cancer cells. Additionally it recovers known precocious cells from a single cell analysis of lipopolysaccharide stimulated mouse dendritic cells.

Magnus Rattray

Title: Probabilistic modelling of dynamic processes in biological systems using Gaussian processes

Abstract: Biological systems are highly dynamic and must respond rapidly to external stimuli and an array of feedback systems at different scales. We are using probabilistic models to help model dynamics at different scales. Many of the models we have developed are based on Gaussian processes which are convenient non-parametric models that can represent time-varying functions with diverse characteristics. The advantage of using Gaussian processes lies both in their flexibility as models and their tractability when carrying out inference from data. We are using smooth models for data averaged over large ensembles of cells and stochastic processes for modelling single-cell time course data from microscopy experiments. I will give some examples of our recent work including: modelling delays in transcription dynamics from high-throughput sequencing time course data data, identifying a sequence of perturbations in two-sample time course data, modelling bifurcations in high-dimensional single-cell expression data and uncovering periodicity from single-cell microscopy time course data.

Brian McWilliams

Title: Preserving differential privacy between features in distributed estimation

Abstract: Privacy is crucial in many applications of machine learning. Legal, ethical and societal issues restrict the sharing of sensitive data making it difficult to learn from datasets that are partitioned between many parties. The differential privacy framework guarantees preserving anonymity in a large dataset and can provide a strong alternative to current best practices and legal guidelines. However, in the distributed setting very few approaches exist for private data sharing. To this end, we propose a scalable framework for distributed estimation where each party communicates perturbed sketches of their locally held features ensuring differentially private data sharing. For L2 penalized supervised learning problems our proposed method has bounded estimation error compared with the optimal estimates obtained without privacy constraints in the non-distributed setting. We confirm this empirically on real world and synthetic datasets.

Tim Park

Title: Inventory Optimisation for Offshore Platforms Preserving differential privacy between features in distributed estimation

Abstract: Shell operates oil and gas platforms all around the world. Keeping these platforms running requires a huge amount of repair and maintenance work and relies on spare parts being available sometimes at very short notice. As a result Shell keeps a large stock of spares on the platforms themselves. The negative effect of this is to tie up an estimated $1B of capital which could be invested elsewhere. The aim of this project is to assess if we can reduce these stock levels while still meeting maintenance targets. In general the demand for a materialis intermittent with the time between orders not following any standard distribution. We therefore make use of bootstrap resampling techniques to model the demand within a given time window. This length of this window is defined by the time it takes to reorder a material, known as the lead time. This can be highly uncertain and outliers are common. We model lead times using Bayesian techniques, taking into account prior information, as well as Extreme Value distributions to account for heavily delayed deliveries. In this talk I will also discuss how the results of this analysis are communicated to the end user via a web based tool.

Shreena Patel

Title: Data science at Dunnhumby

Abstract: Dunnhumby receives and analyses data from over 500 million customers worldwide, 16 million of which are Tesco Clubcard users. Transactions from these customers are used to inform various decisions around Tesco, including store ranging, detection of out of stock items and customer segmentations for targeted marketing campaigns. In this talk, we look at models developed for Tesco Online. ‘Have You Forgotten’ provides personalised recommendations to customers at the point of checkout based on purchase history. A key part of this process is aligning offline metrics for a given model’s performance with live results after the model has been implemented. In addition to accurately predicting which items a customer has omitted from their basket, the model must balance probability of conversion with spend. At the point of delivery, a customer may be offered a substitute for an unavailable item in their order. ‘Self-Learning Substitutes’ is a technique for identifying suitable and affordable product substitutions in this situation, based on historical rejection rates.

Sumeetpal Singh

Title: Blocking Strategies and Stability of Particle Gibbs Samplers

Abstract: Sampling from the posterior probability distribution of the latent states of a Hidden Markov Model (HMM) is a non-trivial problem even in the context of Markov Chain Monte Carlo. To address this Andrieu et al. (2010) proposed a way of using a Particle Filter to construct a Markov kernel which leaves this posterior distribution invariant. Recent theoretical results establish the uniform ergodicity of this Markov kernel and show that the mixing rate does not deteriorate provided the number of particles grows at least linearly with the number of latent states. However, this gives rise to a cost per application of the kernel that is quadratic in the number of latent states, which can be prohibitive for long observation sequences. Using blocking strategies, we devise samplers which have a stable mixing rate for a cost per iteration that is linear in the number of latent states and which are furthermore easily parallelizable. We then extend our method to sample from the posterior distribution of a HMM in the setting where the state transition model cannot be evaluated but can be simulated from-this class of HMMs are said to have an intractable transition density.