George Bolt - STOR-i

PhD Project

Learning to Group Research Profiles Through Online Academic Services

Introduction and Motivation

Elsevier are a company which specialise in the provision of online content and information to researchers. They are responsible for a large portfilio of products, including:
- Mendeley - a reference manager;
- ScienceDirect - a library of peer-reviewed literature;
- Scopus - an abstract and citation database;
- SciVal - a searchable database of research performance metrics, at institutional and national levels.
This by no means exhaustive, and a full list of their platforms can be explored here.

To better serve researchers, Elsevier are committed to the continuous development and improvement of the user experience. Key to this endeavour is the use of data science, which not only informs decisions on platform improvements, but also forms the basis of new features.

For example, Elsevier analyse their Scopus citation database to provide a tool within SciVal which gives users an overview of prominent or thriving research topics, be that globally, nationally or at a specific institution. Another example is their recommender system in Mendeley, Mendeley Suggest , which suggests relevant papers for users to read.

As a joint venture between the STOR-i Centre of Doctoral Training at Lancaster University and Elsevier, this PhD project looks to develop and apply tools from network analysis to make sense of their often high dimensional but structured datasets. Of particular interest is usage data for their various platforms, which lends itself naturally to a network-based representation. Successful analysis of this data would allow Elsevier to better understand how its platforms are being used, thus contributing to the platform improvemnt process.

Representing the Data as a Network

A network, analogous to a mathematical graph, is a tool often used in data science to represent complex interconected data. In the broadest sense a network consists of a set of nodes and edges, where often the nodes represent entities within some system and edges correspond to relations between these entities. Two examples are given below, where the left/right show undirected/directed networks.

Networks are used to represent data in a diverse range of settings, examples include:

Sociology - social networks and physical interaction networks;
Biology - protein-protein interaction networks;
Neuroscience/Psychology - brain connectiviy networks, known as connectomes;
Operations Research - transportation networks or networks of qeues.

In a similar vein, we are going to make use of a network representation to analyse Elsevier's online usage data. In this case nodes will correspond to platforms/products and edges will represent movements between products. If we then look to compare users based on this data, we will be left with a network for each user. This leaves us with multiple network data, a situation becoming more prevelant in the wider literature thanks in most part to improvements in data collection technologies.

Given this network representation, our research will be focussed on the development and implementation of network models which are applicable to the problem at hand. Certain features of the data provide new challenges which traditional network models often fail to meet, hence novel approaches must be developed.

Project Challenges and Aims

As alluded to above, this project throws up certain challenges which will require novel approaches. These include:

Volume of data, which itself brings forth two challenges,
- Multiple network data, since usage data is available for individual users;
- Computation, particularly when conducting inference for formulated models.
Temporal aspect of the data - often usage data comes with timestamps, which can be informative of network dynamics.

In the face of these challenges the project has the following initial aims:

Develop a statistical model for multiple network data, applicable to the Elsevier's usage data. This model should:

Respect features of the data;
Facilitate informative inference;
Be as flexible as possible.

Develop effective inference procedures for the proposed model, noting that with users in the millions these procedures would ideally be scaleable.

Last updated 08/12/2019

PhD Project

Learning to Group Research Profiles Through Online Academic Services

Introduction and Motivation

Representing the Data as a Network

Project Challenges and Aims