A phenomenon in cloud data centres called ‘stragglers’ significantly slows down online activities ranging from everyday internet searches to fundamental scientific analysis.
Huge complex modern cloud computing datacentres break tasks down into small pieces in order to crunch the information quicker across thousands of server nodes. However small numbers of these pieces can take a lot longer to be processed (hence the term stragglers) slowing down the whole task.
These delays – known as the ‘Long-tail Problem’ - can cause cloud-based tasks to be delayed from seconds to minutes, frustrating users as well as producing expensive and energy intensive inefficiencies at giant datacentres. As datacentres grow larger, so does the problem of stragglers. The problem cannot be fixed by simply adding more server nodes, nor can it be diagnosed in a straightforward manner – stemming from a multitude of possible causes ranging from failures, data size, system usage, and even temperature.
Stragglers could also hold back the full potential of emerging technologies such as driverless vehicles that would rely on cloud connectivity for real-time navigation and collision avoidance, where cloud communication delays could prove fatal.
A new two-year research project – ‘Pin the Tail: Understanding Straggler Manifestation in Internet-based Distributed Systems’ – will look closely at the computing conditions that cause stragglers. Researchers aim to determine the ‘perfect storm’ that causes stragglers to manifest, and determine the optimal conditions for avoiding them.
Dr Peter Garraghan, Lecturer in Distributed Systems at Lancaster University’s School of Computing and Communications, said: “We know that stragglers can cause significant delays to distributed systems, such as the Internet of Things and cloud datacentres, causing a myriad of problems such as increasing operating costs and energy consumption of computing systems.This is a problem that the US tech giants have been scratching their heads over for quite a while.
“We don’t exactly know what conditions are likely to cause stragglers and so system managers are unsure of how to avoid their occurrence. They simply ‘live with it’, which going forward is becoming increasingly infeasible.”
“Our work, in collaboration with leading industrial partners with massive-scale distributed systems, represents a significant step towards solving the long-tail problem and will provide direct benefits to the user experience, the operational costs of service providers, and will enhance the competitiveness of the UK digital economy.”
This research is being funded with £120,000 by the Engineering and Physical Sciences Research Council (EPSRC), and involves collaboration with Microsoft, the Science and Technology Facilities Council (STFC) and CIATEQ, a public research centre in Mexico.