On the 9th and 10th of January I got the opportunity to attend my first conference: the STOR-i annual conference. This served as an excellent opportunity to explore a range of contemporary research topics and start thinking about research areas which I may be interested in investigating further. Topics presented included: optimisation of air traffic scheduling (Alex Jacquillat, MIT), Bayesian optimisation (Henry Moss, STOR-i Lancaster University), multi-armed bandits (Ciara Pike-Burke, Universitat Pompeu Fabra), time series (Richard Davis, Columbia University), machine learning (Tom Flowerdew, Featurespace) and credit risk assessment (Veronica Viniciotti, Brunel University).
Of particular interested for me was the presentation by STOR-i Alumnus Ciara Pike-Burke, now of Universitat Pompeu Fabra in Barcelona, titled “Multi-armed bandit problems with history dependent rewards”. Multi-armed bandits are an area of much research interest at Lancaster University. They get there name from a analogy with an traditional one-armed bandit gambling machine (pictured to the right). A one armed bandit is a game which gives random rewards to the player based purely on some unknown probability distribution, they get there name from the large metal arm which was to be pulled on earlier machines in order to commence the game. The concept of a multi-armed bandit is similar, except rather then there being a single arm to pull, there are multiple arms each with a different unknown distribution of possible outcomes, representing a number of different options that may be taken in a given scenario with unknown reward. Hence multi-armed bandit problems tend to deal with the problem of choosing which arm to pull so as to maximise reward, given that the only way to gain information about the distributions of reward from each arm is by pulling them.
Normally, multi-armed bandit problems assume that the reward of each action is independent of all past rewards, however this is not always the case. The example given to illustrate this in the presentation is for the case of web advertisement, here an advertiser may suggest a product to a user, and there is some chance that the user will buy the product; however for certain products if a user has already purchased that product (that is the advertiser has “won” the game and been rewarded), the same user may be unlikely to purchase another one, for example if a user has bought a new oven, they are unlikely to want another. Thus, the reward distribution for suggesting that this particular user buys an oven (pulling the “suggest oven arm”) has changed. On the other hand it may be that the reward distribution for suggesting cookery equipment has changed and hence this is the “arm” which the advertiser should be pulling instead. The research being done by Ciara involves developing statistical methods to determine which arms have history dependent rewards and which do not.
About the author