A long running part of the STOR-i MRes is learning to program in a handful of different programming languages. Namely, we learn to program in R (most familiar to Statistics students), Python, and C++. On top of this, we also learn how to write R packages in C++, giving us the ability to use powerful specialist code in a flexible data analytical environment. In our year, the assessments for Python, C++ and R packages have all shared a common thread: the Stable Marriage problem.
The Stable Marriage problem dates back to the 1960s with Gale and Shapley’s original paper “College admissions and the stability of marriage”. We consider two groups, one of men and one of women for example purposes, of equal size. Each person has a ranked list of who they most want to marry in the other group, which can be represented as “preference tables”. We then say that a “matching” is a pairing-up of each member of one group with another. However, a matching is “unstable” if a man and woman mutually prefer each other over their current partner (that is, they have an affair). The goal of the problem therefore is to find a matching where this is not the case, even if it is not ideal for each member of the problem. Outside of this example scenario, this problem and its generalisations have many applications to the real world such as matching students to schools, medical graduates to hospitals, organs to transplant patients, and so on. Work on this problem even received a Nobel Prize in Economic Sciences in 2012. A more modern application is the matching of billions of internet users to different servers.
Another concept we learn throughout the programming component of MRes is known as “the 5 R’s”, which are a set of five nested concepts to be considered when writing code at different levels. These aren’t necessarily concepts that must be implemented all the time; the original article “Re-run, Repeat, Reproduce, Reuse, Replicate” by Benureau and Rougier demonstrates how even the simplest piece of code can become increasingly verbose as more of the concepts are implemented. On one extreme, you might have a toy piece of code that just checks to see if an idea works. It would be a complete waste of time to make this one piece of code massively over-engineered. On the other hand, if a piece of code has potential to be used by hundreds or thousands of other researchers, or is key to replicating results in a paper, or needs to be integrated into a wider system, or needs to be future-proofed, it may be necessary to incorporate more and more of these “5 R’s”.
The first is “re-runnable”, meaning we can re-run the code in some form whenever we want to. This doesn’t mean that it necessarily has to be re-run in any environment, but that it can be re-run in its original environment. The important thing to do here is to document version numbers of everything you use, and make sure you don’t use any outdated or deprecated features of whichever language you’re using. The second is “repeatable”, meaning the program runs the same every time around. In short, if your program implements some kind of randomness, it is important the the source of randomness is controlled. This is usually done by setting a “seed” at the start of your code.
Thirdly, we have “reproducible”, which is possibly the biggest “jump” in terms of rigour. Firstly, this entails including even more detail about the environment in which we originally ran our code. Outside of the version number of the language itself and the packages, this may also involve the Operating System version, hardware specifications, version of the code or “version control” using git, and so on. Generally, erring on the side of “too much” information is better than too little. Also to be included are “unit-tests”, which are small tests to make sure that the code is running as originally intended, and any results files should be accompanied by a comprehensive list of inputs to the program. On top of all this, the code should always be available in some regard. Even after you individually stop caring about some code, it should still be available to those who wish to use it and keep it up to date.
Fourthly, we have “reusability”. This is less of a drastic increase in quality than the previous step, and pertains more to ensuring that the code can be easily understood by anybody who wishes to use this. This generally means including thorough documentation of the code, possibly including everything from working examples and description of parameters to a function, to the specifics of how the code works. It also implies a reasonable amount of commenting throughout the source code, to aid the understanding of others, and yourself. “Quirks” should also be accounted for in documentation. Finally, we have “replicability”. This is slightly separate from the code itself, and is more a property of the origin of the code, such as an algorithm or academic paper. In short, these origins should include enough details about the algorithm or pseudocode that, if followed strictly by another researcher, they can produce their own code that will produce results exactly the same as yours.
Now, we can talk about the three implementations of the Stable Marriage problem. Our first assignment was to do this in Python. As a language, Python is far easier to read and write for a new programmer, making it ideal for converting a written algorithm directly into code. Python also comes with a wide range of built-in data structures for representing and manipulating your problem, as well as having an abundance of packages for adding more capability to your software. However, the user-friendliness comes at a cost: Python deals with a lot of the finnicky details under the hood, coming with a performance cost compared to programming languages which are closer to the metal. This is where C++ comes in. C++ leaves the programmer to fend for themselves in many regards, and much prefers to crash than do things for you. C++ has a “Standard Library” that incorporates many features and data structures, but there is a lot more functionality and sanity-checking that you’ll have to do yourself. Luckily, in the case of this problem, C++ and Python shared enough data structures and the like for the transferring of the problem from one to the other to be relatively smooth.
The more tricky jump is the one the takes the C++ code, and wishes to turn it into an R package. Interfacing between programming languages is no easy task, but luckily, a package called “Rcpp” exists which helps to deal with a lot of the “boiler plate code” (technical gibberish to mere mortals like us) that is necessary for using the languages together. The core process of building the package is then made quite easy: simply build the package skeleton, drop the C++ source code in the right folder, and then have R build the package for you. Simple! However, this ignores the fact that it’s not so easy to take R data and shove it straight into a C++ function. In fact, we lack some of the commonality of data structures between Python and C++ as mentioned earlier, specifically with “unordered maps”. These are data structures which allow you to input a “key”, and returns a “value”. In the case of the stable marriage problem, we used these maps to represent the preference tables; the “key” was the name of a person, and the “value” was the ordered list of people to whom they were attracted. R, however, does not directly have the ability to create such a structure, and certainly not in a way that can be easily passed to C++, which is far more strict about the “types” of an object that it is dealing with. Instead, we had to make do with an R representation of preference tables that could also be passed to C++ (we used named lists of vectors of names), and also add additional C++ code to convert this back into the unordered map representation. However, with all of that dealt with, we were left with a functioning R package written in C++, and hosted on GitHub for all to use. Additionally, this included some documentation about version numbers and how to use the provided functions, helping to meet some of those five R’s.
And with that, the MRes can take you from somebody who dabbled a bit in R here and there, into a person who can write R packages in C++ for all to use. In research, academic or otherwise, this is a very useful skill that allows you to provide facilities to other researchers to both verify your work and build upon it. It’s worth reiterating that you don’t always need to do all of this. In fact, my results found that the Python and Rcpp versions of the same algorithm performed about as well, despite the latter primarily being written in a (generally) faster language. Indeed, the scale of the problem isn’t too immense, so it isn’t massively surprising. The use of C++ would give better results for far more demanding tasks such as simulation, but here we simply dealt with a relatively simple algorithm and a few built-in data structures. This is about choosing the right tool for the job. A simple bit of data analysis, a few plots, or a trial run of a new method on a small set of test data may as well be done in the language you’re most familiar with. The time spent writing and debugging the same thing in a faster language may be overkill or a waste of time for a slight increase in performance. On the other hand, if you need something to scale up well, and you and other people are going to be using it often, it may be worth getting your hands dirty with some C++.