Reprodown: An R package for reproducible data analysis


Posted on

Data visualisation graph

Replicability and reproducibility

Reproducibility and replicability play important roles in science to confirm new discoveries and to extend our understanding of the natural world. Both are about getting consistent results; the former using mainly the same data, methods and code; while the latter across studies aiming to answer the same scientific question with different collected data and methods (National Academies of Sciences, Engineering, and Medicine; 2019).

The concern on reproducibility and replicability has been raised drastically in the last decade given that several promising results in medicine and other research areas were not able to be replicated or reproduced. In particular, Baker (2015) highlighted the difficulty of replicating results in studies that use antibodies (“Y-shaped proteins that bind to specified biomolecules and used to flag their presence in a sample”). The main problem with these applications was that antibodies were not validated adequately and that standardized information quality about them were not provided.

These concerns are common in different areas of research. A Nature’s survey of 1576 researchers (on Chemistry, Biology, Physics and Engineering, Earth and environment and others) reported that that 70% of them have tried and failed to reproduce another researcher’s experiment. These researchers thought that “more robust experiment design”, “better statistics” and “better mentorship” could help to improve reproducibility.

Reproducibility can be considered a key minimum acceptable standard of research given that replicability cannot always be achieved because it depends on the size of the study, available budget, time and other factors (Peng, 2015). Reproducibility can help to validate data analysis, improve collaboration and detect errors or bad practices of the analysis.

Reprodown

Reprodown is an R package that helps to improve reproducibility by using:

  • Blogdown: An R package that integrates rmarkdown with Hugo to create a website.
  • GNU make: A GNU utility that determines which pieces of a program need to be compiled. This is based on a file called Makefile where dependencies are defined.
  • scholar-docs: A custom hugo theme for a webpage.

The workflow of reprodown is to write the .Rmd files containing our data analysis inside a sub-folder (e.g. scripts). Then the function reprodown::makefile will read the .Rmd files to create automatically the Makefile. The outputs are rendered to html files by executing the make utility on the terminal.

Reprodown example

An example of a website built with reprodown can be found at https://erickchacon.gitlab.io/project-web. You can explore the source code at https://gitlab.com/ErickChacon/project-web.


Reprodown tutorial

Requirements

We need to install the R packages blogdown and reprodown. We need my custom fork of blogdown given that I made a pull request to add a functionality to the function blogdown:::build_rmds. Hopefully, this will be accepted in the future.

remotes::install_github("ErickChacon/blogdown") remotes::install_github("ErickChacon/reprodown")

In addition, we also need the GNU make utility, which comes with any GNU/Linux distribution.

Getting started

  • Create the structure of the project (optional): The function reprodown::create_proj() can be used to create the folders of our project. By default, this create the folders data, docs, scripts and src. However, you can provide the argument yaml_file with a path to a yaml file with a custom structure.
  • Create the website-related files: The function blogdown::new_site can be used to create these files. The theme of the website is also downloaded by this function. You can use your custom theme or other hugo theme from https://themes.gohugo.io/. The user does not need to work directly with most of these files, blogdown will take care of this. The file docs/config.toml defines the metadata of your web, check this and modify your data accordingly.
reprodown::create_proj() blogdown::new_site('docs', theme = 'ErickChacon/scholar-docs', sample = FALSE)


Create your custom scripts

Create your .Rmd files inside the scripts folder. Take into consideration the following:

  • The file _index.Rmd inside the scripts folder control the homepage. You can define the title in a yaml header. See for example the scripts/_index.Rmd file of the web https://erickchacon.gitlab.io/project-web.
  • The dependency of the files is defined in the yaml header of the .Rmd files. For example, the yaml header below of the file scripts/30-process/process.Rmd indicates that this files needs as input the file data/cleaned/data.rds and has as output the file data/processed/data.rds.
--- title: "Transform covariate" prerequisites: - data/cleaned/data.rds targets: - data/processed/data.rds ---

Render the web

The web can be rendered by:

  1. Creating the Makefile with the following R code: reprodown::makefile(). A Makefile like this will be created.
  2. Render the .Rmd files with the R code system(make) or running make on your terminal.
  3. Serve the site using setwd("docs"); blogdown::serve_site(); setwd("..")
  4. Stop serving with servr::daemon_stop().

Publish and update automatically your website

You can host your project in a remote repository to make your website available. An easy way is to host it on gitlab. Use reprodown::create_gitlab_ci() to create the file .gitlab-ci.yml that define the workflow to create the website.

In addition, I suggest to avoid pushing the data folder content to avoid publishing confidential data or having issues with big files. The same should be done with the docs/public folder given that it will be automatically created by the gitlab workflow. This can be done by by creating a .gitignore file with content:

# Public content /docs/public # Data sub-folders content /data/raw/* !/data/raw/.gitkeep /data/modelled/* !/data/modelled/.gitkeep /data/processed/* !/data/processed/.gitkeep /data/cleaned/* !/data/cleaned/.gitkeep

Push all your folder content to a gitlab repository and you will have available a website that will be automatically updated each time you push a commit.

References:

  1. Baker, M. (2016). A Nature Survey Lifts the Lid on How Researchers View the ‘Crisis’ Rocking Science and What They Think Will Help. Nature, 3.
  2. Baker, M. (2015). Blame it on the antibodies. Nature, 521(7552), 274.
  3. National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. National Academies Press.
  4. Peng, R. (2015). The reproducibility crisis in science: A statistical counterattack. Significance, 12(3), 30-32.

Related Blogs


Disclaimer

The opinions expressed by our bloggers and those providing comments are personal, and may not necessarily reflect the opinions of Lancaster University. Responsibility for the accuracy of any of the information contained within blog posts belongs to the blogger.


Back to blog listing