R is a mature scripting language for statistical computations and data processing. An important advantage of R is that it allows writing repeatable statistical analyses by programming all steps of data processing in scripts, which allows re-executing the whole process after any change in data or processing steps.
There are several useful packages for R to obtain repeatability of
statistical computations, such as knitr and
rmarkdown. These tools allow writing R scripts that
generate reports combining text with tables and figures generated from
data.
However, if analyses grow in complexity, manual re-execution of the whole process may become tedious, prone to errors, and very demanding computationally. Complex analyses typically involve:
It is inefficient to re-run all pre-processing steps repeatedly to
refresh the final report after any change. A caching mechanism provided
by knitr is helpful but limited to a single report.
Splitting complex analyses into several parts and saving intermediate
results into files is rational, but brings another challenge:
management of dependencies between inputs, outputs, and
underlying scripts.
This is where Make comes in. Make is a tool that
controls the generation of files from source data and script files by
reading dependencies from a Makefile and comparing
timestamps to determine which files need to be refreshed.
The rmake package provides tools for easy generation of
Makefiles for statistical and data manipulation tasks in R.
The main features of rmake are:
%>>% pipeline
operator and templatingR allows the development of repeatable statistical
analyses. However, when analyses grow in complexity, manual re-execution
on any change may become tedious and error-prone. Make
is a widely accepted tool for managing the generation of resulting files
from source data and script files. rmake makes it easy to
generate Makefiles for R analytical projects.
To install rmake from CRAN:
Alternatively, install the development version from GitHub:
Load the package:
The package requires the R_HOME environment variable to
be properly set. This variable indicates the directory where R is
installed and is automatically set when running from within R or
RStudio.
When running make from the command line (outside of R),
you may need to set R_HOME manually.
To find the correct value for your system, run this in R:
You can also check the current values of R environment variables:
On Linux/macOS:
On Windows (Command Prompt):
On Windows (PowerShell):
For permanent setup, add the export commands to your shell
configuration file (.bashrc, .zshrc, etc. on
Unix-like systems, or system environment variables on Windows).
For more information on R environment variables, see the official R documentation.
To start a new project with rmake:
This creates two files: - Makefile.R - R script to
generate the Makefile - Makefile - The generated Makefile
(initially minimal)
The initial Makefile.R contains:
Let’s walk through a simple example. Suppose we have: -
data.csv - input data file - script.R - R
script to process the data - Output: sums.csv - computed
results
Create data.csv:
ID,V1,V2
a,2,8
b,9,1
c,3,3
Create script.R:
Edit Makefile.R:
The %>>% pipe operator makes rule definitions more
readable:
This is equivalent to the previous example but more concise.
Let’s extend our example to create a PDF report. Create
analysis.Rmd:
---
title: "Analysis"
output: pdf_document
---
# Sums of data rows
```{r, echo=FALSE, results='asis'}
sums <- read.csv('sums.csv')
knitr::kable(sums)
```Update Makefile.R:
library(rmake)
job <- list(
rRule(target = "sums.csv", script = "script.R", depends = "data.csv"),
markdownRule(target = "analysis.pdf", script = "analysis.Rmd",
depends = "sums.csv")
)
makefile(job, "Makefile")Or using pipes:
library(rmake)
job <- "data.csv" %>>%
rRule("script.R") %>>%
"sums.csv" %>>%
markdownRule("analysis.Rmd") %>>%
"analysis.pdf"
makefile(job, "Makefile")Run make again:
Visualize the dependency graph:
This creates an interactive graph showing: -
Squares: Data files - Diamonds: Script
files
- Ovals: Rules - Arrows:
Dependencies
Handle complex dependencies:
chain1 <- "data1.csv" %>>% rRule("preprocess1.R") %>>% "intermed1.rds"
chain2 <- "data2.csv" %>>% rRule("preprocess2.R") %>>% "intermed2.rds"
chain3 <- c("intermed1.rds", "intermed2.rds") %>>%
rRule("merge.R") %>>% "merged.rds" %>>%
markdownRule("report.Rmd") %>>% "report.pdf"
job <- c(chain1, chain2, chain3)Alternatively, you can define all chains directly without intermediate variables:
rmake provides several pre-defined rule types:
rRule(): Execute R scriptsmarkdownRule(): Render R Markdown
documentsknitrRule(): Process knitr
documentscopyRule(): Copy filesofflineRule(): Manual tasks with
remindersFor detailed documentation on all rule types including
depRule(), subdirRule(), and custom rules, see
the Build Rules vignette.
For more information on specific topics, see these vignettes:
Key takeaways: 1. Use rmakeSkeleton() to initialize
projects 2. Define rules in Makefile.R 3. Use
%>>% for readable rule chains 4. Run
make() to execute the build process 5. Use
visualize() to understand dependencies
?rmake