Piggyback Data atop your GitHub Repository!

Carl Boettiger

2023-07-10

Why piggyback?

piggyback grew out of the needs of students both in my classroom and in my research group, who frequently need to work with data files somewhat larger than one can conveniently manage by committing directly to GitHub. As we frequently want to share and run code that depends on >50MB data files on each of our own machines, on continuous integration, and on larger computational servers, data sharing quickly becomes a bottleneck.

GitHub allows repositories to attach files of up to 2 GB each to releases as a way to distribute large files associated with the project source code. There is no limit on the number of files or bandwidth to deliver them.

Installation

Install the latest release from CRAN using:

install.packages("piggyback")

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("ropensci/piggyback")

Authentication

No authentication is required to download data from public GitHub repositories using piggyback. Nevertheless, piggyback recommends setting a token when possible to avoid rate limits. To upload data to any repository, or to download data from private repositories, you will need to authenticate first.

To do so, add your GitHub Token to an environmental variable, e.g. in a .Renviron file in your home directory or project directory (any private place you won’t upload), see usethis::edit_r_environ(). For one-off use you can also set your token from the R console using:

Sys.setenv(GITHUB_PAT="xxxxxx")

But try to avoid putting Sys.setenv() in any R scripts – remember, the goal here is to avoid writing your private token in any file that might be shared, even privately.

For more information, please see the usethis guide to GitHub credentials

Downloading data

Download the latest version or a specific version of the data:

library(piggyback)
pb_download("iris2.tsv.gz", 
            repo = "cboettig/piggyback-tests",
            tag = "v0.0.1",
            dest = tempdir())

Note: Whenever you are working from a location inside a git repository corresponding to your GitHub repo, you can simply omit the repo argument and it will be detected automatically. Likewise, if you omit the release tag, the pb_download will simply pull data from most recent release (latest). Third, you can omit tempdir() if you are using an RStudio Project (.Rproj file) in your repository, and then the download location will be relative to Project root. tempdir() is used throughout the examples only to meet CRAN policies and is unlikely to be the choice you actually want here.

Lastly, simply omit the file name to download all assets connected with a given release.

pb_download(repo = "cboettig/piggyback-tests",
            tag = "v0.0.1",
            dest = tempdir())

These defaults mean that in most cases, it is sufficient to simply call pb_download() without additional arguments to pull in any data associated with a project on a GitHub repo that is too large to commit to git directly.

pb_download() will skip the download of any file that already exists locally if the timestamp on the local copy is more recent than the timestamp on the GitHub copy. pb_download() also includes arguments to control the timestamp behavior, progress bar, whether existing files should be overwritten, or if any particular files should not be downloaded. See function documentation for details.

Sometimes it is preferable to have a URL from which the data can be read in directly, rather than downloading the data to a local file. For example, such a URL can be embedded directly into another R script, avoiding any dependence on piggyback (provided the repository is already public.) To get a list of URLs rather than actually downloading the files, use pb_download_url():

pb_download_url("data/mtcars.tsv.gz", 
                repo = "cboettig/piggyback-tests",  
                tag = "v0.0.1") 

Uploading data

If your GitHub repository doesn’t have any releases yet, piggyback will help you quickly create one. Create new releases to manage multiple versions of a given data file. While you can create releases as often as you like, making a new release is by no means necessary each time you upload a file. If maintaining old versions of the data is not useful, you can stick with a single release and upload all of your data there.

pb_new_release("cboettig/piggyback-tests", "v0.0.2")

Once we have at least one release available, we are ready to upload. By default, pb_upload will attach data to the latest release.

## We'll need some example data first.
## Pro tip: compress your tabular data to save space & speed upload/downloads
readr::write_tsv(mtcars, "mtcars.tsv.gz")

pb_upload("mtcars.tsv.gz", 
          repo = "cboettig/piggyback-tests", 
          tag = "v0.0.1")

Like pb_download(), pb_upload() will overwrite any file of the same name already attached to the release file by default, unless the timestamp the previously uploaded version is more recent. You can toggle these settings with overwrite=FALSE and use_timestamps=FALSE.

Additional convenience functions

List all files currently piggybacking on a given release. Omit the tag to see files on all releases.

pb_list(repo = "cboettig/piggyback-tests", 
        tag = "v0.0.1")

Delete a file from a release:

pb_delete(file = "mtcars.tsv.gz", 
          repo = "cboettig/piggyback-tests", 
          tag = "v0.0.1")

Note that this is irreversible unless you have a copy of the data elsewhere.

Multiple files

You can pass in a vector of file paths with something like list.files() to the file argument of pb_upload() in order to upload multiple files. Some common patterns:

library(magrittr)

## upload a folder of data
list.files("data") %>% 
  pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")

## upload certain file extensions
list.files(pattern = c("*.tsv.gz", "*.tif", "*.zip")) %>% 
  pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")

Similarly, you can download all current data assets of the latest or specified release by using pb_download() with no arguments.

Caching

To reduce API calls to GitHub, piggyback caches most calls with a timeout of 1 second by default. This avoids repeating identical requests to update it’s internal record of the repository data (releases, assets, timestamps, etc) during programmatic use. You can increase or decrease this delay by setting the environmental variable in seconds, e.g. Sys.setenv("piggyback_cache_duration"=10) for a longer delay or Sys.setenv("piggyback_cache_duration"=0) to disable caching, and then restarting R.

Valid file names

GitHub assets attached to a release do not support file paths, and will convert most special characters (#, %, etc) to . or throw an error (e.g. for file names containing $, @, /). piggyback will default to using the base name of the file only (i.e. will only use "mtcars.csv" if provided a file path like "data/mtcars.csv")

A Note on GitHub Releases vs Data Archiving

piggyback is not intended as a data archiving solution. Importantly, bear in mind that there is nothing special about multiple “versions” in releases, as far as data assets uploaded by piggyback are concerned. The data files piggyback attaches to a Release can be deleted or modified at any time – creating a new release to store data assets is the functional equivalent of just creating new directories v0.1, v0.2 to store your data. (GitHub Releases are always pinned to a particular git tag, so the code/git-managed contents associated with repo are more immutable, but remember our data assets just piggyback on top of the repo).

Permanent, published data should always be archived in a proper data repository with a DOI, such as zenodo.org. Zenodo can freely archive public research data files up to 50 GB in size, and data is strictly versioned (once released, a DOI always refers to the same version of the data, new releases are given new DOIs). piggyback is meant only to lower the friction of working with data during the research process. (e.g. provide data accessible to collaborators or continuous integration systems during research process, including for private repositories.)

What will GitHub think of this?

GitHub documentation at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project:

Of course, it will be up to GitHub to decide if this use of release attachments is acceptable in the long term.

mirror server hosted at Truenetwork, Russian Federation.