Introduction to Apache Arrow framework

Fernando Mayer, Niamh Cahill

The Apache Arrow framework

The definition of the Apache Arrow framework is best described from their website:

Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.

A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (including nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.

In other words, the Apache Arrow framework was designed to deal with large datasets (larger than memory), using in-memory analytics. This means that the computations made with “Arrow datasets” are extremely efficient, resulting in very fast computations, otherwise infeasible with standard computations.

The Apache Arrow framework can be used in many different programming languages. However, in each of these languages, there are specific libraries to deal with it. In R, the arrow package is available to load and manipulate Arrow datasets. The manipulation of Arrow objects are made through dplyr verbs, which helps users to feel familiar with it. Not all dplyr verbs are available to work with Arrow datasets, but the vast majority of the most used ones are already “translated” to be used with Arrow. A list of such functions can be found in Functions available in Arrow dplyr queries. A general introduction of using dplyr verbs with Arrow can be seen in Data analysis with dplyr syntax.

Using Apache Arrow with geslaR

The geslaR package makes use of the Apache Arrow framework to deal with the GESLA dataset in R.

In this tutorial, we will use the download_gesla() function, to download the full GESLA dataset, and show some basic data manipulation with the arrow package and dplyr verbs.

The first time you load the geslaR package, it will automatically load both the arrow and dplyr packages.

library(geslaR)
#> Loading required package: arrow
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

To download the full GESLA dataset, one can simply use

download_gesla()

This will create a directory called gesla_dataset in the current working directory (as defined by getwd()) and download the full dataset locally. This download may take some time (expect around 5 to 10 minutes), as it depends on internet connection. Note that this full dataset will need at least 7GB of (hard drive) storage, so make sure this is feasible. However, once downloaded, you will have access to the full dataset, and you will only need to do this once.

You will notice that the full dataset is composed of 5119 Apache Parquet files, ending in .parquet

## Number of downloaded files
length(list.files("gesla_dataset"))
#> [1] 5119
## Check the first files
head(list.files("gesla_dataset"))
#> [1] "a121-a12-nld-cmems.parquet"      "a2-a2-bel-cmems.parquet"        
#> [3] "aalesund-aal-nor-cmems.parquet"  "aarhus-aar-dnk-cmems.parquet"   
#> [5] "aasiaat-aas-grl-gloss.parquet"   "abashiri-347a-jpn-uhslc.parquet"

These files are the same as those originally distributed in the GESLA database, so that each one refers to a site from where the data comes from. To load this full dataset in R, use the arrow::open_dataset() function, specifying the location of the .parquet files. Although there are many files, this function recognizes them as a single dataset, because they all have the same structure (or “Schema”).

## Open dataset
da <- open_dataset("gesla_dataset")
## Check the object
da
#> FileSystemDataset with 5119 Parquet files
#> date_time: timestamp[us]
#> year: int64
#> month: int64
#> day: int64
#> hour: int64
#> country: string
#> site_name: string
#> lat: double
#> lon: double
#> sea_level: double
#> qc_flag: int64
#> use_flag: int64
#> file_name: string
#> 
#> See $metadata for additional Schema metadata
## Verify class
class(da)
#> [1] "FileSystemDataset" "Dataset"           "ArrowObject"      
#> [4] "R6"

Since this is an ArrowObject object, it will actually not load the full dataset in memory (as it would if it was a standard R object, such as a tibble or data.frame). Note, however, that some basic information, such as dim() and names() can be retrieved simply with

dim(da)
#> [1] 1172435674         13
names(da)
#>  [1] "date_time" "year"      "month"     "day"       "hour"      "country"  
#>  [7] "site_name" "lat"       "lon"       "sea_level" "qc_flag"   "use_flag" 
#> [13] "file_name"

Any other manipulation of the dataset must be made using dplyr verbs. For example, to count the number of observations by country, one could use

da |>
    count(country)
#> FileSystemDataset (query)
#> country: string
#> n: int64
#> 
#> See $.data for the source Arrow object

Note, however, that the output is just a query to the full dataset. To explicitly return the calculation, you should use dplyr::collect(), so the result is a standard tibble

da |>
    count(country) |>
    collect()
#> # A tibble: 113 × 2
#>    country         n
#>    <chr>       <int>
#>  1 BEL       1263467
#>  2 JPN      74580447
#>  3 NOR      28452201
#>  4 DNK      36726648
#>  5 USA     243838504
#>  6 GBR      76038844
#>  7 SLV        812230
#>  8 MEX      10755284
#>  9 CAN      67019312
#> 10 NLD     133699230
#> # ℹ 103 more rows

This is intentionally done so that you can manipulate, calculate, and extract information from the dataset, taking advantage of the Arrow in-memory analytics framework. This way, the computations should be faster, and the idea is that you just use dplyr::collect() when the final result is needed as an R object. For example, we could calculate the mean sea level for Ireland per year as

da |>
    filter(country == "IRL", use_flag == 1) |>
    group_by(year) |>
    summarise(mean = mean(sea_level)) |>
    arrange(year) |>
    collect()
#> # A tibble: 63 × 2
#>     year  mean
#>    <int> <dbl>
#>  1  1958  3.11
#>  2  1959  3.12
#>  3  1960  3.16
#>  4  1961  3.16
#>  5  1962  3.09
#>  6  1963  3.13
#>  7  1964  3.13
#>  8  1965  3.11
#>  9  1966  3.15
#> 10  1967  3.14
#> # ℹ 53 more rows

Any other queries could be made, as long as the dplyr verbs used are supported by the arrow package. For example, we could ask for the minimum, mean, and maximum sea level values for Ireland per year

da |>
    filter(country == "IRL", use_flag == 1) |>
    group_by(year) |>
    summarise(
        min = min(sea_level),
        mean = mean(sea_level),
        max = max(sea_level)) |>
    collect()
#> # A tibble: 63 × 4
#>     year   min     mean   max
#>    <int> <dbl>    <dbl> <dbl>
#>  1  2018 -4.09  0.0462  11.6 
#>  2  2019 -5.80  0.0436   3.18
#>  3  2003 -1.13 -0.00629  0.9 
#>  4  2004 -1.11  0.0391   4.07
#>  5  2005 -1.32  1.32     4.83
#>  6  2006 -2.47  0.692    4.94
#>  7  2007 -3.48 -0.0374   4.94
#>  8  2008 -3.41 -0.00481  4.85
#>  9  2009 -3.00  0.0108   2.94
#> 10  2010 -3.71 -0.0192   3.11
#> # ℹ 53 more rows

This same query could then be used to produce graphics with ggplot2, for example. In this case, note that the call to dplyr::collect() is mandatory in advance of using ggplot2 functions, as it will only accept standard R objects (such as tibble or data.frame).

library(ggplot2)
da |>
    filter(country == "IRL", use_flag == 1) |>
    group_by(year) |>
    summarise(
        min = min(sea_level),
        mean = mean(sea_level),
        max = max(sea_level)) |>
    collect() |>
    tidyr::pivot_longer(cols = c(min, mean, max)) |>
    ggplot(aes(x = year, y = value, colour = name)) +
    geom_line() +
    theme(legend.position = "top") +
    labs(colour = "")

mirror server hosted at Truenetwork, Russian Federation.