To showcase the performance of diseasystore on different
database backends, we include this vignette that summarises a simple
benchmark: A sample dataset is created based on the
datasets::mtcars dataset. This data is repeated 1000 times
and given a unique ID (the row number of the data):
benchmark_data
#> # A tibble: 32,000 × 13
#> row_id car mpg cyl disp hp drat wt qsec vs am gear
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Mazda RX4… 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 2 2 Mazda RX4… 21 6 160 110 3.9 2.88 17.0 0 1 4
#> 3 3 Datsun 71… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 4 4 Hornet 4 … 21.4 6 258 110 3.08 3.22 19.4 1 0 3
#> 5 5 Hornet Sp… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
#> 6 6 Valiant 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> 7 7 Duster 36… 14.3 8 360 245 3.21 3.57 15.8 0 0 3
#> 8 8 Merc 240D… 24.4 4 147. 62 3.69 3.19 20 1 0 4
#> 9 9 Merc 230 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
#> 10 10 Merc 280 … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
#> # ℹ 31,990 more rows
#> # ℹ 1 more variable: carb <dbl>A simple diseasystore is built around this data, with
two ?FeatureHandlers, one each for the cyl and
vs variables.
DiseasystoreMtcars <- R6::R6Class(
classname = "DiseasystoreBase",
inherit = DiseasystoreBase,
private = list(
.ds_map = list("n_cyl" = "mtcars_cyl", "vs" = "mtcars_vs"),
mtcars_cyl = FeatureHandler$new(
compute = function(start_date, end_date, slice_ts, source_conn) {
out <- benchmark_data |>
dplyr::transmute(
"key_car" = .data$car, "n_cyl" = .data$cyl,
"valid_from" = Sys.Date() - lubridate::days(2 * .data$row_id - 1),
"valid_until" = .data$valid_from + lubridate::days(2)
)
return(out)
},
key_join = key_join_sum
),
mtcars_vs = FeatureHandler$new(
compute = function(start_date, end_date, slice_ts, source_conn) {
out <- benchmark_data |>
dplyr::transmute(
"key_car" = .data$car, .data$vs,
"valid_from" = Sys.Date() - lubridate::days(2 * .data$row_id),
"valid_until" = .data$valid_from + lubridate::days(2)
)
return(out)
},
key_join = key_join_sum
)
)
)Two separate benchmark functions are created. The first benchmarking
function tests the computation time of
?DiseasystoreBase$get_feature() by computing first the
n_cyl feature then computing the vs feature,
before finally deleting the computations from the database.
The second benchmarking function tests the computation time of
?DiseasystoreBase$key_join_features() by joining the
vs feature to the n_cyl observation. Note that
the n_cyl and vs are re-computed before the
benchmarks are started and are not deleted by the benchmarking function
as was the case for the benchmark of
?DiseasystoreBase$get_feature(). In addition, we only use
the first 100 rows of the benchmark_data for this test to
reduce computation time.
The performance of these benchmark functions are timed with the
{{microbenchmark}} package using 10 replicates. All
benchmarks are run on the same machine.
The results of the benchmark are shown graphically below (mean and
standard deviation), where measure the performance of
diseasystore.