Title: | The Ultimate Tool for Reading Data in Bulk |
Version: | 1.2.0 |
Description: | Designed to simplify and streamline the process of reading and processing large volumes of data in R, this package offers a collection of functions tailored for bulk data operations. It enables users to efficiently read multiple sheets from Microsoft Excel and Google Sheets workbooks, as well as various CSV files from a directory. The data is returned as organized data frames, facilitating further analysis and manipulation. Ideal for handling extensive data sets or batch processing tasks, bulkreadr empowers users to manage data in bulk effortlessly, saving time and effort in data preparation workflows. Additionally, the package seamlessly works with labelled data from SPSS and Stata. |
License: | MIT + file LICENSE |
URL: | https://github.com/gbganalyst/bulkreadr, https://gbganalyst.github.io/bulkreadr/ |
BugReports: | https://github.com/gbganalyst/bulkreadr/issues |
Depends: | purrr |
Imports: | curl, dplyr, fs, googlesheets4, haven, inspectdf, labelled, lubridate, magrittr, methods, openxlsx, readr, readxl, rlang, sjlabelled, stats, stringr, tibble, tidyr |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-04-28 12:00:01 UTC; admin |
Author: | Ezekiel Ogundepo |
Maintainer: | Ezekiel Ogundepo <gbganalyst@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-04-28 12:20:02 UTC |
bulkreadr: The Ultimate Tool for Reading Data in Bulk
Description
Designed to simplify and streamline the process of reading and processing large volumes of data in R, this package offers a collection of functions tailored for bulk data operations. It enables users to efficiently read multiple sheets from Microsoft Excel and Google Sheets workbooks, as well as various CSV files from a directory. The data is returned as organized data frames, facilitating further analysis and manipulation. Ideal for handling extensive data sets or batch processing tasks, bulkreadr empowers users to manage data in bulk effortlessly, saving time and effort in data preparation workflows. Additionally, the package seamlessly works with labelled data from SPSS and Stata.
Author(s)
Maintainer: Ezekiel Ogundepo gbganalyst@gmail.com (ORCID)
Other contributors:
Ernest Fokoue epfeqa@rit.edu (ORCID) [contributor]
Golibe Ezeechesi golibe.ezeechesi@gmail.com [contributor]
Fatimo Adebanjo adebanjofatimo2000@gmail.com [contributor]
Isaac Ajao isaacoluwaseyiajao@gmail.com [contributor]
See Also
Useful links:
Report bugs at https://github.com/gbganalyst/bulkreadr/issues
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs)
.
User friendly date parsing function
Description
convert_to_date()
parses an input vector into POSIXct date object. It is also powerful to convert from excel date number like 42370
into date value like 2016-01-01
.
Usage
convert_to_date(date_num_char, tz = "UTC")
Arguments
date_num_char |
A character or numeric vector of dates |
tz |
Time zone indicator. If |
Value
a vector of class Date
Examples
## ** heterogeneous dates **
dates <- c(
44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
"13-01- 2022", "2023", "2023-2", 41750.2, 41751.99,
"11 07 2023", "2023-4"
)
convert_to_date(dates)
Fill missing values in a data frame
Description
fill_missing_values()
is an efficient function that addresses missing
values in a data frame. It uses imputation by function, also known as
column-based imputation, to impute the missing values. For continuous
variables, it supports various methods of imputation, including minimum,
maximum, mean, median, harmonic mean, and geometric mean. For categorical
variables, missing values are replaced with the mode of the column. This
approach ensures accurate and consistent replacements derived from individual
columns, resulting in a complete and reliable dataset for improved analysis
and decision-making.
Usage
fill_missing_values(
df,
selected_variables = NULL,
method = c("mean", "min", "max", "median", "harmonic", "geometric")
)
Arguments
df |
A dataframe to process for missing value imputation. |
selected_variables |
An optional vector of variable names within |
method |
A character string specifying the imputation method for
continuous variables. Supported methods are |
Value
A data frame with missing values imputed according to the specified
method
.
Examples
library(dplyr)
# Assuming 'df' is the dataframe you want to process
df <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", NA, "versicolor", "setosa",
NA, "virginica", "setosa")
)
# Impute using the mean method for continuous variables
result_df_mean <- fill_missing_values(df, method = "mean")
result_df_mean
# Impute using the geometric mean for continuous variables and specify
# variables `Petal_Length` and `Petal_Width`.
result_df_geomean <- fill_missing_values(df, selected_variables = c
("Petal_Length", "Petal_Width"), method = "geometric")
result_df_geomean
# Impute missing values (NAs) in a grouped data frame
# You can do that by using the following:
sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
"virginica", "virginica", "setosa")
)
sample_iris %>%
group_by(Species) %>%
group_split() %>%
map_df(fill_missing_values, method = "median")
Create a data dictionary from labelled data
Description
generate_dictionary()
creates a data dictionary from a specified data frame.
This function is particularly useful for understanding and documenting the
structure of your dataset, similar to data dictionaries in Stata or SPSS.
Usage
generate_dictionary(data)
Arguments
data |
a data frame or a survey object |
Details
The function returns a tibble (a modern version of R's data frame) with the following columns:
-
position: An integer vector indicating the column position in the data frame.
-
variable: A character vector containing the names of the variables (columns).
-
description: A character vector with a human-readable description of each variable.
-
column type: A character vector specifying the data type (e.g., numeric, character) of each variable.
-
missing: An integer vector indicating the count of missing values for each variable.
-
levels: A list vector containing the levels for categorical variables, if applicable.
Value
A tibble representing the data dictionary. Each row corresponds to a variable in the original data frame, providing detailed information about the variable's characteristics.
Examples
# Creating a data dictionary from an SPSS file
file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")
wage_data <- read_spss_data(file = file_path)
generate_dictionary(wage_data)
Summarize missingness in data frame columns
Description
inspect_na()
summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group.
Usage
inspect_na(df)
Arguments
df |
A data frame |
Details
The tibble returned contains the columns:
-
col_name, a character vector containing column names of df1.
-
cnt, an integer vector containing the number of missing values by column.
-
pcnt, the percentage of records in each columns that is missing.
Value
A tibble summarizing the count and percentage of columnwise missingness for a data frame.
Examples
library(dplyr)
# dataframe summary
inspect_na(airquality)
# grouped dataframe summary
airquality %>%
group_by(Month) %>%
inspect_na()
Look for keywords variable names and descriptions in labelled data
Description
The look_for()
function is designed to emulate the functionality of the Stata lookfor
command in R. It provides a powerful tool for searching through large datasets, specifically targeting variable names, variable label descriptions, factor levels, and value labels. This function is handy for users working with extensive and complex datasets, enabling them to quickly and efficiently locate the variables of interest.
Usage
look_for(
data,
...,
labels = TRUE,
values = TRUE,
ignore.case = TRUE,
details = c("basic", "none", "full")
)
Arguments
data |
a data frame or a survey object |
... |
optional list of keywords, a character string (or several
character strings), which can be formatted as a regular expression suitable
for a |
labels |
whether or not to search variable labels (descriptions);
|
values |
whether or not to search within values (factor levels or value
labels); |
ignore.case |
whether or not to make the keywords case sensitive;
|
details |
add details about each variable (full details could be time
consuming for big data frames, |
Value
A tibble data frame featuring the variable position, name and description (if it exists) in the original data frame.
Examples
look_for(iris)
# Look for a single keyword.
look_for(iris, "petal")
look_for(iris, "s")
Extract or replace parts of an object
Description
pull_out()
is similar to [
. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (%>%
) and base (|>
) operators.
Value
pull_out()
will return an object of the same class as the input object.
Examples
good_choice <- letters %>%
pull_out(c(5, 2, 1, 4))
good_choice
iris %>%
pull_out(, 1:4) %>%
head()
Reads all CSV files from a directory
Description
read_csv_files_from_dir
reads all csv files from the "~/data"
directory and returns an appended dataframe. The resulting dataframe will be in the same order as the CSV files in the directory.
Usage
read_csv_files_from_dir(dir_path = ".", col_types = NULL, .id = NULL)
Arguments
dir_path |
Path to the directory containing the CSV files. |
col_types |
One of If Column specifications created by Alternatively, you can use a compact string representation where each character represents one column:
By default, reading a file without a column specification will print a
message showing what |
.id |
The name of a column in which to store the file path. This is
useful when reading multiple input files and there is data in the file
paths, such as the data collection date. If |
Value
A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.
See Also
read_excel_files_from_dir()
which reads Excel workbooks data from a directory.
Examples
directory <- system.file("csvfolder", package = "bulkreadr")
read_csv_files_from_dir(dir_path = directory, .id = "cut")
# Column types mismatch error --------------------------------------
# If the `read_csv_files_from_dir()` function complains about a data type mismatch,
# then set the `col_types` argument to `"c"`.
# This will make all the column types in the resulting dataframe be characters.
Read Excel Workbooks data from a directory
Description
read_excel_files_from_dir()
reads all Excel workbooks in the "~/data"
directory and returns an appended dataframe.
Usage
read_excel_files_from_dir(dir_path, col_types = NULL, .id = NULL)
Arguments
dir_path |
Path to the directory containing the |
col_types |
Either |
.id |
The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions. |
Value
A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.
See Also
read_excel_workbook()
which imports data from multiple sheets of an Excel workbook
Examples
directory <- system.file("xlsxfolder", package = "bulkreadr")
read_excel_files_from_dir(dir_path = directory, .id = "cut")
# Column types mismatch error --------------------------------------
# If the `read_excel_files_from_dir()` function complains about a data type mismatch,
# then set the `col_types` argument to `"text"`.
# This will make all the column types in the resulting dataframe be characters.
Import data from multiple sheets of an Excel workbook
Description
read_excel_workbook()
reads all the data from the sheets of an Excel workbook and return an appended dataframe.
Usage
read_excel_workbook(path, col_types = NULL, .id = NULL)
Arguments
path |
Path to the xls/xlsx file. |
col_types |
Either |
.id |
The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions. |
Value
A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.
See Also
readxl::read_excel()
, which reads a Sheet of an Excel file into a data frame, and read_gsheets()
, which imports data from multiple sheets in a Google Sheets.
Examples
path <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr", mustWork = TRUE)
read_excel_workbook(path = path, .id = "Year")
# Column types mismatch error --------------------------------------
# If the `read_excel_workbook()` function complains about a data type mismatch,
# then set the `col_types` argument to `"text"`.
# This will make all the column types in the resulting DataFrame be characters.
Import data from multiple sheets in Google Sheets
Description
The read_gsheets()
function imports data from multiple sheets in a Google Sheets spreadsheet and appends the resulting dataframes from each sheet together to create a single dataframe. This function is a powerful tool for data analysis, as it allows you to easily combine data from multiple sheets into a single dataset.
Usage
read_gsheets(ss, col_types = NULL, .id = NULL)
Arguments
ss |
Something that identifies a Google Sheet:
Processed through |
col_types |
Column types. Either |
.id |
The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions. |
Value
A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.
See Also
googlesheets4::read_sheet()
which reads a Google (spread)Sheet into a data frame.
Examples
sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0"
read_gsheets(ss = sheet_id, .id = "sheet.name")
# Column types mismatch error --------------------------------------
# If the `read_gsheets()` function complains about a data type mismatch,
# then set the `col_types` argument to `"c"`.
# This will make all the column types in the resulting dataframe be characters.
# For example,
sheet_id <- "1rrjKAV05POre9lDVtHtZePTa8VROf1onVO47cHnhrTU"
try(read_gsheets(ss = sheet_id)) # error, column types mismatch
read_gsheets(ss = sheet_id, col_types = "c")
Read SPSS data file
Description
read_spss_data()
is designed to seamlessly import data from an SPSS data (.sav
or .zsav
) files. It converts labelled variables into factors, a crucial step that enhances the ease of data manipulation and analysis within the R programming environment.
Usage
read_spss_data(file, label = FALSE)
Arguments
file |
The path to the SPSS data file. |
label |
Logical indicating whether to use variable labels as column names (default is FALSE). |
Value
A tibble containing the data from the SPSS file.
See Also
read_stata_data()
which reads Stata data file and converts labelled variables into factors.
Examples
# Read an SPSS data file without converting variable labels as column names
file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")
data <- read_spss_data(file = file_path)
data
# Read an SPSS data file and convert variable labels as column names
data <- read_spss_data(file = file_path, label = TRUE)
data
Read Stata data file
Description
Read Stata data file
Usage
read_stata_data(file, label = FALSE)
Arguments
file |
The path to the Stata data file. |
label |
Logical indicating whether to use variable labels as column names (default is FALSE). |
Value
A data frame containing the Stata data, with labeled variables converted to factors.
See Also
read_spss_data()
which reads SPSS data file and converts labelled variables into factors.
Examples
# Read Stata data file without converting variable labels as column names
file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr")
data <- read_stata_data(file = file_path)
data
# Read Stata data file and convert variable labels as column names
data <- read_stata_data(file = file_path, label = TRUE)
data
Export Excel Sheets to CSV Files
Description
This function reads an Excel file, converts each sheet into a data frame, and writes each sheet to a CSV file in the specified output directory.
Usage
write_excel_sheets_to_csv(excel_path, output_dir = "data/")
Arguments
excel_path |
A character string specifying the path to the Excel file. |
output_dir |
A character string specifying the directory where CSV files will be saved. Defaults to "data/". |
Value
A list of file paths corresponding to the exported CSV files.
Examples
# Path to the example Excel file shipped with the package
excel_file <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr")
# Export each sheet to its own CSV in a temporary directory
output_dir <- tempdir()
write_excel_sheets_to_csv(excel_file, output_dir)