dformula allows to easily modify, transform, add and extrapolate using the basic R formula. The operations on the data are the following:
| Operation | Function |
|---|---|
| Add new variables | add() |
| Transform existing variables | transform() |
| Rename existing variables | rename() |
| Selection rows and columns | select() |
| Removing row and column | remove |
The formula is composed of two part:
\[ column\_names \sim new\_variables \]
the right-hand side shows the names of the columns of the data and the left-hand side the transformation or the new variables to insert in the data.
The I() is used in the right-hand side to indicate the
type of transformation of the existing variable. In this function, we
can insert logical statement, function implemented in R or user build
function.
For example:
\[ var\_name_1 + var\_name_2 \sim I(log(var\_name_1)) + I(var\_name_2 == "something") \]
the two variable \(var_name_1\) and \(var_name_2\) are transformed in \(log(var_name_1)\) or selected to be equal to \("something"\).
In the same fashion of SQL, we have the from argument,
the input data, and the as argument, the new name of the
variables, after transformation, selection or addition.
The CRAN version can be loaded
or the development version from GitHub:
The data are available in the package will be used in this overview
which describes the Population and Area of world countries
Adding variablesThe add() function inserts new variables starting from
the existing columns in the data.
Suppose we want to calculate population density and attach this to the original dataset
new_pop <- add(from = pop_data, formula = ~ I(Population / Area))
head(new_pop)
## Country Population Area Var.1
## 1 China 1439323776 9388211.00 153.31183
## 2 India 1380004385 2973190.00 464.14941
## 3 United States 331002651 9147420.00 36.18536
## 4 Indonesia 273523615 1811570.00 150.98705
## 5 Pakistan 220892340 770.88 286545.68804
## 6 Brazil 212559417 8358140.00 25.43143and give a name to this new variable
new_pop <- add(from = pop_data, formula = ~ I(Population / Area), as = "pop_density")
head(new_pop)
## Country Population Area pop_density
## 1 China 1439323776 9388211.00 153.31183
## 2 India 1380004385 2973190.00 464.14941
## 3 United States 331002651 9147420.00 36.18536
## 4 Indonesia 273523615 1811570.00 150.98705
## 5 Pakistan 220892340 770.88 286545.68804
## 6 Brazil 212559417 8358140.00 25.43143Multiple variable can be added with a single formula
new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)))
head(new_pop)
## Country Population Area Var.1 Var.2
## 1 China 1439323776 9388211.00 153.31183 16.054965
## 2 India 1380004385 2973190.00 464.14941 14.905146
## 3 United States 331002651 9147420.00 36.18536 16.028982
## 4 Indonesia 273523615 1811570.00 150.98705 14.409704
## 5 Pakistan 220892340 770.88 286545.68804 6.647533
## 6 Brazil 212559417 8358140.00 25.43143 15.938746and with new names
new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)),
as = c("pop_density", "log_area"))
head(new_pop)
## Country Population Area pop_density log_area
## 1 China 1439323776 9388211.00 153.31183 16.054965
## 2 India 1380004385 2973190.00 464.14941 14.905146
## 3 United States 331002651 9147420.00 36.18536 16.028982
## 4 Indonesia 273523615 1811570.00 150.98705 14.409704
## 5 Pakistan 220892340 770.88 286545.68804 6.647533
## 6 Brazil 212559417 8358140.00 25.43143 15.938746If we have one transformation applied to a group of variables, we do not specify the function multiple times
new_pop <- add(from = pop_data, formula = Population + Area ~ log())
head(new_pop)
## Country Population Area Var.1 Var.2
## 1 China 1439323776 9388211.00 21.08744 16.054965
## 2 India 1380004385 2973190.00 21.04535 14.905146
## 3 United States 331002651 9147420.00 19.61764 16.028982
## 4 Indonesia 273523615 1811570.00 19.42690 14.409704
## 5 Pakistan 220892340 770.88 19.21319 6.647533
## 6 Brazil 212559417 8358140.00 19.17473 15.938746and with new column names
new_pop <- add(from = pop_data, formula = Population + Area ~ log(),
as = c("log_pop", "log_area"))
head(new_pop)
## Country Population Area log_pop log_area
## 1 China 1439323776 9388211.00 21.08744 16.054965
## 2 India 1380004385 2973190.00 21.04535 14.905146
## 3 United States 331002651 9147420.00 19.61764 16.028982
## 4 Indonesia 273523615 1811570.00 19.42690 14.409704
## 5 Pakistan 220892340 770.88 19.21319 6.647533
## 6 Brazil 212559417 8358140.00 19.17473 15.938746Suppose we want to add a numerical id for the
countries at the beginning of the dataset, using the
position argument
new_pop <- add(from = pop_data,
formula = ~ I(1:nrow(new_pop)),
position = "left", as = "id")
head(new_pop)
## id Country Population Area
## 1 1 China 1439323776 9388211.00
## 2 2 India 1380004385 2973190.00
## 3 3 United States 331002651 9147420.00
## 4 4 Indonesia 273523615 1811570.00
## 5 5 Pakistan 220892340 770.88
## 6 6 Brazil 212559417 8358140.00We can also add a constant variable. For example the year of the observation
new_pop <- add(from = pop_data, formula = ~ C("2020"), position = "left")
head(new_pop)
## Var.1 Var.2 Var.3 Country Population Area
## 1 2020 2020 2020 China 1439323776 9388211.00
## 2 2020 2020 2020 India 1380004385 2973190.00
## 3 2020 2020 2020 United States 331002651 9147420.00
## 4 2020 2020 2020 Indonesia 273523615 1811570.00
## 5 2020 2020 2020 Pakistan 220892340 770.88
## 6 2020 2020 2020 Brazil 212559417 8358140.00or both
new_pop <- add(from = pop_data,
formula = ~ I(1:nrow(new_pop)) + C("2020"),
position = "left", as = c("ids", "year"))
head(new_pop)
## ids year Country Population Area
## 1 1 2020 China 1439323776 9388211.00
## 2 2 2020 India 1380004385 2973190.00
## 3 3 2020 United States 331002651 9147420.00
## 4 4 2020 Indonesia 273523615 1811570.00
## 5 5 2020 Pakistan 220892340 770.88
## 6 6 2020 Brazil 212559417 8358140.00The C() construct add a constant for all the rows
We can be interested in having a dummy variable, i.e. a variable equal to \(1\) if some event happen or \(0\) otherwise. For example, we suppose to build a dummy variables with the most populated countries. In this we suppose countries with more than \(100\) million of people.
new_pop <- add(from = pop_data, formula = ~ I(Population > 100000000))
head(new_pop)
## Country Population Area Var.1
## 1 China 1439323776 9388211.00 1
## 2 India 1380004385 2973190.00 1
## 3 United States 331002651 9147420.00 1
## 4 Indonesia 273523615 1811570.00 1
## 5 Pakistan 220892340 770.88 1
## 6 Brazil 212559417 8358140.00 1or two variables one with the most populated countries and the other with the biggest extended countries
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000) + I(Area > 8000000))
head(new_pop)
## Country Population Area Var.1 Var.2
## 1 China 1439323776 9388211.00 1 1
## 2 India 1380004385 2973190.00 1 0
## 3 United States 331002651 9147420.00 1 1
## 4 Indonesia 273523615 1811570.00 1 0
## 5 Pakistan 220892340 770.88 1 0
## 6 Brazil 212559417 8358140.00 1 1or a variable indicating the most populated and the biggest countries togheter
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)
## Country Population Area Var.1
## 1 China 1439323776 9388211.00 1
## 2 India 1380004385 2973190.00 0
## 3 United States 331002651 9147420.00 1
## 4 Indonesia 273523615 1811570.00 0
## 5 Pakistan 220892340 770.88 0
## 6 Brazil 212559417 8358140.00 1If we want obtain a boolean vector, as an interrogation, setting to
TRUE the argument logic_convert the function
will return a boolean vector
new_pop <- add(from = pop_data,
formula = ~ I(Population > 100000000),
logic_convert = FALSE, as = "most_populated")
head(new_pop)
## Country Population Area most_populated
## 1 China 1439323776 9388211.00 TRUE
## 2 India 1380004385 2973190.00 TRUE
## 3 United States 331002651 9147420.00 TRUE
## 4 Indonesia 273523615 1811570.00 TRUE
## 5 Pakistan 220892340 770.88 TRUE
## 6 Brazil 212559417 8358140.00 TRUETransform variablesThe transform() function modifies existing variables in
the dataset.
Suppose we want to change the scale on the Population
new_pop <- transform(from = pop_data,
formula = Population ~ I(Population/10000))
head(new_pop)
## Country Population Area
## 1 China 143932.38 9388211.00
## 2 India 138000.44 2973190.00
## 3 United States 33100.27 9147420.00
## 4 Indonesia 27352.36 1811570.00
## 5 Pakistan 22089.23 770.88
## 6 Brazil 21255.94 8358140.00or we want a logarithmic transformation, renaming the variable
new_pop <- transform(from = pop_data,
formula = Population ~ I(log(Population)),
as = "log_pop")
head(new_pop)
##
## 1 China 21.08744 9388211.00
## 2 India 21.04535 2973190.00
## 3 United States 19.61764 9147420.00
## 4 Indonesia 19.42690 1811570.00
## 5 Pakistan 19.21319 770.88
## 6 Brazil 19.17473 8358140.00With a single formula multiple variables can be transformed, as showed before.
new_pop <- transform(from = pop_data,
formula = Population + Area~ I(log()))
head(new_pop)
## Country Population Area
## 1 China 21.08744 16.054965
## 2 India 21.04535 14.905146
## 3 United States 19.61764 16.028982
## 4 Indonesia 19.42690 14.409704
## 5 Pakistan 19.21319 6.647533
## 6 Brazil 19.17473 15.938746We can also transformed multiple variables with multiple transformations
new_pop <- transform(from = pop_data,
formula = Population + Area ~ I(Population > 100000000) + I(log(Area)))
head(new_pop)
## Country Population Area
## 1 China 1 16.054965
## 2 India 1 14.905146
## 3 United States 1 16.028982
## 4 Indonesia 1 14.409704
## 5 Pakistan 1 6.647533
## 6 Brazil 1 15.938746Rename variablesThe rename() function may be used to change names of
existing variables, for example
new_pop <- rename(from = pop_data, formula = Population ~ pop )
head(new_pop)
## Country pop Area
## 1 China 1439323776 9388211.00
## 2 India 1380004385 2973190.00
## 3 United States 331002651 9147420.00
## 4 Indonesia 273523615 1811570.00
## 5 Pakistan 220892340 770.88
## 6 Brazil 212559417 8358140.00or multiple variables
new_pop <- rename(from = pop_data, formula = Population + Area ~ pop + area)
head(new_pop)
## Country pop area
## 1 China 1439323776 9388211.00
## 2 India 1380004385 2973190.00
## 3 United States 331002651 9147420.00
## 4 Indonesia 273523615 1811570.00
## 5 Pakistan 220892340 770.88
## 6 Brazil 212559417 8358140.00Select variables and rowsIn the same fashion of SQL, the select() function first
select the rows, given a statement, and then shows the select
variables.
The first part of the formula are the columns to select, as the previous functions, and the right-hand side of the formula, the condition part, will select the rows.
Suppose to want to select only the most populated countries
new_pop <- select(from = pop_data,
formula = . ~ I(Population > 100000000))
head(new_pop)
## Country Population Area
## 1 China 1439323776 9388211.00
## 2 India 1380004385 2973190.00
## 3 United States 331002651 9147420.00
## 4 Indonesia 273523615 1811570.00
## 5 Pakistan 220892340 770.88
## 6 Brazil 212559417 8358140.00you can also add . to returns all variables instead of
nothing.
We want only the name of the most populated countries
new_pop <- select(from = pop_data,
formula = Country ~ I(Population > 100000000))
head(new_pop)
## Country
## 1 China
## 2 India
## 3 United States
## 4 Indonesia
## 5 Pakistan
## 6 BrazilWe might be interest in only the most populated and biggest countries
new_pop <- select(from = pop_data,
formula = . ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)
## Country Population Area
## 1 China 1439323776 9388211
## 3 United States 331002651 9147420
## 6 Brazil 212559417 8358140
## 9 Russia 145934462 16376870or both
new_pop <- select(from = pop_data,
formula = ~ I(Population > 100000000 | Area > 8000000))
head(new_pop)
## Country Population Area
## 1 China 1439323776 9388211.00
## 2 India 1380004385 2973190.00
## 3 United States 331002651 9147420.00
## 4 Indonesia 273523615 1811570.00
## 5 Pakistan 220892340 770.88
## 6 Brazil 212559417 8358140.00by selecting only the names
Remove variablesThe remove() function has the same syntax of
select() function, but now the rows and columns will be
removed.
In all the functions, except for rename, the argument
na.remove will remove all the rows with missing values,
after adding, transforming or selecting the rows.
The remove function, can be employed to remove all the
rows with at least a missing observation,
data("airquality")
dt <- airquality
dt_new <- remove(from = dt,formula = .~., na.remove = TRUE)
head(dt_new)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8If we are interested to focus on the observation with missing values,
the na.return = TRUE arguments of select
function will return only the incomplete rows after the selection