Get started

The ggmice package

The ggmice package provides visualizations for the evaluation of incomplete data, mice imputation model arguments, and multiply imputed data sets (mice::mids objects). The functions in ggmice adhere to the ‘grammar of graphics’ philosophy, popularized by the ggplot2 package. With that, ggmice enhances imputation workflows and provides plotting objects that are easily extended and manipulated by each individual ‘imputer’.

This vignette gives an overview of the different plotting function in ggmice. The core function, ggmice(), is a ggplot2::ggplot() wrapper function which handles missing and imputed values. In this vignette, you’ll learn how to create and interpret ggmice visualizations.

Experienced mice users may already be familiar with the lattice style plotting functions in mice. These ‘old friends’ such as mice::bwplot() can be re-created with the ggmice() function, see the Old friends vignette for advice.

Set-up

You can install the latest ggmice release from CRAN with:

install.packages("ggmice")

The development version of the ggmice package can be installed from GitHub with:

# install.packages("devtools")
devtools::install_github("amices/ggmice")

After installing ggmice, you can load the package into your R workspace. It is highly recommended to load the mice and ggplot2 packages as well. This vignette assumes that all three packages are loaded:

library(mice)
library(ggplot2)
library(ggmice)

We will use the mice::boys data for illustrations. This is an incomplete dataset (\(n = 748\)) with cross-sectional data on \(9\) growth-related variables (e.g., age in years and height in cm).

We load the incomplete data with:

dat <- boys

For the purpose of this vignette, we impute all incomplete variables \(m = 3\) times with predictive mean matching as imputation method. Imputations are generated with:

imp <- mice(dat, m = 3, method = "pmm")

We now have the necessary packages, an incomplete dataset (dat), and a mice::mids object (imp) loaded in our workspace.

The ggmice() function

The core function in the ggmice package is ggmice(). This function mimics how the ggplot2 function ggplot() works: both take a data argument and a mapping argument, and will return an object of class ggplot.

Using ggmice() looks equivalent to a ggplot() call:

ggplot(dat, aes(x = age))
ggmice(dat, aes(x = age))

The main difference between the two functions is that ggmice() is actually a wrapper around ggplot(), including some pre-processing steps for incomplete and imputed data. Because of the internal processing in ggmice(), the mapping argument is required for each ggmice() call. This is in contrast to the aesthetic mapping in ggplot(), which may also be provided in subsequent plotting layers. After creating a ggplot object, any desired plotting layers may be added (e.g., with the family of ggplot2::geom_* functions), or adjusted (e.g., with the ggplot2::labs() function). This makes ggmice() a versatile plotting function for incomplete and/or imputed data.

The object supplied to the data argument in ggmice() should be an incomplete dataset of class data.frame, or an imputation object of class mice::mids. Depending on which one of these is provided, the resulting visualization will either differentiate between observed and missing data, or between observed and imputed data. By convention, observed data is plotted in blue and missing or imputed data is plotted in red.

The mapping argument in ggmice() cannot be empty. An x or y mapping (or both) has to be supplied for ggmice() to function. This aesthetic mapping can be provided with the ggplot2 function aes() (or equivalents). Other mapping may be provided too, except for colour, which is already used to display observed versus missing or imputed data.

Incomplete data

If the object supplied to the data argument in ggmice() is a data.frame, the visualization will contain observed data in blue and missing data in red. Since missing data points are by definition unobserved, the values themselves cannot be plotted. What we can plot are sets of variable pairs. Any missing values on one variable can be displayed on top of the axis of the other. This provides a visual cue that the missing data is distinct from the observed values, but still displays the observed value of the other variable.

For example, the variable age is completely observed, while there are some missing entries for the height variable hgt. We can create a scatter plot of these two variables with:

ggmice(dat, aes(age, hgt)) +
  geom_point()

The age of cases with missing hgt are plotted on top of the horizontal axis. This is in contrast to a regular ggplot() call with the same arguments, which would leave out all cases with missing hgt. So, with ggmice() we loose less information, and may even gain valuable insight into the missingness in the data.

Another example of ggmice() in action on incomplete data is when one of the variables is categorical. The incomplete continuous variable hgt is plotted against the incomplete categorical variable reg with:

ggmice(dat, aes(reg, hgt)) +
  geom_point()

Again, missing values are plotted on top of the axes. Cases with observed hgt and missing reg are plotted on top of the vertical axis. Cases with observed reg and missing hgt are plotted on top of the horizontal axis. There are no cases were neither is observed, but otherwise these would be plotted on the intersection of the two axes.

The ‘grammar of graphics’ makes it easy to adjust the plots programmatically. For example, we could be interested in the differences in growth data between the city and other regions. Add facets based on a clustering variable with:

ggmice(dat, aes(wgt, hgt)) +
  geom_point() +
  facet_wrap(~ reg == "city", labeller = label_both)

Or, alternatively, we could convert the plotted values of the variable hgt from centimeters to inches and the variable wgt from kilograms to pounds with:

ggmice(dat, aes(wgt * 2.20, hgt / 2.54)) +
  geom_point() +
  labs(x = "Weight (lbs)", y = "Height (in)")
#> Warning: Mapping variable 'wgt * 2.2' recognized internally as wgt.
#> Please verify whether this matches the requested mapping variable.
#> Warning: Mapping variable 'hgt/2.54' recognized internally as hgt.
#> Please verify whether this matches the requested mapping variable.

A final example of ggmice() applied to incomplete data is faceting based on a missingness indicator. Doing so may help explore the missingness mechanisms in the incomplete data. The distribution of the continuous variable age and categorical variable reg are visualized faceted by the missingness indicator for hgt with:

# continuous variable
ggmice(dat, aes(age)) +
  geom_density() +
  facet_wrap(~ factor(is.na(hgt) == 0, labels = c("observed height", "missing height")))

# categorical variable
ggmice(dat, aes(reg)) +
  geom_bar(fill = "white") +
  facet_wrap(~ factor(is.na(hgt) == 0, labels = c("observed height", "missing height")))

Imputed data

If the data argument in ggmice() is provided a mice::mids object, the resulting plot will contain observed data in blue and imputed data in red. There are many possible visualizations for imputed data, four of which are explicitly defined in the mice package. Each of these can be re-created with the ggmice() function (see the Old friends vignette). But ggmice() can do even more.

For example, we could create the same scatter plots as the ones above, but now on the imputed data:

ggmice(imp, aes(age, hgt)) +
  geom_point()

ggmice(imp, aes(reg, hgt)) +
  geom_point()

ggmice(imp, aes(wgt, hgt)) +
  geom_point() +
  facet_wrap(~ reg == "city", labeller = label_both)

ggmice(imp, aes(wgt * 2.20, hgt / 2.54)) +
  geom_point() +
  labs(x = "Weight (lbs)", y = "Height (in)")
#> Warning: Mapping variable 'wgt * 2.2' recognized internally as wgt.
#> Please verify whether this matches the requested mapping variable.
#> Warning: Mapping variable 'hgt/2.54' recognized internally as hgt.
#> Please verify whether this matches the requested mapping variable.

These figures show the observed data points once in blue, plus three imputed values in red for each missing entry.

It is also possible to use the imputation number as mapping variable in the plot. For example, we can create a stripplot of observed and imputed data with the imputation number .imp on the horizontal axis:

ggmice(imp, aes(x = .imp, y = hgt)) +
  geom_jitter(height = 0, width = 0.25) +
  labs(x = "Imputation number")

A major advantage of ggmice() over the equivalent function mice::stripplot() is that ggmice allows us to add subsequent plotting layers, such as a boxplot overlay:

ggmice(imp, aes(x = .imp, y = hgt)) +
  geom_jitter(height = 0, width = 0.25) +
  geom_boxplot(width = 0.5, size = 1, alpha = 0.75, outlier.shape = NA) +
  labs(x = "Imputation number")

You may want to create a plot visualizing the imputations of multiple variables as one object. This can be done by combining ggmice with the functional programming package purrr and visualization package patchwork. For example, we could obtain boxplots of different imputed variables as one object using:

purrr::map(c("wgt", "hgt", "bmi"), ~ {
  ggmice(imp, aes(x = .imp, y = .data[[.x]])) +
    geom_boxplot() +
    labs(x = "Imputation number")
}) %>%
  patchwork::wrap_plots()

To re-create any mice plot with ggmice, see the Old friends vignette.

Other functions

The ggmice package contains some additional plotting functions to explore incomplete data and evaluate convergence of the imputation algorithm. These are presented in the order of a typical imputation workflow, where the missingness is first investigated using a missing data pattern and influx-outflux plot, then imputation models are built based on relations between variables, and finally the imputations are inspected visually to check for non-convergence.

Missing data pattern

The plot_pattern() function displays the missing data pattern in an incomplete dataset. The argument data (the incomplete dataset) is required, the argument square is optional and determines whether the missing data pattern has square or rectangular tiles, and the optional argument rotate changes the angle of the variable names 90 degrees if requested. Other optional arguments are cluster and npat.

# create missing data pattern plot
plot_pattern(dat)


# specify optional arguments
plot_pattern(
  dat,
  square = TRUE,
  rotate = TRUE,
  npat = 3,
  cluster = "reg"
)

Influx and outflux

The plot_flux() function produces an influx-outflux plot. The influx of a variable quantifies how well its missing data connect to the observed data on other variables. The outflux of a variable quantifies how well its observed data connect to the missing data on other variables. In general, higher influx and outflux values are preferred when building imputation models. The plotting function requires an incomplete dataset (argument data), and takes optional arguments to adjust the legend and axis labels.

# create influx-outflux plot
plot_flux(dat)


# specify optional arguments
plot_flux(
  dat,
  label = FALSE,
  caption = FALSE
)

Correlations between variables

The function plot_corr() can be used to investigate relations between variables, for the development of imputation models. Only one of the arguments (data, the incomplete dataset) is required, all other arguments are optional.

# create correlation plot
plot_corr(dat)


# specify optional arguments
plot_corr(
  dat,
  vrb = c("hgt", "wgt", "bmi"),
  label = TRUE,
  square = FALSE,
  diagonal = TRUE,
  rotate = TRUE
)

Predictor matrix

The function plot_pred() displays mice predictor matrices. A predictor matrix is typically created using mice::make.predictorMatrix(), mice::quickpred(), or by using the default in mice::mice() and extracting the predictorMatrix from the resulting mids object. The plot_pred() function requires a predictor matrix (the data argument), but other arguments can be provided too.

# create predictor matrix
pred <- quickpred(dat)

# create predictor matrix plot
plot_pred(pred)


# specify optional arguments
plot_pred(
  pred,
  label = FALSE,
  square = FALSE,
  rotate = TRUE,
  method = "pmm"
)

Algorithmic convergence

The function plot_trace() plots the trace lines of the MICE algorithm for convergence evaluation. The only required argument is data (to supply a mice::mids object). The optional argument vrb defaults to "all", which would display traceplots for all variables.

# create traceplot for one variable
plot_trace(imp, "hgt")


This is the end of the vignette. This document was generated using:

sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                       LC_CTYPE=Dutch_Netherlands.utf8   
#> [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C                      
#> [5] LC_TIME=Dutch_Netherlands.utf8    
#> 
#> time zone: Europe/Amsterdam
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggmice_0.1.0  ggplot2_3.4.2 mice_3.16.2  
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.6       utf8_1.2.3       generics_0.1.3   tidyr_1.3.0     
#>  [5] shape_1.4.6      stringi_1.7.12   lattice_0.21-8   lme4_1.1-34     
#>  [9] digest_0.6.31    magrittr_2.0.3   mitml_0.4-5      evaluate_0.21   
#> [13] grid_4.3.0       iterators_1.0.14 fastmap_1.1.1    foreach_1.5.2   
#> [17] jomo_2.7-6       jsonlite_1.8.7   glmnet_4.1-7     Matrix_1.5-4.1  
#> [21] nnet_7.3-18      backports_1.4.1  survival_3.5-5   purrr_1.0.1     
#> [25] fansi_1.0.4      scales_1.2.1     codetools_0.2-19 jquerylib_0.1.4 
#> [29] cli_3.6.1        rlang_1.1.1      munsell_0.5.0    splines_4.3.0   
#> [33] withr_2.5.0      cachem_1.0.8     yaml_2.3.7       pan_1.8         
#> [37] tools_4.3.0      nloptr_2.0.3     minqa_1.2.5      dplyr_1.1.2     
#> [41] colorspace_2.1-0 boot_1.3-28.1    broom_1.0.5      rpart_4.1.19    
#> [45] vctrs_0.6.2      R6_2.5.1         lifecycle_1.0.3  stringr_1.5.0   
#> [49] MASS_7.3-58.4    pkgconfig_2.0.3  gtable_0.3.3     pillar_1.9.0    
#> [53] bslib_0.4.2      glue_1.6.2       Rcpp_1.0.10      highr_0.10      
#> [57] xfun_0.37        tibble_3.2.1     tidyselect_1.2.0 rstudioapi_0.14 
#> [61] knitr_1.42       farver_2.1.1     patchwork_1.1.2  htmltools_0.5.4 
#> [65] nlme_3.1-162     labeling_0.4.2   rmarkdown_2.21   compiler_4.3.0

mirror server hosted at Truenetwork, Russian Federation.