Introduction to lessR

The vignette examples of using lessR became so extensive that the maximum R package installation size was exceeded. Find a limited number of examples below. Find many more vignette examples at:

Visualization Function Structure in lessR

Chart()
Creates visualizations of summarized data, typically frequency or statistical tables from one or two categorical variables. Produces bar charts, pie charts, radar charts, stacked charts, and hierarchical displays such as sunburst and treemap diagrams. Use Chart() when each category or combination of categories corresponds to a numeric summary such as a count or a statistic.
X()
Visualizes a single raw variable, numeric or categorical. Generates histograms, density curves, dot plots, boxplots, violin plots, and other univariate graphics directly from the observed data values.
XY()
Displays relationships between variables, typically one x-variable and one y-variable. Produces scatterplots, line charts, bubble charts, time-series plots, and regression visualizations, with optional grouping and faceting.

Together, these three functions provide a unified and intuitive workflow: the structure of the input data—summarized table, single variable, or paired variables—naturally determines which visualization function to use, ensuring consistency and simplicity across all lessR graphics.

Read Data

more examples of reading and writing data files

Many of the following examples analyze data in the Employee data set, included with lessR. To read an internal lessR data set, just pass the name of the data set to the lessR function Read(). Read the Employee data into the data frame d. For data sets other than those provided by lessR, enter the path name or URL between the quotes, or leave the quotes empty to browse for the data file on your computer system. See the Read and Write vignette for more details.

d <- Read("Employee")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> integer: Numeric data values, integers only
#> double: Numeric data values with decimal digits
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
#>  2    Gender character     37       0       2   M  M  W ... W  W  M
#>  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
#>  4    Salary    double     37       0      37   63788.26  104494.58 ... 66508.32  67562.36
#>  5    JobSat character     35       2       3   med  low  high ... high  low  high
#>  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
#>  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
#>  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
#> ------------------------------------------------------------------------------------------

d is the default name of the data frame for the lessR data analysis functions. Explicitly access the data frame with the data parameter in the analysis functions.

As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need be entered into the table. The table can be a csv file or an Excel file.

Read the file of variable labels into the l data frame, currently the only permitted name. The labels will be displayed on both the text and visualization output. Each displayed label is the variable name juxtaposed with the corresponding label, as shown in the following output.

l <- rd("Employee_lbl")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     label character      8       0       8   Time of Company Employment ... Test score on legal issues after instruction
#> ------------------------------------------------------------------------------------------
l
#>                                                label
#> Years                     Time of Company Employment
#> Gender                                  Man or Woman
#> Dept                             Department Employed
#> Salary                           Annual Salary (USD)
#> JobSat            Satisfaction with Work Environment
#> Plan             1=GoodHealth, 2=GetWell, 3=BestCare
#> Pre    Test score on legal issues before instruction
#> Post    Test score on legal issues after instruction

Bar Chart

more examples of bar charts and pie charts

Consider the categorical variable Dept in the Employee data table. Use Chart() to tabulate and display the visualization of the number of employees in each department, here relying upon the default data frame (table) named d. Otherwise add the data= option for a data frame with another name.

Chart(Dept)
#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Chart(Dept type="radar")  # Plotly radar chart
#> Chart(Dept type="treemap")  # Plotly treemap chart
#> Chart(Dept type="pie")  # Plotly pie/sunburst chart
#> Chart(Dept type="icicle")  # Plotly icicle chart
#> Chart(Dept type="bubble")  # Plotly bubble chart
#>

Bar chart of tablulated counts of employees in each department.

#> --- Dept --- 
#> 
#> Missing Values: 0 
#> 
#>                 ACCT   ADMN   FINC   MKTG   SALE    Total 
#> Frequencies:       5      6      4      6     15       36 
#> Proportions:   0.139  0.167  0.111  0.167  0.417    1.000 
#> 
#> Chi-squared test of null hypothesis of equal probabilities 
#>   Chisq = 10.944, df = 4, p-value = 0.027

Specify a single fill color with the fill parameter, the edge color of the bars with color. Set the transparency level with transparency. Against a lighter background, display the value for each bar with a darker color using the labels_color parameter. To specify a color, use color names, specify a color with either its rgb() or hcl() color space coordinates, or use the lessR custom color palette function getColors().

Chart(Dept, fill="darkred", color="black", transparency=.8,
         labels_color="black")
#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Chart(Dept type="radar")  # Plotly radar chart
#> Chart(Dept type="treemap")  # Plotly treemap chart
#> Chart(Dept type="pie")  # Plotly pie/sunburst chart
#> Chart(Dept type="icicle")  # Plotly icicle chart
#> Chart(Dept type="bubble")  # Plotly bubble chart
#>

#> --- Dept --- 
#> 
#> Missing Values: 0 
#> 
#>                 ACCT   ADMN   FINC   MKTG   SALE    Total 
#> Frequencies:       5      6      4      6     15       36 
#> Proportions:   0.139  0.167  0.111  0.167  0.417    1.000 
#> 
#> Chi-squared test of null hypothesis of equal probabilities 
#>   Chisq = 10.944, df = 4, p-value = 0.027

Use the theme parameter to change the entire color theme: “colors”, “lightbronze”, “dodgerblue”, “slatered”, “darkred”, “gray”, “gold”, “darkgreen”, “blue”, “red”, “rose”, “green”, “purple”, “sienna”, “brown”, “orange”, “white”, and “light”. In this example, changing the full theme accomplishes the same as changing the fill color. Turn off the displayed value on each bar with the parameter labels set to off. Specify a horizontal bar chart with base R parameter horiz.

Chart(Dept, theme="gray", labels="off", horiz=TRUE)
#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Chart(Dept type="radar")  # Plotly radar chart
#> Chart(Dept type="treemap")  # Plotly treemap chart
#> Chart(Dept type="pie")  # Plotly pie/sunburst chart
#> Chart(Dept type="icicle")  # Plotly icicle chart
#> Chart(Dept type="bubble")  # Plotly bubble chart
#>

#> --- Dept --- 
#> 
#> Missing Values: 0 
#> 
#>                 ACCT   ADMN   FINC   MKTG   SALE    Total 
#> Frequencies:       5      6      4      6     15       36 
#> Proportions:   0.139  0.167  0.111  0.167  0.417    1.000 
#> 
#> Chi-squared test of null hypothesis of equal probabilities 
#>   Chisq = 10.944, df = 4, p-value = 0.027

Histogram

more examples of histograms

Consider the continuous variable Salary in the Employee data table. Use X() to tabulate and display the number of employees in each department, here relying upon the default data frame (table) named d, so the data= parameter is not needed.

X(Salary)

Histogram of tablulated counts for the bins of Salary.

#> >>> Suggestions 
#> bin_width: set the width of each bin 
#> bin_start: set the start of the first bin 
#> bin_end: set the end of the last bin 
#> X(Salary, type="density")  # smoothed curve + histogram 
#> X(Salary, type="vbs")  # Violin/Box/Scatterplot (VBS) plot 
#> 
#> --- Salary --- 
#>  
#>      n   miss         mean           sd          min          mdn          max 
#>      37      0    83795.557    21799.533    56124.970    79547.600   144419.230 
#>  
#> 
#>   
#> --- Outliers ---     from the box plot: 1 
#>  
#> Small      Large 
#> -----      ----- 
#>             144419.2 
#> 
#> 
#> Bin Width: 10000 
#> Number of Bins: 10 
#>  
#>              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
#> --------------------------------------------------------- 
#>   50000 >  60000   55000      4    0.11        4     0.11 
#>   60000 >  70000   65000      8    0.22       12     0.32 
#>   70000 >  80000   75000      8    0.22       20     0.54 
#>   80000 >  90000   85000      5    0.14       25     0.68 
#>   90000 > 100000   95000      3    0.08       28     0.76 
#>  100000 > 110000  105000      5    0.14       33     0.89 
#>  110000 > 120000  115000      1    0.03       34     0.92 
#>  120000 > 130000  125000      1    0.03       35     0.95 
#>  130000 > 140000  135000      1    0.03       36     0.97 
#>  140000 > 150000  145000      1    0.03       37     1.00 
#>

By default, the Histogram() function provides a color theme according to the current, active theme. The function also provides the corresponding frequency distribution, summary statistics, the table that lists the count of each category, from which the histogram is constructed, as well as an outlier analysis based on Tukey’s outlier detection rules for box plots.

Use the parameters bin_start, bin_width, and bin_end to customize the histogram.

Easy to change the color, either by changing the color theme with style(), or just change the fill color with fill. Can refer to standard R colors, as shown with lessR function showColors(), or implicitly invoke the lessR color palette generating function getColors(). Each 30 degrees of the color wheel is named, such as "greens", "rusts", etc, and implements a sequential color palette.

X(Salary, bin_start=35000, bin_width=14000, fill="reds")

Customized histogram.

#> >>> Suggestions 
#> bin_end: set the end of the last bin 
#> X(Salary, type="density")  # smoothed curve + histogram 
#> X(Salary, type="vbs")  # Violin/Box/Scatterplot (VBS) plot 
#> 
#> --- Salary --- 
#>  
#>      n   miss         mean           sd          min          mdn          max 
#>      37      0    83795.557    21799.533    56124.970    79547.600   144419.230 
#>  
#> 
#>   
#> --- Outliers ---     from the box plot: 1 
#>  
#> Small      Large 
#> -----      ----- 
#>             144419.2 
#> 
#> 
#> Bin Width: 14000 
#> Number of Bins: 8 
#>  
#>              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
#> --------------------------------------------------------- 
#>   35000 >  49000   42000      0    0.00        0     0.00 
#>   49000 >  63000   56000      5    0.14        5     0.14 
#>   63000 >  77000   70000     12    0.32       17     0.46 
#>   77000 >  91000   84000      8    0.22       25     0.68 
#>   91000 > 105000   98000      6    0.16       31     0.84 
#>  105000 > 119000  112000      3    0.08       34     0.92 
#>  119000 > 133000  126000      2    0.05       36     0.97 
#>  133000 > 147000  140000      1    0.03       37     1.00 
#>

Scatterplot

more examples of scatter plots and related

Specify an X and Y variable with the plot function to obtain a scatter plot. For two variables, both variables can be any combination of continuous or categorical. One variable can also be specified. A scatterplot of two categorical variables yields a bubble plot. Below is a scatterplot of two continuous variables.

XY(Years, Salary)

#> 
#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> XY(Years, Salary, enhance=TRUE)  # many options
#> XY(Years, Salary, fill="skyblue")  # interior fill color of points
#> XY(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
#> XY(Years, Salary, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
#> 
#> 
#> >>> Pearson's product-moment correlation 
#>  
#> Years: Time of Company Employment 
#> Salary: Annual Salary (USD) 
#>  
#> Number of paired values with neither missing, n = 36 
#> Sample Correlation of Years and Salary: r = 0.852 
#>   
#> Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
#> 95% Confidence Interval for Correlation:  0.727 to 0.923 
#>

Enhance the default scatterplot with parameter enhance. The visualization includes the mean of each variable indicated by the respective line through the scatterplot, the 95% confidence ellipse, labeled outliers, least-squares regression line with 95% confidence interval, and the corresponding regression line with the outliers removed.

XY(Years, Salary, enhance=TRUE)
#> 
#> [Ellipse with Murdoch and Chow's function ellipse from their ellipse package]

#> 
#> 
#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> XY(Years, Salary, color="red")  # exterior edge color of points
#> XY(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
#> XY(Years, Salary, out_cut=.10)  # label top 10% from center as outliers 
#> 
#> >>> Outlier analysis with Mahalanobis Distance 
#>  
#>   MD                   ID 
#> -----                ----- 
#> 8.34       Skrotzki, Sara 
#> 7.73       Billing, Susan 
#>  
#> 5.83        Fulton, Scott 
#> 5.77      Correll, Trevon 
#> 3.92       Downs, Deborah 
#> ...                   ... 
#> 
#> 
#> >>> Pearson's product-moment correlation 
#>  
#> Years: Time of Company Employment 
#> Salary: Annual Salary (USD) 
#>  
#> Number of paired values with neither missing, n = 36 
#> Sample Correlation of Years and Salary: r = 0.852 
#>   
#> Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
#> 95% Confidence Interval for Correlation:  0.727 to 0.923 
#>

The default plot for a single continuous variable includes not only the scatterplot, but also the superimposed violin plot and box plot, with outliers identified. Call this plot the VBS plot.

XY(Salary)
#> [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
#> Warning: In the future, run VBS plots from X()
#> >>> Suggestions
#> X(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE, type="vbs")  # Label two outliers ...
#> X(Salary, box_adj=TRUE, type="vbs")  # Adjust whiskers for asymmetry

#> 
#> ---------- Summary Statistics for Salary
#> 
#> Salary      n      Mean    Median        SD       IQR       Min        Max
#> Salary  37.00  83795.56  79547.60  21799.53  31012.56  56124.97  144419.23

Following is a scatterplot in the form of a bubble plot for two categorical variables.

Chart(JobSat, Gender, type="bubble")
#> >>> Suggestions  or  enter: style(suggest=FALSE)
#> Chart(JobSat, by=Gender, type="radar")  # Plotly radar chart
#> Chart(JobSat, by=Gender, type="treemap")  # Plotly treemap chart
#> Chart(JobSat, by=Gender, type="pie")  # Plotly pie/sunburst chart
#> Chart(JobSat, by=Gender, type="icicle")  # Plotly icicle chart
#>  
#> 
#> Joint and Marginal Frequencies 
#> ------------------------------ 
#>  
#>      JobSat 
#> Gender   high low med Sum 
#>   M         3  11   4  18 
#>   W         8   2   7  17 
#>   Sum      11  13  11  35 
#> 
#> Cramer's V: 0.515 
#>  
#> Chi-square Test of Independence:
#>      Chisq = 9.301, df = 2, p-value = 0.010

Means and Proportions

Means

more examples of t-tests and ANOVA

For the independent-groups t-test, specify the response variable to the left of the tilde, ~, and the categorical variable with two groups, the grouping variable, to the right of the tilde.

ttest(Salary ~ Gender)
#> 
#> Compare Salary across Gender with levels M and W 
#> Grouping Variable:  Gender, Man or Woman
#> Response Variable:  Salary, Annual Salary (USD)
#> 
#> 
#> ------ Describe ------
#> 
#> Salary for Gender M:  n.miss = 0,  n = 18,  mean = 91147.458,  sd = 23128.436
#> Salary for Gender W:  n.miss = 0,  n = 19,  mean = 76830.598,  sd = 18438.456
#> 
#> Mean Difference of Salary:  14316.860
#> 
#> Weighted Average Standard Deviation:   20848.636 
#> 
#> 
#> ------ Assumptions ------
#> 
#> Note: These hypothesis tests can perform poorly, and the 
#>       t-test is typically robust to violations of assumptions. 
#>       Use as heuristic guides instead of interpreting literally. 
#> 
#> Null hypothesis, for each group, is a normal distribution of Salary.
#> Group M  Shapiro-Wilk normality test:  W = 0.962,  p-value = 0.647
#> Group W  Shapiro-Wilk normality test:  W = 0.828,  p-value = 0.003
#> 
#> Null hypothesis is equal variances of Salary, homogeneous.
#> Variance Ratio test:  F = 534924536.348/339976675.129 = 1.573,  df = 17;18,  p-value = 0.349
#> Levene's test, Brown-Forsythe:  t = 1.302,  df = 35,  p-value = 0.201
#> 
#> 
#> ------ Infer ------
#> 
#> --- Assume equal population variances of Salary for each Gender 
#> 
#> t-cutoff for 95% range of variation: tcut =  2.030 
#> Standard Error of Mean Difference: SE =  6857.494 
#> 
#> Hypothesis Test of 0 Mean Diff:  t-value = 2.088,  df = 35,  p-value = 0.044
#> 
#> Margin of Error for 95% Confidence Level:  13921.454
#> 95% Confidence Interval for Mean Difference:  395.406 to 28238.314
#> 
#> 
#> --- Do not assume equal population variances of Salary for each Gender 
#> 
#> t-cutoff: tcut =  2.036 
#> Standard Error of Mean Difference: SE =  6900.112 
#> 
#> Hypothesis Test of 0 Mean Diff:  t = 2.075,  df = 32.505, p-value = 0.046
#> 
#> Margin of Error for 95% Confidence Level:  14046.505
#> 95% Confidence Interval for Mean Difference:  270.355 to 28363.365
#> 
#> 
#> ------ Effect Size ------
#> 
#> --- Assume equal population variances of Salary for each Gender 
#> 
#> Standardized Mean Difference of Salary, Cohen's d:  0.687
#> 
#> 
#> ------ Practical Importance ------
#> 
#> Minimum Mean Difference of practical importance: mmd
#> Minimum Standardized Mean Difference of practical importance: msmd
#> Neither value specified, so no analysis
#> 
#> 
#> ------ Graphics Smoothing Parameter ------
#> 
#> Density bandwidth for Gender M: 14777.680
#> Density bandwidth for Gender W: 11630.912

Next, to analyze the operational efficiency of a weeping device, do the two way independent groups ANOVA analyzing the variable breaks across levels of tension and wool. Specify the second independent variable preceded by a * sign.

ANOVA(breaks ~ tension * wool, data=warpbreaks)

#> 
#>   BACKGROUND 
#> 
#> Response Variable: breaks 
#>  
#> Factor Variable 1: tension 
#>   Levels: L M H 
#>  
#> Factor Variable 2: wool 
#>   Levels: A B 
#>  
#> Number of cases (rows) of data:  54 
#> Number of cases retained for analysis:  54 
#>  
#> Two-way Between Groups ANOVA 
#> 
#> 
#>   DESCRIPTIVE STATISTICS  
#> 
#>  
#> Equal cell sizes, so balanced design 
#>  
#>                       
#>       tension 
#>  wool    L    M    H 
#>     A    9    9    9 
#>     B    9    9    9 
#> 
#>                          
#>       tension 
#>  wool     L     M     H 
#>     A 44.56 24.00 24.56 
#>     B 28.22 28.78 18.78 
#> 
#> tension 
#>                       
#>         L     M     H 
#>   1 36.39 26.39 21.67 
#>  
#> wool 
#>                 
#>         A     B 
#>   1 31.04 25.26 
#> 
#> NA 
#>  
#> 
#>                         
#>       tension 
#>  wool     L    M     H 
#>     A 18.10 8.66 10.27 
#>     B  9.86 9.43  4.89 
#> 
#> 
#>   ANOVA 
#> 
#>  
#>              df    Sum Sq   Mean Sq   F-value   p-value 
#>      tension  2   2034.26   1017.13      8.50    0.0007 
#>         wool  1    450.67    450.67      3.77    0.0582 
#> tension:wool  2   1002.78    501.39      4.19    0.0210 
#>    Residuals 48   5745.11    119.69 
#> 
#> Partial Omega Squared for tension: 0.217 
#> Partial Omega Squared for wool: 0.049 
#> Partial Omega Squared for tension & wool: 0.106 
#>  
#> Cohen's f for tension: 0.527 
#> Cohen's f for wool: 0.226 
#> Cohen's f for tension_&_wool: 0.344 
#> 
#> 
#>  TUKEY MULTIPLE COMPARISONS OF MEANS 
#> 
#> Family-wise Confidence Level: 0.95 
#> 
#> Factor: tension 
#> ------------------------------- 
#>         diff    lwr   upr p adj 
#>   M-L -10.00 -18.82 -1.18  0.02 
#>   H-L -14.72 -23.54 -5.90  0.00 
#>   H-M  -4.72 -13.54  4.10  0.40 
#> 
#> Factor: wool 
#> ----------------------------- 
#>        diff    lwr  upr p adj 
#>   B-A -5.78 -11.76 0.21  0.06 
#> 
#> Cell Means 
#> ------------------------------------ 
#>             diff    lwr    upr p adj 
#>   M:A-L:A -20.56 -35.86  -5.25  0.00 
#>   H:A-L:A -20.00 -35.31  -4.69  0.00 
#>   L:B-L:A -16.33 -31.64  -1.03  0.03 
#>   M:B-L:A -15.78 -31.08  -0.47  0.04 
#>   H:B-L:A -25.78 -41.08 -10.47  0.00 
#>   H:A-M:A   0.56 -14.75  15.86  1.00 
#>   L:B-M:A   4.22 -11.08  19.53  0.96 
#>   M:B-M:A   4.78 -10.53  20.08  0.94 
#>   H:B-M:A  -5.22 -20.53  10.08  0.91 
#>   L:B-H:A   3.67 -11.64  18.97  0.98 
#>   M:B-H:A   4.22 -11.08  19.53  0.96 
#>   H:B-H:A  -5.78 -21.08   9.53  0.87 
#>   M:B-L:B   0.56 -14.75  15.86  1.00 
#>   H:B-L:B  -9.44 -24.75   5.86  0.46 
#>   H:B-M:B -10.00 -25.31   5.31  0.39 
#> 
#> 
#>   RESIDUALS 
#> 
#> Fitted Values, Residuals, Standardized Residuals 
#>    [sorted by Standardized Residuals, ignoring + or - sign] 
#>    [res_rows = 20, out of 54 cases (rows) of data, or res_rows="all"] 
#> ------------------------------------------------ 
#>      tension wool breaks fitted residual z-resid 
#>    5       L    A  70.00  44.56    25.44    2.47 
#>    9       L    A  67.00  44.56    22.44    2.18 
#>    4       L    A  25.00  44.56   -19.56   -1.90 
#>    8       L    A  26.00  44.56   -18.56   -1.80 
#>    1       L    A  26.00  44.56   -18.56   -1.80 
#>   24       H    A  43.00  24.56    18.44    1.79 
#>   36       L    B  44.00  28.22    15.78    1.53 
#>    2       L    A  30.00  44.56   -14.56   -1.41 
#>   23       H    A  10.00  24.56   -14.56   -1.41 
#>   29       L    B  14.00  28.22   -14.22   -1.38 
#>   37       M    B  42.00  28.78    13.22    1.28 
#>   34       L    B  41.00  28.22    12.78    1.24 
#>   40       M    B  16.00  28.78   -12.78   -1.24 
#>   18       M    A  36.00  24.00    12.00    1.16 
#>   14       M    A  12.00  24.00   -12.00   -1.16 
#>   19       H    A  36.00  24.56    11.44    1.11 
#>   16       M    A  35.00  24.00    11.00    1.07 
#>   41       M    B  39.00  28.78    10.22    0.99 
#>   44       M    B  39.00  28.78    10.22    0.99 
#>   39       M    B  19.00  28.78    -9.78   -0.95

For a one-way ANOVA, just include one independent variable. A randomized block design is also available.

Proportions

more examples of analyzing proportions

The analysis of proportions is of two primary types.

Focus on a single value of a categorical variable, termed a “success” when it occurs, for one or more samples of data. Analyze the resulting proportion of occurrence for a single sample, or for a test of homogeneity, compare proportions of successes across distinct data samples for a single variable.
Compare the obtained proportions across the values of one or more categorical variables for a single sample. Applied to a single variable, the analysis is a goodness-of-fit. Or, evaluate a potential relationship between two categorical variables, a test of independence.

Here, just analyze the \(\chi^2\) test of independence, which applies to two categorical variables. The first categorical variable listed in this example is the value of the parameter variable, the first parameter in the function definition, so does not need the parameter name. The second categorical variable listed must include the parameter name by.

The question for the analysis is if the observed frequencies of Jacket thickness and Bike ownership sufficiently differ from the frequencies expected by the null hypothesis that we conclude the variables are related.

d <- Read("Jackets")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1      Bike character   1025       0       2   BMW  Honda  Honda ... Honda  Honda  BMW
#>  2    Jacket character   1025       0       3   Lite  Lite  Lite ... Lite  Med  Lite
#> ------------------------------------------------------------------------------------------
Prop_test(Jacket, by=Bike)
#> variable: Jacket 
#> by: Bike 
#> 
#> <<< Pearson's Chi-squared test 
#> 
#> --- Description
#> 
#>        Jacket
#> Bike    Lite  Med Thick  Sum
#>   BMW     89  135   194  418
#>   Honda  283  207   117  607
#>   Sum    372  342   311 1025
#> 
#>  Cramer's V: 0.319 
#> 
#>  Row Col Observed Expected Residual Stnd Res
#>    1   1       89  151.703  -62.703   -8.288
#>    1   2      135  139.469   -4.469   -0.602
#>    1   3      194  126.827   67.173    9.287
#>    2   1      283  220.297   62.703    8.288
#>    2   2      207  202.531    4.469    0.602
#>    2   3      117  184.173  -67.173   -9.287
#> 
#> --- Inference
#> 
#> Chi-square statistic: 104.083 
#> Degrees of freedom: 2 
#> Hypothesis test of equal population proportions: p-value = 0.000

Regression Analysis

more examples of regression and logistic regression

The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The motivation is to provide virtually all of the information needed for a proper regression analysis.

d <- Read("Employee", quiet=TRUE)
reg(Salary ~ Years + Pre)

#> >>> Suggestion
#> # Create an R markdown file for interpretative output with  Rmd = "file_name"
#> reg(Salary ~ Years + Pre, Rmd="eg")  
#> 
#> 
#>   BACKGROUND 
#> 
#> Data Frame:  d 
#>  
#> Response Variable: Salary 
#> Predictor Variable 1: Years 
#> Predictor Variable 2: Pre 
#>  
#> Number of cases (rows) of data:  37 
#> Number of cases retained for analysis:  36 
#> 
#> 
#>   BASIC ANALYSIS 
#> 
#>              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
#> (Intercept) 54140.971  13666.115    3.962    0.000   26337.052   81944.891 
#>       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462 
#>         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825 
#> 
#> Standard deviation of Salary: 21,822.372 
#>  
#> Standard deviation of residuals:  11,753.478 for df=33 
#> 95% range of residuals:  47,825.260 = 2 * (2.035 * 11,753.478) 
#>  
#> R-squared: 0.726    Adjusted R-squared: 0.710    PRESS R-squared: 0.659 
#> 
#> Null hypothesis of all 0 population slope coefficients:
#>   F-statistic: 43.827     df: 2 and 33     p-value:  0.000 
#> 
#> -- Analysis of Variance 
#>  
#>             df           Sum Sq          Mean Sq   F-value   p-value 
#>     Years    1  12107157290.292  12107157290.292    87.641     0.000 
#>       Pre    1      1639658.444      1639658.444     0.012     0.914 
#>  
#> Model        2  12108796948.736   6054398474.368    43.827     0.000 
#> Residuals   33   4558759843.773    138144237.690 
#> Salary      35  16667556792.508    476215908.357 
#> 
#> 
#>   K-FOLD CROSS-VALIDATION 
#> 
#> 
#>   RELATIONS AMONG THE VARIABLES 
#> 
#>          Salary Years  Pre 
#>   Salary   1.00  0.85 0.03 
#>    Years   0.85  1.00 0.05 
#>      Pre   0.03  0.05 1.00 
#> 
#>         Tolerance       VIF 
#>   Years     0.998     1.002 
#>     Pre     0.998     1.002 
#> 
#>  Years Pre    R2adj    X's 
#>      1   0    0.718      1 
#>      1   1    0.710      2 
#>      0   1   -0.028      1 
#>  
#> [based on Thomas Lumley's leaps function from the leaps package] 
#> 
#> 
#>   RESIDUALS AND INFLUENCE 
#> 
#> -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
#>    [sorted by Cook's Distance] 
#>    [n_res_rows = 20, out of 36 rows of data, or do n_res_rows="all"] 
#> ----------------------------------------------------------------------------------------- 
#>                        Years     Pre     Salary     fitted      resid rstdnt dffits cooks 
#>       Correll, Trevon     21      97 144419.230 120648.843  23770.387  2.424  1.217 0.430 
#>         James, Leslie     18      70 132563.380 111387.773  21175.607  1.998  0.714 0.156 
#>         Capelle, Adam     24      83 118138.430 130658.778 -12520.348 -1.211 -0.634 0.132 
#>           Hoang, Binh     15      96 121074.860 101158.659  19916.201  1.860  0.649 0.131 
#>    Korhalkar, Jessica      2      74  82502.500  59292.181  23210.319  2.171  0.638 0.122 
#>        Billing, Susan      4      91  82675.260  65484.493  17190.767  1.561  0.472 0.071 
#>          Singh, Niral      2      59  71055.440  59566.155  11489.285  1.064  0.452 0.068 
#>        Skrotzki, Sara     18      63 101352.330 111515.627 -10163.297 -0.937 -0.397 0.053 
#>      Saechao, Suzanne      8      98  65545.250  78362.271 -12817.021 -1.157 -0.390 0.050 
#>         Kralik, Laura     10      74 102681.190  85303.447  17377.743  1.535  0.287 0.026 
#>   Anastasiou, Crystal      2      59  66508.320  59566.155   6942.165  0.636  0.270 0.025 
#>     Langston, Matthew      5      94  59188.960  68681.106  -9492.146 -0.844 -0.268 0.024 
#>        Afshari, Anbar      6     100  79441.930  71822.925   7619.005  0.689  0.264 0.024 
#>   Cassinelli, Anastis     10      80  67562.360  85193.857 -17631.497 -1.554 -0.265 0.022 
#>      Osterman, Pascal      5      69  59704.790  69137.730  -9432.940 -0.826 -0.216 0.016 
#>   Bellingar, Samantha     10      67  76337.830  85431.301  -9093.471 -0.793 -0.198 0.013 
#>          LaRoe, Maria     10      80  71961.290  85193.857 -13232.567 -1.148 -0.195 0.013 
#>      Ritchie, Darnell      7      82  63788.260  75403.102 -11614.842 -1.006 -0.190 0.012 
#>        Sheppard, Cory     14      66 105027.550  98455.199   6572.351  0.579  0.176 0.011 
#>        Downs, Deborah      7      90  67139.900  75256.982  -8117.082 -0.706 -0.174 0.010 
#> 
#> 
#>   PREDICTION ERROR 
#> 
#> -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals 
#>    [sorted by lower bound of prediction interval] 
#>    [to see all intervals add n_pred_rows="all"] 
#>  ---------------------------------------------- 
#> 
#>                        Years    Pre     Salary       pred    s_pred     pi.lwr     pi.upr     width 
#>          Hamide, Bita      1     83  61036.850  55876.388 12290.483  30871.211  80881.564 50010.352 
#>          Singh, Niral      2     59  71055.440  59566.155 12619.291  33892.014  85240.296 51348.281 
#>   Anastasiou, Crystal      2     59  66508.320  59566.155 12619.291  33892.014  85240.296 51348.281 
#> ... 
#>          Link, Thomas     10     83  76312.890  85139.062 11933.518  60860.137 109417.987 48557.849 
#>          LaRoe, Maria     10     80  71961.290  85193.857 11918.048  60946.405 109441.308 48494.903 
#>   Cassinelli, Anastis     10     80  67562.360  85193.857 11918.048  60946.405 109441.308 48494.903 
#> ... 
#>       Correll, Trevon     21     97 144419.230 120648.843 12881.876  94440.470 146857.217 52416.747 
#>         Capelle, Adam     24     83 118138.430 130658.778 12955.608 104300.394 157017.161 52716.767 
#> 
#> ---------------------------------- 
#> Plot 1: Distribution of Residuals 
#> Plot 2: Residuals vs Fitted Values 
#> ----------------------------------

As with several other lessR functions, save the output to an object with the name of your choosing, such as r, and then reference desired pieces of the output. View the names of those pieces from the manual, here obtained with ?reg, or use the R names function, such as in the following example.

r <- reg(Salary ~ Years + Pre)

names(r)
#>  [1] "out_suggest"     "call"            "formula"         "vars"           
#>  [5] "out_title_bck"   "out_background"  "out_title_basic" "out_estimates"  
#>  [9] "out_fit"         "out_anova"       "out_title_mod"   "out_mod"        
#> [13] "out_mdls"        "out_title_kfold" "out_kfold"       "out_title_rel"  
#> [17] "out_cor"         "out_collinear"   "out_subsets"     "out_title_res"  
#> [21] "out_residuals"   "out_title_pred"  "out_predict"     "out_ref"        
#> [25] "out_Rmd"         "out_Word"        "out_pdf"         "out_odt"        
#> [29] "out_rtf"         "out_plots"       "n.vars"          "n.obs"          
#> [33] "n.keep"          "coefficients"    "sterrs"          "tvalues"        
#> [37] "pvalues"         "cilb"            "ciub"            "anova_model"    
#> [41] "anova_residual"  "anova_total"     "se"              "resid_range"    
#> [45] "Rsq"             "Rsqadj"          "PRESS"           "RsqPRESS"       
#> [49] "m_se"            "m_MSE"           "m_Rsq"           "cor"            
#> [53] "tolerances"      "vif"             "resid.max"       "pred_min_max"   
#> [57] "residuals"       "fitted"          "cooks.distance"  "model"          
#> [61] "terms"

View any piece of output with the name of the output file, a dollar sign, and the specific name of that piece. Here, examine the fit indices.

r$out_fit
#> Standard deviation of Salary: 21,822.372
#> 
#> Standard deviation of residuals:  11,753.478 for df=33
#> 95% range of residuals:  47,825.260 = 2 * (2.035 * 11,753.478)
#> 
#> R-squared: 0.726    Adjusted R-squared: 0.710    PRESS R-squared: 0.659
#> 
#> Null hypothesis of all 0 population slope coefficients:
#>   F-statistic: 43.827     df: 2 and 33     p-value:  0.000

These expressions could also be included in a markdown document that systematically reviews each desired piece of the output.

Time Series and Forecasting

more examples of run charts, time series charts, and forecasting

The time series plot, plotting the values of a variable cross time, is a special case of a scatterplot, potentially with the points of size 0 with adjacent points connected by a line segment. Indicate a time series by specifying the x-variable, the first variable listed, as a variable of type Date. Unlike Base R functions, Plot() automatically converts to Date data values as dates specified in a digital format, such as 18/8/2024 or related formats plus examples such as 2024 Q3 or 2024 Aug. Otherwise, explicitly use the R function as.Date() to convert to this format before calling Plot() or pass the date format directly with the ts_format parameter.

Plot() implements time series forecasting based on trend and seasonality with either exponential smoothing or regression analysis, including the accompanying visualization. Time series parameters include:

ts_method: Set at "es" for exponential smoothing, the default, or "lm" for linear model regression.
ts_unit: The time unit, either as the natural occurring interval between dates in the data, the default, or aggregated to a wider time interval.
ts_ahead: The number of time units to forecast into the future
ts_agg: If aggregating the time unit, aggregate as the "sum", the default, or as the "mean".
ts_PIlevel: The confidence level of the prediction intervals, with 0.95 the default.
ts_format: Provides a specific format for the date variable if not detected correctly by default.
ts_seasons: Set to FALSE to turn off seasonality in the estimated model.
ts_trend: Set to FALSE to turn off trend in the estimated model.
ts_type: Applies to exponential smoothing to specify additive or multiplicative seasonality, with additive the default.

In this StockPrice data file, the date conversion as already been done.

d <- Read("StockPrice")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> Date: Date with year, month and day
#> double: Numeric data values with decimal digits
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     Month      Date   1482       0     494   1985-01-01 ... 2026-02-01
#>  2   Company character   1482       0       3   Apple  Apple ... Intel  Intel
#>  3     Price    double   1482       0    1461   0.0953059941530228 ... 48.810001373291
#>  4    Volume    double   1482       0    1480   175302400  137737600 ... 95396400  101414600
#> ------------------------------------------------------------------------------------------
head(d)
#>          Month Company      Price    Volume
#> 1   1985-01-01   Apple 0.09530599 175302400
#> 23  1985-02-01   Apple 0.09787013 137737600
#> 42  1985-03-01   Apple 0.08504876 247430400
#> 63  1985-04-01   Apple 0.07393683 114060800
#> 84  1985-05-01   Apple 0.07137270  57344000
#> 106 1985-06-01   Apple 0.05470514 576016000

We have the date as Month, and also have variables Company and stock Price.

d <- Read("StockPrice")
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> Date: Date with year, month and day
#> double: Numeric data values with decimal digits
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     Month      Date   1482       0     494   1985-01-01 ... 2026-02-01
#>  2   Company character   1482       0       3   Apple  Apple ... Intel  Intel
#>  3     Price    double   1482       0    1461   0.0953059941530228 ... 48.810001373291
#>  4    Volume    double   1482       0    1480   175302400  137737600 ... 95396400  101414600
#> ------------------------------------------------------------------------------------------
Plot(Month, Price, filter=(Company=="Apple"), ts_area_fill="on")
#> lessR visualizations are now unified over just three core functions:
#>   - Chart() for pivot tables, such as bar charts. More info: ?Chart
#>   - X() for a single variable x, such as histograms. More info: ?X
#>   - XY() for scatterplots of two variables, x and y. More info: ?XY
#> 
#> Plot() is deprecated, though still working for now.
#> Please use X(...) or XY(...) going forward.
#> 
#> filter:  (Company == "Apple") 
#> -----
#> Rows of data before filtering:  1482 
#> Rows of data after filtering:   494 
#> 
#> [with functions from Ryan, Ulrich, Bennett, and Joy's xts package]

With the by parameter, plot all three companies on the same panel.

Plot(Month, Price, by=Company)
#> [with functions from Ryan, Ulrich, Bennett, and Joy's xts package]

Here, aggregate the mean by time, from months to quarters.

Plot(Month, Price, ts_unit="quarters", ts_agg="mean")
#> >>> Note
#> The  Date  variable is not sorted in Increasing Order.
#> 
#> For a data frame named d, enter: 
#>     d <- order_by(d, Month)
#> Maybe you have a  by  variable with repeating Date values?
#> Enter  ?order_by  for more information and examples.
#> [with functions from Ryan, Ulrich, Bennett, and Joy's xts package]

Plot() implements exponential smoothing or linear regression with seasonality forecasting with accompanying visualization. Parameters include ts_ahead for the number of ts_units to forecast into the future, and ts_format to provide a specific format for the date variable if not detected correctly by default. Parameter ts_method defaults to es for exponential smoothing, or set to lm for linear regression. Control aspects of the exponential smoothing estimation and prediction algorithms with parameters ts_level (alpha), ts_trend (beta), ts_seasons (gamma), ts_type for additive or multiplicative seasonality, and ts_PIlevel for the level of the prediction intervals.

To forecast Apple’s stock price, focus here on the last several years of the data, beginning with Row 400 through Row 473, the last row of data for apple. In this example, forecast ahead 24 months. Here, rely upon the default exponential smoothing estimation procedure from the fpp3 ecosystem package fable.

d <- d[400:473,]
Plot(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24)
#> [with functions from Ryan, Ulrich, Bennett, and Joy's xts package] 
#> [with functions from Hyndman and Athanasopoulos's, fpp3 packages] 
#>    -- standard reference: https://otexts.com/fpp3/
#> 
#> Specified model
#> ---------------
#>    Price  [with no specifications] 
#> The specified model is only suggested.
#> It may differ from the estimated model.
#> 
#> Model to be estimated
#> ---------------------
#> Price ~ error("M") + trend("A") 
#> 
#> 
#> Model analysis
#> --------------
#> Series: Price 
#> Model: ETS(M,A,N) 
#>   Smoothing parameters:
#>     alpha = 0.9753647 
#>     beta  = 0.0001000105 
#> 
#>   Initial states:
#>      l[0]     b[0]
#>  40.72976 2.189317
#> 
#>   sigma^2:  0.0088
#> 
#>      AIC     AICc      BIC 
#> 647.1876 648.0831 658.6399 
#> 
#> Mean squared error of fit to data: 101.0079

#> Forecast
#> --------
#>       Month predicted     lower    upper     width
#> 1  May 2024  170.9771 139.47744 202.4768  62.99939
#> 2  Jun 2024  173.1632 128.77574 217.5507  88.77492
#> 3  Jul 2024  175.3493 120.74797 229.9506 109.20260
#> 4  Aug 2024  177.5353 114.08464 240.9860 126.90138
#> 5  Sep 2024  179.7214 108.27275 251.1700 142.89730
#> 6  Oct 2024  181.9075 103.04896 260.7660 157.71700
#> 7  Nov 2024  184.0935  98.25729 269.9298 171.67248
#> 8  Dec 2024  186.2796  93.79655 278.7626 184.96609
#> 9  Jan 2025  188.4657  89.59684 287.3345 197.73764
#> 10 Feb 2025  190.6517  85.60757 295.6959 210.08831
#> 11 Mar 2025  192.8378  81.79080 303.8848 222.09399
#> 12 Apr 2025  195.0239  78.11722 311.9305 233.81328
#> 13 May 2025  197.2099  74.56371 319.8561 245.29244
#> 14 Jun 2025  199.3960  71.11162 327.6804 256.56875
#> 15 Jul 2025  201.5821  67.74570 335.4184 267.67272
#> 16 Aug 2025  203.7681  64.45329 343.0830 278.62967
#> 17 Sep 2025  205.9542  61.22376 350.6846 289.46088
#> 18 Oct 2025  208.1403  58.04805 358.2325 300.18442
#> 19 Nov 2025  210.3263  54.91844 365.7342 310.81577
#> 20 Dec 2025  212.5124  51.82824 373.1966 321.36831
#> 21 Jan 2026  214.6985  48.77160 380.6253 331.85371
#> 22 Feb 2026  216.8845  45.74344 388.0256 342.28217
#> 23 Mar 2026  219.0706  42.73924 395.4019 352.66271
#> 24 Apr 2026  221.2567  39.75500 402.7583 363.00332

Next, implement the classic Holt-Winters exponential smoothing method from the Base~R function Holt-Winters().

Plot(Month, Price, ts_unit="months", ts_agg="mean", ts_ahead=24,
     ts_source="classic")
#> [with functions from Ryan, Ulrich, Bennett, and Joy's xts package]

#> Smoothing Parameters 
#>  alpha: 0.999957 
#>  
#> Mean squared error of fit to data: 106.8439860 
#>  
#> Coefficients for Linear Trend 
#>  b0: 168.4969644   
#>  
#> Forecast
#> --------
#>       Month predicted     lower    upper     width
#> 1  May 2024   168.497 148.55042 188.4435  39.89310
#> 2  Jun 2024   168.497 140.28890 196.7050  56.41614
#> 3  Jul 2024   168.497 133.94953 203.0444  69.09488
#> 4  Aug 2024   168.497 128.60516 208.3888  79.78360
#> 5  Sep 2024   168.497 123.89667 213.0973  89.20059
#> 6  Oct 2024   168.497 119.63986 217.3541  97.71421
#> 7  Nov 2024   168.497 115.72532 221.2686 105.54330
#> 8  Dec 2024   168.497 112.08174 224.9122 112.83044
#> 9  Jan 2025   168.497 108.65962 228.3343 119.67468
#> 10 Feb 2025   168.497 105.42290 231.5710 126.14813
#> 11 Mar 2025   168.497 102.34435 234.6496 132.30523
#> 12 Apr 2025   168.497  99.40284 237.5911 138.18826
#> 13 May 2025   168.497  96.58154 240.4124 143.83086
#> 14 Jun 2025   168.497  93.86681 243.1271 149.26030
#> 15 Jul 2025   168.497  91.24744 245.7465 154.49906
#> 16 Aug 2025   168.497  88.71401 248.2799 159.56591
#> 17 Sep 2025   168.497  86.25859 250.7353 164.47675
#> 18 Oct 2025   168.497  83.87439 253.1195 169.24516
#> 19 Nov 2025   168.497  81.55554 255.4384 173.88285
#> 20 Dec 2025   168.497  79.29696 257.6970 178.40002
#> 21 Jan 2026   168.497  77.09417 259.8998 182.80560
#> 22 Feb 2026   168.497  74.94323 262.0507 187.10748
#> 23 Mar 2026   168.497  72.84064 264.1533 191.31265
#> 24 Apr 2026   168.497  70.78329 266.2106 195.42735

Factor Analysis

more examples of exploratory and confirmatory factor analysis

Access the lessR data set called datMach4 for the analysis of 351 people to the Mach IV scale. Read the optional variable labels. Including the item contents as variable labels means that the output of the confirmatory factor analysis contains the item content grouped by factor.

d <- Read("Mach4", quiet=TRUE)
l <- Read("Mach4_lbl", var_labels=TRUE)
#> 
#> >>> Suggestions
#> Recommended binary format for data files: feather
#>   Create with Write(d, "your_file", format="feather")
#> More details about your data, Enter:  details()  for d, or  details(name)
#> 
#> Data Types
#> ------------------------------------------------------------
#> character: Non-numeric data values
#> ------------------------------------------------------------
#> 
#>     Variable                  Missing  Unique 
#>         Name     Type  Values  Values  Values   First and last values
#> ------------------------------------------------------------------------------------------
#>  1     label character     20       0      20   Never tell anyone the real reason you did something unless it is useful to do so ... Most people forget more easily the death of a parent than the loss of their property
#> ------------------------------------------------------------------------------------------

Calculate the correlations and store in R.

R <- cr(m01:m20)
#> 
#> >>> No missing data
#> 
#> 
#> Note: To provide more color separation for off-diagonal
#>       elements, the diagonal elements of the matrix for
#>       computing the heat map are set to 0.

The correlation matrix for analysis is named R. The item (observed variable) correlation matrix is the numerical input into the confirmatory factor analysis.

Here, do the default two-factor solution with "promax" rotation. The default correlation matrix is mycor. The abbreviation for corEFA() is efa().

efa(R, n_factors=4)
#>   EXPLORATORY FACTOR ANALYSIS 
#> 
#> Loadings (except -0.2 to 0.2) 
#> ------------------------------------- 
#>       Factor1 Factor2 Factor3 Factor4 
#>   m06   0.828                  -0.290 
#>   m07   0.712                         
#>   m10   0.539                         
#>   m03   0.422   0.318                 
#>   m09   0.323                         
#>   m05           0.649                 
#>   m18           0.555   0.253         
#>   m13           0.543   0.226         
#>   m01           0.490                 
#>   m12           0.434  -0.230         
#>   m08           0.236  -0.202         
#>   m14           0.402   0.991  -0.401 
#>   m04                   0.426         
#>   m20           0.237  -0.282         
#>   m17                   0.267         
#>   m19                                 
#>   m11          -0.299   0.309  -0.609 
#>   m16                   0.274  -0.455 
#>   m02                          -0.319 
#>   m15  -0.207   0.203          -0.214 
#> 
#> Sum of Squares 
#> ------------------------------------------------ 
#>                  Factor1 Factor2 Factor3 Factor4 
#>      SS loadings   1.933   2.038   1.825   1.099 
#>   Proportion Var   0.097   0.102   0.091   0.055 
#>   Cumulative Var   0.097   0.199   0.290   0.345 
#> 
#>   CONFIRMATORY FACTOR ANALYSIS CODE 
#> 
#> MeasModel <-  
#> "  F1 =~ m01 + m02 + m03 + m04 + m05 
#>    F2 =~ m06 + m07 + m08 + m09 + m10 + m11 
#>    F3 =~ m12 + m13 + m14 + m15 
#>    F4 =~ m17 + m18 + m19 + m20 
#> " 
#>  
#> fit <- lessR::cfa(MeasModel)
#>  
#> library(lavaan) 
#> fit <- lavaan::cfa(MeasModel, data=d) 
#> summary(fit, fit.measures=TRUE, standardized=TRUE) 
#> 
#> Deletion threshold: min_loading = 0.2 
#> Deleted items: m16

The confirmatory factor analysis is of multiple-indicator measurement scales, that is, each item (observed variable) is assigned to only one factor. Solution method is centroid factor analysis.

Specify the measurement model for the analysis in Lavaan notation. Define four factors: Deceit, Trust, Cynicism, and Flattery.

MeasModel <- 
" 
   Deceit =~ m07 + m06 + m10 + m09 
   Trust =~ m12 + m05 + m13 + m01 
   Cynicism =~ m11 + m16 + m04 
   Flattery =~ m15 + m02 
"

Pivot Tables

more examples of pivot tables

Aggregate with pivot(). Any function that processes a single vector of data, such as a column of data values for a variable in a data frame, and outputs a single computed value, the statistic, can be passed to pivot(). Functions can be user-defined or built-in.

Here, compute the mean and standard deviation of each company in the StockPrice data set download it with lessR.

d <- Read("StockPrice", quiet=TRUE)
pivot(d, c(mean, sd), Price, by=Company)
#>   Company   n na Price_mean Price_sd
#> 1   Apple 494  0     31.537   61.447
#> 2     IBM 494  0     64.376   55.447
#> 3   Intel 494  0     17.086   14.600

Interpret this call to pivot() as

for data set d
compute the mean and standard deviation
of variable Price
for each Company

Select any two of the three possibilities for multiple parameter values: Multiple compute functions, multiple variables over which to compute, and multiple categorical variables by which to define groups for aggregation.

Color Scales

more examples of color scales

Generate color scales with getColors(). The default output of getColors() is a color spectrum of 12 hcl colors presented in the order in which they are assigned to discrete levels of a categorical variable. For clarity in the following function call, the default value of the pal or palette parameter is explicitly set to its name, "hues".

getColors("hues")

#> 
#>       h    hex      r    g    b
#> -------------------------------
#>  1   240 #4398D0   67  152  208 
#>  2    60 #B28B2A  178  139   42 
#>  3   120 #5FA140   95  161   64 
#>  4     0 #D57388  213  115  136 
#>  5   275 #9A84D6  154  132  214 
#>  6   180 #00A898    0  168  152 
#>  7    30 #C97E5B  201  126   91 
#>  8    90 #909711  144  151   17 
#>  9   210 #00A3BA    0  163  186 
#> 10   330 #D26FAF  210  111  175 
#> 11   150 #00A76F    0  167  111 
#> 12   300 #BD76CB  189  118  203

lessR provides pre-defined sequential color scales across the range of hues around the color wheel in 30 degree increments: "reds", "rusts", "browns", "olives", "greens", "emeralds", "turqoises", "aquas", "blues", "purples", "biolets", "magentas", and "grays".

getColors("blues")

#> 
#>       h    hex      r    g    b
#> -------------------------------
#>  1   240 #CCECFFFF  204  236  255 
#>  2   240 #B4D8FCFF  180  216  252 
#>  3   240 #9DC5EBFF  157  197  235 
#>  4   240 #84B2DBFF  132  178  219 
#>  5   240 #6B9FCCFF  107  159  204 
#>  6   240 #4F8DBCFF   79  141  188 
#>  7   240 #2D7CAEFF   45  124  174 
#>  8   240 #006BA0FF    0  107  160 
#>  9   240 #005B93FF    0   91  147 
#> 10   240 #004C8AFF    0   76  138 
#> 11   240 #004087FF    0   64  135 
#> 12   240 #0040A9FF    0   64  169

To create a divergent color palette, specify beginning and an ending color palettes, which provide values for the parameters pal and end_pal, where pal abbreviates palette. Here, generate colors from rust to blue.

getColors("rusts", "blues")

#> 
#>   color    r    g    b
#> ----------------------
#> #70370FFF  112   55   15 
#> #7D4A32FF  125   74   50 
#> #8B5F4DFF  139   95   77 
#> #997568FF  153  117  104 
#> #A88E86FF  168  142  134 
#> #B9ADAAFF  185  173  170 
#> #AAB0B8FF  170  176  184 
#> #8595A7FF  133  149  167 
#> #658099FF  101  128  153 
#> #466D8DFF   70  109  141 
#> #1D5C83FF   29   92  131 
#> #004D7AFF    0   77  122