Choosing the supports spaces

Prior-informed support space construction

If sufficient prior information is available, the support space can be constructed around a pre-estimate or prior mean, making the representation more efficient. — Golan [1]

Preliminary estimates (e.g., from OLS or Ridge) can guide the center and/or range of the support points for the unknown parameters. This can help to:

Avoid arbitrarily wide or narrow supports;
Improve estimation efficiency and stability;
Try to include the true value within the support.

Ridge

The Ridge regression introduced by Hoerl and Kennard [2] is an estimation procedure to handle collinearity without removing variables from the regression model. By adding a small non-negative constant (ridge or shrinkage parameter) to the diagonal of the correlation matrix of the explanatory variables, it is possible to reduce the variance of the OLS estimator through the introduction of some bias. Although the resulting estimators are biased, the biases are small enough for these estimators to be substantially more precise than the unbiased estimators. The challenge in Ridge regression remains on the selection of the ridge parameter. One straightforward approach is based on simply plotting the coefficients against several possible values for the ridge parameter and inspecting the resulting traces. The Ridge Regression estimator of \(\boldsymbol{\beta}\) takes the form \[\begin{align} \widehat{\boldsymbol{\beta}}^{ridge}&= \underset{\boldsymbol{\beta}}{\operatorname{argmin}} \|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^2+\lambda \|\boldsymbol{\beta}\|^2 \\ &=(\mathbf{X}'\mathbf{X}+\lambda \mathbf{I})^{-1}\mathbf{X'y}, \end{align}\] where \(\lambda \geq 0\) denotes the ridge parameter and \(\mathbf{I}\) is a \(((K+1) \times (K+1))\) identity matrix. Note that when \(\lambda \rightarrow 0\), the Ridge regression estimator approaches the OLS estimator whereas the Ridge regression estimator approaches the zero vector when \(\lambda \rightarrow \infty\). Thus, a trade-off between variance and bias is needed.

Ridge preliminary estimates can be obtained (choosing the ridge parameter according to a given rule) and be used to define, for instance, zero centered support spaces [3]. Macedo et al [4,5] suggest to define \(Z\) uniformly and symmetrically around zero with limits established by the absolute maximum values of the ridge estimates. The absolute maximum values are defined by the ridge trace when setting a vector of penalization parameters, usually between \(0\) and \(1\). This procedure is called the RidGME.

Consider dataGCE (see “Generalized Maximum Entropy framework”).

coef.dataGCE <- c(1, 0, 0, 3, 6, 9)

Suppose we want to obtain the estimated model

\[\begin{equation} \widehat{\mathbf{y}}=\widehat{\beta_0} + \widehat{\beta_1}\mathbf{X001} + \widehat{\beta_2}\mathbf{X002} + \widehat{\beta_3}\mathbf{X003} + \widehat{\beta_4}\mathbf{X004} + \widehat{\beta_5}\mathbf{X005}. \end{equation}\]

In order to define the support spaces let us obtain the ridge trace. Using the function ridgetrace() and setting:

formula = y ~ X001 + X002 + X003 + X004 + X005
data = dataGCE

the ridge trace is computed.

res.rt.01 <- 
  ridgetrace(
    formula = y ~ X001 + X002 + X003 + X004 + X005,
    data = dataGCE)

Since that in our example the true parameters are known, we can add them to the ridge trace using the argument coef = coef.dataGCE from the plot function

plot(res.rt.01, coef = coef.dataGCE)

The Ridge estimated coefficients that produce the lowest 5-fold cross-validation RMSE (by default errormeasure = "RMSE",cv = TRUE and cv.nfolds = 5) are

res.rt.01
#> 
#> Call:
#> ridgetrace(formula = y ~ X001 + X002 + X003 + X004 + X005, data = dataGCE)
#> 
#> Coefficients:
#> (Intercept)         X001         X002         X003         X004         X005  
#>      1.0262      -0.1057       1.3074       3.0304       7.7508      10.8430

Yet, we are interested in the maximum absolute values, and those values can be obtain setting the argument which = "max.abs" in the coef function.

coef(res.rt.01, which = "max.abs")
#> (Intercept)        X001        X002        X003        X004        X005 
#>    1.029594    1.083863    4.203592    3.574235    8.981052   12.021861

Note that the maximum absolute value of each estimate is greater than the true absolute value of the parameter (represented by the horizontal dashed horizontal lines).

coef(res.rt.01, which = "max.abs") > abs(c(1, 0, 0, 3, 6, 9))
#> (Intercept)        X001        X002        X003        X004        X005 
#>        TRUE        TRUE        TRUE        TRUE        TRUE        TRUE

Given this information and if one wants to have symmetrically centered supports it is possible to define, for instance, the following:

\(\mathbf{z}_0'= \left[ -1.029594, -1.029594/2, 0, 1.029594/2, 1.029594\right]\), \(\mathbf{z}_1'= \left[ -1.083863, -1.083863/2, 0, 1.083863/2, 1.083863\right]\) \(\mathbf{z}_2'= \left[ -4.203592, -4.203592/2, 0, 4.203592/2, 4.203592\right]\), \(\mathbf{z}_3'= \left[ -3.574235, -3.574235/2, 0, 3.574235/2, 3.574235\right]\), \(\mathbf{z}_4'= \left[ -8.981052, -8.981052/2, 0, 8.981052/2, 8.981052\right]\) and \(\mathbf{z}_5'= \left[ -12.021861, -12.021861/2, 0, 12.021861/2, 12.021861\right]\).


(RidGME.support <- 
  matrix(c(-coef(res.rt.01, which = "max.abs"),
           coef(res.rt.01, which = "max.abs")),
         ncol = 2,
         byrow = FALSE))
#>            [,1]      [,2]
#> [1,]  -1.029594  1.029594
#> [2,]  -1.083863  1.083863
#> [3,]  -4.203592  4.203592
#> [4,]  -3.574235  3.574235
#> [5,]  -8.981052  8.981052
#> [6,] -12.021861 12.021861

Using lmgce and setting support.signal = RidGME.support it is possible to obtain the desired model

res.lmgce.RidGME <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    support.signal = RidGME.support,
    twosteps.n = 0
  )

Alternatively, it is possible to use directly the lmgce function setting support.method = "ridge" and support.signal = 1. Doing this, the support spaces will be internally calculated.

res.lmgce.RidGME <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    support.method = "ridge",
    support.signal = 1,
    twosteps.n = 0
  )

The estimated GME coefficients with a prior Ridge information are \(\widehat{\boldsymbol{\beta}}^{GME_{(RidGME)}}=\) (0.858, 0.116, -1.794, 0.601, 2.478, 6.275).

The prediction error is \(RMSE_{\mathbf{\hat y}}^{GME_{(RidGME)}} \approx\) 0.459, the cross-validation prediction error is \(CV\text{-}RMSE_{\mathbf{\hat y}}^{GME_{(RidGME)}} \approx\) 0.518, and the precision error is \(RMSE_{\boldsymbol{\hat\beta}}^{GME_{(RidGME)}} \approx\) 2.192.

We can compare these results with the ones from the “Generalized Maximum Entropy framework” vignette.

	\(OLS\)	\(GME_{(100000)}\)	\(GME_{(100)}\)	\(GME_{(50)}\)	\(GME_{(RidGME)}\)
Prediction RMSE	0.405	0.405	0.407	0.411	0.459
Prediction CV-RMSE	0.436	0.436	0.437	0.441	0.518
Precision RMSE	5.825	5.809	1.597	1.575	2.192

Although we did not have to blindly define the support spaces, the results are not very reassuring and a different strategy should be pursued. Furthermore, if we consider the data set dataincRidGME and want to obtain the estimated model

we can note that now not all the maximum absolute values of the estimates are greater than the true absolute value of the parameters.

res.rt.02 <- 
  ridgetrace(
    formula = y ~ .,
    data = dataincRidGME)

coef.dataincRidGME <- c(2.5, rep(0, 3), c(-8, 19, -13))
plot(res.rt.02, coef = coef.dataincRidGME)

coef(res.rt.02, which = "max.abs") > abs(coef.dataincRidGME)
#> (Intercept)        X001        X002        X003        X004        X005 
#>       FALSE        TRUE        TRUE        TRUE       FALSE        TRUE 
#>        X006 
#>        TRUE

If we use the maximum absolute values to define the support spaces we are excluding the true value of the parameter in two of them. To avoid that, we can broaden the support spaces by a given greater than \(1\) factor, for instance \(2\). That can be done by setting support.signal = 2 in lmgce.


res.lmgce.RidGME.02.alpha1 <-
  GCEstim::lmgce(
    y ~ .,
    data = dataincRidGME,
    support.method = "ridge",
    support.signal = 1,
    twosteps.n = 0
  )

res.lmgce.RidGME.02.alpha2 <-
  GCEstim::lmgce(
    y ~ .,
    data = dataincRidGME,
    support.method = "ridge",
    support.signal = 2,
    twosteps.n = 0
  )

From summary we can confirm that both prediction error and prediction cross-validation error are smaller when multiplying the maximum absolute values by \(2\).

summary(res.lmgce.RidGME.02.alpha1)$error.measure
#> [1] 2.764785
summary(res.lmgce.RidGME.02.alpha2)$error.measure
#> [1] 2.591847

summary(res.lmgce.RidGME.02.alpha1)$error.measure.cv.mean
#> [1] 2.862672
summary(res.lmgce.RidGME.02.alpha2)$error.measure.cv.mean
#> [1] 2.678707

The precision error is also smaller

round(GCEstim::accmeasure(coef(res.lmgce.RidGME.02.alpha1), coef.dataincRidGME, which = "RMSE"), 3) 
#> [1] 4.214
round(GCEstim::accmeasure(coef(res.lmgce.RidGME.02.alpha2), coef.dataincRidGME, which = "RMSE"), 3) 
#> [1] 3.312

But, since generally we do not know the true value of the parameters, we also can not know by which factor we must multiply the maximum absolute values. And to make it even more complicated, in some situations, a “better” estimation is obtained when the factor is between \(0\) and \(1\). So, we might as well test different values of the factor and choose, for instance, the one with the lowest k-fold cross-validation error. By default support.signal.vector.n = 20 values logarithmically spaced between support.signal.vector.min = 0.3 and support.signal.vector.max = 20 will be tested in a cv.nfolds = 5 fold cross-validation (CV) scenario and the factor chosen corresponds to the one that produces the CV errormeasure = "RMSE" that is not greater than the minimum CV-RMSE plus one standard error (errormeasure.which = "1se").

res.lmgce.RidGME.02 <-
  GCEstim::lmgce(
    y ~ .,
    data = dataincRidGME,
    support.method = "ridge",
    twosteps.n = 0
  )

With plot it is possible to visualize the change in the CV-error with the different factors used to multiply the maximum absolute value given by the Ridge trace.

plot(res.lmgce.RidGME.02, which = 2, NormEnt = FALSE)[[1]]

Red dots represent the CV-error and whiskers have the length of two standard errors for each of the 20 support spaces. The dotted horizontal line is the OLS CV-error. The black vertical dotted line corresponds to the support spaces that produced the lowest CV-error. The black vertical dashed line corresponds to the support spaces that produced the 1se CV-error. The red vertical dotted line corresponds to the support spaces that produced the elbow CV-error.

summary(res.lmgce.RidGME.02)
#> 
#> Call:
#> GCEstim::lmgce(formula = y ~ ., data = dataincRidGME, support.method = "ridge", 
#>     twosteps.n = 0)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -6.6901 -1.3847  0.3723  2.4831  7.2108 
#> 
#> Coefficients:
#>             Estimate Std. Deviation z value Pr(>|t|)    
#> (Intercept)  1.77244        0.12563  14.108  < 2e-16 ***
#> X001         0.76265        7.60140   0.100 0.920082    
#> X002         0.33140        8.33267   0.040 0.968275    
#> X003         0.01588        4.42229   0.004 0.997136    
#> X004        -0.30815        5.80985  -0.053 0.957700    
#> X005        18.14404        4.70974   3.852 0.000117 ***
#> X006        -5.86246       17.21197  -0.341 0.733402    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Normalized Entropy:
#>              NormEnt  SupportLL SupportUL
#> (Intercept) 0.671333  -2.593417  2.593417
#> X001        0.992142  -6.790953  6.790953
#> X002        0.994869  -3.650083  3.650083
#> X003        0.999982  -2.969648  2.969648
#> X004        0.997433  -4.796338  4.796338
#> X005        0.626514 -25.138406 25.138406
#> X006        0.932470 -17.996584 17.996584
#> 
#> Residual standard error: 2.787 on 113 degrees of freedom
#> Chosen factor for the Upper Limit of the Supports: 1.13, Chosen Error: 1se
#> Multiple R-squared: 0.4018, Adjusted R-squared:  0.37
#> NormEnt: 0.8878, CV-NormEnt: 0.8967 (0.01176)
#> RMSE: 2.705, CV-RMSE:   2.8 (0.2471)

Note that the prediction errors are worst than the ones obtained when the factor was \(2\) because we chose 1se error. In this case, the precision error is also not the best one.

plot(res.lmgce.RidGME.02, which = 5, NormEnt = FALSE, coef = coef.dataincRidGME)[[1]]

From the above plot it seems that we should have chosen errormeasure.which = "min". That can obviously be done using

res.lmgce.RidGME.02.min <-
  GCEstim::lmgce(
    y ~ .,
    data = dataincRidGME,
    support.method = "ridge",
    errormeasure.which = "min",
    twosteps.n = 0
  )

but that implies a complete reestimation and can be very time-costly. Since the results of all the evaluated support spaces are stored, choosing a different support should be done with changesupport

res.lmgce.RidGME.02.min <- 
  changesupport(res.lmgce.RidGME.02, "min")

From the summary we can conclude that the lowest prediction errors were obtained.

summary(res.lmgce.RidGME.02.min)
#> 
#> Call:
#> GCEstim::lmgce(formula = y ~ ., data = dataincRidGME, support.method = "ridge", 
#>     twosteps.n = 0)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.0957 -1.7710 -0.0411  1.5043  6.0953 
#> 
#> Coefficients:
#>             Estimate Std. Deviation z value Pr(>|t|)    
#> (Intercept)   2.2149         0.1190  18.607  < 2e-16 ***
#> X001          2.5974         7.2020   0.361    0.718    
#> X002          0.2105         7.8949   0.027    0.979    
#> X003         -0.6147         4.1899  -0.147    0.883    
#> X004         -1.6352         5.5046  -0.297    0.766    
#> X005         21.4375         4.4623   4.804 1.55e-06 ***
#> X006        -10.8053        16.3076  -0.663    0.508    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Normalized Entropy:
#>              NormEnt  SupportLL SupportUL
#> (Intercept) 0.949413  -7.831890  7.831890
#> X001        0.989998 -20.508077 20.508077
#> X002        0.999773 -11.022927 11.022927
#> X003        0.997077  -8.968074  8.968074
#> X004        0.992060 -14.484515 14.484515
#> X005        0.949564 -75.915762 75.915762
#> X006        0.975226 -54.348090 54.348090
#> 
#> Residual standard error: 2.643 on 113 degrees of freedom
#> Chosen factor for the Upper Limit of the Supports: 3.4125, Chosen Error: min
#> Multiple R-squared: 0.5304, Adjusted R-squared: 0.5055
#> NormEnt: 0.979, CV-NormEnt: 0.9794 (0.001428)
#> RMSE: 2.565, CV-RMSE:  2.66 (0.4113)

The precision error is also the best one.

round(GCEstim::accmeasure(coef(res.lmgce.RidGME.02), coef.dataincRidGME, which = "RMSE"), 3) 
#> [1] 4.001

round(GCEstim::accmeasure(coef(res.lmgce.RidGME.02.min), coef.dataincRidGME, which = "RMSE"), 3) 
#> [1] 2.891

If we go back to our first example and use this last approach, called incRidGME,

res.lmgce.RidGME.01.1se <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    support.method = "ridge",
    twosteps.n = 0
  )

res.lmgce.RidGME.01.min <- changesupport(res.lmgce.RidGME.01.1se, "min")

we can see an improvement from the RidGME. In particular, when the 1se error is selected, the Bias-Variance tradeoff seems more appropriate than when the min error is defined.

	\(OLS\)	\(GME_{(100000)}\)	\(GME_{(100)}\)	\(GME_{(50)}\)	\(GME_{(RidGME)}\)	\(GME_{(incRidGME_{1se})}\)	\(GME_{(incRidGME_{min})}\)
Prediction RMSE	0.405	0.405	0.407	0.411	0.459	0.423	0.411
Prediction CV-RMSE	0.436	0.436	0.437	0.441	0.518	0.450	0.424
Precision RMSE	5.825	5.809	1.597	1.575	2.192	2.018	1.589

Standardization

Since all parameters estimations methods have some drawback we can try to avoid doing a pre estimation to define the support space. Consider model in (1) (see “Generalized Maximum Entropy framework”). It can be written as
\[\begin{align} \qquad \qquad \mathbf{y} &= \beta_0 + \beta_1 \mathbf{x_{1}} + \beta_2 \mathbf{x_{2}} + \dots + \beta_K \mathbf{x_{K}} + \boldsymbol{\epsilon}, \qquad \qquad (2) \end{align}\]

Standardizing \(y\) and \(x_j\), the model in (2) is rewritten as
\[\begin{align} y^* &= X^*b + \epsilon^*,\\ y^* &= b_1x_1^* + b_2x_2^* + \dots + b_Kx_K^* + \epsilon^*, \end{align}\] where \[\begin{align} y_i^*&=\frac{y_i-\frac{\sum_{i=1}^{N}y_i}{N}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y_i-\frac{\sum_{i=1}^{N}y_i}{N}\right)^2}},\\ x_{ji}^*&=\frac{x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}\right)^2}},\\ b_j&=\frac{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( x_{ji}-\frac{\sum_{i=1}^{N}x_{ji}}{N}\right)^2}}{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left( y_i-\frac{\sum_{i=1}^{N}y_i}{N}\right)^2}}\beta_j, \end{align}\] with \(j\in \left\lbrace 1,\dots,K\right\rbrace\), and \(i \in \left\lbrace 1,\dots,N\right\rbrace\). In this formulation, \(b_j\) are called standardized coefficients.

Although not bounded, standardized coefficients greater than \(1\) in magnitude tend to occur with low frequency, and specially in extremely ill-conditioned problems. Given this, one can define zero centered support spaces for the standardized variables symmetrically bounded by a “small” number (or vector of numbers) and then revert the support spaces to the original scale. By doing so, no pre estimation is performed. lmgce uses this approach by default (support.method = "standardized") to do the estimation.

res.lmgce.1se <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    twosteps.n = 0
  )
#> Warning in GCEstim::lmgce(y ~ ., data = dataGCE, twosteps.n = 0): 
#> 
#> The minimum error was found for the highest upper limit of the support. Confirm if higher values should be tested.

We can also choose the support space that produced the lowest CV-error.

res.lmgce.min <- changesupport(res.lmgce.1se, "min")

summary(res.lmgce.1se)
#> 
#> Call:
#> GCEstim::lmgce(formula = y ~ ., data = dataGCE, twosteps.n = 0)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.08091 -0.20053  0.06421  0.39002  1.01289 
#> 
#> Coefficients:
#>              Estimate Std. Deviation z value Pr(>|t|)    
#> (Intercept) 0.9491701      0.0210431  45.106  < 2e-16 ***
#> X001        0.0003152      0.2716887   0.001  0.99907    
#> X002        0.1387515      2.5683772   0.054  0.95692    
#> X003        2.3460379      1.4284691   1.642  0.10052    
#> X004        6.1456104      3.2029279   1.919  0.05502 .  
#> X005        9.4138450      3.0664655   3.070  0.00214 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Normalized Entropy:
#>              NormEnt  SupportLL SupportUL
#> (Intercept) 0.669543  -1.385564  1.385564
#> X001        1.000000 -40.044509 40.044509
#> X002        0.999996 -53.653819 53.653819
#> X003        0.999167 -64.082185 64.082185
#> X004        0.996515 -82.107203 82.107203
#> X005        0.986704 -64.503624 64.503624
#> 
#> Residual standard error: 0.4289 on 94 degrees of freedom
#> Chosen Upper Limit for Standardized Supports: 6.623, Chosen Error: 1se
#> Multiple R-squared: 0.8205, Adjusted R-squared: 0.811
#> NormEnt: 0.942, CV-NormEnt: 0.9474 (0.02795)
#> RMSE: 0.4158, CV-RMSE: 0.4559 (0.06669)

summary(res.lmgce.min)
#> 
#> Call:
#> GCEstim::lmgce(formula = y ~ ., data = dataGCE, twosteps.n = 0)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.11689 -0.29240  0.01783  0.33891  0.99409 
#> 
#> Coefficients:
#>             Estimate Std. Deviation z value Pr(>|t|)    
#> (Intercept)  0.99575        0.02056  48.436  < 2e-16 ***
#> X001        -0.45746        0.26543  -1.724 0.084797 .  
#> X002         5.28770        2.50917   2.107 0.035087 *  
#> X003         5.22004        1.39554   3.741 0.000184 ***
#> X004        12.67322        3.12909   4.050 5.12e-05 ***
#> X005        15.60450        2.99578   5.209 1.90e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Normalized Entropy:
#>              NormEnt   SupportLL  SupportUL
#> (Intercept) 0.865202   -2.190354   2.190354
#> X001        0.999991 -120.925588 120.925588
#> X002        0.999338 -162.022706 162.022706
#> X003        0.999548 -193.514073 193.514073
#> X004        0.998376 -247.945652 247.945652
#> X005        0.996007 -194.786726 194.786726
#> 
#> Residual standard error: 0.4187 on 94 degrees of freedom
#> Chosen Upper Limit for Standardized Supports: 20, Chosen Error: min
#> Multiple R-squared: 0.8295, Adjusted R-squared: 0.8204
#> NormEnt: 0.9764, CV-NormEnt: 0.9775 (0.0007956)
#> RMSE: 0.406, CV-RMSE: 0.4309 (0.07052)

And we can do a final comparison between methods.

	\(OLS\)	\(GME_{(RidGME)}\)	\(GME_{(incRidGME_{1se})}\)	\(GME_{(incRidGME_{min})}\)	\(GME_{(std_{1se})}\)	\(GME_{(std_{min})}\)
Prediction RMSE	0.405	0.459	0.423	0.411	0.416	0.406
Prediction CV-RMSE	0.436	0.518	0.450	0.424	0.456	0.431
Precision RMSE	5.809	2.192	2.018	1.589	0.327	4.495

The precision error obtained by the 1se with support spaces defined by standardized bounds was the best at a small expense of the prediction error.

Choosing the supports spaces

Jorge Cabral

Introduction

Prior-informed support space construction

Ridge

Standardization

Conclusion

References

Acknowledgements