Skip to contents

The task of model selection targets the question: If there are several competing models, how do I choose the most appropriate one? This vignette1 outlines the model selection tools implemented in RprobitB.

For illustration, we revisit the probit model of travelers deciding between two fictional train route alternatives from the vignette on model fitting:

form <- choice ~ price + time + change + comfort | 0
data <- prepare_data(form = form, choice_data = train_choice, id = "deciderID", idc = "occasionID")
model_train <- fit_model(
  data = data,
  scale = "price := -1",
  R = 1000, B = 500, Q = 10
)

As a competing model, we consider explaining the choices only by the alternative’s price, i.e. the probit model with the formula choice ~ price | 0:

model_train_sparse <- update(model_train, form = choice ~ price | 0)

The update() function helps to estimate a new version of model_train with new specifications. Here, only form has been changed.

The model_selection() function

RprobitB provides the convenience function model_selection(), which takes an arbitrary number of RprobitB_fit objects and returns a matrix of model selection criteria:

model_selection(model_train, model_train_sparse)
#>      model_train model_train_sparse
#> npar           4                  1
#> LL      -1727.75           -1865.86
#> AIC      3463.49            3733.73
#> BIC      3487.42            3739.71

Specifying criteria is optional. Per default, criteria = c("npar", "LL", "AIC", "BIC").2 The available model selection criteria are explained in the following.

npar

"npar" yields the number of model parameters, which is computed by the npar() method:

npar(model_train, model_train_sparse)
#> [1] 4 1

Here, model_train has 4 parameters (a coefficient for price, time, change, and comfort, respectively), while model_train_sparse has only a single price coefficient.

LL

If "LL" is included in criteria, model_selection() returns the model’s log-likelihood values. They can also be directly accessed via the logLik() method:3

logLik(model_train)
#> 'log Lik.' -1727.746 (df=4)
logLik(model_train_sparse)
#> 'log Lik.' -1865.865 (df=1)

AIC

Including "AIC" yields the Akaike’s Information Criterion (Akaike 1974), which is computed as 2LL+knpar,-2 \cdot \text{LL} + k \cdot \text{npar}, where LL\text{LL} is the model’s log-likelihood value, kk is the penalty per parameter with k=2k = 2 per default for the classical AIC, and npar\text{npar} is the number of parameters in the fitted model.

Alternatively, the AIC() method also returns the AIC values:

AIC(model_train, model_train_sparse, k = 2)
#>                    df      AIC
#> model_train         4 3463.492
#> model_train_sparse  1 3733.729

The AIC quantifies the trade-off between over- and under-fitting, where smaller values are preferred. Here, the increase in goodness of fit justifies the additional 3 parameters of model_train.

BIC

Similar to the AIC, "BIC" yields the Bayesian Information Criterion (Schwarz 1978), which is defined as 2LL+log(nobs)npar,-2 \cdot \text{LL} + \log{(\text{nobs})} \cdot \text{npar}, where LL\text{LL} is the model’s log-likelihood value, nobs\text{nobs} is the number of data points (which can be accessed via the nobs() method), and npar\text{npar} is the number of parameters in the fitted model. The interpretation is analogue to AIC.

RprobitB also provided a method for the BIC value:

BIC(model_train, model_train_sparse)
#>                    df      BIC
#> model_train         4 3487.422
#> model_train_sparse  1 3739.712

WAIC (with se(WAIC) and pWAIC)

WAIC is short for Widely Applicable (or Watanabe-Akaike) Information Criterion (Watanabe and Opper 2010). As for AIC and BIC, the smaller the WAIC value the better the model. Including "WAIC" in criteria yields the WAIC value, its standard error se(WAIC), and the effective number of parameters pWAIC, see below.

The WAIC is defined as

2lppd+2pWAIC,-2 \cdot \text{lppd} + 2\cdot p_\text{WAIC},

where lppd\text{lppd} stands for log pointwise predictive density and pWAICp_\text{WAIC} is a penalty term proportional to the variance in the posterior distribution that is sometimes called effective number of parameters, see McElreath (2020) p. 220 for a reference.

The lppd\text{lppd} is approximated as follows. Let psi=Pr(yiθs)p_{si} = \Pr(y_i\mid \theta_s) be the probability of observation yiy_i (here the single choices) given the ss-th set θs\theta_s of parameter samples from the posterior. Then

lppd=ilog(S1spsi).\text{lppd} = \sum_i \log \left( S^{-1} \sum_s p_{si} \right). The penalty term is computed as the sum over the variances in log-probability for each observation: pWAIC=i𝕍θlog(psi).p_\text{WAIC} = \sum_i \mathbb{V}_{\theta} \log (p_{si}) . The WAIC\text{WAIC} has a standard error of n𝕍i[2(lppd𝕍θlog(psi))],\sqrt{n \cdot \mathbb{V}_i \left[-2 \left(\text{lppd} - \mathbb{V}_{\theta} \log (p_{si}) \right)\right]}, where nn is the number of choices.

Before computing the WAIC of an object, the probabilities psip_{si} must be computed via the compute_p_si() function: 4

model_train <- compute_p_si(model_train)

Afterwards, the WAIC can be accessed as follows:

WAIC(model_train)
WAIC(model_train_sparse)

MMLL

"MMLL" in criteria stands for marginal model log-likelihood. The model’s marginal likelihood Pr(yM)\Pr(y\mid M) for a model MM and data yy is required for the computation of Bayes factors, see below. In general, the term has no closed form and must be approximated numerically.

RprobitB uses the posterior Gibbs samples derived from the mcmc() function to approximate the model’s marginal likelihood via the posterior harmonic mean estimator (Newton and Raftery 1994): Let SS denote the number of available posterior samples θ1,,θS\theta_1,\dots,\theta_S. Then, Pr(yM)=(𝔼posterior1/Pr(yθ,M))1(1Ss1/Pr(yθs,M))1=Pr̃(yM).\Pr(y\mid M) = \left(\mathbb{E}_\text{posterior} 1/\Pr(y\mid \theta,M) \right)^{-1} \approx \left( \frac{1}{S} \sum_s 1/\Pr(y\mid \theta_s,M) \right) ^{-1} = \tilde{\Pr}(y\mid M).

By the law of large numbers, Pr̃(yM)Pr(yM)\tilde{\Pr}(y\mid M) \to \Pr(y\mid M) almost surely as SS \to \infty.

As for the WAIC, computing the MMLL relies on the probabilities psi=Pr(yiθs)p_{si} = \Pr(y_i\mid \theta_s), which must first be computed via the compute_p_si() function. Afterwards, the mml() function can be called with an RprobitB_fit object as input. The function returns the RprobitB_fit object, where the marginal likelihood value is saved as the entry "mml" and the marginal log-likelihood value as the attribute "mmll":5

model_train <- mml(model_train)
model_train$mml
attr(model_train$mml, "mmll")

There are two options for improving the approximation: As for the WAIC, you can use more posterior samples. Alternatively, you can combine the posterior harmonic mean estimate with the prior arithmetic mean estimator (Hammersley and Handscomb 1964): For this approach, SS samples θ1,,θS\theta_1,\dots,\theta_S are drawn from the model’s prior distribution. Then,

Pr(yM)=𝔼priorPr(yθ,M)1SsPr(yθs,M)=Pr̃(yM).\Pr(y\mid M) = \mathbb{E}_\text{prior} \Pr(y\mid \theta,M) \approx \frac{1}{S} \sum_s \Pr(y\mid \theta_s,M) = \tilde{\Pr}(y\mid M).

Again, it hols by the law of large numbers, that Pr̃(yM)Pr(yM)\tilde{\Pr}(y\mid M) \to \Pr(y\mid M) almost surely as SS \to \infty. The final approximation of the model’s marginal likelihood than is a weighted sum of the posterior harmonic mean estimate and the prior arithmetic mean estimate, where the weights are determined by the sample sizes.

To use the prior arithmetic mean estimator, call the mml() function with a specification of the number of prior draws S and set recompute = TRUE:6

model_train <- mml(model_train, S = 1000, recompute = TRUE)

Note that the prior arithmetic mean estimator works well if the prior and posterior distribution have a similar shape and strong overlap, as Gronau et al. (2017) points out. Otherwise, most of the sampled prior values result in a likelihood value close to zero, thereby contributing only marginally to the approximation. In this case, a very large number S of prior samples is required.

BF

The Bayes factor is an index of relative posterior model plausibility of one model over another (Marin and Robert 2014). Given data 𝚢\texttt{y} and two models 𝚖𝚘𝚍𝟶\texttt{mod0} and 𝚖𝚘𝚍𝟷\texttt{mod1}, it is defined as

BF(𝚖𝚘𝚍𝟶,𝚖𝚘𝚍𝟷)=Pr(𝚖𝚘𝚍𝟶𝚢)Pr(𝚖𝚘𝚍𝟷𝚢)=Pr(𝚢𝚖𝚘𝚍𝟶)Pr(𝚢𝚖𝚘𝚍𝟷)/Pr(𝚖𝚘𝚍𝟶)Pr(𝚖𝚘𝚍𝟷). BF(\texttt{mod0},\texttt{mod1}) = \frac{\Pr(\texttt{mod0} \mid \texttt{y})}{\Pr(\texttt{mod1} \mid \texttt{y})} = \frac{\Pr(\texttt{y} \mid \texttt{mod0} )}{\Pr(\texttt{y} \mid \texttt{mod1})} / \frac{\Pr(\texttt{mod0})}{\Pr(\texttt{mod1})}.

The ratio Pr(𝚖𝚘𝚍𝟶)/Pr(𝚖𝚘𝚍𝟷)\Pr(\texttt{mod0}) / \Pr(\texttt{mod1}) expresses the factor by which 𝚖𝚘𝚍𝟶\texttt{mod0} a priori is assumed to be the correct model. Per default, RprobitB sets Pr(𝚖𝚘𝚍𝟶)=Pr(𝚖𝚘𝚍𝟷)=0.5\Pr(\texttt{mod0}) = \Pr(\texttt{mod1}) = 0.5. The front part Pr(𝚢𝚖𝚘𝚍𝟶)/Pr(𝚢𝚖𝚘𝚍𝟷)\Pr(\texttt{y} \mid \texttt{mod0} ) / \Pr(\texttt{y} \mid \texttt{mod1}) is the ratio of the marginal model likelihoods. A value of BF(𝚖𝚘𝚍𝟶,𝚖𝚘𝚍𝟷)>1BF(\texttt{mod0},\texttt{mod1}) > 1 means that the model 𝚖𝚘𝚍𝟶\texttt{mod0} is more strongly supported by the data under consideration than 𝚖𝚘𝚍𝟷\texttt{mod1}.

Adding "BF" to the criteria argument of model_selection yields the Bayes factors. We find a decisive evidence (Jeffreys 1998) in favor of model_train.

model_selection(model_train, model_train_sparse, criteria = c("BF"))

pred_acc

Finally, adding "pred_acc" to the criteria argument for the model_selection() function returns the share of correctly predicted choices. From the output of model_selection() above (or alternatively the one in the following) we deduce that model_train correctly predicts about 6% of the choices more than model_train_sparse:7

pred_acc(model_train, model_train_sparse)
#> [1] 0.6951178 0.6340048

References

Akaike, H. 1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control 19. https://doi.org/10.1109/TAC.1974.1100705.
Gronau, Q. F., A. Sarafoglou, D. Matzke, A. Ly, U. Boehm, M. Marsman, D. S. Leslie, J. J. Forster, E. Wagenmakers, and H. Steingroever. 2017. “A Tutorial on Bridge Sampling.” Journal of Mathematical Psychology 81. https://doi.org/10.1016/j.jmp.2017.09.005.
Hammersley, J. M., and D. C. Handscomb. 1964. “General Principles of the Monte Carlo Method.” Springer Verlag. https://doi.org/10.1007/978-94-009-5819-7_5.
Jeffreys, Harold. 1998. The Theory of Probability. OUP Oxford.
Marin, J., and C. Robert. 2014. Bayesian Essentials with R. Springer Verlag. https://doi.org/10.1007/978-1-4614-8687-9.
McElreath, R. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman; Hall/CRC. https://doi.org/10.1201/9780429029608.
Newton, M. A., and A. E. Raftery. 1994. “Approximate Bayesian Inference with the Weighted Likelihood Bootstrap.” Journal of the Royal Statistical Society: Series B (Methodological) 56 (1).
Schwarz, G. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6.
Watanabe, S., and M. Opper. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11 (12).