Skip to contents

This vignette1 is a documentation of the estimation procedure fit_model() in RprobitB.

Bayes estimation of the probit model

Bayes estimation of the probit model builds upon the work of McCulloch and Rossi (1994), Nobile (1998), Allenby and Rossi (1998), and Imai and Dyk (2005). A key ingredient is the concept of data augmentation, see Albert and Chib (1993): The idea is to treat the latent utilities UU in the model equation U=Xβ+ϵU = X\beta + \epsilon as additional parameters. Then, conditional on UU, the probit model constitutes a standard Bayesian linear regression set-up. Its posterior distribution can be approximated by iteratively drawing and updating each model parameter conditional on the other parameters (the so-called Gibbs sampling approach).

A priori, we assume the following (conjugate) parameter distributions:

  • (s1,,sC)DC(δ)(s_1,\dots,s_C)\sim D_C(\delta), where DC(δ)D_C(\delta) denotes the CC-dimensional Dirichlet distribution with concentration parameter vector δ=(δ1,,δC)\delta = (\delta_1,\dots,\delta_C),

  • αMVNPf(ψ,Ψ)\alpha\sim \text{MVN}_{P_f}(\psi,\Psi), where MVNPf\text{MVN}_{P_f} denotes the PfP_f-dimensional normal distribution with mean ψ\psi and covariance Ψ\Psi,

  • bcMVNPr(ξ,Ξ)b_c \sim \text{MVN}_{P_r}(\xi,\Xi), independent for all cc,

  • ΩcWPr1(ν,Θ)\Omega_c \sim W^{-1}_{P_r}(\nu,\Theta), independent for all cc, where WPr1(ν,Θ)W^{-1}_{P_r}(\nu,\Theta) denotes the PrP_r-dimensional inverse Wishart distribution with ν\nu degrees of freedom and scale matrix Θ\Theta,

  • and ΣWJ11(κ,Λ)\Sigma \sim W^{-1}_{J-1}(\kappa,\Lambda).

These prior distributions imply the following conditional posterior distributions:

  • The class weights are drawn from the Dirichlet distribution (s1,,sC)δ,zDC(δ1+m1,,δC+mC),\begin{equation} (s_1,\dots,s_C)\mid \delta,z \sim D_C(\delta_1+m_1,\dots,\delta_C+m_C), \end{equation} where for c=1,,Cc=1,\dots,C, mc=#{n:zn=c}m_c=\#\{n:z_n=c\} denotes the current absolute class size.2

  • Independently for all nn, we update the allocation variables (zn)n(z_n)_n from their conditional distribution Prob(zn=cs,β,b,Ω)=scϕPr(βnbc,Ωc)cscϕPr(βnbc,Ωc).\begin{equation} \text{Prob}(z_n=c\mid s,\beta,b,\Omega )=\frac{s_c\phi_{P_r}(\beta_n\mid b_c,\Omega_c)}{\sum_c s_c\phi_{P_r}(\beta_n\mid b_c,\Omega_c)}. \end{equation}

  • The class means (bc)c(b_c)_c are updated independently for all cc via bcΞ,Ω,ξ,z,βMVNPr(μbc,Σbc),\begin{equation} b_c\mid \Xi,\Omega,\xi,z,\beta \sim\text{MVN}_{P_r}\left( \mu_{b_c}, \Sigma_{b_c} \right), \end{equation} where μbc=(Ξ1+mcΩc1)1(Ξ1ξ+mcΩc1bc)\mu_{b_c}=(\Xi^{-1}+m_c\Omega_c^{-1})^{-1}(\Xi^{-1}\xi +m_c\Omega_c^{-1}\bar{b}_c), Σbc=(Ξ1+mcΩc1)1\Sigma_{b_c}=(\Xi^{-1}+m_c\Omega_c^{-1})^{-1}, bc=mc1n:zn=cβn\bar{b}_c=m_c^{-1}\sum_{n:z_n=c} \beta_n.

  • The class covariance matrices (Ωc)c(\Omega_c)_c are updated independently for all cc via Ωcν,Θ,z,β,bWPr1(μΩc,ΣΩc),\begin{equation} \Omega_c \mid \nu,\Theta,z,\beta,b \sim W^{-1}_{P_r}(\mu_{\Omega_c},\Sigma_{\Omega_c}), \end{equation} where μΩc=ν+mc\mu_{\Omega_c}=\nu+m_c and ΣΩc=Θ1+n:zn=c(βnbc)(βnbc)\Sigma_{\Omega_c}=\Theta^{-1} + \sum_{n:z_n=c} (\beta_n-b_c)(\beta_n-b_c)'.

  • Independently for all nn and tt and conditionally on the other components, the utility vectors (Unt:)(U_{nt:}) follow a J1J-1-dimensional truncated multivariate normal distribution, where the truncation points are determined by the choices ynty_{nt}. To sample from a truncated multivariate normal distribution, we apply a sub-Gibbs sampler, following the approach of Geweke (1998): UntjUnt(j),ynt,Σ,W,α,X,β𝒩(μUntj,ΣUntj){1(Untj>max(Unt(j),0))ifynt=j1(Untj<max(Unt(j),0))ifyntj,\begin{equation} U_{ntj} \mid U_{nt(-j)},y_{nt},\Sigma,W,\alpha,X,\beta \sim \mathcal{N}(\mu_{U_{ntj}},\Sigma_{U_{ntj}}) \cdot \begin{cases} 1(U_{ntj}>\max(U_{nt(-j)},0) ) & \text{if}~ y_{nt}=j\\ 1(U_{ntj}<\max(U_{nt(-j)},0) ) & \text{if}~ y_{nt}\neq j \end{cases}, \end{equation} where Unt(j)U_{nt(-j)} denotes the vector (Unt:)(U_{nt:}) without the element UntjU_{ntj}, 𝒩\mathcal{N} denotes the univariate normal distribution, ΣUntj=1/(Σ1)jj\Sigma_{U_{ntj}} = 1/(\Sigma^{-1})_{jj} and μUntj=Wntjα+XntjβnΣUntj(Σ1)j(j)(Unt(j)Wnt(j)αXnt(j)βn),\begin{equation} \mu_{U_{ntj}} = W_{ntj}'\alpha + X_{ntj}'\beta_n - \Sigma_{U_{ntj}} (\Sigma^{-1})_{j(-j)} (U_{nt(-j)} - W_{nt(-j)}'\alpha - X_{nt(-j)}' \beta_n ), \end{equation} where (Σ1)jj(\Sigma^{-1})_{jj} denotes the (j,j)(j,j)th element of Σ1\Sigma^{-1}, (Σ1)j(j)(\Sigma^{-1})_{j(-j)} the jjth row without the jjth entry, Wnt(j)W_{nt(-j)} and Xnt(j)X_{nt(-j)} the coefficient matrices WntW_{nt} and XntX_{nt}, respectively, without the jjth column.

  • Updating the fixed coefficient vector α\alpha is achieved by applying the formula for Bayesian linear regression of the regressors WntW_{nt} on the regressands (Unt:)Xntβn(U_{nt:})-X_{nt}'\beta_n, i.e. αΨ,ψ,W,Σ,U,X,βMVNPf(μα,Σα),\begin{equation} \alpha \mid \Psi,\psi,W,\Sigma,U,X,\beta \sim \text{MVN}_{P_f}(\mu_\alpha,\Sigma_\alpha), \end{equation} where μα=Σα(Ψ1ψ+n=1,t=1N,TWntΣ1((Unt:)Xntβn))\mu_\alpha = \Sigma_\alpha (\Psi^{-1}\psi + \sum_{n=1,t=1}^{N,T} W_{nt} \Sigma^{-1} ((U_{nt:})-X_{nt}'\beta_n) ) and Σα=(Ψ1+n=1,t=1N,TWntΣ1Wnt)1\Sigma_\alpha = (\Psi^{-1} + \sum_{n=1,t=1}^{N,T} W_{nt}\Sigma^{-1} W_{nt}^{'} )^{-1}.

  • Analogously to α\alpha, the random coefficients (βn)n(\beta_n)_n are updated independently via βnΩ,b,X,Σ,U,W,αMVNPr(μβn,Σβn),\begin{equation} \beta_n \mid \Omega,b,X,\Sigma,U,W,\alpha \sim \text{MVN}_{P_r}(\mu_{\beta_n},\Sigma_{\beta_n}), \end{equation} where μβn=Σβn(Ωzn1bzn+t=1TXntΣ1(Unt:Wntα))\mu_{\beta_n} = \Sigma_{\beta_n} (\Omega_{z_n}^{-1}b_{z_n} + \sum_{t=1}^{T} X_{nt} \Sigma^{-1} (U_{nt:}-W_{nt}'\alpha) ) and Σβn=(Ωzn1+t=1TXntΣ1Xnt)1\Sigma_{\beta_n} = (\Omega_{z_n}^{-1} + \sum_{t=1}^{T} X_{nt}\Sigma^{-1} X_{nt}^{'} )^{-1} .

  • The error term covariance matrix Σ\Sigma is updated by means of $$\begin{equation} \Sigma \mid \kappa,\Lambda,U,W,\alpha,X,\beta \sim W^{-1}_{J-1}(\kappa+NT,\Lambda+S), \\ \end{equation}$$ where S=n=1,t=1N,TεntεntS = \sum_{n=1,t=1}^{N,T} \varepsilon_{nt} \varepsilon_{nt}' and εnt=(Unt:)WntαXntβn\varepsilon_{nt} = (U_{nt:}) - W_{nt}'\alpha - X_{nt}'\beta_n.

Parameter normalization

Samples obtained from the updating scheme described above lack identification (except for ss and zz draws), compare to the vignette on the model definition. Therefore, subsequent to the sampling, the following normalizations are required for the iith updates in each iterations ii:

  • α(i)ω(i)\alpha^{(i)} \cdot \omega^{(i)},

  • bc(i)ω(i)b_c^{(i)} \cdot \omega^{(i)}, c=1,,Cc=1,\dots,C,

  • Unt(i)ω(i)U_{nt}^{(i)} \cdot \omega^{(i)}, n=1,,Nn = 1,\dots,N, t=1,,Tt = 1,\dots,T,

  • βn(i)ω(i)\beta_n^{(i)} \cdot \omega^{(i)}, n=1,,Nn = 1,\dots,N,

  • Ωc(i)(ω(i))2\Omega_c^{(i)} \cdot (\omega^{(i)})^2, c=1,,Cc=1,\dots,C, and

  • Σ(i)(ω(i))2\Sigma^{(i)} \cdot (\omega^{(i)})^2,

where either ω(i)=const/(Σ(i))jj\omega^{(i)} = \sqrt{\text{const} / (\Sigma^{(i)})_{jj}} with (Σ(i))jj(\Sigma^{(i)})_{jj} the jjth diagonal element of Σ(i)\Sigma^{(i)}, 1jJ11\leq j \leq J-1, or alternatively ω(i)=const/αp(i)\omega^{(i)} = \text{const} / \alpha^{(i)}_p for some coordinate 1pPf1\leq p \leq P_f of the iith draw for the coefficient vector α\alpha. Here, const\text{const} is any positive constant (typically 1). The preferences will be flipped if ω(i)<0\omega^{(i)} < 0, which only is the case if αp(i)<0\alpha^{(i)}_p < 0.

Burn-in and thinning

The theory behind Gibbs sampling constitutes that the sequence of samples produced by the updating scheme is a Markov chain with stationary distribution equal to the desired joint posterior distribution. It takes a certain number of iterations for that stationary distribution to be approximated reasonably well. Therefore, it is common practice to discard the first BB out of RR samples (the so-called burn-in period). Furthermore, correlation between nearby samples should be expected. In order to obtain independent samples, we consider only every QQth sample when computing Gibbs sample statistics like expectation and standard deviation. The independence of the samples can be verified by computing the serial correlation and the convergence of the Gibbs sampler can be checked by considering trace plots, see below.

The fit_model() function

The Gibbs sampling scheme described above can be executed by applying the function

fit_model(data = data)

where data must be an RprobitB_data object (see the vignette about choice data). The function has the following optional arguments:

  • scale: A character which determines the utility scale. It is of the form "<parameter> := <value>", where <parameter> is either the name of a fixed effect or Sigma_<j>,<j> for the <j>th diagonal element of Sigma, and <value> is the value of the fixed parameter (i.e. const\text{const} introduced above). Per default scale = "Sigma\_1,1 := 1", i.e. the first error-term variance is fixed to 1.

  • R: The number of iterations of the Gibbs sampler. The default is R = 10000.

  • B: The length of the burn-in period, i.e. a non-negative number of samples to be discarded. The default is B = R/2.

  • Q: The thinning factor for the Gibbs samples, i.e. only every Qth sample is kept. The default is Q = 1.

  • print_progress: A boolean, determining whether to print the Gibbs sampler progress.

  • prior: A named list of parameters for the prior distributions (their default values are documented in the check_prior() function):

    • eta: The mean vector of length P_f of the normal prior for alpha.

    • Psi: The covariance matrix of dimension P_f x P_f of the normal prior for alpha.

    • delta: The concentration parameter of length 1 of the Dirichlet prior for s.

    • xi: The mean vector of length P_r of the normal prior for each b_c.

    • D: The covariance matrix of dimension P_r x P_r of the normal prior for each b_c.

    • nu: The degrees of freedom (a natural number greater than P_r) of the Inverse Wishart prior for each Omega_c.

    • Theta: The scale matrix of dimension P_r x P_r of the Inverse Wishart prior for each Omega_c.

    • kappa: The degrees of freedom (a natural number greater than J-1) of the Inverse Wishart prior for Sigma.

    • E: The scale matrix of dimension J-1 x J-1 of the Inverse Wishart prior for Sigma.

  • latent_classes: A list of parameters specifying the number and the updating scheme of latent classes, see the vignette on modeling heterogeneity fitting.

Example

In the previous vignette on choice data, we introduced the train_choice data set that contains 2922 choices between two fictional train route alternatives. The following lines fit a probit model that explains the chosen trip alternatives (choice) by their price, time, number of changes, and level of comfort (the lower this value the higher the comfort). For normalization, the first linear coefficient, the price, was fixed to -1, which allows to interpret the other coefficients as monetary values:

form <- choice ~ price + time + change + comfort | 0
data <- prepare_data(form = form, choice_data = train_choice, id = "deciderID", idc = "occasionID")
model_train <- fit_model(
  data = data,
  scale = "price := -1"
)

The estimated coefficients (using the mean of the Gibbs samples as a point estimate) can be printed via

coef(model_train)
#>            Estimate   (sd)
#> 1   price     -1.00 (0.00)
#> 2    time    -25.90 (2.09)
#> 3  change     -4.82 (0.84)
#> 4 comfort    -14.49 (0.86)

and visualized via

plot(coef(model_train), sd = 3)

The results indicate that the deciders value one hour travel time by about 25€, an additional change by 5€, and a more comfortable class by 14€.3

Checking the Gibbs samples

The Gibbs samples are saved in list form in the RprobitB_fit object at the entry "gibbs_samples", i.e.

str(model_train$gibbs_samples, max.level = 2, give.attr = FALSE)
#> List of 2
#>  $ gibbs_samples_raw:List of 2
#>   ..$ alpha: num [1:1000, 1:4] -0.000713 -0.022961 -0.031988 -0.036215 -0.03571 ...
#>   ..$ Sigma: num [1:1000, 1] 1.05 1.1 1.03 1.02 1.01 ...
#>  $ gibbs_samples_nbt:List of 2
#>   ..$ alpha: num [1:500, 1:4] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>   ..$ Sigma: num [1:500, 1] 617 611 626 552 734 ...

This object contains 2 elements:

  • gibbs_samples_raw is a list of the raw samples from the Gibbs sampler,

  • and gibbs_samples_nbt are the Gibbs samples used for parameter estimates, i.e. the normalized and thinned Gibbs samples after the burn-in.

Calling the summary function on the estimated RprobitB_fit object yields additional information about the Gibbs samples gibbs_samples_nbt. You can specify a list FUN of functions that compute any point estimate of the Gibbs samples4, for example

  • mean for the arithmetic mean,

  • stats::sd for the standard deviation,

  • R_hat for the Gelman-Rubin statistic (Gelman and Rubin 1992) 5,

  • or custom statistics like the absolute difference between the median and the mean.

summary(model_train,
  FUN = c(
    "mean" = mean,
    "sd" = stats::sd,
    "R^" = R_hat,
    "custom_stat" = function(x) abs(mean(x) - median(x))
  )
)
#> Probit model
#> Formula: choice ~ price + time + change + comfort | 0 
#> R: 1000, B: 500, Q: 1
#> Level: Utility differences with respect to alternative 'B'.
#> Scale: Coefficient of effect 'price' (alpha_1) fixed to -1.
#> 
#> Gibbs sample statistics
#>                mean           sd           R^  custom_stat
#>  alpha
#>                                                           
#>      1        -1.00         0.00         1.00         0.00
#>      2       -25.90         2.09         1.04         0.07
#>      3        -4.82         0.84         1.00         0.02
#>      4       -14.49         0.86         1.00         0.02
#> 
#>  Sigma
#>                                                           
#>    1,1       661.69        59.21         1.03         7.30

Calling the plot method with the additional argument type = "trace" plots the trace of the Gibbs samples gibbs_samples_nbt:

par(mfrow = c(2, 1))
plot(model_train, type = "trace")

Additionally, we can visualize the serial correlation of the Gibbs samples via the argument type = "acf". The boxes in the top-right corner state the total sample size TSS (here R - B = 10000 - 5000 = 5000), the effective sample size ESS, and the factor by which TSS is larger than ESS.

par(mfrow = c(2, 3))
plot(model_train, type = "acf")

Here, the effective sample size is the value TSS/(1+k1ρk)\text{TSS} / (1 + \sum_{k\geq 1} \rho_k), where ρk\rho_k is the auto correlation between the chain offset by kk positions. The auto correlations are estimated via the stats::acf() function.

Model transformation after estimation

The transform method can be used to transform an RprobitB_fit object in three ways:

  1. change the length B of the burn-in period, for example
model_train <- transform(model_train, B = 1)
  1. change the thinning factor Q of the Gibbs samples, for example
model_train <- transform(model_train, Q = 100)
  1. or change the model normalization scale, for example
model_train <- transform(model_train, scale = "Sigma_1 := 1")

References

Albert, J. H., and S. Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal of the American Statistical Association 88. https://doi.org/10.2307/2290350.
Allenby, G. M., and P. Rossi. 1998. “Marketing Models of Consumer Heterogeneity.” Journal of Econometrics 89. https://doi.org/10.1016/S0304-4076(98)00055-4.
Gelman, A., and D. B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4). https://doi.org/10.1214/ss/1177011136.
Geweke, J. 1998. “Efficient Simulation from the Multivariate Normal and Student-t Distributions Subject to Linear Constraints and the Evaluation of Constraint Probabilities.” Computing Science and Statistics 23.
Imai, K., and D. A. van Dyk. 2005. “A Bayesian Analysis of the Multinomial Probit Model Using Marginal Data Augmentation.” Journal of Econometrics.
McCulloch, R., and P. Rossi. 1994. “An Exact Likelihood Analysis of the Multinomial Probit Model.” Journal of Econometrics 64. https://doi.org/10.1016/0304-4076(94)90064-7.
Nobile, A. 1998. “A Hybrid Markov Chain for the Bayesian Analysis of the Multinomial Probit Model.” Statistics and Computing 8. https://doi.org/10.1023/A:1008905311214.