Estimation and Inference of VAR with Bayesian methods

Jincheng Jiang

2024-11-19

this post won’t exist without these resources:

  • Antolín-Díaz, Juan, and Juan F. Rubio-Ramírez. 2018. “Narrative Sign Restrictions for SVARs.” American Economic Review, 108 (10): 2802–29.
  • Bayesian Macroeconometrics in R
  • Juan F. Rubio-Ramírez, Daniel F. Waggoner, Tao Zha. 2010. “Structural Vector Autoregressions: Theory of Identification and Algorithms for Inference.” The Review of Economic Studies, 77(2): 665–696.
  • Kilian L, Lütkepohl H. “Structural Vector Autoregressive Analysis.” Cambridge University Press, 2017.
  • Yifei Lyu’s lecture notes

Great Thanks!

A reduced form VAR

Let’s consider a m-dimension VAR(p) model:

\[ Y_t = \alpha + \sum_{s=1}^{p}Y_{t-s} \beta_{s} + \epsilon_{t} \tag{1} \]

where we assume disturbance terms are iid along time series, \(\epsilon_{t} \sim \mathcal{N}(0, \Sigma)\), but \(\Sigma\) may not be diagonal. An illustrative example is a bi-variate VAR(2) model containing GDP \(y_{t}\) and inflation \(\pi_{t}\):

\[ \begin{align*} y_{t} &= \alpha_{0} + \alpha_{11} y_{t-1} + \alpha_{12} \pi_{t-1} + \alpha_{21} y_{t-2} + \alpha_{22} \pi_{t-2} + \epsilon_{t}^{y} \\ \pi_{t} &= \beta_{0} + \beta_{11} y_{t-1} + \beta_{12} \pi_{t-1} + \beta_{21} y_{t-2} + \beta_{22} \pi_{t-2} + \epsilon_{t}^{y} \\ \end{align*} \]

which can be written as

\[ \begin{bmatrix} y_{t} \\ \pi_{t} \end{bmatrix} = \begin{bmatrix} \alpha_{0} \\ \beta_{0} \end{bmatrix} + \begin{bmatrix} \alpha_{11} & \alpha_{12} \\ \beta_{21} & \beta_{22} \end{bmatrix} \begin{bmatrix} y_{t-1} \\ \pi_{t-1} \end{bmatrix} + \begin{bmatrix} \alpha_{21} & \alpha_{22} \\ \beta_{21} & \beta_{22} \end{bmatrix} \begin{bmatrix} y_{t-2} \\ \pi_{t-2} \end{bmatrix} + \begin{bmatrix} \epsilon_{t}^{y}\\ \epsilon_{t}^{\pi} \end{bmatrix} \tag{2} \]

The matrix form (2) can be used to understand the compact form (1).

Estimating VAR via OLS

In model (1), we know innovation at time \(t\) won’t affect lagged terms of \(Y\), so the OLS assumption \(\mathbb{E}(\epsilon_{t} x_{t})=0\) and \(\mathbb{E}(\epsilon_{t} \epsilon_{t-s})=0\) are naturally satisfied. Thus we can estimate model (1) via OLS method.

Still, we illustrate the problem using the bi-variate VAR(2) example. We can stack all observations vertically and write:

\[ \begin{bmatrix} y_{3} & \pi_{3} \\ y_{4} & \pi_{4} \\ \vdots & \vdots \\ y_{T} & \pi_{T} \end{bmatrix} = \begin{bmatrix} 1 & y_{2} & \pi_{2} & y_{1} & \pi_{1} \\ 1 & y_{3} & \pi_{3} & y_{2} & \pi_{2} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & y_{T-1} & \pi_{T-1} & y_{T-2} & \pi_{T-2} \\ \end{bmatrix} \begin{bmatrix} \alpha_{0} & \beta_{0} \\ \alpha_{11} & \beta_{11} \\ \alpha_{12} & \beta_{12} \\ \alpha_{21} & \beta_{21} \\ \alpha_{22} & \beta_{22} \\ \end{bmatrix} + \begin{bmatrix} \epsilon_{3}^{y} & \epsilon_{3}^{\pi} \\ \epsilon_{4}^{y} & \epsilon_{4}^{\pi} \\ \vdots & \vdots \\ \epsilon_{T}^{y} & \epsilon_{T}^{\pi} \\ \end{bmatrix} \]

which can be written as

\[ Y = X \beta + \epsilon \]

then we directly get

\[ \hat{\beta} = (X'X)^{-1}X'Y, \hat{\epsilon}=Y-X \hat{\beta}, \hat{\Sigma}=\hat{\epsilon}' \hat{\epsilon} \]

this is done in the R code.

Estimating via Bayesian method

Why do we bother to use bayesian method where OLS would simply solve the problem? Because bayesian method would be of great help if we want to do inference, which I will talk about below. Here I first outline the basic idea of bayesian method (mainly from this author’s github repo and his book).

Let’s write (1) in the following more compact form

\[ y = (\mathbb{I}_{m} \otimes X) \alpha + \varepsilon \tag{3} \]

where \(y=\text{vec}(Y)\), \(\alpha=\text{vec}(\beta)\), \(\varepsilon=\text{vec}(\epsilon) \sim \mathcal{N}(0, \Sigma \otimes \mathbb{I}_{m})\) are all of dimension \((m \times T) \times 1\) (in fact it should be \((m \times (T-lag)) \times 1\), for convenience I write \(T\)). Here \(\text{vec}\) means to stack the matrix vertically by column and \(\otimes\) means the Kronecker product. In R, it can be done by as.vector and %x%. Note \(\mathbb{I}_{m}\) is the m-dimension identity matrix.

Again, we illustrate (3) using the VAR(2) example, it says

\[ \begin{align*} \begin{bmatrix} y_{3} \\ \vdots \\ y_{T} \\ \pi_{3} \\ \vdots \\ \pi_{T} \end{bmatrix}&= \left(\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \otimes X\right) \alpha + \varepsilon \\ &= \begin{bmatrix} X_{5 \times 5} & \mathbf{0} \\ \mathbf{0} & X_{5 \times 5} \end{bmatrix} \begin{bmatrix} \alpha_{0} \\ \vdots \\ \alpha_{22} \\ \beta_{0} \\ \vdots \\ \beta_{22} \end{bmatrix} + \begin{bmatrix} \epsilon_{3}^{y} \\ \vdots \\ \epsilon_{T}^{y} \\ \epsilon_{3}^{\pi} \\ \vdots \\ \epsilon_{T}^{\pi} \\ \end{bmatrix} \end{align*} \]

The parameters we want to estimate are \(\alpha\) and \(\Sigma\). Since we assume disturbance terms to be iid normal, with parameters given we know

\[ y \sim \mathcal{N} \left((\mathbb{I}_{m} \otimes X) \alpha, \Sigma \otimes \mathbb{I}_{m}\right) \]

thus we can write the likelihood function as

\[ \begin{align*} \mathcal{L}(\alpha, \Sigma \mid y, X)&=(2 \pi)^{-mT / 2}\left|\Sigma \otimes \mathbb{I}_T\right|^{-1 / 2} \\ \quad &\times \exp \left\{-\frac{1}{2}\left(y-\left(\mathbb{I}_m \otimes Z\right) \alpha\right)'\left(\Sigma^{-1} \otimes \mathbb{I}_T\right)\left(y-\left(\mathbb{I}_m \otimes Z\right) \alpha\right)\right\} \\ & \equiv p(y \mid \alpha, \Sigma, X) \end{align*} \]

after some manipulation we yield

\[ \begin{align*} \ln \mathcal{L} &= -\frac{T}{2} \ln |\Sigma|-\left\{\frac{1}{2}(\alpha-\widehat{\alpha})' \mathcal{X}'\left(\Sigma^{-1} \otimes \mathbb{I}_T\right) \mathcal{X}(\alpha-\widehat{\alpha})\right\} \\ & \quad -\left\{\frac{1}{2} \operatorname{tr}\left[\left[(y-\mathcal{X} \widehat{\alpha})(y-\mathcal{X} \widehat{\alpha})'\left(\Sigma^{-1} \otimes \mathbb{I}_T\right)\right]\right\}\right. \\ & \propto \ln \left[ \mathcal{N}(\alpha \mid \widehat{\alpha},\Sigma,\mathcal{X},y) \times \mathcal{IW}(\Sigma \mid \widehat{\alpha},\mathcal{X},y)\right] \end{align*} \]

where

\[ \begin{align*} \widehat{\alpha} &\equiv (\Sigma^{-1} \otimes X'X)^{-1}(\Sigma^{-1} \otimes X)'y \\ \mathcal{X} &\equiv \mathbb{I}_{m} \otimes X \end{align*} \]

which means the likelihood function is (conditionally) proportional to the product of a normal distribution for \(\alpha\) and a inverse-Wishart distribution for \(\Sigma\). It seems that we can sample the parameters independently from the posterior density using Gibbs Sampler, if we use appropriate priors.

By the way, I find a code snippet to generate random draws from inverse-Wishart distribution from this package, which should be easily transformed to Rcpp code.

In fact it is the case if we use the independent Normal-inverse-Wishart Prior, which specifies the priors of \(\alpha\) and \(\Sigma\) to be independently normal and inverse-Wishart. The result is as follow:

First, use bayesian formula we know the joint posterior distribution satisfies

\[ p(\alpha, \Sigma \mid \mathcal{X}, y) \propto p(y \mid \alpha, \Sigma, X) p(\alpha,\Sigma) = p(y \mid \alpha, \Sigma, X) p(\alpha) p(\Sigma) \]

The priors are \(p(\alpha) \sim \mathcal{N}(\bar{\alpha},\mathcal{E}_{\alpha})\) and \(\Sigma \sim \mathcal{IW}(\mathcal{E}_{\Sigma},\nu)\).

If given \(\Sigma\), after some computation (still follow this author’s github repo or this book’s chapter 5) we know the prior density of \(\alpha\) satisfies

\[ p(\alpha \mid \Sigma, \mathcal{X}, y) \propto \exp \left(-\frac{1}{2}\left[(\alpha-\tilde{\alpha})' \tilde{\Sigma}_{\alpha}^{-1}(\alpha-\tilde{\alpha})+C \right]\right)=\mathcal{N}(\tilde{\alpha},\tilde{\Sigma}_{\alpha}) \tag{4} \]

where

\[ \begin{align*} \tilde{\Sigma}_{\alpha}^{-1}&=\mathcal{E}_{\alpha}^{-1}+\mathcal{X}'(\Sigma^{-1} \otimes \mathbb{I}_{T}) \mathcal{X} \\ \tilde{\alpha}&=\tilde{\Sigma}_{\alpha}(\mathcal{E}_{\alpha}^{-1}\bar{\alpha}+\mathcal{X}'(\Sigma^{-1}\otimes \mathbb{I}_{T})y) \\ C&=y'(\Sigma^{-1}\otimes \mathbb{I}_{T})y + \bar{\alpha}'\mathcal{E}_{\alpha}^{-1} \bar{\alpha} - \tilde{\alpha}'\tilde{\Sigma}_{\alpha}^{-1}\tilde{\alpha} \end{align*} \]

If given \(\alpha\), we have

\[ p(\Sigma \mid \beta, X, Y) = \mathcal{IW} \left(\mathcal{E}_{\Sigma}+(Y-X\beta)'(Y-X\beta),T+\nu \right) \tag{5} \]

where \(\alpha=\text{vec}(\beta)\).

On (4) and (5) we can apply the Gibbs sampler to estimate the parameters we are interested in. A possible selection of prior parameters is:

  • \(\bar{\alpha}\): set the prior mean of the first lag of the dependent variable to one and set the prior mean of all other slope coefficients to zero, i.e. mimic a multivariate random walk model
  • \(\mathcal{E}_{\alpha}\): set to \(\eta \mathbb{I}_{m(mp+1)}\), where \(\eta\) is a hyper parameter
  • \(\mathcal{E}_{\Sigma}\): set to \(\mathbb{I}_{m}\)
  • \(\nu\): set to \(m+1\)

Identification and Inference

Note that in (1), the random shocks are serially uncorrelated, but different shocks at the same period may be correlated. In macroeconomic analysis, people are mainly interested in the response of economic variables on structural shocks, i.e. exogenous and mutually independent shocks with explicit economic meaning. To achieve this, we want to find a matrix \(\mathcal{B}\) s.t. it transforms structural shocks \(\xi_{t}\) to reduced-form shocks \(\epsilon_{t}\)

\[ \mathcal{B} \xi_{t} = \epsilon_{t}, \xi_{t} \sim \mathcal{N}(0,D) \]

where \(D\) is a diagonal matrix. Let \(B=\mathcal{B}^{-1}\), we can write

\[ BY_t = B\alpha + \sum_{s=1}^{p}B\beta_{s} Y_{t-s} + \xi_{t} \]

which is

\[ Y_{t} = B\alpha + (I-B)Y_{t} + \sum_{s=1}^{p} B\beta_{s}Y_{t-s} + \xi_{t} \tag{6} \]

Now it becomes inappropriate to estimate (6) using OLS since \(\mathbb{E}(x_{t} \xi_{t}) \neq 0\), and OLS estimator is inconsistent.

To better understand the problem, let’s go back to the reduced form VAR where structural shocks are specified

\[ Y_t = \alpha + \sum_{s=1}^{p} \beta_{s} Y_{t-s} + \mathcal{B} \xi_{t} \]

We can directly using OLS to estimate this equation, get \(\hat{\beta}_{s}\) and \(\hat{\epsilon}_{t}\). To estimate \(\mathcal{B}\), we utilize

\[ \text{Var}(\hat{\epsilon}_{t}) \equiv \Sigma = \mathcal{B} \mathcal{B}' \]

where we normalize \(D=\mathbb{I}_{m}\) to work with 1-unit-standard-error shock. Note we have \(m\times m\) entries in \(\mathcal{B}\) to estimate while the equality just give us \(m(m-1)/2\) equations, this \(\mathcal{B}\) is not identified, i.e. we lack enough assumptions to identify \(\mathcal{B}\) matrix.

A possible solution is setting \(\mathcal{B}\) to be lower triangle, which requires \(b_{ij} = 0\) if \(i < j\). This assumption is widely used in old times because it’s easy to implemented and has straightforward economic meaning: variable 1 won’t affected by shock 2 to last in the short run, variable 2 won’t affected by shock 3 to last in the short run, and so on. But recently people find this assumption might be too strong and may not be realistic.

Another solution is using IV, which heavily relies on accessible data but easy to implement. It not the focus of this project.

We focus on identification by sign restrictions, which will be illustrated later.

After identification of \(\mathcal{B}\), we can analyze the quantity we want, i.e. the response of economic variables on structural shocks, which is known as structural impulse-response functions (IRFs). The structural IRF of variable \(i\) to shock \(j\) after \(h\) periods the shock took place is defined as

\[ \Phi_{ij}^{h} \equiv \frac{\partial y_{it+h}}{\partial \xi_{jt}} = \frac{\partial y_{it+h}}{\partial \epsilon_{t}} \frac{\partial \epsilon_{t}}{\partial \xi_{jt}}=\Psi_{i \cdot} \mathcal{B}_{\cdot j} \]

where \(\Psi\) is the reduced-form IRF which can be easily computed after \(\hat{\beta}\) is estimated (and has been implemented in the R code).

To do inference, with frequentist methods we bootstrap on \(\epsilon_{t}\) to reconstruct data, re-estimate model and re-compute IRF for, typically, 2000 times and use 5% and 95% quantiles of IRF computed as credible interval, with bayesian methods the sampling procedure naturally gives the credible interval of IRF.

Sign Restrictions

Basics

Traditional sign restrictions don’t set entries of \(\mathcal{B}\) matrix to certain values. Instead, it utilize economic theories to restrict the sign of entries of \(\mathcal{B}\) matrix to be positive, negative, or zero. For example, consider the bi-variate VAR(2) model

\[ \begin{bmatrix} y_{t} \\ \pi_{t} \end{bmatrix} = \begin{bmatrix} \alpha_{0} \\ \beta_{0} \end{bmatrix} + \begin{bmatrix} \alpha_{11} & \alpha_{12} \\ \beta_{21} & \beta_{22} \end{bmatrix} \begin{bmatrix} y_{t-1} \\ \pi_{t-1} \end{bmatrix} + \begin{bmatrix} \alpha_{21} & \alpha_{22} \\ \beta_{21} & \beta_{22} \end{bmatrix} \begin{bmatrix} y_{t-2} \\ \pi_{t-2} \end{bmatrix} + \mathcal{B} \begin{bmatrix} \xi_{t}^{y}\\ \xi_{t}^{\pi} \end{bmatrix} \]

Business cycle theory tells us, faced with demand shocks, inflation should move with the same direction of demand shock. That is, \(\pi_{t} \uparrow \uparrow xi_{t}^{y}\). Thus, we can add sign restrictions to \(\mathcal{B}\) matrix like

\[ \mathcal{B} \leftarrow \begin{bmatrix} \cdot & \cdot \\ + & \cdot \end{bmatrix} \]

Which is very general and can be realistic. Adding restrictions to matrix \(\mathcal{B}\) is actually equivalent to adding restrictions to zero-order IRF \(\Phi^{0}\), of course we can add restrictions to higher orders of IRFs to restrict the (relatively) long run effect of certain shocks on any variables.

The problem of sign restrictions is that it’s difficult to implement. RUBIO-RAM´ IREZ et al.(2010) gives an algorithm to get a set of \(\mathcal{B}\) matrix with sigh restrictions, which is based on the fact that matrix \(\mathcal{B}\) must satisfy

\[ \Sigma = \mathcal{B} \mathcal{B}' \]

which is a necessary condition for true \(\mathcal{B}\). Also, use Cholesky decomposition we can get

\[ \Sigma = P P' \]

where \(P\) is a lower triangular matrix. We can prove that \(\mathcal{B}\) must equal \(P\) multiplies some orthogonal matrix \(Q\), where \(Q Q' = \mathbb{I}\). That is

\[ \text{if} \quad \Sigma = PP', \text{then} \quad \Sigma=\mathcal{B} \mathcal{B}' \iff \mathcal{B} = PQ \]

Proof. \(\implies\): \(\mathcal{B} \mathcal{B}' = PP' \implies P^{-1}\mathcal{B} \mathcal{B}' P'^{-1}=\mathbb{I}\)

let \(P^{-1}\mathcal{B}=Q\), we know \(Q Q'= \mathbb{I}\), and \(\mathcal{B}=PQ\). Just note that \(P\) is triangular and thus invertible.

\(\impliedby\): \(B = PQ \implies \mathcal{B} \mathcal{B}' = PQ Q' P' = PP' = \mathbb{I}\) and we are done.

This result tells us if we find a set of appropriate orthogonal matrix \(Q\), then we can get a set of matrix \(\tilde{\mathcal{B}}\) which may contain the true \(\mathcal{B}\). Here is how the algorithm do:

  1. Draw \(\alpha\) and \(\Sigma\) from posterior distribution, do Cholesky decomposition on \(\Sigma\) to get \(P\)
  2. Generate an \(n \times n\) matrix \(X\) with all entries \(\sim \mathcal{N}(0,1)\) iid.
  3. Do QR decomposition on \(X\), get \(X=QR\), where \(Q\) is orthogonal and \(R\) is upper triangular.
  4. Get “one draw” of \(\tilde{\mathcal{B}}=PQ\)
  5. Check whether \(\tilde{\mathcal{B}}\) of IRF generated by it satisfies sign restriction, if yes, save this draw, if no, back to step 1.
  6. Repeat 1-5, until the number of saved draws equal the number of draws you desire.

There are two points well notable:

  • With sign restrictions we only yield set identify, that is, we don’t identify an exact \(\mathcal{B}\) matrix, instead we get a set of \(\mathcal{B}\) and thus a set of IRF. Then we plot the 5% and 95% quantiles of IRF and use the median or mean of IRF to mimic a point estimation.
  • About matrix \(Q\). If \(Q\) is generated from QR decomposition of a iid. normal matrix, then Q satisfies a Haar distribution, which means it’s probability measure may be “biased” when we compute IRF (containing power terms of \(Q\)). Hence, frequentists’ “random draw” methodology may not be valid during this process, we’d better turn to bayesian methods to utilize this “belief”.

Narrative Sign Restrictions

Antolín-Díaz and Rubio-Ramírez (2018) combines sigh restriction with historical facts. They impose restrictions not only on structural parameters and IRFs, but also on historical decomposition (HD), i.e. the cumulative effect of shock \(i\) on variable \(j\) during certain period. In their article they use two kind of such restrictions:

  • Type A: a given shock was the most important (least important) driver of the unexpected change in a variable during some periods. That is, the absolute value of the HD of this shock is the largest (smallest) during some periods.
  • Type B: a given shock was the overwhelming (negligible) driver of the unexpected change in a given variable during the period. That is, the absolute value of its HD is larger (smaller) than the sum of HDs of all other structural shocks

Note that to check narrative sign restrictions, we need HD and thus estimated structural shocks \(\xi_{t}\), which should be computed by \(\mathcal{B}(Y-\beta X)\) with a drawn \(\mathcal{B}\). This should not be difficult once the procedure of computing bayesian estimator, IRF and HD are well implemented.