# Bayesian Vector Autoregressions: Estimation

## Summary and Keywords

Vector autoregressions (VARs) are linear multivariate time-series models able to capture the joint dynamics of multiple time series. Bayesian inference treats the VAR parameters as random variables, and it provides a framework to estimate “posterior” probability distribution of the location of the model parameters by combining information provided by a sample of observed data and prior information derived from a variety of sources, such as other macro or micro datasets, theoretical models, other macroeconomic phenomena, or introspection.

In empirical work in economics and finance, informative prior probability distributions are often adopted. These are intended to summarize stylized representations of the data generating process. For example, “Minnesota” priors, one of the most commonly adopted macroeconomic priors for the VAR coefficients, express the belief that an independent random-walk model for each variable in the system is a reasonable “center” for the beliefs about their time-series behavior. Other commonly adopted priors, the “single-unit-root” and the “sum-of-coefficients” priors are used to enforce beliefs about relations among the VAR coefficients, such as for example the existence of co-integrating relationships among variables, or of independent unit-roots.

Priors for macroeconomic variables are often adopted as “conjugate prior distributions”—that is, distributions that yields a posterior distribution in the same family as the prior p.d.f.—in the form of Normal-Inverse-Wishart distributions that are conjugate prior for the likelihood of a VAR with normally distributed disturbances. Conjugate priors allow direct sampling from the posterior distribution and fast estimation. When this is not possible, numerical techniques such as Gibbs and Metropolis-Hastings sampling algorithms are adopted.

Bayesian techniques allow for the estimation of an ever-expanding class of sophisticated autoregressive models that includes conventional fixed-parameters VAR models; Large VARs incorporating hundreds of variables; Panel VARs, that permit analyzing the joint dynamics of multiple time series of heterogeneous and interacting units. And VAR models that relax the assumption of fixed coefficients, such as time-varying parameters, threshold, and Markov-switching VARs.

Keywords: Bayesian inference, vector autoregression models, BVAR, SVAR, forecasting

Introduction

Vector autoregressions (VARs) are linear multivariate time-series models able to capture the joint dynamics of multiple time series. The pioneering work of Sims (1980) proposed to replace the large-scale macroeconomic models popular in the 1960s with VARs and suggested that Bayesian methods could have improved upon frequentist ones in estimating the model coefficients. Bayesian VARs (BVARs) with macroeconomic variables were first employed in forecasting by Litterman (1979) and Doan et al. (1984). Since then, VARs and BVARs have been a standard macroeconometric tool routinely used by scholars and policymakers for structural analysis, forecasting and scenario analysis in an ever-growing number of applications.

The aim of this article is to review key ideas and contributions in the BVAR literature. A companion paper provides a brief survey of applications of BVARs in economics and finance, such as forecasting, scenario analysis, and structural identification (Miranda-Agrippino & Ricco, 2018). An exhaustive survey of the literature is beyond the scope of this article due to space limitations. Readers are referred to a number of monographs and more detailed surveys available on different topics in the BVARs literature.^{1}

Differently from frequentist statistics, Bayesian inference treats the VAR parameters as random variables and provides a framework to update probability distributions about the unobserved parameters conditional on the observed data. By providing such a framework, the Bayesian approach allows for incorporation of prior information about the model parameters into post-sample probability statements. The “prior” distributions about the location of the model parameters summarize pre-sample information available from a variety of sources, such as other macro or micro datasets, theoretical models, other macroeconomic phenomena, or introspection.

In the absence of pre-sample information, Bayesian VAR inference can be thought of as adopting “non-informative” (or “diffuse” or “flat”) priors, that express complete ignorance about the model parameters, in light of the sample evidence summarized by the likelihood function (i.e., the probability density function of the data as a function of the parameters). Often, in such a case, Bayesian probability statements about the unknown parameters (conditional on the data) are similar to classical confidence statements about the probability of random intervals around the true parameters value. For example, for a VAR with Gaussian errors and a flat prior on the model coefficients, the posterior distribution is centered at the maximum likelihood estimator (MLE), with variance given by the variance-covariance matrix of the residuals. Section "Inference in BVARs" discusses inference in BVARs and “non-informative” priors.

While non-informative priors can provide a useful benchmark, in empirical work with macroeconomic and financial variables informative priors are often adopted. In scientific data analysis, priors on the model coefficients do not incorporate the investigator’s “subjective” beliefs, instead, they summarize stylized representations of the data generating process. Conditional on a model, these widely held standardized priors aim at making the likelihood-based description of the data useful to investigators with potentially diverse prior beliefs (Sims, 2010b).^{2}

The most commonly adopted macroeconomic priors for VARs are the so-called Minnesota priors (Litterman, 1980). They express the belief that an independent random-walk model for each variable in the system is a reasonable “center” for the beliefs about their time-series behavior. While not motivated by economic theory, they are computationally convenient priors, meant to capture commonly held beliefs about how economic time series behave. Minnesota priors can be cast in the form of a Normal-Inverse-Wishart (NIW) prior, which is the conjugate prior for the likelihood of a VAR with normally distributed disturbances (see Kadiyala & Karlsson, 1997). Conjugate priors are such that the posterior distribution belongs to the same family as the prior probability distribution. Hence, they allow for analytical tractability of the posterior and computational speed. Because the data is incorporated into the posterior distribution only through the sufficient statistics, formulas for updating the prior into the posterior are in this case conveniently simple. It is often useful to think of the parameters of a prior distribution—known as “hyperparameters”—as corresponding to having observed a certain number of “dummy” or “pseudo-” observations with properties specified by the prior beliefs on the VAR parameters. Minnesota priors can be formulated in terms of artificial data featuring pseudo observations for each of the regression coefficients and that directly assert the prior on them.

Dummy observations can also implement prior beliefs about relations among the VAR coefficients, such as, for example, co-integration among variables. In this case, commonly used priors are formulated directly as linear joint stochastic restrictions among the coefficients.^{3} This is, for example, the case of the “single-unit root” prior, that is centered on a region of the VAR parameter space where either there is no intercept and the system contains at least one unit root, or the system is stationary and close to its steady state at the beginning of the sample (Sims, 1993).^{4} Another instance in which dummy observations are used to establish relations among several coefficients is the “sum-of-coefficients” prior, which incorporates the widely shared prior beliefs that economic variables can be represented by a process with unit roots and weak cross-sectional linkages (Litterman, 1979).^{5} Section "Informative Priors for Reduced-Form VARs" discusses some of the priors commonly adopted in the economic literature.

The hyperparameters can be either fixed using prior information (and sometimes “unorthodoxly” using sample information) or associated with hyperprior distributions that express beliefs about their values. A Bayesian model with more than one level of priors is called a hierarchical Bayes model. In empirical macroeconomic modeling, the hyperparameters associated with the informativeness of the prior beliefs (i.e., the tightness of the prior distribution) are usually left to the investigator’s judgment. In order to select a value for these hyperparameters, the VAR literature has adopted mostly heuristic methodologies that minimize pre-specified loss functions over a pre-sample (e.g., the out-of-sample mean squared forecast error in Litterman, 1979, or the in-sample fit in Bańbura et al., 2010). Conversely, Giannone et al. (2015) specified hyperprior distributions and chose the hyperparameters that maximize their posterior probability distribution conditional on the data. Section "Hyperpriors and Hierarchical Modeling" discusses hierarchical modeling and common approaches to choose hyperparameters not specified by prior information.

In section "Time-Varying Parameter, State-Dependent, Stochastic Volatility VARs" we discuss Bayesian inference in VAR models that relax the assumption of fixed coefficients in order to capture changes in the time-series dynamics of macroeconomic and financial variables, such as VARs with autoregressive coefficients, Threshold, and Markov Switching VARs.

Finally, in section "Bayesian Panel VARs" we discuss Panel Bayesian VARs that generalize VAR models by describing the joint dynamics of multiple time series of potentially heterogenous and interacting units—as, for example, the economies of several countries, regions, or sectors.

Inference in BVARs

Vector autoregressions (VARs) are linear stochastic models that describe the joint dynamics of multiple time series. Let *y _{t}* be an

*n*× 1 random vector that takes values in ${\mathbb{R}}_{n}$. The evolution of

*y*—the endogenous variables—is described by a system of

_{t}*p*-th order difference equations—the VAR(

*p*):

In Eq. (1), *A _{j}*,

*j*= 1, . . . ,

*p*are

*n*× n matrices of autoregressive coefficients,

*c*is a vector of

*n*intercepts, and

*u*is an

_{t}*n*-dimensional vector of one-step-ahead forecast errors, or reduced-form innovations. The vector of stochastic innovations,

*u*, is an independent and identically distributed random variable for each

_{t}*t*. The distribution from which

*u*is drawn determines the distribution of

_{t}*y*, conditional on its past

_{t}*y*

_{1–p:t–1}≡ {

*y*

_{1–p},…,

*y*

_{0},…,

*y*

_{t}_{–2},

*y*

_{t}_{–1}}. The standard assumption in the macroeconometric literature is that errors are Gaussian

This implies that also the conditional distribution of *y _{t}* is normal.

^{6}

Bayesian inference on the model in Eq. (1) amounts to updating prior beliefs about the VAR parameters, that are seen as stochastic variables, after having observed a sample *y*_{1–p:t} ≡ {*y*_{1–p},*...*, *y*_{0},…, *y _{t}*

_{–2},

*y*}. Prior beliefs about the VAR coefficients are summarized by a probability density function (p.d.f.), and updated using Bayes’ Law

_{t}

where we define *A* ≡ [*A*_{1},…, *A _{p}*,

*c*]′ as a

*k*×

*n*matrix, with

*k*=

*np*+1. The joint posterior distribution of the VAR(p) coefficients $p(A,\Sigma |{y}_{1-p:t})$ incorporates the information contained in the prior distribution

*p*(

*A*, Σ)—summarizing the initial information about the model parameters –, and the sample information summarized by

*p*(

*y*

_{1–p:t}

*|A*, Σ). Viewed as a function of the parameters, the sample information is the likelihood function.

^{7}The posterior distribution summarizes all the information available and is used to conduct inference on the VAR parameters.

Given the autoregressive structure of the model, and the i.i.d. innovations, the (conditional) likelihood function of the sample observations *y*_{1:T}—conditional on *A*, Σ and on the first *p* observations *y*_{1–p:0} –, can be written as the product of the conditional distribution of each observation

Under the assumption of Gaussian errors, the conditional likelihood of the VAR in Eq. (1) is

where ${x}_{t}^{\prime}\equiv \left[{y}_{t-1}^{\prime}\dots {y}_{t-p}^{\prime}1\right]$.

The likelihood in Eq. (5) can be written in compact form, by using the seemingly unrelated regression (SUR) representation of the VAR

where the *T* × *n* matrices *y* and *u* and the *T* × *k* matrix *x* are defined as

Using this notation and standard properties of the trace operator, the conditional likelihood function can be equivalently expressed as

where
$\widehat{A}$ is the maximum-likelihood estimator (MLE) of *A*, and
$\widehat{S}$ the matrix of sums of squared residuals, that is

The likelihood can also be written in terms of the vectorized representation of the VAR

where **y** ≡ *vec*(*y*) and **u** ≡ *vec*(*u*) are *Tn* × 1 vectors, and *α* *≡ vec*(*A*) is *nk* × 1. In this vectorized notation the likelihood function is written as

where, consistently,
$\widehat{\alpha}\equiv vec(\widehat{A})$ is *nk* × 1. Detailed derivations for the multivariate Gaussian linear regression model can be found in Zellner (1971).

Given the likelihood function, Eq. (3) is used to update the prior information regarding the VAR parameters. An interesting case arises when we assume the absence of any information on the location of the model parameters. This setting can be formalized by assuming that *α* and Σ are independently distributed, that is

with prior p.d.f.

These priors are known as diffuse or Jeffreys prior (Geisser, 1965; Tiao & Zellner, 1964). Jeffreys priors are proportional to the square root of the determinant of the Fisher information matrix, and are derived from the Jeffreys “invariance principle,” meaning that the prior is invariant to re-parameterization (see Zellner, 1971).^{8}

Given this set of priors, it is straightforward to derive the posterior distribution of the VAR parameters as

where the proportionality factor has been dropped for convenience.

From the joint posterior in Eq. (14) one can readily deduce the form of the posterior for *α*, conditional on Σ and the observed sample. Also, the posterior can be integrated over *α* to obtain the marginal posterior for Σ. Therefore, it is possible to conveniently write the posterior distribution of the parameters as

Where

Hence, given the diffuse priors on *α* and Σ, the posterior for the autoregressive coefficients is centered at the MLE, with posterior variance
$\Sigma \otimes {\left(x\text{'}x\right)}^{-1}$.^{9} Interestingly, in this standard normal multivariate linear regression model, Bayesian probability statements about the parameters (given the data) have the same form as the frequentist pre-sample probability statements about the parameters’ estimator (see also Sims, 2010b). This is a more general property, in fact, Kwan (1998) has shown that under widely applicable regularity conditions an estimator
${\widehat{\alpha}}_{T}$ for which

allows, with high accuracy, to approximate the distribution of
$\sqrt{T}(\alpha -{\widehat{\alpha}}_{T})|\widehat{\alpha}$ as
$\mathcal{N}\left(0,\text{}\Sigma \right)$ in large samples. Hence, it is often possible to interpret (1 – *ρ*) approximate confidence sets generated from the frequentist asymptotic approximate distribution as if they were sets in the parameter space with posterior probability (1 – *ρ*).

In potentially mis-specified models for which linear regression coefficients are the object of interest, Muller (2013) proposes to adopt an artificial Gaussian posterior centered at the MLE but with a sandwich estimator for the covariance matrix. In fact, in the case of a mis-specified model, the shape of the likelihood (the posterior) is asymptotically Gaussian and centered at the MLE but of a different variance than the asymptotically normal sampling distribution of the MLE. This argument can be seen as a “flipping” of the frequentist asymptotic statement that supports the use of a sandwich estimator for the covariance matrix in mis-specified models, in line with the results in Kwan (1998).^{10}

An important case in which frequentist pre-sample probability statements and Bayesian post-sample probability statements about parameters diverge, is the case of time-series regression models with unit roots. In such cases, while the frequentist distribution of the estimator is skewed asymptotically, the likelihood, and hence the posterior p.d.f., remain unaffected (see Sims & Uhlig, 1991; Kim, 1994).

Informative Priors for Reduced-Form VARs

Informative prior probability distributions incorporate information about the VAR parameters that is available before some sample is observed. Such prior information can be contained in samples of past data—from the same or a related system –, or can be elicited from introspection, casual observation, and theoretical models. The first case is sometimes referred to as a “data-based” prior, while the second as a “nondata-based” prior.

An important case arises when the prior probability distribution yields a posterior distribution for the parameters in the same family as the prior p.d.f. In this case the prior is called a “natural conjugate prior” for the likelihood function (Raiffa & Schlaifer, 1961). In general, it has been shown that exponential distributions are the only class of distributions that admit a natural conjugate prior, due to these having a fixed number of sufficient statistics that does not increase as the sample size *T* increases (see e.g., Gelman et al., 2013). Because the data is incorporated into the posterior distribution only through the sufficient statistics, formulas for updating the prior into the posterior are in these cases conveniently simple.

Prior distributions can be expressed in terms of coefficients, known as hyperparameters, whose functions are sufficient statistics for the model parameters. It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of pseudo-observations with properties specified by the priors on the parameters. In general, for nearly all conjugate prior distributions, the hyperparameters can be interpreted in terms of “dummy” or pseudo-observations. The basic idea is to add to the observed sample extra “data” that express prior beliefs about the hyperparameters. The prior then takes the form of the likelihood function of these dummy observations. Hyperparameters can be either fixed using prior information, or associated to hyperprior distributions that express beliefs about their values. A Bayesian model with more than one level of priors is called a “hierarchical Bayes model.” In this section we review some of the most commonly used priors for VARs with macroeconomic and financial variables, while we discuss the choice of the hyperpriors and hierarchical modeling in section "Hyperpriors and Hierarchical Modeling."

## Natural Conjugate Normal-Inverse Wishart Priors

The Normal-Inverse Wishart (NIW) conjugate priors, part of the exponential family, are commonly used prior distributions for (*A*, Σ) in VARs with Gaussian errors. These assume a multivariate normal distribution for the regression coefficients, and an Inverse Wishart specification for the covariance matrix of the error term, and can be written as

where
$(\underset{\_}{S},\underset{\_}{d},\underset{\_}{\alpha},\underset{\_}{\Omega})$ are the priors’ hyperparameters;
$\underset{\_}{d}$ and
$\underset{\_}{S}$ denote, respectively, the degrees of freedom and the scale of the prior Inverse-Wishart distribution for the variance-covariance matrix of the residuals.
$\underset{\_}{\alpha}$ is the prior mean of the VAR coefficients, and
$\underset{\_}{\Omega}$ acts as a prior on the variance-covariance matrix of the dummy regressors.^{11} The posterior distribution can be analytically derived and is given by

Where

Comparing Eqs. (16)–(17) to Eqs. (19)–(20), it is evident that informative priors can be thought of as equivalent to having observed dummy observations (*y _{d}*,

*x*) of size

_{d}*T*, such that

_{d}

This idea was first proposed for a classical estimator for stochastically restricted coefficients by Theil (1963). Once a set of pseudo-observations able to match the wished hyperparameters is found, the posterior can be equivalently estimated using the extended samples ${y}_{*}=\left[{y}^{\prime},{y}_{d}^{\text{'}}\right],\text{}{x}_{*}={\left[{x}^{\prime},{x}_{d}^{\text{'}}\right]}^{\prime}$ of size ${T}_{*}=T+{T}_{d}$ obtaining

Indeed, it is easy to verify that the posterior moments obtained with the starred variables coincide with those in Eqs. (21)–(22). The posterior estimator efficiently combines sample and prior information using their precisions as weights in the spirit of the mixed estimation of Theil and Goldberger (1961). Posterior inference can be conducted via direct sampling.

Algorithm 1: Direct Monte Carlo Sampling From Posterior of VAR Parameters

For *s* = 1, . . .*,n _{sim}*:

1. Draw Σ

^{(s)}from the Inverse-Wishart distribution $\Sigma |y~\mathcal{I}\mathcal{W}({S}_{*},{T}_{*}+\underset{\_}{d})$.2. Draw

*A*^{(s)}from the Normal distribution of ${A}^{\left(s\right)}|{\Sigma}^{\left(s\right)},y~\mathcal{N}\left({\alpha}_{*},\text{}{\Sigma}^{\left(s\right)}\otimes {\left({x}_{*}^{\prime}{x}_{*}\right)}^{-\text{1}}\right)$.

When it is not possible to sample directly from the posterior distribution, as in this case, Markov chain Monte Carlo (MCMC) algorithms are usually adopted (see, e.g., Chib, 2001).^{12}

An important feature of the NIW priors in Eqs. (19)–(20) is the Kronecker factorization that appears in the Gaussian prior for *α*. As discussed in the previous section, because the same set of regressors appears in each equation, homoskedastic VARs can be written as SUR models. This symmetry across equations means that homoskedastic VAR models have a Kronecker factorization in the likelihood, which in turn implies that estimation can be broken into *n* separate least-squares calculations, each only of dimension *np* + 1. The symmetry in the likelihood can be inherited by the posterior, if the prior adopted also features a Kronecker structure as in Eq. (20). This is a desirable property that guarantees tractability of the posterior p.d.f. and computational speed. However, such a specification can result in unappealing restrictions and may not fit the actual prior beliefs one has—see discussions in Kadiyala and Karlsson (1997) and Sims and Zha (1998). In fact, it forces symmetry across equations, because the coefficients of each equation have the same prior variance matrix (up to a scale factor given by the elements of Σ). There may be situations in which theory suggests “asymmetric restrictions” may be desirable instead, for example, money neutrality implies that the money supply does not Granger-cause real output.^{13} Also, the Kronecker structure implies that prior beliefs must be correlated across the equations of the reduced form representation of the VAR, with a correlation structure that is proportional to that of the disturbances.

## Minnesota Prior

In macroeconomic and financial applications, the parameters of the NIW prior in Eqs. (19)–(20) are often chosen so that prior expectations and variances of A coincide with the so-called Minnesota prior, which was originally proposed in Litterman (1980, 1986).^{14} The basic intuition behind this prior is that the behavior of most macroeconomic variables is well approximated by a random walk with drift. Hence, it “centers” the distribution of the coefficients in *A* at a value that implies a random-walk behavior for all the elements in *y _{t}*

While not motivated by economic theory, these are computationally convenient priors, meant to capture commonly held beliefs about how economic time series behave.

The Minnesota prior assumes the coefficients *A*_{1},. . . , *A _{p}* to be a priori independent and normally distributed, with the following moments

In Eq. (33),
${({A}_{\ell})}_{ij}$ denotes the coefficient of variable *j* in equation *i* at lag
$\ell $. In the original formulation of the prior *δ** _{i}* = 1, in accordance with Eq. (32). The random-walk assumption, however, may not be appropriate if the variables in

*y*were characterized by substantial mean-reversion. For stationary series, or series that have been transformed to achieve stationarity, Bańbura et al. (2010) center the distribution around zero (i.e.,

_{t}*δ*

*= 0). The prior also assumes that lags of other variables are less informative than own lags, and that most recent lags of a variable tend to be more informative than more distant lags. This intuition is formalized with $f(\ell )$. A common choice for this function is a harmonic lag decay—that is $f(\ell )={\ell}^{{\lambda}_{2}}$, a special case of which is $f(\ell )=\ell -$, where the severity of the lag decay is regulated by the hyperparameter*

_{i}*λ*

_{2}. The factor ${\Sigma}_{ij}/{\omega}_{j}^{2}$ accounts for the different scales of variables

*i*and

*j*. The hyperparameters ${\omega}_{j}^{2}$ are often fixed using sample information, for example from univariate regressions of each variable onto its own lags.

Importantly, *λ*_{1} is a hyperparameter that controls the overall tightness of the random walk prior. If *λ*_{1} = 0 the prior information dominates, and the VAR reduces to a vector of univariate models. Conversely, as
${\lambda}_{1}\to \infty $ the prior becomes less informative, and the posterior mostly mirrors sample information. We discuss the choice of the free hyperparameters in section "Hyperpriors and Hierarchical Modeling."

The Minnesota prior can be implemented using dummy observations. Priors on the A coefficients are implemented via the following pseudo-observations

where
${J}_{p}=diag([{1}^{{\lambda}_{2}},\text{}{2}^{{\lambda}_{2}},\dots ,{p}^{{\lambda}_{2}}])$ with geometric lag decay.^{15} To provide intuition on how the prior is implemented using artificial observations, we consider the simplified case of a *n* = 2, *p* = 2 VAR for the pseudo-observations. The first *n* rows of Eq. (34) impose priors on *A*_{1}; that is, on the coefficients of the first lag. In the *n* = 2, *p* =2 case one obtains,

that implies, for example, the following equations for the elements (1,1) and (1, 2) of *A*_{1}

Similar restrictions are obtained for the elements the elements (2,1) and (2, 2) of *A*_{1}. The following (*n* – 1)*p* rows in Eq. (34) implement priors on the coefficients of the other lags. In fact, we readily obtain

which for example implies the following restriction for the element (1,1) of *A*_{2}

Similar restrictions obtain for the other elements of *A*_{2}. Priors beliefs on the residual covariance matrix Σ can instead be implemented by the following block of dummies

In the *n* = 2, *p* = 2 case, they correspond to appending to the VAR equations *λ*_{3} replications of

*λ*_{3} is the hyperparameter that determines the tightness of the prior on Σ. To understand how this works, it is sufficient to consider that given *λ*_{3} artificial observations *z _{i}*, with
${z}_{i}~\mathcal{N}(0,\phantom{\rule{0.2em}{0ex}}{\sigma}_{z}^{2})$, an estimator for the covariance is given by
${\lambda}_{3}^{-1}{\displaystyle {\sum}_{i=1}^{{\lambda}_{3}}{z}_{i}^{2}}$.

Finally, uninformative priors for the intercept are often implemented with the following set of pseudo-observations

where
$\u03f5$ is a hyperparameter usually set to a very small number.^{16}

## Priors for VAR With Unit Roots and Trends

Sims (1996, 2000) observed that flat-prior VARs, or more generally estimation methods that condition on initial values, tend to attribute an implausibly large share of the variation in observed time series to deterministic—and hence entirely predictable—components. The issue stems from the fact that ML and OLS estimators that condition on the initial observations and treat them as non-stochastic do not apply any penalization to parameters values that imply that these observations are distant from the variables’ steady state (or their trend if non-stationary). As a consequence, complex transient dynamics from the initial conditions to the steady state are treated as plausible and can explain an “implausibly” large share of the low-frequency variation of the data. This typically translates into poor out-of-sample forecasts. To understand the intuition, consider the univariate model

Iterating Eq. (40) backward yields

which, if |*a*| < 1, reduces to

The first term in square brackets in Eq. (41) is the deterministic component: the evolution of *y _{t}* from the initial conditions

*y*

_{0}, absent any shocks. The second term instead captures the stochastic evolution of

*y*due to the shocks realized between [0,

_{t}*t*– 1].

*c*/(1 –

*a*) in Eq. (42) is the unconditional mean of

*y*. If

_{t}*y*is close to non-stationary—i.e., $a\simeq \text{1}-$, the MLE estimator of the unconditional mean of

_{t}*y*may be very far from

_{t}*y*

_{0}, and the “reversion to the mean” from

*y*

_{0}is then used to fit the data (see Eq. 42).

One way to deal with this issue is to use the unconditional likelihood, by explicitly incorporating the density of the initial observations in the inference. However, because most macroeconomic time series are effectively nonstationary, it is not obvious how the density of the initial observations should be specified.^{17} Another approach, following Sims and Zha (1998), is that of Sims (2000), which would be instead to specify priors that downplay the importance of the initial observations and hence reduce the explanatory power of the deterministic component.

These types of priors, implemented through artificial observations, aim to reduce the importance that the deterministic component has in explaining a large share of the in-sample variation of the data, eventually improving forecasting performances out-of-sample (see Sims, 1996; Sims & Zha, 1998, for a richer discussion on this point).^{18}

The “co-persistence” (or “single-unit-root” or “dummy initial observation”) prior (Sims, 1993) reflects the belief that when all lagged *y _{t}*’s are at some level
${\overline{y}}_{0}$,

*y*tends to persist at that level. It is implemented using the following artificial observation

_{t}

where
${\overline{y}}_{0,i},\phantom{\rule{0.2em}{0ex}}i=1,\dots ,n$ are the average of the initial values of each variable, and usually set to be equal to the average of the first *p* observations in the sample, and *k* = *np* + 1. Writing down the implied system of equations
${y}_{d}^{(4)}=A{x}_{d}^{(4)}+{u}_{d}^{(4)}$ one obtains the following stochastic restriction on the VAR coefficients

where
${\mathbb{I}}_{n}-A(1)=({\mathbb{I}}_{n}-{A}_{1}-\dots -{A}_{p})$. The hyperparameter *λ*_{4} controls the tightness of this stochastic constraint. The prior is uninformative for
${\lambda}_{\text{4}}\to \infty $. Conversely, as
${\lambda}_{\text{4}}\to 0$ the model tends to a form where either there is at least one explosive common unit root and the constant *c* is equal to zero (
${\overline{y}}_{0}$ is the eigenvector of the unit root), or the VAR is stationary, *c* is different from zero, and the initial conditions are close to the implied unconditional mean
$({\overline{y}}_{0}={\left[{\mathbb{I}}_{n}-A\left(1\right)\right]}^{-1}c)$. In the stationary form, this prior does not rule out cointegrated models. This prior induces prior correlation among all the VAR coefficients in each equation, including the constant.^{19}

The “sums-of-coefficients” (or “no-cointegration”) prior (Doan et al., 1984), captures the belief that when the average lagged values of a variable *y _{j,t}* is at some level
${\overline{y}}_{0,j}$, then
${\overline{y}}_{0,j}$ is likely to be a good forecast of

*y*. It also implies that knowing the average of lagged values of variable

_{j,t}*j*does not help in predicting a variable

*i*≠

*j*. This prior is implemented using

*n*artificial observations, one for each variable in

*y*

_{t}

The prior implied by these dummy observations is centered at 1 for the sum of coefficients on own lags for each variable, and at 0 for the sum of coefficients on other variables’ lags. It also introduces correlation among the coefficients of each variable in each equation. In fact, it is easy to show that equation by equation this prior implies the stochastic constraint

where
${({A}_{\ell})}_{jj}$ denotes the coefficient of variable *j* in equation *j* at lag
$\ell $. The hyperparameter *λ*_{5} controls the variance of these prior beliefs. As
${\lambda}_{5}\to \infty $ the prior becomes uninformative, while
${\lambda}_{\text{5}}\to \text{0}$ implies that each variable is an independent unit-root process, and there are no co-integration relationships.^{20}

The Bayesian analysis of cointegrated VARs is an active area of research, (a detailed survey is in Koop et al. 2006).^{21} Giannone et al. (2016) elicited theory-based priors for the long run of persistent variables that shrink toward a random walk those linear combination of variables that are likely to have a unit root. Conversely, combinations that are likely to be stationary (i.e., cointegrating relationships among variables) are shrunk toward stationary processes. Operationally, this is achieved by rewriting the VAR in Eq. (1) as

where
$\Pi ={A}_{1}+\dots +{A}_{p}-{\mathbb{I}}_{n},\phantom{\rule{0.2em}{0ex}}{P}_{j}=-\text{}({A}_{j}{}_{+1}+\dots +{A}_{p})$, and F is any invertible n- dimensional matrix. The problem is then specified as setting a prior for
$\tilde{\Pi}\equiv \Pi {F}^{-1}$, conditional on a specific choice of *F*. *F* defines the relevant linear combinations of the variables in *y _{t}* that macroeconomic theory suggests to be a priori stationary or otherwise.

Another alternative is in Villani (2009). Here the VAR is written as

for Math Keying.

where *ρ*_{0} and *ρ*_{1} are *n* × 1 vectors. The first term, *ρ*_{0} + *ρ*_{1}*t*, captures a linear deterministic trend of *y _{t}*, whereas the law of motion of
${\stackrel{~}{y}}_{t}$ captures stochastic fluctuations around the deterministic trend, which can be either stationary or non-stationary. This alternative specification allows to separate beliefs about the deterministic trend component from beliefs about the persistence of fluctuations around this trend. Let

*A*= [

*A*

_{1}, . . .,

*A*]′ and $\rho ={\left[{\rho}_{0}^{\prime},\phantom{\rule{0.2em}{0ex}}{\rho}_{1}^{\prime}\right]}^{\prime}$. It can be shown that if the prior distribution of

_{p}*ρ*conditional on

*A*and Σ is Normal, the (conditional) posterior distribution of

*ρ*is also Normal (see also Del Negro & Schorfheide, 2011, for details). Hence, posterior inference can be implemented via Gibbs sampling.

## Priors From Structural Models

DeJong et al. (1993), Ingram and Whiteman (1994), and Del Negro and Schorfheide (2004) have proposed the use of priors for VARs that are derived from Dynamic Stochastic General Equilibrium (DSGE) models. This approach bridges VARs and DSGEs by constructing families of prior distributions informed by the restrictions that a DSGE-model implies on the VAR coefficients. This modeling approach is sometimes referred to as DSGE-VAR. Ingram and Whiteman (1994) derived prior information from the basic stochastic growth model of King et al. (1988) and reported that a BVAR based on the Real Business Cycle model prior outperforms a BVAR with a Litterman prior in forecasting real economic activity. Del Negro and Schorfheide (2004) extend and generalize this approach and show how to conduct policy simulations within this framework.

Schematically, the exercises can be thought of as follows. First, time series are simulated from a DSGE model. Second, a VAR is estimated from these simulated data. Population moments of the simulated data computed from the DSGE model solution are considered in place of sample moments. Since the DSGE model depends on unknown structural parameters, hierarchical prior modeling is adopted by specifying a distribution on the DSGE model parameters. A tightness parameter controls the weight of the DSGE model prior relative to the weight of the actual sample. Finally, Markov Chain Monte Carlo methods are used to generate draws from the joint posterior distribution of the VAR and DSGE model parameters.

## Priors for Model Selection

It is standard practice in VAR models to pre-select the relevant variables to be included in the system (and with how many lags). This procedure may be thought of as having dogmatic priors about which variables have non-zero coefficients in the system. The challenge is in selecting among an expansive set of potential models. Indeed, for a VAR with *n* endogenous variables, *q* additional potentially exogenous variables including a constant, and *p* lags, there are 2^{(q+pn)n+n(n–1)/2} possible models.

Jarociński and Maćkowiak (2017) proposed to select the variables to be included in the system by systematically assessing the posterior probability of “Granger causal priority” (Sims, 2010a) in a BVAR with conjugate priors. Granger causal priority answers questions of the form “Is variable z relevant for variable x, after controlling for other variables in the system?” The authors provide a closed form expression for the posterior probability of Granger causal priority and suggest that variables associated with high Granger causal priority probabilities can be omitted from a VAR with the variables of interest.

Alternatively, one can adopt priors that support model selection and enforce sparsity. A variety of techniques, including double exponential (Laplace) prior, spike-and-slab prior (among others) have been adopted to handle this issue. Some recent theoretical and empirical contributions on this topic are in Mitchell and Beauchamp (1988), George et al. (2008), Koop (2013), Korobilis (2013), Bhattacharya et al. (2015), Griffin and Brown (2010, 2017), Giannone et al. (2017), and Huber and Feldkircher (2017).

Hyperpriors and Hierarchical Modeling

As seen in the previous section, the informativeness of prior beliefs on the VAR parameters often depends on a set of free hyperparameters. Let *λ* ≡ [*λ*_{1}, *λ*_{2},. . .] denote the vector collecting all the hyperparameters not fixed using (pre)sample information, and *θ* denote all the VAR parameters (i.e., *A* and Σ). The prior distribution of *θ* is thus effectively *p*_{λ}(*θ*). Choosing a value for *λ* alters the tightness of the prior distribution and hence determines how strictly the prior is enforced on the data.

In order to set the informativeness of the prior distribution of the VAR coefficients, the literature has initially used mostly heuristic methodologies. Litterman (1980) and Doan et al. (1984), for example, choose a value for the hyperparameters that maximizes the out-of-sample forecasting performance over a pre-sample. Conversely, Bańbura et al. (2010) proposed to choose the shrinkage parameters that yield a desired in-sample fit in order to control for overfitting. Subsequent studies have then either used these as “default” values or adopted either one of these criteria. Robertson and Tallman (1999), Wright (2009), and Giannone et al. (2014) opt for the first, while for example, Giannone et al. (2008), Bloor and Matheson (2011), Carriero et al. (2009), and Koop (2013) follow Bańbura et al. (2010).

In VARs, Giannone et al. (2015) observed that, from a purely Bayesian perspective, choosing *λ* is conceptually identical to conducting inference on any other unknown parameter of the model. Specifically, the model is interpreted as a hierarchical one (Berger, 1985; Koop, 2003) and *λ* can be chosen as the maximizer of

This method is also known in the literature as the Maximum Likelihood Type II (ML-II) approach to prior selection (Berger, 1985; Canova, 2007). In Eq. (49), *p*(λ|**y**) is the posterior distribution of *λ* conditional on the data, and *p*(*λ*) denotes a prior probability density specified on the hyperparameters themselves and also known as the hyperprior distribution. In such hierarchical model, the prior distribution for the VAR coefficients is treated as a conditional prior, that is *p _{λ}*(

*θ*) is replaced by

*p*(

*θ*|

*λ*). In the case of a NIW family of distributions, the prior structure becomes

*p*(

*α*|Σ,

*λ*)

*p*(Σ|

*λ*)

*p*(

*λ*).

*p*(

**y**

*|λ, y*

_{1–p:0}) is the marginal likelihood (ML) and is obtained as the density of the data as a function of

*λ*, after integrating out all the VAR parameters. Conveniently, with conjugate priors the ML is available in closed form.

Conversely, the joint posterior of *α*, Σ and *λ* is not available in closed form. However, with NIW priors for *θ*, Giannone et al. (2015) set up the following Metropolis-Hasting sampler for the joint distribution

Algorithm 2: MCMC Sampler for a VAR with Hierarchical Prior.

For *s* = 1,. . .,*n _{sim}*:

1. Draw a candidate vector

*λ*^{*}from the random walk distribution ${\lambda}^{*}~\mathcal{N}\left({\lambda}^{s-1},\kappa {H}^{-1}\right)$, where*H*is the Hessian of the negative of the log-posterior at the peak for*λ*, and*κ*is a tuning constant. Choose$${\lambda}^{\left(s\right)}=\{\begin{array}{lll}{\lambda}^{*}\hfill & \text{with}\phantom{\rule{0.2em}{0ex}}\text{probability}\hfill & =\mathrm{min}\left\{1,\phantom{\rule{0.2em}{0ex}}\frac{p\left(y|{\lambda}^{*}\right)}{p\left(y|{\lambda}^{\left(s-1\right)}\right)}\right\}\hfill \\ {\lambda}^{\left(s-1\right)}\hfill & \text{otherwise}\hfill & \hfill \end{array}$$2. Draw Σ

^{(s)}form the full conditional posterior Σ|**y**,*λ*^{(s)}in Eq. (21).3. Draw

*A*^{(s)}from the full conditional posterior*A*^{(s)}|**y**, Σ^{(s)},*λ*^{(s)}in Eq. (22).

In a similar fashion, Belmonte et al. (2014) applied a hierarchical structure to time-varying parameters (TVP) models and specify priors for Bayesian Lasso shrinkage parameters to determine whether coefficients in a forecasting model for inflation are zero, constant, or time-varying in a data driven way.

Carriero et al. (2015) evaluated the forecasting performance of BVARs where tightness hyperparameters are chosen as the maximisers of Eq. (49) or rather set to default values and find that the former route yields modest but statistically significant gains in forecasting accuracy particularly at short horizons.

Time-Varying Parameter, State-Dependent, Stochastic Volatility VARs

Models that allow parameters to change over time are increasingly popular in empirical research, in recognition of the fact that they can capture structural changes in the economy. In fact, it seems to be a common belief that the properties of many (if not most) macroeconomic time series have changed over time and can change across regimes or phases of the business cycle. Model parameters either change frequently and gradually over time according to a multivariate autoregressive process—as for example in time-varying parameters VARs (TVP-VARs) –, or they change abruptly and infrequently in, for instance, Markov-switching or structural-break models.

## Time-Varying Parameters VAR (TVP-VAR)

Time-varying parameters VARs differ from fixed-coefficient VARs in that they allow the parameters of the model to vary over time, according to a specified law of motion.^{22} TVP-VARs often include also stochastic volatility (SV), which allows for time variation in the variance of the stochastic disturbances.^{23} Doan et al. (1984) were the first to show how estimation of a TVP-VAR with Litterman priors could be conducted by casting the VAR in state space form and using Kalman filtering techniques. This same specification is in Sims (1993). Bayesian time-varying parameter VARs have become popular in empirical macroeconomics following the work of Cogley and Sargent (2002, 2005) and Primiceri (2005) who provided the foundations for Bayesian inference in these models, and used then innovations in MCMC algorithms to improve on their computational feasibility.

The basic TVP-VAR is of the form

where the constant coefficients of Eq. (1) are replaced by the time-varying *A _{j,t}*. Eq. (50) can be rewritten in compact form as

where *x _{t}* is defined as in Eq. (5), and

*A*= [

_{t}*A*

_{1,t},…,

*A*,

_{p,t}*c*]′. It is common to assume that the coefficients follow a random-walk process

_{t}

where *α** _{t}* ≡

*vec*(

*A*). The covariance matrix $\Upsilon $ is usually restricted to be diagonal, and the innovations

_{t}*ς*

*to be uncorrelated with*

_{t}*u*, with

_{t}*u*distributed as in Eq. (61). The law of motion for

_{t}*α*

*in Eq. (52)—that is, the state equation –, implies that ${\alpha}_{t+1}|{\alpha}_{t},\phantom{\rule{0.2em}{0ex}}\Upsilon ~\mathcal{N}\left({\alpha}_{t},\phantom{\rule{0.2em}{0ex}}\Upsilon \right)$, which can be used as a prior distribution for*

_{t}*α*

_{t}_{+1}. Hence, the prior for all the states (i.e., ${\alpha}_{t}\forall t$) is a product of normal distributions. For the initial vector of the VAR coefficients Cogley and Sargent (2002, 2005) use a prior of the form ${\alpha}_{1}~\mathcal{N}\left({\underset{\_}{\alpha}}_{1|0},{\underset{\_}{\Upsilon}}_{1|0}\right)$, where ${\underset{\_}{\alpha}}_{1|0}$ and ${\underset{\_}{\Upsilon}}_{1|0}$ are set by estimating a fixed-coefficient VAR with a flat prior on a pre-sample.

^{24}If the Gaussian prior for the states is complemented with IW priors for Stochastic volatility in Bayesian VARs was initially introduced in Uhlig (1997), both Σ and Υ, then sampling from the joint posterior is possible with a Gibbs sampling algorithm

Algorithm 3: Gibbs Sampling From Posterior of TVP-VAR Parameters

Select starting values for Σ^{(0)} and ϒ^{(0)}. For *s* = 1,. . . ., *n _{sim}*:

1. Draw ${\alpha}_{T}^{\left(s\right)}$ from the full conditional posterior

$${\alpha}_{T}^{(s)}|{y}_{1:T},{\Sigma}^{(s-1)},{\Upsilon}^{(s-1)}~\mathcal{N}({\alpha}_{T|T},\phantom{\rule{0.2em}{0ex}}{\Upsilon}_{T|T})$$obtained from the Kalman filter. For

*t*=*T*– 1,. . . ., 1 draw ${\alpha}_{T}^{(s)}$ from the full conditional posterior$${\alpha}_{t}^{(s)}|{y}_{1:T},{\Sigma}^{(s-1)},{\Upsilon}^{(s-1)}~\mathcal{N}({\alpha}_{t|T},\phantom{\rule{0.2em}{0ex}}{\Upsilon}_{t|T})$$obtained from a simulation smoother.

2. Draw ϒ

^{(s)}from$${\Upsilon}^{(s)}|{\alpha}_{1:T}^{(s)}~\mathcal{I}\mathcal{W}\left({\underset{\_}{S}}_{\Upsilon}+{\displaystyle \sum _{t=1}^{T}\left[{\alpha}_{t+1}^{(s)}-{\alpha}_{t}^{(s)}\right]}\phantom{\rule{0.2em}{0ex}}\phantom{\rule{0.2em}{0ex}}{\left[{\alpha}_{t+1}^{(s)}-{\alpha}_{t+1}^{(s)}\right]}^{\prime},{\underset{\_}{d}}_{\Upsilon}+T\right).$$3. Draw Σ

^{(s)}from$${\Sigma}^{(s)}|y,\phantom{\rule{0.2em}{0ex}}{\alpha}_{1:T}^{(s)}~\mathcal{I}\mathcal{W}\left(\underset{\_}{S}+{\displaystyle \sum _{t=1}^{T}\left[y-\phantom{\rule{0.2em}{0ex}}\left({\mathbb{I}}_{n}\otimes x\right){\alpha}_{t}^{(s)}\right]}\phantom{\rule{0.2em}{0ex}}\phantom{\rule{0.2em}{0ex}}{\left[y\phantom{\rule{0.2em}{0ex}}-\left({\mathbb{I}}_{n}\otimes x\right){\alpha}_{t}^{(s)}\right]}^{\prime},\underset{\_}{d}+T\right).$$

When stochastic volatility is added to the framework, the VAR innovations are assumed to be still normally distributed but with variance that evolves over time (see Cogley & Sargent, 2002, 2005; Primiceri, 2005)

where *K* is a lower-triangular matrix with ones on the main diagonal, and Ξ* _{t}* a diagonal matrix with elements evolving following a geometric random-walk process

The prior distributions for ϒ and
${\sigma}_{\eta ,j}^{2}\phantom{\rule{0.2em}{0ex}}j=1,\dots ,n$ can be used to express beliefs about the magnitude of the period-to-period drift in the VAR coefficients, and the changes in the volatility of the VAR innovations respectively. In practice, these priors are chosen to ensure that innovations to the parameters are small enough that the short- and medium-run dynamics of *y _{t}* are not swamped by the random-walk behavior of

*A*and Ξ

_{t}_{t}Primiceri (2005) extends the above TVP-VAR by also allowing the nonzero off-diagonal elements of the contemporaneous covariance matrix

*K*to evolve as random-walk processes (i.e.,

*K*is replaced by

*K*to allow for an arbitrary time-varying correlation structure). A Gibbs sampler to draw from the posterior distribution of the parameters is in Primiceri (2005).

_{t}## Markov Switching, Threshold, and Smooth Transition VARs

Contrary to the drifting coefficients models discussed in the previous section, Markov switching (MS) VARs are designed to capture abrupt changes in the dynamics of *y _{t}*.

^{25}These can be viewed as models that allow for at least one structural break to occur within the sample, with the timing of the break being unknown. They are of the form

where *x _{t}* is defined as in Eq. (5). The matrix of autoregressive coefficients

*A*(

*s*) and the variance of the error term Σ(

_{t}*s*) are a function of a discrete

_{t}*m*-state Markov process

*s*with fixed transition probabilities

_{t}

If *π** _{ii}* = 1 for some
$i\in [\text{1},\dots ,\phantom{\rule{0.2em}{0ex}}m]$, then
${\mathcal{S}}_{i}$ is an absorbing state from which the system is not allowed to move away. Suppose

*m*= 2, and that both

*A*(

*s*) and Σ(

_{t}*s*) change simultaneously when switching from ${\mathcal{S}}_{1}$ to ${\mathcal{S}}_{2}$ and vice versa. If a NIW prior is specified for

_{t}*A*(

*s*) and Σ(

_{t}*s*), and

_{t}*π*

_{11}and

*π*

_{22}have independent Beta prior distributions, a Gibbs sampler can be used to sample from the posterior (see, e.g., Del Negro & Schorfheide, 2011).

A MS-VAR with non-recurrent states is called a “change-point” model (see Chib, 1998; Bauwens & Rombouts, 2012). Generalizing the specification to allow for more states, with the appropriate transition probabilities, allows to adapt the change-point model to the case of several structural breaks (see also Koop & Potter, 2007, 2009; Liu et al., 2017, for models where the number of change points is unknown). Important extensions regard the transmission of structural shocks in the presence of structural breaks and in a time-varying coefficient environment discussed in, for example, Sims (2006) and Koop et al. (2011) who also allow for cointegration.

In threshold VARs (TVARs), the coefficients of the model change across regimes when an observable variable exceeds a given threshold value. Bayesian inference in TVAR models is discussed in detail in Geweke and Terui (1993) and Chen and Lee (1995). A TVAR with two regimes can be written as

where *A* and *A*^{*} are *n* × *k* matrices that collect the autoregressive coefficients of the two regimes, Θ(•) is a Heaviside step function, which is a discontinuous function whose value is zero for a negative argument, and one for a positive argument, *τ** _{t–d}* is threshold variable at lag

*d*, and

*τ*is a potentially unobserved threshold value. The system in Eq. (57) can be easily generalized to allow for multiple regimes. TVARs have been applied to several problems in the economic literature (see, e.g., Koop & Potter, 1999; Ricco et al., 2016; Alessandri & Mumtaz, 2017).

If the coefficients gradually migrate to the new state(s), the model is called a smooth-transition VAR (STVAR). A STVAR model with two regimes can be written as

where *A*^{*}, *A*, and *x _{t}* are defined as in Eq. (57). The function

*G*(

*w*;

_{t}*ϑ*,

*w*) governs the transition across states, and is a function of the observable variable

*w*, and of the parameters

_{t}*ϑ*and

*w*. In an exponential smooth-transition (EST) VAR, typically

where *ϑ* > 0 determines the speed of transition across regimes, *w* can be thought of as a threshold value, and *σ** _{w}* is the sample standard deviation of

*w*. The higher

_{t}*ϑ*the more abrupt the transition, the more the model collapses into a fixed threshold VAR. Among others, Gefang and Strachan (2009) and Gefang (2012) apply Bayesian techniques to estimate smooth-transition VAR models.

Bayesian Panel VARs

Panel VARs generalize VAR models by describing the joint dynamics of multiple time series of heterogenous and interacting units—as, for example, the economies of several countries, regions, or sectors. Thorough reviews are in Canova and Ciccarelli (2013) and in Dieppe, van Roye, and Legrand (2016).

A panel VAR describes the the evolution of *y _{t,i}*—the vector of

*n*× 1 endogenous variables of each unit $i\in \left[\text{1},\mathrm{...}\text{}N\right]-$ by a system of

*p*-th order VARs

where *w _{t}* is a vector of

*m*exogenous controls. The innovations are generally assumed to be

*i.i.d*. and Gaussian

while being possibly correlated across units.

Stacking over the *N* units, the model assumes the form of a VAR(*p*) with exogenous controls

In Eq. (63)

moreover

and

While in Eq. (63) the system appears as a standard VAR, its panel structure is captured by three properties: (i) dynamic interdependencies—the dynamics of the variables in each unit depend on the lagged values of the other endogenous variables in the unit and possibly all other units:
${A}_{\ell ,jk}\ne 0$ for *j* ≠ *k*; (ii) static interdependencies—the innovations *u _{t}*

_{,i}can be correlated across units: Σ

*• ≠ 0 for*

_{ij}*i*≠

*j*; (iii) cross-unit (sub-sectional) heterogeneity—the VAR coefficients and residual variances can be unit-specific: ${A}_{\ell ,ik}\ne {A}_{\ell ,jk},\phantom{\rule{0.2em}{0ex}}{H}_{i}\ne {H}_{j}$ and ${\Sigma}_{ii}\ne {\Sigma}_{jj}$ for $i\ne j$.

^{26}

If all of these properties are present in the data, and relationships do not exist among the coefficients, the system is a VAR with a large cross-section (i.e., a Large VAR) and can be estimated with standard macroeconomic priors such as, for example, the Minnesota priors of section "Informative Priors for Reduced-Form VARs."^{27} However, it is often possible to assume that some of these properties are relevant to system of interest.

If the units do not have dynamic or static interdependencies (i.e.,
${A}_{\ell}{{}_{,j}}_{k}=0$ and Σ* _{jk}* = 0 for

*j*≠

*k*) and the dynamic coefficients are homogenous across units (i.e., ${A}_{\ell}{}_{,jj}={\overline{A}}_{\ell}$, ${H}_{j}=\overline{H}$, and ${\Sigma}_{jj}=\overline{\Sigma}\text{}\forall j$), then

and the system simplifies into a single pooled VAR, with only *n* × (*np* + *m*) coefficients to be estimated.^{28} By stacking the observations first over different units and then over different times, the system can be cast in the standard SUR representation (Eq. 6) and estimated with standard priors (e.g., Normal-Inverse Wishart priors) and techniques.

If the dynamic coefficients are heterogenous across units but no dynamic or static interdependencies are present, then the system breaks up into N independent VARs, with the following SUR representation

A random coefficients model for Eq. (65) assumes that the coefficients from each unit can be thought of as random draws from a common distribution. For example,

where *α** _{i} ≡ vec*(

*A*). Eq. (66) can be thought of as an exchangeable Bayesian prior on the units’ coefficients—the unit indices

_{i}*i*are uninformative, in the sense that they can be exchanged without any loss of information. This approach was proposed by Zellner and Hong (1989) who used a “Minnesota” type prior with fixed and known residual covariance matrix, a diagonal Σ

*with overall tightness hyperparameter*

_{a}*λ*

*, and a plug-in pooled estimator for*

_{a}*a*. Jarociñski (2010) proposes instead a fully Bayesian model in which all the parameters are treated as random variables, and a sophisticated hierarchical prior approach is adopted. Estimation can be archived using a Gibbs sampler.

If dynamic interdependencies are allowed, the estimation problem becomes more complex. Canova and Ciccarelli (2004, 2009) have suggested solutions based on different Bayesian and cross-sectional shrinkage techniques that can deal with the issue of parameters proliferation that arises in these cases. The approach works by assuming a factor structure for the matrix of coefficients and can be estimated with standard Bayesian priors and a Gibbs sampler. This structural factor approach is flexible and can also be used to estimate Panel VARs with dynamic coefficients that evolve over time, as done in, for example, Ciccarelli et al. (2012) and Canova and Ciccarelli (2013).

Acknowledgments

We thank the editors of the *Oxford Research Encyclopedia of Economics and Finance* and an anonymous referee for useful comments and suggestions. We are also grateful to Fabio Canova, Andrea Carriero, Matteo Ciccarelli, Domenico Giannone, Marek Jarociński, Marco del Negro, Massimiliano Marcellino, Giorgio Primiceri, Lucrezia Reichlin, and Frank Shorfheide for helpful comments and discussions. The views expressed in this paper are those of the authors and do not necessarily reflect those of the Bank of England or any of its committees.

## References

Alessandri, P., & Mumtaz, H. (2017, March). Financial conditions and density forecasts for US output and inflation. *Review of Economic Dynamics*, *24*, 66–78.Find this resource:

Bańbura, M., Giannone, D., & Reichlin, L. (2010). Large Bayesian vector auto regressions. *Journal of Applied Econometrics*, *25*(1), 71–92.Find this resource:

Bauwens, L., & Rombouts, J. V. K. (2012). On marginal likelihood computation in change-point models. *Computational Statistics & Data Analysis*, *56*(11), 3415–3429.Find this resource:

Belmonte, M. A. G., Koop, G., & Korobilis, D. (2014). Hierarchical shrinkage in time-varying parameter models. *Journal of Forecasting*, *33*(1), 80–94.Find this resource:

Berger, J. O. (1985). *Statistical decision theory and Bayesian analysis*. New York, NY: Springer.Find this resource:

Bernardo, J. M., & Smith, A. F. M. (2009). *Bayesian theory*, Wiley Series in Probability and Statistics. Chichester, U.K.: Wiley.Find this resource:

Bhattacharya, A., Pati, D., Pillai, N. S., & Dunson, D. B. (2015). Dirichlet-Laplace priors for optimal shrinkage. *Journal of the American Statistical Association*, *110*(512), 1479–1490.Find this resource:

Bloor, C., & Matheson, T. (2011). Real-time conditional forecasts with Bayesian VARs: An application to New Zealand. *The North American Journal of Economics and Finance*, *22*(1), 26–42.Find this resource:

Canova, F. (1992). An alternative approach to modeling and forecasting seasonal time series. *Journal of Business & Economic Statistics*, *10*(1), 97–108.Find this resource:

Canova, F. (1993). Forecasting time series with common seasonal patterns. *Journal of Econometrics*, *55*(1–2), 173–200.Find this resource:

Canova, F. (2007). *Methods for applied macroeconomic research*. Princeton, NJ: Princeton University Press.Find this resource:

Canova, F., & Ciccarelli, M. (2004). Forecasting and turning point predictions in a Bayesian panel VAR model. *Journal of Econometrics*, *120*(2), 327–359.Find this resource:

Canova, F., & Ciccarelli, M. (2009). Estimating multicountry VAR models. *International Economic Review*, *50*(3), 929–959.Find this resource:

Canova, F., & Ciccarelli, M. (2013). Panel vector autoregressive models: A survey. In T. B. Fomby, L. Kilian, & A. Murphy (Eds.), *VAR models in macroeconomics—new developments and applications: Essays in honor of Christopher A. Sims (advances in econometrics)* (pp. 205–246). Bingley, U.K.: Emerald Group.Find this resource:

Carriero, A., Clark, T. E., & Marcellino, M. (2015). Bayesian VARs: Specification choices and forecast accuracy. *Journal of Applied Econometrics*, *30*(1), 46–73.Find this resource:

Carriero, A., Kapetanios, G., & Marcellino, M. (2009). Forecasting exchange rates with a large Bayesian VAR. *International Journal of Forecasting*, *25*(2), 400–417.Find this resource:

Chen, C. W. S., & Lee, J. C. (1995). Bayesian inference of threshold autoregressive models. *Journal of Time Series Analysis*, *16*(5), 483–492.Find this resource:

Chib, S. (1998). Estimation and comparison of multiple change-point models. *Journal of Econometrics*, *86*(2), 221–241.Find this resource:

Chib, S. (2001). Markov chain Monte Carlo methods: Computation and inference. In J. J. Heckman & E. Leamer (Eds.), *Handbook of econometrics* (Vol. 5, pp. 3569–3649). Amsterdam: Elsevier.Find this resource:

Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. *The American Statistician*, *49*(4), 327–335.Find this resource:

Chiu, C.-W., Mumtaz, H., & Pinter, G. (2017). Forecasting with VAR models: Fat tails and stochastic volatility. *International Journal of Forecasting*, *33*(4), 1124–1143.Find this resource:

Ciccarelli, M., Ortega, E., & Valderrama, M. T. (2012). Heterogeneity and cross-country spillovers in macroeconomic-financial linkages. Working Paper Series 1498, European Central Bank.Find this resource:

Ciccarelli, M., & Rebucci, A. (2003). Bayesian vars; A survey of the recent literature with an application to the European Monetary System. IMF Working Papers 03/102. Washington, DC: International Monetary Fund.Find this resource:

Cogley, T., & Sargent, T. J. (2002). Evolving post-World War II U.S. inflation dynamics. In *NBER Macroeconomics Annual 2001, Volume 16* (pp. 331–388). National Bureau of Economic Research.Find this resource:

Cogley, T., & Sargent, T. J. (2005) Drift and volatilities: Monetary policies and outcomes in the post WWII U.S. *Review of Economic Dynamics*, *8*(2), 262–302.Find this resource:

De Mol, C., Giannone, D., & Reichlin, L. (2008). Forecasting using a large number of predictors: Is Bayesian shrinkage a valid alternative to principal components? *Journal of Econometrics*, *146*(2), 318–328.Find this resource:

DeJong, D. N., Ingram, B., & Whiteman, C. H. (1993). Analyzing VARs with monetary business cycle model priors. In *Proceedings of the American Statistical Association, Bayesian Statistics Section* (pp. 160–169). Alexandria, VA: The American Statistical Association.Find this resource:

Del Negro, M., & Schorfheide, F. (2004). Priors from general equilibrium models for VARS. *International Economic Review*, *45*(2), 643–673.Find this resource:

Del Negro, M., & Schorfheide, F. (2011). Bayesian macroeconometrics. In J. Geweke, G. Koop, & H. Van Dijk (Eds.), *The Oxford Handbook of Bayesian Econometrics* (pp. 293–389). Oxford, U.K.: Oxford University Press.Find this resource:

Dieppe, A., van Roye, B., & Legrand, R. (2016). The BEAR toolbox. Working Paper Series 1934. European Central Bank.Find this resource:

Doan, T., Litterman, R., & Sims, C. (1984). Forecasting and conditional projection using realistic prior distributions. *Econometric Reviews*, *3*(1), 1–100.Find this resource:

Gefang, D. (2012, February). Money-output causality revisited—A Bayesian logistic smooth transition VECM perspective. *Oxford Bulletin of Economics and Statistics*, *74*(1), 131–151.Find this resource:

Gefang, D., & Strachan, R. (2009). Nonlinear impacts of international business cycles on the U.K.—A Bayesian smooth transition VAR approach. *Studies in Nonlinear Dynamics & Econometrics*, *14*(1), 1–33.Find this resource:

Geisser, S. (1965). Bayesian estimation in multivariate analysis. *The Annals of Mathematical Statistics*, *36*(1), 150–159.Find this resource:

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). *Bayesian data analysis* (3rd ed.). CRC Texts in Statistical Science. London, U.K.: Taylor & Francis.Find this resource:

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). *Bayesian data analysis* (2nd ed.). CRC Texts in Statistical Science. London, U.K.: Taylor & Francis.Find this resource:

George, E. I., Sun, D., & Ni, S. (2008). Bayesian stochastic search for VAR model restrictions. *Journal of Econometrics*, *142*(1), 553–580.Find this resource:

Geweke, J. (1996, November). Bayesian reduced rank regression in econometrics. *Journal of Econometrics*, *75*(1), 121–146.Find this resource:

Geweke, J. (1999). Using simulation methods for Bayesian econometric models: Inference, development, and communication. *Econometric Reviews*, *18*(1), 1–73.Find this resource:

Geweke, J. (2005). *Contemporary Bayesian econometrics and statistics*. Wiley Series in Probability and Statistics. Chichester, U.K.: Wiley.Find this resource:

Geweke, J., & Nobuhiko, T. (1993). Bayesian threshold autoregressive models for nonlinear time series. *Journal of Time Series Analysis*, *14*(5), 441–454.Find this resource:

Geweke, J., & Whiteman, C. (2006). Bayesian forecasting. In G. Elliott, C. Granger, & A. Timmermann (Eds.), *Handbook of economic forecasting* (Vol. 1, pp. 3–80). Amsterdam: Elsevier.Find this resource:

Giannone, D., Lenza, M., Momferatou, D., & Onorante, L. (2014). Short-term inflation projections: A Bayesian vector autoregressive approach. *International Journal of Forecasting*, *30*(3), 635–644.Find this resource:

Giannone, D., Lenza, M., & Primiceri, G. E. (2015, May). Prior selection for vector autoregressions. *The Review of Economics and Statistics*, *2*(97), 436–451.Find this resource:

Giannone, D., Lenza, M., & Primiceri, G. E. (2016). Priors for the long run. CEPR Discussion Papers 11261, C.E.P.R. Discussion Papers.Find this resource:

Giannone, D., Lenza, M., & Primiceri, G. E. (2017). Economic predictions with big data: The illusion of sparsity. CEPR Discussion Papers 12256, C.E.P.R. Discussion Papers.Find this resource:

Giannone, D., Lenza, M., & Reichlin, L. (2008) Explaining the great moderation: It is not the shocks. *Journal of the European Economic Association*, *6*(2–3), 621–633.Find this resource:

Griffin, J. E., & Brown, Giannone, D., Lenza, M., & Primiceri, G. E. (2010). Inference with normal-gamma prior distributions in regression problems. *Bayesian Analysis*, *5*(1), 171–188.Find this resource:

Griffin, J., & Brown, P. (2017). Hierarchical shrinkage priors for regression models. *Bayesian Analysis, 12*(1), 135–159.Find this resource:

Hajivassiliou, V. A., & Ruud, P. A. (1994). Classical estimation methods for LDV models using simulation. In *Handbook of econometrics* (Vol. 4, pp. 2383–2441). Amsterdam: Elsevier.Find this resource:

Highfield, R. A. (1992). Forecasting similar time series with Bayesian pooling methods: Application to forecasting European output growth. In P. K. Goel & N. Sreenivas Iyengar (Eds.), *Bayesian analysis in statistics and econometrics* (pp. 303–326). New York, NY: Springer.Find this resource:

Huber, F., & Feldkircher, M. (2017). Adaptive shrinkage in Bayesian vector autoregressive models. *Journal of Business & Economic Statistics*, 1–13.Find this resource:

Ingram, B. F., & Whiteman, C. H. (1994). Supplanting the ‘Minnesota’ prior: Forecasting macroeconomic time series using real business cycle model priors. *Journal of Monetary Economics*, *34*(3), 497–510.Find this resource:

Jarociński, M. (2010). Responses to monetary policy shocks in the east and the west of Europe: A comparison. *Journal of Applied Econometrics*, *25*(5), 833–868.Find this resource:

Jarociński, M., & Maćkowiak, B. (2017). Granger causal priority and choice of variables in vector autoregressions. *The Review of Economics and Statistics*, *99*(2), 319–329.Find this resource:

Jarociński, M., & Marcet, A. (2011). Autoregressions in small samples, priors about observables and initial conditions. CEP Discussion Papers dp1061. London, U.K.: Centre for Economic Performance, LSE.Find this resource:

Jarociński, M., & Marcet, A. (2014). Contrasting Bayesian and frequentist approaches to autoregressions: The role of the initial condition. Working Papers 776. Barcelona, Spain: Barcelona Graduate School of Economics.Find this resource:

Jochmann, M., & Koop, G. (2015). Regime-switching cointegration. *Studies in Nonlinear Dynamics & Econometrics*, *19*(1), 35–48.Find this resource:

Kadiyala, R. K., & Karlsson, S. (1997). Numerical methods for estimation and inference in Bayesian VAR-models. *Journal of Applied Econometrics*, *12*(2), 99–132.Find this resource:

Karlsson, S. (2013). Forecasting with Bayesian vector autoregression. In G. Elliott & A. Timmermann (Eds.), *Handbook of economic forecasting* (Vol. 2, pp. 791–897). Amsterdam: Elsevier.Find this resource:

Kim, J. Y. (1994). Bayesian asymptotic theory in a time series model with a possible nonstationary process. *Econometric Theory*, *10*(3–4), 764–773.Find this resource:

Kim, C. J. & Nelson, C. (1999). *State-space models with regime switching: Classical and Gibbs-sampling approaches with applications* (Vol. 1). Cambridge, MA: MIT Press.Find this resource:

King, R. G., Plosser, C. I., & Rebelo, S. T. (1988). Production, growth and business cycles: I. The basic neoclassical model. *Journal of Monetary Economics*, *21*(2), 195–232.Find this resource:

Kleibergen, F., & Paap, R. (2002). Priors, posteriors and Bayes factors for a Bayesian analysis of cointegration. *Journal of Econometrics*, *111*(2), 223–249.Find this resource:

Kleibergen, F., & van Dijk, H. K. (1994). On the shape of the likelihood/posterior in cointegration models. *Econometric Theory*, 10(3–4), 514–551.Find this resource:

Koop, G. (2003). *Bayesian econometrics*. Chichester, U.K.: John Wiley.Find this resource:

Koop, G. (2013). Forecasting with medium and large Bayesian VARS. *Journal of Applied Econometrics*, *28*(2), 177–203.Find this resource:

Koop, G., & Korobilis, D. (2010). Bayesian multivariate time series methods for empirical macroeconomics. *Foundations and Trends(R) in Econometrics*, *3*(4), 267–358.Find this resource:

Koop, G., & Potter, S. M. (1999). Dynamic asymmetries in U.S. unemployment. *Journal of Business & Economic Statistics*, *17*(3), 298–312.Find this resource:

Koop, G., & Potter, S. M. (2007). Estimation and forecasting in models with multiple breaks. *The Review of Economic Studies*, *74*(3), 763–789.Find this resource:

Koop, G., & Potter, S. M. (2009, August). Prior elicitation in multiple change-point models. *International Economic Review*, *50*(3), 751–772.Find this resource:

Koop, G., & Steel, F. (1991). A comment on: ‘To criticize the critics: An objective Bayesian analysis of stochastic trends.’ *Journal of Applied Econometrics*, *6*, 365–370.Find this resource:

Koop, G., Leon-Gonzalez, R., & Strachan, R. W. (2011). Bayesian inference in a time varying cointegration model. *Journal of Econometrics*, *165*(2), 210–220.Find this resource:

Koop, G., Strachan, R. W., van Dijk, H. K., & Villani, M. (2006). Monetary policy shocks: What have we learned and to what end?” In T. C. Mills & K. P. Patterson eds. *Palgrave handbook of econometrics* (Vol. 1, pp. 871–898). Basingstoke, U.K.: Palgrave Macmillan.Find this resource:

Korobilis, D. (2013). VAR forecasting using Bayesian variable selection. *Journal of Applied Econometrics*, *28*(2), 204–230.Find this resource:

Kroese, D. P., & Chan, J. (2014). *Statistical modeling and computation*. New York, NY: Springer.Find this resource:

Kwan, Y. K. (1998). Asymptotic Bayesian analysis based on a limited information estimator. *Journal of Econometrics*, *88*(1), 99–121.Find this resource:

Litterman, R. B. (1979). Techniques of forecasting using vector autoregressions. Working Papers 115. Minneapolis, MN: Federal Reserve Bank of Minneapolis.Find this resource:

Litterman, R. B. (1980). A Bayesian procedure for forecasting with vector autoregression. Working Papers. Cambridge, MA: MIT Department of Economics.Find this resource:

Litterman, R. B. (1986). Forecasting with Bayesian vector autoregressions-five years of experience. *Journal of Business & Economic Statistics*, *4*(1), 25–38.Find this resource:

Liu, P., Mumtaz, H., Theodoridis, K., & Zanetti, F. (2017). Changing macroeconomic dynamics at the zero lower bound. *Journal of Business & Economic Statistics*.Find this resource:

Lubik, T. A., & Matthes, C. (2015). Time-varying parameter vector autoregressions: Specification, estimation, and an application. *Economic Quarterly*, *4Q*, 323–352.Find this resource:

Miranda-Agrippino, S., & Ricco, G. (2017). The transmission of monetary policy shocks. Bank of England Working Papers 657. London, U.K.: Bank of England.Find this resource:

Miranda-Agrippino, S., & Ricco, G. (2018). Bayesian vector autoregressions: Applications. In J. Hamilton (Ed.), *Oxford Research Encyclopedia of Economics and Finance*. Oxford, U.K.: Oxford University Press.Find this resource:

Mitchell, T. J., & Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. *Journal of the American Statistical Association*, *83*(404), 1023–1032.Find this resource:

Müller, U. K. (2013). Risk of Bayesian inference in misspecified models, and the sandwich covariance matrix. *Econometrica*, *81*(5), 1805–1849.Find this resource:

Müller, U. K., & Elliott, G. (2003). Tests for unit roots and the initial condition. *Econometrica*, *71*(4), 1269–1286.Find this resource:

Panagiotelis, A., & Smith, M. (2008). Bayesian density forecasting of intraday electricity prices using multivariate skew t distributions. *International Journal of Forecasting*, *24*(4), 710–727.Find this resource:

Phillips, P. C. B. (1991a). Bayesian routes and unit roots: De rebus prioribus semper est disputandum. *Journal of Applied Econometrics*, *6*(4), 435–473.Find this resource:

Phillips, P. C. B. (1991b). To criticize the critics: An objective Bayesian analysis of stochastic trends. *Journal of Applied Econometrics*, *6*(4), 333–364.Find this resource:

Primiceri, G. E. (2005). Time varying structural vector autoregressions and monetary policy. *Review of Economic Studies*, *72*(3), 821–852.Find this resource:

Raiffa, H., & Schlaifer, R. (1961). *Applied statistical decision theory*, Studies in Managerial Economics. Cambridge, MA: Division of Research, Graduate School of Business Administration, Harvard University.Find this resource:

Ricco, G., Callegari, G., & Cimadomo, J. (2016). Signals from the government: Policy disagreement and the transmission of fiscal shocks. *Journal of Monetary Economics*, *82*, 107–118.Find this resource:

Robertson, J. C., & Tallman, E. W. (1999). Vector autoregressions: Forecasting and reality. *Economic Review*, *Q1*, 4–18.Find this resource:

Sims, C. A. (1980). Macroeconomics and reality. *Econometrica*, *48*(1), 1–48.Find this resource:

Sims, C. A. (1988). Bayesian skepticism on unit root econometrics. *Journal of Economic Dynamics and Control*, *12*(2), 463–474.Find this resource:

Sims, C. A. (1991). Comment by Christopher A. Sims on ‘To criticize the critics’, by Peter C. B. Phillips. *Journal of Applied Econometrics*, *6*(4), 423–434.Find this resource:

Sims, C. A. (1993). A nine-variable probabilistic macroeconomic forecasting model. In James H. Stock & Mark W. Watson (Eds.), *Business cycles, indicators and forecasting* (pp. 179–212). National Bureau of Economic Research, 28. Chicago: University of Chicago Press.Find this resource:

Sims, C. A. (1996). Inference for multivariate time series models with trend. Technical report. Princeton, NJ: Princeton University.Find this resource:

Sims, C. A. (2000). Using a likelihood perspective to sharpen econometric discourse: Three examples. *Journal of Econometrics*, *95*(2), 443–462.Find this resource:

Sims, C. A. (2005a). Conjugate dummy observation priors for VARs. Technical report. Princeton, NJ: Princeton University.Find this resource:

Sims, C. A. (2005b). Dummy observation priors revisited. Technical report. Princeton, NJ: Princeton University.Find this resource:

Sims C. A. (2010a). Causal ordering and exogeneity. Technical report. Princeton, NJ: Princeton University.Find this resource:

Sims C. A. (2010b). Understanding non-Bayesians. Technical report. Princeton, NJ: Princeton University.Find this resource:

Sims, C. A., & Uhlig, H. (1991). Understanding unit rooters: A helicopter tour. *Econometrica*, *59*(6), 1591–1599.Find this resource:

Sims, C. A., & Zha, T. (1998). Bayesian methods for dynamic multivariate models. *International Economic Review*, *39*(4), 949–968.Find this resource:

Sims C. A. (2006). Were there regime switches in U.S. monetary policy? *American Economic Review*, *96*(1), 54–81.Find this resource:

Strachan, R. W., & Inder, B. (2004). Bayesian analysis of the error correction model. *Journal of Econometrics*, *123*(2), 307–325.Find this resource:

Theil, H. (1963). On the use of incomplete prior information in regression analysis. *Journal of the American Statistical Association*, *58*(302), 401–414.Find this resource:

Theil, H., & Goldberger, A. S. (1961). On pure and mixed statistical estimation in economics. *International Economic Review*, *2*(1), 65–78.Find this resource:

Tiao, G. C., & Zellner, A. (1964). On the Bayesian estimation of multivariate regression. *Journal of the Royal Statistical Society. Series B (Methodological)*, *26*(2), 277–285.Find this resource:

Tierney, L. (1994). Markov chains for exploring posterior distributions. *Annals of Statistics*, *22*(4), 1701–1728.Find this resource:

Timmermann, A. (2006). Forecast combinations. In G. Elliott, C. Granger, & A. Timmermann (Eds.), *Handbook of economic forecasting* (Vol. 1, pp. 135–196). Amsterdam: Elsevier.Find this resource:

Uhlig, H. (1994a). On Jeffreys prior when using the exact likelihood function. *Econometric Theory*, *10*(3–4), 633–644.Find this resource:

Uhlig, H. (1994b). What macroeconomists should know about unit roots: A Bayesian perspective. *Econometric Theory*, 10(3–4), 645–671.Find this resource:

Uhlig, H. (1997). Bayesian vector autoregressions with stochastic volatility. *Econometrica*, *65*(1), 59–74.Find this resource:

Villani, M. (2001). Bayesian prediction with cointegrated vector autoregressions. *International Journal of Forecasting*, *17*(4), 585–605.Find this resource:

Villani, M. (2009). Steady-state priors for vector autoregressions. *Journal of Applied Econometrics*, *24*(4), 630–650.Find this resource:

Wright, J. H. (2009). Forecasting US inflation by Bayesian model averaging. *Journal of Forecasting*, *28*(2), 131–144.Find this resource:

Zellner, A. (1971). *An introduction to Bayesian inference in econometrics*. Wiley Classics Library. Chichester, U.K.: Wiley-Interscience.Find this resource:

Zellner, A., & Hong, C. (1989). Forecasting international growth rates using Bayesian shrinkage and other procedures. *Journal of Econometrics*, *40*(1), 183–202.Find this resource:

## Notes:

(1.) Several books provide excellent in-depth treatments of Bayesian inference. Among others, Zellner (1971), Gelman et al. (2003), Koop (2003), and Geweke (2005). Canova (2007) provides a book treatment of VARs and BVARs in the context of the methods for applied macroeconomic research. Several recent articles survey the literature on BVARs. Del Negro and Schorfheide (2011) have a deep and insightful discussion of BVAR with a broader focus on Bayesian macroeconometrics and DSGE models. Koop and Korobilis (2010) proposed a discussion of Bayesian multivariate time- series models with an in-depth discussion of time-varying parameters and stochastic volatility models. Geweke and Whiteman (2006a) and Karlsson (2013b) provide a detailed survey with a focus on forecasting with Bayesian vector autoregression. Ciccarelli and Rebucci (2003) surveyed BVARs in forecasting analysis with Euro Area data. Canova and Ciccarelli (2009, 2013) discussed panel Bayesian VARs. Finally, the reader is referred to Timmermann (2006) for an in-depth discussion on model averaging and forecast combination, a natural extension of the Bayesian framework; Dieppe et al. (2016) have developed the ready-to-use BEAR toolbox that implements many of the methods described in this article. Other useful code sources are those related to Kroese and Chan (2014) (see online) and Koop and Korobilis (2010) (see online).

(2.) Bayesian priors can often be interpreted as frequentist penalized regressions (see, e.g., De Mol et al., 2008). A Gaussian prior for the regression coefficients, for example, can be thought of as a Ridge penalized regression. Having a double exponential (Laplace) prior on the coefficients is instead equivalent to a Lasso regularization problem.

(3.) In principle, dummy observations can also implement prior beliefs about nonlinear functions of the parameters (a short discussion on this is in Sims, 2005b).

(4.) Such a prior is adopted to capture the belief that it is not plausible to assume that initial transients can explain a large part of observed long-run variation in economic time series. Since in a sample of given size there is no information on the behavior of time series at frequencies longer than the sample size, the prior assumptions implicitly or explicitly elicited in the analysis will inform results. This is a clear example, in the inference in VARs, of an issue for which Bayesian inference provides a framework to make prior information explicit and available to scientific discussion on the inference in VAR models.

(5.) Several sets of pseudo-observations can be adopted at the same time. In fact, successive dummy observations modify the prior distribution as if they reflected successive observations of functions of the VAR parameters, affected by stochastic disturbances.

(6.)
While the assumption of normally distributed errors makes the posterior p.d.f. tractable, modern computational methods permit straightforward characterization of posterior distributions obtained under different assumptions. Among others, Chiu et al. (2017) and Panagiotelis and Smith (2008) depart from the normality assumption and allow for *t*-distributed errors. It is interesting to observe that in large samples, and under certain regularity conditions, the likelihood function converges to a Gaussian distribution, with mean at the maximum likelihood estimator (MLE) and covariance matrix given by the usual MLE estimator for the covariance matrix. This implies that conditioning on the MLE and using its asymptotic Gaussian distribution is, approximately in large samples, as good as conditioning on all the data (see discussion in Sims, 2010b).

(7.)
The marginal p.d.f. for the observations, denoted as *p*(*y*_{1–p:t}), is a normalizing constant and as such can be dropped when making inference about the model parameters.

(8.) “Non-informative” or “flat” priors are designed to extract the maximum amount of expected information from the data. They maximize the difference (measured by Kullback-Leibler distance) between the posterior and the prior when the number of samples drawn goes to infinity. Jeffreys priors for VARs are “improper,” in the sense that they do not integrate to one over the parameter space. Hence, they cannot be thought of as well specified p.d.f. distributions. However, they can be obtained as degenerate limit of the Normal-Inverse-Wishart conjugate distribution, and their posterior is proper. For an in-depth discussion on non-informative priors in multi-parameter settings see Zellner (1971) and Bernardo and Smith (2009).

(9.)
The marginal posterior distribution of the *k* ×*n* matrix *A* is matricvariate *t* (see Kadiyala & Karlsson, 1997)

(10.) Müller (2013) shows that a Bayesian decision maker can justify using OLS with a sandwich co-variance matrix when the probability limit of the OLS estimator is the object of interest, despite the fact that the linear regression model is known not to be the true model (see discussion in Sims, 2010b). Miranda-Agrippino and Ricco (2017) used this intuition to construct coverage bands for impulse responses estimated with Bayesian Local Projections (BLP). This method can be thought of as a generalization of BVARs that estimates a different model for different forecast horizons—as in direct forecasts—and hence induces autocorrelation in the reduced-form residuals that violate the the i.i.d. assumption in Eq. (61).

(11.)
The prior mean of the VAR coefficients is $\mathbb{E}\left[\alpha \right]=\underset{\_}{\alpha}$, for $\underset{\_}{d}>n$, while the variance is $\mathbb{V}ar\left[\alpha \right]={(\underset{\_}{d}-n-\text{1})}^{-\text{1}}\underset{\_}{S}\otimes \underset{\_}{\Omega}$, for $\underset{\_}{d}>n+1$. Setting $\underset{\_}{d}=\mathrm{max}\left\{n+2,\phantom{\rule{0.2em}{0ex}}n+2h-T\right\}$ ensures that both the prior variances of *A* and the posterior variances of the forecasts at *T* + *h* are defined.

(12.)
The key idea of MCMC algorithms is to construct a Markov chain for *θ* ≡ (*A*, Σ) that has the posterior as its (unique) limiting stationary distribution, and such that random draws can be sampled from the transition kernel *p*(*θ*^{(s+1)}|*θ*^{(s)}). Tierney (1994) and Geweke (2005) discuss the conditions for the convergence of the chain to the posterior distribution when starting from an arbitrary point in the parameter space. Typically, a large number of initial draws (known as burn-in sample) is discarded to avoid including portions of the chain that have not yet converged to the posterior. Also, even if convergent, the chain may move slowly in the parameter space due to, for example, autocorrelation between the draws, and a large number of draws may be needed. See also Karlsson (2013a) for a discussion on this point and on empirical diagnostic tests to assess the chain convergence. References include Geweke (1999), Chib and Greenberg (1995), and Geweke and Whiteman (2006b).

(13.) Such restrictions can be accommodated by replacing Eq. (19) with a truncated normal distribution. In this case, however, posterior moments are not available analytically and must be evaluated numerically, with consequential complications and loss of efficiency with respect to the MCMC algorithm discussed above (see Hajivassiliou & Ruud, 1994; Kadiyala & Karlsson, 1997, for further details).

(14.) The original formulation of Litterman (1980) prior was of the form

where $\underset{\_}{\Gamma}\equiv diag([{\gamma}_{1}^{2},\dots ,{\gamma}_{n}^{2}])$ is assumed to be fixed, known, and diagonal. Highfield (1992) and Kadiyala and Karlsson (1997) observed that by modifying Litterman’s prior to make it symmetric across equations in the form of a NIW prior, the posterior p.d.f. was tractable.

(15.) Given the dummy observations in Eq. (34), the matrix Ω in Eq. (19) is diagonal and of the form

(16.) Canova (1992, 1993) propose a set of artificial observations to account for seasonal patterns and potentially other peaks in the spectral densities.

(17.)
This approach requires the use of iterative nonlinear optimisation methods. The main issue with this approach is that nonstationary models have no unconditional—in other words, ergodic—distribution of the initial conditions. Also, while near-nonstationary models may have an ergodic distribution, the time required to arrive at the ergodic distribution from arbitrary initial conditions may be long. For this reason, using such a method requires strong beliefs about the stationarity of the model, which is rarely the case in macroeconomics, and imposing the ergodic distribution on the first *p* observations may be unreasonable (see Sims, 2005a).

(18.)
The treatment of unit root in Bayesian and frequentist inference has been hotly debated. Among others, important contributions are Sims (1988, 1991), Sims and Uhlig (1991), Koop and Steel (1991), Phillips (1991a, 1991b), Uhlig (1994a, 1994b), Müller and Elliott (2003); Jarociński and Marcet (2011, 2014). *The Journal of Applied Econometrics*, *6*(4) October/December 1991 issue has been entirely dedicated to this debate.

(19.)
To put a heavier weight on the presence of a unit root, one could add to the observation in Eq. (43) an additional artificial observation that enforces the belief that *c* = 0. Alternatively, one could modify Eq. (43) to have a zero in place of ${\lambda}_{4}^{-1}$ as the observation corresponding to the intercept. In this case, the prior gives no plausibility to stationary models and, if used in isolation, allows for at least a single unit root without any restriction on *c*. Hence, despite the presence of a unit root, it may not necessarily reduce the importance of the deterministic component (see Sims, 2005a).

(20.)
The sums-of-coefficients observations of Eq. (45) do not imply any restriction on the vector of intercepts c, since the artificial observations loading on the constant are set to zero. Therefore, this prior allows for a non-zero constant and hence for a linearly trending drift. To assign smaller probability to versions of the model in which deterministic transient components are much more important than the error term in explaining the series variance, one has to add to Eq. (45) artificial observations that favor *c* = 0 (see Sims, 2005a).

(21.) Among many others, contributions to the treatment of cointegration in Bayesian VARs are in Kleibergen and van Dijk (1994), Geweke (1996), Villani (2001), Kleibergen and Paap (2002), Strachan and Inder (2004), Koop et al. (2011), and Jochmann and Koop (2015).

(22.) Review articles are in Del Negro and Schorfheide (2011), Koop and Korobilis (2010), and Lubik and Matthes (2015).

(24.)
See also the discussion in Karlsson (2013a) for additional details on the specification of the prior for *α** _{t}*.

(25.) Kim and Nelson (1999) is the standard reference for frequentist and Bayesian estimation of Markov switching models.

(26.) An additional potential property of the Panel VAR is the time-variation in the VAR coefficients. For ease of notation we abstract from this in the following exposition.

(27.) A Panel VAR can always be estimated as a Large VAR using standard macroeconomic priors (Bańbura et al., 2010). However, this implies (comes to the cost of) treating all the variables symmetrically thus disregarding the unit structure, and the fact that different variables may measure the same quantities in different units. Also, for large systems the need to adopt too tight priors to overcome the issue of dimensionality may distort the posterior distribution.

(28.)
Alternatively, one could use standard priors to estimate a VAR for each of the *N* units separately and then average the results across units. Such a mean group estimator is inefficient relative to the pooled estimator under dynamic homogeneity but gives consistent estimates of the average system dynamic effects if dynamic heterogeneity is present. Conversely, the pooled estimator is inconsistent under dynamic heterogeneity due to the presence of correlation between the regressors and the error term.