Due to the COVID-19 crisis, the transition into subscription mode of the Oxford Research Encyclopedia of Economics and Finance has been temporarily postponed. Please watch this space for updates as we work toward launching in the near future. Visit About to learn more, meet the editorial board, or learn how to subscribe.

Dismiss
Show Summary Details

Page of

date: 04 April 2020

# Predictive Regressions

## Summary and Keywords

Predictive regressions are a widely used econometric environment for assessing the predictability of economic and financial variables using past values of one or more predictors. The nature of the applications considered by practitioners often involve the use of predictors that have highly persistent, smoothly varying dynamics as opposed to the much noisier nature of the variable being predicted. This imbalance tends to affect the accuracy of the estimates of the model parameters and the validity of inferences about them when one uses standard methods that do not explicitly recognize this and related complications. A growing literature aimed at introducing novel techniques specifically designed to produce accurate inferences in such environments ensued. The frequent use of these predictive regressions in applied work has also led practitioners to question the validity of viewing predictability within a linear setting that ignores the possibility that predictability may occasionally be switched off. This in turn has generated a new stream of research aiming at introducing regime-specific behavior within predictive regressions in order to explicitly capture phenomena such as episodic predictability.

# The Basic Environment

Predictive regressions refer to linear regression models designed to assess the predictive power of past values of some economic or financial variable for the future values of another variable. In their simplest univariate form, these predictive regressions are formulated as

$Display mathematics$
(1)

with the main concern being the testing of the statistical significance of an estimate of $β1$. Such models are in principle no different from simple linear regression specifications with lagged explanatory variables and for which standard inferences should apply under mild assumptions. However, the specific context in which they are encountered in many economics and finance applications and the dynamic properties of commonly considered predictors in particular has led to a vast body of research aiming to improve the quality and accuracy of inferences in such settings. Indeed, across many applications involving the estimation of such predictive regressions, it is often the case that predictors are highly persistent, behaving like nearly non-stationary processes while predictands are typically noisier with rapidly mean-reverting dynamics instead. This imbalance in the stochastic properties of predictors and predictand is also often combined with the presence of sizable contemporaneous correlations between the shocks driving $yt$ and $xt$. The co-existence of these two important features and their common presence in many economic and finance applications tends to seriously distort inferences based on traditional significance tests that rely on standard normal approximations (e.g., t-ratios used on least squares–based estimates of $β1$).

One of the most commonly encountered empirical applications subject to these complications has originated in the asset-pricing literature and has involved the study of the predictability of stock returns with valuation ratios and dividend yields in particular (e.g., Campbell & Yogo, 2006; Golez & Koudijs, 2018; Goyal & Welch, 2008; Koijen & Van Nieuwerburgh, 2011; Lettau & Van Nieuwerburgh, 2008; Lewellen, 2004; Stambaugh, 1999). Such predictors are well known to have roots close to unity in their autoregressive representation, making shocks to such series last for very long periods instead of dying off quickly, hence their labeling as highly persistent predictors. Stock returns, on the other hand, are well known to have very short memory with virtually no serial correlation, resulting in much noisier dynamics relative to those predictors. In parallel to these distinct stochastic characteristics of predictand and predictors, it is also often the case that shocks to scaled price variables (e.g., price to earnings, price to book value, price to sales) are contemporaneously negatively correlated with shocks to returns.

The distortions that affect traditional least squares–and t-ratio–based inferences conducted on $β1$ in such settings typically materialize in the form of important size distortions that lead to too frequent wrong rejections of the null hypothesis and the finding of spurious predictability. It is important to emphasize, however, that these distortions are driven by the joint presence of persistence and contemporaneous correlations, with the latter’s magnitudes driving the seriousness of these wrong rejections.

These econometric complications have led to a vast research agenda aiming to develop alternative approaches to conducting inferences about $β1$ with good size and power properties even under persistence and sizable contemporaneous correlations between predictors and predictand. A very prolific avenue of research in this context has involved recognizing the persistent nature of predictors by explicitly modeling them as nearly non-stationary local to unit root processes. Given that commonly used predictors such as valuation ratios cannot logically be viewed as pure unit root processes, as this would imply that prices and fundamentals (e.g., earnings, dividends) can diverge for long periods, the use of a near non-stationary framework offers a particularly useful compromise for capturing the stylized facts associated with these regression models. A popular specification for capturing persistence is the well-known local to unit root model often specified as

$Display mathematics$
(2)

with $c$ referring to a strictly positive constant and $T$ to the sample size so that the associated autocorrelation coefficient is less than but possibly very close to unity. Such parameterizations lead to non-standard and non Gaussian asymptotics for the associated test statistics used to test hypotheses on $β1$, and their implementation requires the use of simulation-based critical values. These non-standard asymptotics can also easily accommodate contemporaneous correlations between $ut$ and $vt$, and it is generally hoped that they may lead to statistics with better size and power properties compared to the use of standard inferences relying on normal approximations.

One fundamental drawback of this more realistic framework, however, is that these non-standard asymptotics taking the form of stochastic integrals in Gaussian processes also depend on the unknown noncentrality parameter $c$ that controls persistence. This makes their practical implementation difficult. An important ensuing agenda then aimed at addressing this problem through more or less successful means. Early approaches involved considering bounds-type tests that rely on multiple tests conducted over a range of value of the nuisance parameter and subsequently corrected using Bonferroni bounds (Campbell & Yogo, 2006; Cavanagh, Elliott, & Stock, 1995; Jansson & Moreira, 2006). More recently the focus has shifted toward methods that involve either model or test statistic transformations so as to robustify the asymptotics to the influence of $c$. Examples include the use of instrumental variable as opposed to least squares–based estimation of $β1$ with instruments designed in such a way that the resulting asymptotics no longer depend on $c$ (Kostakis, Magdalinos, & Stamatogiannis, 2015; Phillips & Magdalinos, 2009). Other related approaches have relied on model augmentation techniques that augment the original predictive regression with an additional predictor selected in such a way that inferences about $β1$ have convenient nuisance parameter-free distributions (Breitung & Demetrescu, 2015). These two approaches have become the norm in the applied literature due to their good size and power properties and their ability to accommodate a rich set of features such as heteroskedasticity and serial correlation. A particularly useful feature of these methods is also their ability to handle multiple persistent predictors within (1) and to effectively be immune to persistence.

This line of research aiming at improving and robustifying inferences in the context of these predictive regressions also opened the way to novel approaches to modeling predictability and to the introduction of nonlinearities in particular. The main motivation driving this important extension and generalization was the recognition that predictability may not be a stable phenomenon but possibly varies across time or across economically relevant episodes. The predictive power of a predictor may, for instance, kick in solely during particular economic times while shutting off in other times (e.g., recessions versus expansions versus normal times). If ignored, the presence of such phenomena will almost certainly distort inferences about predictability in the sense of leading to conflicting outcomes depending on the sample periods being considered.

A burgeoning research agenda in this area has involved introducing the presence of regime-specific nonlinearities (e.g., structural breaks, threshold effects) within these predictive regressions while at the same time continuing to address the complications arising from the persistent nature of predictors and the particular type of endogeneity induced by the strong contemporaneous correlation between $ut$ and $vt$. An early example of a nonlinear predictive regression model in which nonlinearities have been modeled via threshold effects has, for instance, been introduced in Gonzalo and Pitarakis (2012, 2017). This new class of threshold predictive regressions allowed the parameters of the model to potentially alternate between two possible values depending on whether a variable proxying for the economic cycle exceeds or lies below a threshold parameter. This offered a convenient and intuitive way of attaching a cause to the presence of predictability while also allowing it to shut off during particular periods. Another related extension has involved allowing the parameters of (1) to be subject to structural breaks with time effectively acting as a threshold variable. Pitarakis (2017) has introduced a battery of tests designed to detect the presence of such effects while at the same time addressing the two common econometric complications. A related modeling framework has also been recently developed in Farmer, Schmidt, and Timmermann (2018), where the authors introduced the notion of pockets of predictability captured via smoothly varying functional parameters viewed as functions of time. Other fully non-parametric approaches effectively remaining agnostic about the functional form linking $yt$ and $xt−1$ have also been developed in Juhl (2014) and Kasparis, Andreou, and Phillips (2015), among others.

# Simple Predictive Regressions: Inference Problems and Early Research

Operating within the simple specification given by (1)–(2) is initially instructive to illustrate in greater depth the econometric complications that arise when testing the null hypothesis $H0:β1=0$ under the explicit modeling of the predictor as a near unit root process. For the sake of the exposition it is assumed that $ut$ and $vt$ are stationary disturbances that are i.i.d. but correlated and with the associated variance-covariance matrix given by $Σ={{σu2,σuv},{σuv,σv2}}$. Given this simplified framework and some further regularity conditions (see Phillips, 1987), it is well known that the stochastic process $XT(r)=x[Tr]/T$, where $x[Tr]=∑i=1[Tr](1−c/T)[Tr]−ivi$ satisfies an invariance principle with $XT(r)⇒Jc(r)$ for $r∈[0,1]$. Here $Jc(r)$ is referred to as an Ornstein-Uhlenbeck process and can informally be viewed as the continuous time equivalent of an autoregressive process. More specifically, ,$Jc(r)=∫0re(r−s)cdWv(s)$ with $Wv(r)$ denoting a standard Brownian motion associated with the $vt′s$. This process is clearly Gaussian but with the complication that its variance depends on a DGP-specific parameter, namely $c$. As a functional central limit theorem also holds for $wt=(ut,vt)′$, with $T−12∑t=1[Tr]wt⇒∑12(Wu(r),Wv(r))′$, following Cavanagh et al. (1995) the t-ratio associated with $β1$ satisfies

$Display mathematics$
(3)

where $ρ=σuv/σuσv$ and $Z$ denotes a standard normal random variable.

The formulation in (3) is particularly instructive for understanding the nature of the complications that arise in predictive regressions and the joint role played by the presence of high persistence and a non-zero $ρ$ (induced by the non-zero contemporaneous covariance $σuv$) in particular. In such instances the limiting distribution in (3) depends on the noncentrality parameter $c$ complicating the practical implementation of inferences based on $tβ^1$. If $ρ=0$, however, we have $tβ^1⇒N(0,1)$, suggesting that the normal approximation should lead to a test that is properly sized under sufficiently large sample sizes.

Early research in this area has addressed the problem of the dependence of inferences on $c$ through a variety of methods which, although theoretically sound, were subject to practical shortcomings often leading to tests that were conservative and having low power. Given the dependence of the quantiles of the limiting distribution in (3) on the unknown noncentrality parameter $c$, popular approaches relied on the early literature on multiple testing and Bonferroni-based techniques in particular.

In Cavanagh et al. (1995), for instance, the authors developed a Bonferroni-based confidence interval for $β1$ that relied on an initial confidence interval for $c$ obtained following the confidence belt methodology of Stock (1991). Stock’s (1991) approach for constructing a confidence interval for $c$ (equivalently $ϕ=1−c/T$) involves first implementing an Augmented Dickey Fuller (ADF)–type t-test for testing $H0:ϕ=1$ on $xt$. This ADF t-test is distributed as

$Display mathematics$
(4)

which depends solely on $c$. The idea is then to use the duality between hypothesis testing and confidence intervals to obtain a confidence interval for $c$ via the inversion of the acceptance region of the test. Letting $hL,α12$ and $hU,1−α12$ denote the $α1/2$ and $1−α1/2$ percentiles of $tadf(c)$, we can write $t^adf(c)∈[hL,α12,hU,1−α12]$ for the acceptance region of the test statistic. These critical values can then be inverted numerically to lead to the confidence interval for $c$, say $CIc(α1)=[hU,α12−1(tadf(c)),hL,1−α12−1(tadf(c))]≡[cL(α1),cU(α1)]$, which is obtained for some given value of the test statistic and which effectively provides the range of values of $c$ that are in the acceptance region. For each value of $c$ in this interval, one can subsequently construct confidence intervals for $β1$ using the limiting distribution of $tβ^1$ in (3). It is worth pointing out, however, that these confidence intervals for $c$ have some undesirable properties in the sense of not being uniform in $ϕ$ and leading to generally poor outcomes when the underlying $ϕ$ is too far off the unit root scenario (see Mikusheva, 2007, who proposed an alternative way of constructing these confidence intervals for $c$ using a modification to the $t^adf(c)$ statistic that leads to confidence intervals that are uniform across $ϕ$). Given the confidence interval for $c$, it is then possible to proceed with a Bonferroni-based approach to obtain a confidence interval for $β1$ that no longer depends on $c$. More formally a confidence interval for $β1$ is first constructed for each value of $c$, say $CIβ1|c(α2)$, using the limiting distribution of $tβ^1$ in (3). A final confidence interval for $β1$ that does not depend on $c$ is then obtained as the union across $c∈[cL(α1),cU(α2)]$ of these $CIβ1|c(α2)′$ s leading to $CIβ1(α1,α2)=[mincL(α1)≤c≤cU(α1)dtβ^1,c,α22,maxcL(α1)≤c≤cU(α1)dtβ^1,c,1−α22]$, with $dtβ^1,c$ referring to the critical values associated with (3).

Within this methodological context it is important to recognize that the choice of using the ADF-based t-ratio for obtaining a confidence interval for $c$ followed by the use of $tβ^1$ is arbitrary in the sense that alternative test statistics fulfilling the same purpose may also be considered. An important literature followed this line of research by considering alternative test statistics with better optimality properties and better power properties across the relevant range of $ϕ$ —see Elliott and Stock (2001), for instance, for an alternative approach to obtaining confidence intervals for $c$ that relies on the the point optimal test proposed in Elliott, Rothemberg, and Stock (1996). In an influential paper Campbell and Yogo (2006) focused on these issues in the specific context of the predictive regression setting as in (1)–(2). For the construction of a confidence interval for $c$, they proposed to rely on the more efficient DF-GLS test of Elliott et al. (1996) and for which they provided tabulations linking the magnitude of this test statistic with a corresponding confidence interval for $c$. Given this alternative approach to obtaining the relevant range of $c$ values, they subsequently also introduced an alternative to $tβ^1$, which they referred to as their Q statistic. The latter is effectively a t-ratio on $β1$ but obtained from the augmented specification $yt=β1xt−1+λ(xt−ϕxt−1)+ηt$ with $λ=σuv/σv2$ and shown to lead to better power properties compared to the use of $tβ^1$.

An important limitation of all these two-stage confidence interval–based approaches is that the resulting confidence intervals are typically not uniform in $ϕ$, have potentially zero coverage probabilities, and may lead to poor power properties when it comes to conducting inferences about $β1$. An excellent technical discussion of these shortcomings can be found in Phillips (2015). Also noteworthy is the fact that these methods are difficult to generalize to multiple predictor settings or for handling more flexible assumptions on the variances of the error processes.

Alternative routes to improving inferences about $β1$ within (1)–(2) have also been considered around the same time as these Bonferroni-based approaches. One line of research involved improving the quality of the least squares estimator of $β1$ by removing its bias. Note, for instance, that the least squares estimator of $β1$ obtained from (1) is not unbiased as the predictor is not strictly exogenous. As shown in Stambaugh (1999) the bias of $β^1$ can be formulated as $E[β^1−β1]=(σuv/σv2)E[ϕ^−ϕ]$. Under $ϕ≈1$ it is well known that $ϕ^$ has a strong downward bias, and as $σuv$ is typically negative, one clearly expects a strong upward bias in $β^1$. It is this undesirable feature of $β^1$ that this literature has attempted to address by appealing to existing results on biases of first order autocorrelation coefficients (e.g., Kendall, 1954) such as $E[ϕ^−ϕ]=−(1+3ϕ)/T+O(T−2)$. We note, for instance, that an adjusted estimator of the slope parameter $β^1c=β^1+(σuv/σv2)((1+3ϕ)/T)$ satisfies $E[β^1c−β1]=0$ under known $ϕ$. Lewellen (2004) took advantage of these results to devise an alternative approach to testing $H0:β1=0$ that relies on a bias corrected estimator of $β1$ given by $β^1lw=β^1+(σuv/σv2)(ϕ^−ϕ)$ with $ϕ$ set at $0.9999$. This expression is subsequently operationalized by replacing $σuv$ and $σv2$ by suitable estimates. Naturally these approaches rely on an important set of assumptions for their validity (e.g., normality) and require a certain level of ad hoc input.

# Robustifying Inferences to the Noncentrality Parameter: Recent Developments

A more recent trend in this literature on conducting inferences in predictive regressions with persistent predictors has aimed to jointly address two key concerns. The first concern is the need to operate within a more flexible environment than (1)–(2) that can accommodate multiple predictors while also taking into account complications such as serial correlation and heteroskedasticity. The second concern stems from the need to develop inferences with good size and power properties that are also robust to the persistence properties of the predictors. A more empirically relevant generalization of (1)–(2) can be formulated as

$Display mathematics$
(5)

$Display mathematics$
(6)

with $C=diag(c1,…,cp)$, $ci>0$, and $ut$ and $vt$ modeled as possibly dependent and cross-correlated stationary processes.

A novel approach to the problem of estimating the parameters of (5) and testing relevant hypotheses on $β$ has beeen developed in Kostakis et al. (2015), where the authors introduced an instrumental variable-based approach designed in such a way that the resulting asymptotics of a suitably normalized Wald statistic for testing hypotheses of the form $H0:Rβ=r$ in (5) are $χ2$ distributed and not dependent on the $ci′s$. Their framework is in fact more general than (5)–(6) as it can also accommodate predictors that are more or less persistent than those modeled as in (6) including pure unit root, stationary or mildly persistent processes parameterized as $xit=(1−ci/Tα)xit−1+vit$ with $α∈(0,1)$. The strength of the methodology lies in the fact that one can effectively operate and conduct inferences about $β$ while being agnostic about the degree of persistence of the predictors. Its reliance on a standard Wald statistic also makes the implementation of traditional Newey-West type corrections for accommodating serial correlation and heteroskedasticity particularly straightforward. This instrumental variable approach has originated in the earlier work of Phillips and Magdalinos (2009), who focused on a multivariate cointegrated system closely related to (5)–(6) with (5) replaced with $yt=β′xt+ut$ and labeled as a cointegrated system with persistent predictors.

The main idea behind the instrumental variable approach involves instrumenting $xt$ with a slightly less persistent version of itself constructed with the help of the first differenced $xt′s$. In this sense the IV is generated using solely model-specific information and does not require any external information, hence its labeling as IVX. More specifically, the p-vector of instruments for $xt$ is constructed as

$Display mathematics$
(7)

for a given $δ∈(0,1)$ and some given $Cz=diag(cz1,…,czp)$, $cz,i>0$, for $i=1,…,p$. Note that as $δ<1$ the instruments are less persistent than $xt$. From (6) we have $Δxj=−C/T+vt$ which when combined within (7) leads to the following decomposition of the instrument vector

$Display mathematics$
(8)

with $zt=∑j=1t(Ip−Cz/T)t−jvj$ and $Ψt=∑j=1t(Ip−C/T)t−jxj−1$. Note that $zt$ is such that $zt=(Ip−Cz/Tδ)zt−1+vt$, while $Ψt$ is a remainder term shown not to have any influence on the asymptotics. For practical purposes these mildly integrated IVs are generated as a filtered version of $xt$ using (7) with a given $Cz$ and $δ$ and are approximately equivalent to $zt$. These are then used to obtain an IV-based estimator of $β$ from (5). More formally, letting $X$ and $Z$ denote the regressor and IV matrices, respectively, both obtained by stacking the elements of $xt$ and $zt$, we have $β^ivx=(X′Z)−1Z′y$, and the associated conditionally homoskedastic version of the Wald statistic is given by

$Display mathematics$
(9)

with $σ˜u2=∑t(yt−β^ ivx′xt)2/T$. In the context of (5)–(6) Kostakis et al. (2015) established that

$Display mathematics$
(10)

with $m$ referring to the rank of the restriction matrix $R$, thus removing the need to be concerned with the magnitude of the $ci′s$ that parameterize the persistent predictors in the DGP.

Here it is important to point out that $β^ivx$ continues to have a limiting distribution that depends on the $ci$ ’s so that the strength of the IVX methodology operates via the Wald statistic’s variance normalisation, as illustrated by the middle term in (9), and which effectively cancels out the asymptotic variance of $β^ivx$, leading to an identity matrix (due to the asymptotic mixed normality of $β^ivx$). Note also that the use of this IV approach is not inconsequential for the asymptotic properties of $β^ivx$, which converges at a rate slower than $β^$ with a rate determined by the magnitude of $δ$ used in the construction of the IVs. More specifically, $β^ivx−β=Op(T−1+δ2)$, which can be compared with the T-consistency of the standard least squares estimator $β^$.

To highlight some of these properties more explicitly, it is useful to revisit the simple univariate setting of (1)–(2). Letting $yt*$, $xt*$, and $z˜t*$ denote the demeaned versions of the variables of interest, the IVX-based estimator of $β1$ is given by

$Display mathematics$
(11)

and from Phillips and Magdalinos (2009) and Kostakis et al. (2015) we have

$Display mathematics$
(12)

which highlights the fact that the distribution of the IVX estimator continues to depend on $c$ via the presence of the $Jc(r)$ process in the asymptotic variance. Thanks to the mixed normality of $β^1ivx$, however, the use of the IV-based variance normalization embedded in the Wald statistic given here by

$Display mathematics$
(13)

leads to the outcome that $WT(β1=0)⇒χ2(1)$. At this stage it is also useful to point out that the demeaning of the IVs used in (11) and (13) was not strictly necessary as the IVX-based estimator of $β1$ is invariant to their demeaning, as discussed in Kostakis et al. (2015).

The implementation of the estimator $β^1ivx$ requires one to take a stance on the magnitudes of $cz$ and $δ$, which are needed for generating the instrumental variables. As the choice of $cz$ is innocuous, Phillips and Magdalinos (2009) suggest setting $cz=1$. The impact of $δ$ used in the construction of the IVs is more problematic, however. Although the asymptotic analysis requires $δ∈(0,1)$ it is clear that a choice for $δ$ that is close to 1 will make the IV closer to the original variable that it is instrumenting. Choosing a $δ$ that is much lower than 1 will have the opposite effect. As the choice of $δ$ directly influences the rate of convergence of $β^ivx$ with lower magnitudes of $δ$ implying a slower rate of convergence, it is natural to expect that the choice of $δ$ may raise important size versus power trade-offs, in smaller samples in particular. Kostakis et al. (2015) argue that a choice such as $δ=0.95$ offers excellent size/power trade-offs while they advise against choosing $δ<0.9$ due to potentially negative power implications. As shown in their simulations, the closer $δ$ is to 1 the better the power properties of the IVX-based Wald statistic. However, this choice also tends to create nonignorable size distortions in moderate sample sizes such as $T=500$. This is an issue the authors have explored in great detail, showing that the estimation of an intercept in (5) is the key driver of these size distortions that further amplify as $δ→1$. To remedy this problem they introduced a finite sample correction to the formulation of the Wald statistic in (9) and that is shown to make the Wald statistic match its asymptotic limit very accurately in finite samples even for $δ$ close to 1. Note also that the inclusion of this finite sample correction has no bearing on the $χ2$ asymptotics in (10). In the context of the formulation in (9), the middle term $σ˜u2Z′Z$ of the quadratic form is replaced with $σ˜u2[Z′Z−z¯Tz¯T′(1−γ^)]$ with $γ^=σ^uv2/σ^u2σ^v2$ and $z¯T$ referring to the p-vector of sample means of the IVs. This simple correction is shown to lead to a Wald statistic with excellent size control and power across a very broad range of persistent parameters.

An alternative yet similar approach to handling inferences within models such as (5)–(6) was also introduced in Breitung and Demetrescu (2015), who focused on a model augmentation approach instead. In the context of a simple predictive regression, the idea behind variable augmentation is to expand the specification in (1) with an additional carefully chosen regressor and testing $H0:β1=0$ in

$Display mathematics$
(14)

ignoring the restriction $β1=ψ1$. They subsequently show that choosing $zt$ to satisfy a range of characteristics including that it is less persistent than $xt$ leads to standard normally distributed t-ratios despite the presence of the highly persistent predictor in the DGP. These characteristics effectively require $zt$ and related cross-moments to satisfy law of large numbers and CLT type results (e.g., for $η∈[0,1/2]$, $∑zt−12/T1+2η=Op(1)$, $∑zt−12ut2/T1+2η⇒Vzu=Op(1)$, $∑zt−1xt−1/T32+η→p0$, and $∑zt−1ut/VzuT12+η⇒N(0,1)$). Inferences can be conducted using a t-statistic that can be further corrected for heteroskedasticity à la Eicker-White. There is naturally a broad range of candidates for $zt$ that satisfy these requirements including for instance the IVX variable of Phillips and Magdalinos (2009) but also fractionally integrated processes, short memory processes, etc. As discussed by the authors these choices may have important implications for the power properties of the tests. The framework in (14) can also be straightforwardly adapted to include both deterministic components such as an intercept and trends and multiple predictors as in (5), leading to $χ2$ distributed Wald statistics for testing $H0:β=0$.

# Capturing Nonlinearities within Predictive Regressions

This vast body of research on predictive regressions has mainly operated within a linear setting, implying that predictability if present is a stable phenomenon in the sense that the full sample-based estimator of $β^$ converges to its true and potentially non-zero counterpart $β$. This naturally rules out scenarios whereby predictability may be a time-varying phenomenon with periods during which $β=0$ and periods during which $β≠0$. Ignoring such economically meaningful phenomena may seriously distort the validity of standard techniques and the reliability of conclusions about the presence or absence of predictability. In the context of the predictability of stock returns, for instance, the presence of such phenomena may explain the conflicting empirical results that have appeared in the applied literature depending on the sample periods being considered.

These concerns have led to a novel research agenda that aimed to explicitly account for potential time variation in predictability by considering predictive regressions specified as

$Display mathematics$
(15)

with $xt$ as in (6). Naturally this more realistic and flexible setting raises its own difficulties as one needs to take a stance on the type of time variation driving the evolving parameters. Popular parametric approaches that have been considered in the literature include standard structural breaks and threshold effects, among others. All of these regime-specific approaches effectively model time variation as

$Display mathematics$
(16)

with $Dt$ referring to a suitable 0/1 dummy variable. Such specifications allow predictability to shut off over particular periods (e.g., $β1=0$ and $β2≠0$) determined by the way the dummy variables have been defined, making hypotheses such as $H0:β1=β2$ or $H0:β1=β2=0$ important to assess and provide a toolkit for.

Most of this literature has operated within simple univariate settings with only limited results developed for the multi-predictor case. In Gonzalo and Pitarakis (2012, 2017) the authors argued that a threshold-based parameterization of (16) can provide an economically meaningful yet parsimonious way of modeling time variation in the $β$ ’s. The inclusion of threshold effects effectively turns the linear predictive regressions into piecewise linear processes in which regimes are determined by the magnitude of a suitable threshold variable selected by the investigator. More formally within the simple predictive regression context, a two-regime threshold specification comforming to the notation in (16) can be formulated as

$Display mathematics$
(17)

where $qt$ is an observed threshold variable whose magnitude relative to $γ$ determines the regime structure. If $qt$ is taken as a proxy of the business cycle, for instance, the specification in (17) could allow predictability to kick in (or be weaker/stronger) across economic episodes such as expansions and recessions. The fact that the threshold variable $qt$ is under the control of the investigator can also be viewed as particularly advantageous in this context as it allows one to attach an observable cause to what drives time variation in predictability. An important additional advantage of using piecewise linear structures such as (17) comes from the fact that such functions may provide good approximations for a much wider class of functional forms as demonstrated in Petruccelli (1992).

In Gonzalo and Pitarakis (2012) the authors focused on predictive regressions of the type presented in (17) with their stochastic properties assumed to mimic the environments considered in the linear predictive regression literature (e.g., allowing for persistence and endogeneity). The threshold variable $qt$ was in turn modeled as a strictly stationary and ergodic process whose innovations could potentially be correlated with those driving the predictor and predictand. Despite the presence of a highly persistent predictor parameterized as a local to unit root process, Gonzalo and Pitarakis (2012) showed that a Wald-type statistic for testing the null hypothesis of linearity ($H0:(β01,β11)=(β02,β12)$) follows a well-known distribution that is free of nuisance parameters and more importantly not dependent on $c$. As the framework in (17) also raises the issue of unidentified nuisance parameters (in this instance $γ$) under the null hypothesis inferences are conducted using supremum Wald-type statistics viewed as a function of the unknown threshold parameter $γ$. Under suitable assumptions on the density of $qt$ the indicator functions satisfy $I(qt≤γ)=I(F(qt)≤F(γ))$, with $F(.)$ denoting the distribution function of $qt$ so that the Wald statistic can also be viewed as a function of $F(γ)≡λ$ for purely technical reasons. The key result in Gonzalo and Pitarakis (2012) is given by

$Display mathematics$
(18)

with $B(λ)$ denoting a standard Brownian motion whose dimension is given by the number of parameters whose equality is being tested under the null. A remarkable property of the limiting distribution in (18) is its robustness to the local to unit root parameter $c$, making inferences straightforward to implement. The task is further facilitated by the fact that the limiting distribution can be recognized as a normalized vector Brownian bridge and is extensively tabulated in the literature (see, e.g., Andrews, 1993). It is also important to note that the result in (18) remains valid in the context of (5)–(6) involving multiple predictors with potentially different $ci′s$. A rejection of the null hypothesis of linearity in (17) would clearly support the presence of regime-specific predictability in $yt$.

Another hypothesis of interest in this context is the joint null $H0:β01=β02,β11=β12=0$, whose failure to be rejected would support a martingale difference type of behavior for stock returns. Unlike the scenario in (18), however, the Wald statistic associated with this latter hypothesis has a limiting distribution that depends on $c$ and for which Gonzalo and Pitarakis (2012) developed an IVX-type Wald statistic. More specifically, they showed that the Wald statistic for testing $H0:β01=β02,β11=β12=0$ in (17) is asymptotically equivalent to the sum of two independent Wald statistics, with the first one given by $WT(λ)$ in (18) used for testing $H0:(β01,β11)=(β02,β12)$ and the second one associated with testing $H0:β1=0$ in the linear predictive regression in (1) and for which an IVX procedure can be implemented, say $WTivx(β1=0)$, known to be distributed as $χ2(1)$. This allowed them to construct a novel statistic given by the sum of these two Wald statistics $supλWT(λ)+WTivx(β1=0)$ and shown to be distributed as $supλ(B(λ)−λB(1))′(B(λ)−λB(1))/λ(1−λ)+χ2(1)$. Although non-standard, this limit is free of the influence of $c$ and can be easily tabulated via simulation methods.

A rejection of these joint null hypotheses is naturally problematic to interpret when one is solely interested in whether regime-specific predictability is induced by the highly persistent predictor $xt$. This is because a rejection of the null may occur not because of shifting slope parameters but due to shifting intercepts instead (i.e., $β01≠β02$). This issue has been subsequently addressed in Gonzalo and Pitarakis (2017), where the authors developed a Wald-type test statistic for $H0:β11=β12=0$ designed in such a way that its large sample behavior remains robust to whether $β01=β02$ or $β01≠β02$. Their method effectively relies on obtaining a conditional least squares–based estimator of the unknown threshold parameter obtained from the null restricted version of (17) and using it as a plug-in estimator within an IVX-based Wald statistic for testing $H0:β11=β12=0$. This is then shown to be distributed as $χ2(2)$ under the null regardless of whether the threshold parameter estimator is spurious or consistent for an underlying true value, i.e., regardless of whether the DGP has threshold effects in its intercept.

Other parametric alternatives to the threshold-based approach have also been considered in this literature. A popular setting involves allowing the parameters of the predictive regression to be subject to deterministic structural breaks, effectively replacing $I(qt≤γ)$ with $I(t≤k)$ in (17). Due to the presence of the highly persistent predictor standard results from the structural break literature no longer apply in this context. Testing the null hypothesis of linearity via a SupWald-type statistic no longer follows the normalized Brownian bridge–type distribution tabulated in Andrews (1993). Unlike the simplifications that occur in the context of threshold effects and that lead to convenient outcomes as in (18), the main issue in this context continues to be the dependence of inferences on the unknown noncentrality parameter $c$ with processes such as $Jc(r)$ appearing in the asymptotics. The invalidity of traditional parameter constancy tests under persistent predictors was pointed out in Rapach and Wohar (2006), who were concerned with assessing the presence of breaks in return-based predictive regressions. In this early work they suggested using Hansen’s (2000) fixed regressor bootstrap as a way of controlling for the unknown degree of persistence in the predictors. This idea has also been taken up and expanded in the more recent work of Georgiev, Harvey, Leybourne, and Taylor (2018).

Pitarakis (2017) proposed to bypass some of these difficulties by developing a CUSUMSQ-type statistic based on the squared residuals from (1) and shown to have a limiting distribution that does not depend on $c$ as in

$Display mathematics$
(19)

with $s^T2$ denoting a consistent estimator of the long-run variance of $(ut2−σu2)$. Here the $u^t$ ’s refer to the standard least squares–based residuals obtained from (1). The results obtained in Pitarakis (2017) naturally extend to multiple predictor settings (e.g., with $u^t2$ obtained from (5)), can accommodate conditional heteroskedasticity and have been shown to have excellent power properties with good size control. In related recent work Georgiev et al. (2018) also developed new inference methods within predictive regressions as in (5)–(6) with either stochastically (e.g., $βt$ evolving as a random walk) or deterministically varying (e.g., structural breaks) parameters using LM- and SupWald-type test statistics, respectively. Their approach to neutralizing the dependence of their asymptotics on the $ci′s$ relied on a fixed regressor bootstrapping algorithm that uses the realized $xt−1′s$ as a fixed regressor in the bootstrap.

The parametric approaches for capturing nonlinearities have led to various novel stylized facts on the predictability of stock returns. Within the threshold setting of Gonzalo and Pitarakis (2012), the authors documented strong countercyclicality in the predictability of U.S. returns with dividend yields, with the latter entering the predictive regression significantly solely during recessions. This phenomenon has generated considerable recent interest with numerous novel contributions aiming to explain it and document it more comprehensively. A particularly interesting novel approach has been introduced in Farmer et al. (2018), where the authors establish that pockets of predictability are a much broader phenomenon that is not solely confined to recessionary periods.

The concern for functional form mispecification that may affect the parametric nonlinear settings has also motivated fully nonparametric approaches to assessing predictability by letting $xt−1$ enter (1) via an unknown functional form as in $yt=f(xt−1)+ut$. A particularly useful and simple-to-implement approach has been developed in Kasparis et al. (2015), where the authors focused on designing tests of $H0:f(x)=μ$ based on the Nadaraya-Watson kernel regression estimator of $f(.)$ and whose distributions have been shown to be robust to the persistence properties of $xt$ including local to unit root parameterizations. One shortcoming of these nonparametric techniques is the weakness of their power properties against linear alternatives when compared with parametric approaches.

The vast body of research broadly labeled as predictive regression literature has been driven by concerns that arose in empirical applications across a variety of fields and the asset pricing literature in particular. Numerous new avenues of research that may help address novel questions or revisit older ones through new methodological developments are expected to continue to further grow and enrich this research area.

Alternative approaches for handing the joint presence of persistence and endogeneity in predictive regressions formulated as in (1)–(2) include Cai and Wang (2014), where the authors developed a projection-based method for estimating and testing the coefficients of interest, and Camponovo (2015), who introduced a novel differencing-based approach leading to Gaussian asymptotics in the same context. Although our review has focused on the most commonly encountered parameterizations of predictive regressions with predictors explicitly modeled as local to unit root processes, alternative approaches designed to remain agnostic about the process driving the predictors have also been recently explored. In Gungor and Luger (2018), for instance, the authors developed a novel approach that relies on signed rank-based tests of the null of no predictability following the early work in Campbell and Dufour (1995) and leading to valid finite sample inferences that are invariant to the various econometric complications we discussed (see also Taamouti, 2015, for a comprehensive review of this sign-based inference literature in the context of both linear and nonlinear regression models). In parallel to this literature further progress is also expected when it comes to capturing time variation in the parameters driving these predictive regressions. This is an area of particular relevance to economic and financial applications (see Cai, Wang, & Wang, 2015; Demetrescu, Georgiev, Rodrigues, & Taylor, 2019).

Most of the existing predictability literature has also been confined to the conditional mean of the predictands of interest, whereas predictability may be a much broader phenomenon potentially also (or solely) affecting the quantiles of the series of interest. In numerous risk-related applications one may be interested in uncovering factors influencing the extreme tails of a series. Generalizing the existing literature to accommodate time variation in such settings will almost certainly raise many novel challenges. Recent developments in this area include Fan and Lee (2019) and Lee (2013).

Given the increased availability of big data sets, the issue of handling multiple predictors having different stochastic properties in either linear or nonlinear contexts will also continue to create many technical challenges if one wishes to take advantage of the growing literature on high dimensional estimation, model selection, and prediction via shrinkage-based techniques. In Lee, Shi, and Gao (2018) and Koo, Anderson, Seo, and Yao (2016), the authors consider a predictive regression framework with a multitude of predictors having varying degrees of persistence and evaluate the properties of LASSO-based techniques for estimation and model selection.

## References

Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown changepoint. Econometrica, 61, 821–856.Find this resource:

Breitung, J., & Demetrescu, M. (2015). Instrumental variable and variable addition based inference in predictive regressions. Journal of Econometrics, 187, 358–375.Find this resource:

Cai, Z., & Wang, Y. (2014). Testing predictive regression models with nonstationary regressors. Journal of Econometrics, 178, 4–14.Find this resource:

Cai, Z., Wang, Y., & Wang, Y. (2015). Testing instability in a predictive regression model with nonstationary regressors. Econometric Theory, 31, 953–980.Find this resource:

Campbell, B., & Dufour, J. M. (1995). Exact nonparametric orthogonality and random walk tests. Review of Economics and Statistics, 77, 1–16.Find this resource:

Campbell, J. Y., & Yogo, M. (2006). Efficient tests of stock return predictability. Journal of Financial Economics, 81, 27–60.Find this resource:

Camponovo, L. (2015). Differencing transformations and inference in predictive regression models. Econometric Theory, 31, 1331–1358.Find this resource:

Cavanagh, C. L., Elliott, G., & Stock, J. H. (1995). Inference in models with nearly integrated regressors. Econometric Theory, 11, 1131–1147.Find this resource:

Demetrescu, M., Georgiev, I., Rodrigues, P. M. M., & Taylor, A. M. R. (2019). Testing for episodic predictability in stock returns. Technical Report 2:2019, Essex Finance Centre.Find this resource:

Elliott, G., Rothemberg, T. J., & Stock, J. H. (1996). Efficient test for an autoregressive unit root. Econometrica, 64, 813–836.Find this resource:

Elliott, G., & Stock, J. H. (2001). Confidence intervals for autoregressive coefficients near one. Journal of Econometrics, 103, 155–181.Find this resource:

Fan, R., & Lee, J. H. (2019). Predictive quantile regressions under persistence and conditional heteroskedasticity. Journal of Econometrics.Find this resource:

Farmer, L., Schmidt, L., & Timmermann, A. (2018). Pockets of predictability. Discussion Paper 12885, CEPR.Find this resource:

Georgiev, I., Harvey, D. I., Leybourne, S. J., & Taylor, A. M. R. (2018). Testing for parameter instability in predictive regression models. Journal of Econometrics, 204, 101–118.Find this resource:

Golez, B., & Koudijs, P. (2018). Four centuries of return predictability. Journal of Financial Economics, 127, 248–263.Find this resource:

Gonzalo, J., & Pitarakis, J. (2012). Regime specific predictability in predictive regressions. Journal of Business and Economic Statistics, 30, 229–241.Find this resource:

Gonzalo, J., & Pitarakis, J. (2017). Inferring the predictability induced by a persistent regressor in a predictive threshold model. Journal of Business and Economic Statistics, 35, 202–217.Find this resource:

Goyal, A., & Welch, I. (2008). A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies, 21, 1455–1508.Find this resource:

Gungor, S., & Luger, R. (2018). Small-sample tests for stock return predictability with possibly non-stationary regressors and garch-type effects. Journal of Econometrics.Find this resource:

Hansen, B. E. (2000). Testing for structural change in conditional models. Journal of Econometrics, 97, 93–115.Find this resource:

Jansson, M., & Moreira, M. J. (2006). Optimal inference in regression models with nearly integrated regressors. Econometrica, 74, 681–714.Find this resource:

Juhl, T. (2014). A nonparametric test of the predictive regression model. Journal of Business and Economic Statistics, 32, 387–394.Find this resource:

Kasparis, I., Andreou, E., & Phillips, P. C. B. (2015). Nonparametric predictive regression. Journal of Econometrics, 185, 468–494.Find this resource:

Kendall, M. G. (1954). Note on bias in the estimation of autocorrelation. Biometrika, 41, 403–404.Find this resource:

Koijen, R. J. S., & Van Nieuwerburgh, S. (2011). Predictability of returns and cash flows. Annual Review of Financial Economics, 3, 467–491.Find this resource:

Koo, B., Anderson, H. M., Seo, M. W., & Yao, W. (2016). High dimensional predictive regression in the presence of cointegration. Working Paper 2851677, SSRN.Find this resource:

Kostakis, A., Magdalinos, A., & Stamatogiannis, M. (2015). Robust econometric inference for stock return predictability. Review of Financial Studies, 28, 1506–1553.Find this resource:

Lee, J. H. (2013). Predictive quantile regression with persistent covariates: Ivx-qr approach. Journal of Econometrics, 192, 105–118.Find this resource:

Lee, J. H., Shi, Z., & Gao, Z. (2018). On lasso for predictive regression. ArXiv.Find this resource:

Lettau, M., & Van Nieuwerburgh, S. (2008). Reconciling the return predictability evidence. Review of Financial Studies, 21, 1607–1652.Find this resource:

Lewellen, J. (2004). Predicting returns with financial ratios. Journal of Financial Economics, 74, 209–235.Find this resource:

Mikusheva, A. (2007). Uniform inference in autoregressive models. Econometrica, 75, 1411–1452.Find this resource:

Petruccelli, J. (1992). On the approximation of time series by threshold autoregressive models. Sankhya, 54, 106–113.Find this resource:

Phillips, P. C. B. (1987). Time series regression with a unit root. Econometrica, 55, 227–301.Find this resource:

Phillips, P. C. B. (2015). Pitfalls and possibilities in predictive regression. Working Paper 2003, Cowles Foundation Discussion Paper.Find this resource:

Phillips, P. C. B., & Magdalinos, A. (2009). Econometric inference in the vicinity of unity. Working Paper 06–2009, Singapore Management University.Find this resource:

Pitarakis, J. (2017). A simple approach for diagnosing instabilities in predictive regressions. Oxford Bulletin of Economics and Statistics, 79, 851–874.Find this resource:

Rapach, D. E., & Wohar, M. E. (2006). Structural breaks and predictive regression models of aggregate us returns. Journal of Financial Econometrics, 4, 238–274.Find this resource:

Stambaugh, R. F. (1999). Predictive regressions. Journal of Financial Economics, 54, 375–421.Find this resource:

Stock, J. H. (1991). Confidence intervals for the largest autoregressive root in us economic time series. Journal of Monetary Economics, 28, 435–460.Find this resource:

Taamouti, A. (2015). Finite sample sign based inference in linear and nonlinear regression models with applications in finance. L’actualité économique: Revue d’analyse économique, 91, 89–113.Find this resource: