# Score-Driven Models: Methodology and Theory

# Score-Driven Models: Methodology and Theory

- Mariia Artemova, Mariia ArtemovaSchool of Buiness and Economics, Vrije Universiteit Amsterdam; Tinbergen Institute
- Francisco Blasques, Francisco BlasquesSchool of Buiness and Economics, Vrije Universiteit Amsterdam; Tinbergen Institute
- Janneke van BrummelenJanneke van BrummelenSchool of Buiness and Economics, Vrije Universiteit Amsterdam; Tinbergen Institute
- and Siem Jan KoopmanSiem Jan KoopmanSchool of Buiness and Economics, Vrije Universiteit Amsterdam; Tinbergen Institute; CREATES Aarhus University

### Summary

Score-driven models belong to a wider class of observation-driven time series models that are used intensively in empirical studies in economics and finance. A defining feature of the score-driven model is its mechanism of updating time-varying parameters by means of the score function of the predictive likelihood function. The class of score-driven models contains many other well-known observation-driven models as special cases, and many new models have been developed based on the score-driven principle. Score-driven models provide a general way of parameter updating, or filtering, in which all relevant features of the observation density function are considered. In models with fat-tailed observation densities, the score-driven updates are robust to large observations in time series. This kind of robustness is a convenient feature of score-driven models and makes them suitable for applications in finance and economics, where noisy data sets are regularly encountered. Parameter estimation for score-driven models is straightforward when the method of maximum likelihood is used. In many cases, theoretical results are available under rather general conditions.

### Subjects

- Economic Theory and Mathematical Models

### Background and Rationale for Score-Driven Models

In empirical research in economics and finance, it is widely acknowledged that parameters in reduced-form representations of structural models are subject to various instabilities due to model misspecifications and approximations. These instabilities are partly caused by the limitations of linear model specifications, which are typically used in applied econometrics. Furthermore, empirical data sets of interest in economics and finance are typically subject to changing behaviors of economic agents (possibly aimed at returning to some level of equilibrium), endogenous and exogenous shocks, systemic innovations, fiscal policy changes, and more. To account for such instabilities and distortions, linear model specifications are often extended by replacing fixed parameters with time-varying parameters. These generalizations can apply to parameters related to the mean or location (constant and regression coefficients) and the variance or scale (error variance) equations of the model.

In the context of the standard linear regression model, time-varying parameters can be empirically detected by means of the recursive least squares method. Although the method was already developed in the original work of Gauss, it is often credited to Plackett (1950), who provided an elegant derivation based on matrix algebra. The recursive least squares method is a set of equations that provides the least squares estimates based on a set of observations that is growing with individual observations, sequentially over time. The standard errors of the recursive estimates tend to become smaller after each update because the sample size increases. The usual assumptions of the regression model are also applicable in the context of recursive estimation, including the assumption that the regression coefficients are constant over time. The relaxation of this assumption has been considered in the work of Rudolf E. Kalman, where effectively each regression parameter can potentially follow a linear dynamic process (see Kalman, 1960). The celebrated Kalman filter can be regarded as the corresponding recursive least squares method that allows for time-varying regression parameters. It should be emphasized that Kalman’s work was developed in the context of control and system theory, relevant in engineering and mathematics. Due to the work of Andrew C. Harvey in the 1970s, the statistical impact of Kalman’s work for linear regression, autoregressive moving average (ARMA), and other linear dynamic models has been revealed and acknowledged (see Harvey, 1981, Chapter 4, for a textbook treatment). Harvey’s work has been instrumental in the recognition of the Kalman filter and its relevance in econometrics. In particular, a strong case is provided by the notion that almost all linear dynamic models can be represented as a state space model that consists of an observation equation and an updating equation for the state vector with the time-varying parameters. Once the model of interest is represented in state space form, the Kalman filter can be applied straightforwardly. The output of the Kalman filter enables the estimation of the time-varying parameters as well as the evaluation of the likelihood function using the prediction error decomposition.

In the same period, in the seminal article Cox (1981), Sir David Cox recognized two different classes of statistical time-series models with time-varying parameters: observation-driven models and parameter-driven models. In the latter class of models, the parameters are treated as dynamic processes with their own source of errors. Due to this additional source of errors, the time-varying parameters are not perfectly predictable, even if the time-varying parameters are analyzed conditional on past and concurrent observations. The state space model clearly belongs to the class of parameter-driven models. For the class of observation-driven models, time-varying parameters are treated as functions of lagged dependent variables as well as exogenous variables. In such model specifications, when conditioning on past and concurrent observations, the time-varying parameters are perfectly predictable. As a result, the likelihood evaluation is relatively straightforward. Given these features and characteristics, observation-driven models have become popular in econometrics. Typical examples of observation-driven models are the generalized autoregressive conditional heteroskedasticity (GARCH) models of Engle (1982) and Bollerslev (1986), the exponential GARCH (EGARCH) model of Nelson (1991), and the autoregressive conditional duration (ACD) and intensity (ACI) models of Engle and Russell (1998).

A subclass of observation-driven models is the class of score-driven models proposed and developed by Creal et al. (2011, 2013) and Harvey (2013). For this class of models, the score of the conditional observational density is used to update the time-varying parameters in the model. Here, the score refers to the first derivate of the log likelihood function with respect to the parameter. A more precise definition of the derivative in the context of score-driven models is provided in the section “Model Specification.” The score-driven models are also known as generalized autoregressive score (GAS) models and as dynamic conditional score (DCS) models. Various factors have given rise to the use of score-driven models in empirical studies in economics and finance. For example, in many cases the score function provides an intuitive driving mechanism for the time-varying parameter. Generally, the score indicates in which “direction” the time-varying parameter must “move” in order to improve the fit in terms of a local (predictive) density. It can therefore be used “naturally” for designing new updating equations for time-varying parameters, especially in settings where it is not immediately obvious what is a good choice for a parameter-driving mechanism. The score-driven updating mechanism is of a particular convenience because it has the important advantage that the entire predictive density structure is exploited. In other words, all features of the available information are used in updating the time-varying parameter and the information is not limited to means and/or higher-order moments of the predictive density. A more formal rationale for using the score as the parameter-driving mechanism is provided by Blasques et al. (2015), who argued that score-driven models are optimal in approximating the conditional observation density, even when the model is not correctly specified.

In the initial work of both Creal et al. (2013) and Harvey (2013), it was argued that score-driven models also lead naturally to robust updating equations for parameters in models where fat-tailed and asymmetric densities are present. This feature of score-driven models makes them appealing for modeling noisy economic and financial data. In particular, high-frequency data, such as weekly, daily, or intradaily time series, are often contaminated with noise and outliers. Another key feature of score-driven models is that they are easy to implement and fast in computations. Score-driven models do not rely on complex algorithms or simulation-based methods for the estimation of fixed parameters and the filtering of time-varying parameters.

More specifically, a convenient feature of observation-driven models is that the likelihood has a closed-form expression that can be constructed from the prediction error decomposition. Hence, this class of models offers a convenient framework for use in empirical work. This is in contrast to parameter-driven models, where the unobserved time-varying parameter process has its own source of errors that need to be integrated out from the joint density function to obtain the likelihood. In many cases of empirical interest this integration problem cannot be solved analytically, and the solution is typically found in numerical simulation-based methods that are computationally intensive. Still, it was shown by Koopman et al. (2016) that a score-driven model has a forecasting performance similar to that of the corresponding parameter-driven model, even when the latter is the true data-generation process. Furthermore, it is relatively straightforward to extend observation-driven models to settings where, for example, nonlinear equations, asymmetric densities, and long-memory dynamic processes are present. It may be concluded that score-driven models can be applied straightforwardly to a wide range of different and elaborate models.

The score-driven class of models encompasses many of the well-known observation-driven models, such as GARCH, ACD, and ACI models. It has also given rise to a wide range of newly developed models that have proven to be useful for economic and financial applications. An overview of the literature on the more advanced score-driven models is provided in Artemova et al. (2022). In this article, a thorough but still practical introduction of the methodology for score-driven models is presented. The remainder of the article consists of three parts. First, the general model specification and corresponding methodology are introduced. Second, the various aspects of score-driven models are examined for a basic location model. Third, score-driven scale models are thoroughly discussed, with an emphasis on the Student’s *t* conditional volatility model.

### Score-Driven Model Specification

This section provides the statistical specification of the score-driven model and discusses its most important features for a univariate time series of *T* observations, which is denoted by ${y}_{1},{y}_{2},{y}_{3},\dots ,{y}_{T}$. In addition, the estimation of the fixed parameters in the model, the filtering of the time-varying parameters, and the predictions and forecasting of the observations, in-sample and out-of-sample, are discussed.

#### Model Specification

The score-driven model is specified as in Creal et al. (2013) and is given by

where *y*_{t} is the time series observation at time *t*, for $t=1,\dots ,T$, ${p}_{y}\left({y}_{t}|{f}_{t},{\mathrm{\mathcal{F}}}_{t-1};\theta \right)$ denotes the predictive conditional density for observation *y*_{t}, *f*_{t} is the time-varying parameter for the conditional density, with${\mathrm{\mathcal{F}}}_{t-1}$denoting the information set based on the observations${y}_{1},\dots ,{y}_{t-1}$, and *θ* is the parameter vector that contains fixed and unknown coefficients, including *ω*, *α*, and *β*, but also parameters that index the predictive density${p}_{y}$. It is assumed that the observations become available sequentially over time. When the observation *y*_{t} arrives, the time-varying parameter *f*_{t} is updated using the updating equation ${f}_{t+1}=\omega +\alpha {s}_{t}+\beta {f}_{t}$. The innovation term for the updating equation is the scaled score ${s}_{t}={S}_{t}\cdot {\nabla}_{t}$, where ∇_{t} denotes the score of the predictive conditional density ${p}_{y}\left({y}_{t}|{f}_{t},{\mathrm{\mathcal{F}}}_{t-1};\theta \right)$ with respect to *f*_{t} and ${S}_{t}\equiv S\left({f}_{t},{\mathrm{\mathcal{F}}}_{t-1};\theta \right)$ is a scaling function (discussed in the section “Possible Scaling Measures for the Score”). A similar formulation of the score-driven model was given by Harvey (2013).

The updating of *f*_{t} using the score of the conditional predictive density is intuitive. Namely, if a local ascent algorithm was used to find the local maximum of the predictive density at time *t* over *f*_{t} for a given parameter vector *θ*, then the score ∇_{t} indicates the direction in which the time-varying parameter must move, given the current position of *f*_{t}, according to the algorithm. For this reason, it is natural to use the score to determine the optimal updating step of *f*_{t}. This was underlined by Blasques et al. (2015), who proved the information-theoretic optimality of a simple class of score-driven models under certain regularity conditions. In particular, they showed that time-varying parameter updates based on the score always reduce the local Kullback-Leibler divergence in expectation and in every step. These results hold for misspecified models as well. Also, they argued that any parameter update that is not based on the score does not have this property. These are all limit results. However, Blasques et al. (2020) showed in a Monte Carlo study that the optimality also holds in finite samples for score-driven conditional volatility models.

Creal et al. (2008, 2013) discussed more options for generalizing the specification in Equation 1. For example, the model can be specified with a more extensive lag structure and with exogenous covariates ${x}_{t}$ in the updating equation for *f*_{t}. Also, the model can include other forms of nonlinearity in the updating equation, such as a regime-switching process and a long-memory process. Furthermore, a setting where the predictive density depends on past values of *y*_{t} and on some exogenous covariates${x}_{t}$can be considered. Finally, *y*_{t} and *f*_{t} can be vectors, or even matrices, rather than scalars. If *f*_{t} is a time-varying parameter vector, the score${s}_{t}$is a vector and the scaling factor${S}_{t}$is a matrix. Having an observation vector *y*_{t} results in having a multivariate score-driven model. A range of extensions, including multivariate score-driven models, is discussed in Artemova et al. (2022).

#### Possible Scaling Measures for the Score

The scaling function${S}_{t}$allows us to determine how the score ∇_{t} impacts the updating at time *t*, to obtain the next ${f}_{t+1}$. It must be emphasized that a different choice for${S}_{t}$results in an inherently different model, with a different set of statistical properties. Hence, the most suitable scaling function for a specific setting must be determined with some care.

Typically, a natural choice for the scaling factor is a function of the variance of the score ∇_{t}. Then the scaling is based on the curvature of the conditional log density of the *t*th observation. More specifically, it is common to use the inverse asymptotic variance of the score, which is equal to the conditional information matrix under standard regularity conditions; so ${S}_{t}={\mathrm{\mathcal{I}}}_{t\mid t-1}^{-1}$ where ${\mathrm{\mathcal{I}}}_{t\mid t-1}={\mathbb{E}}_{t-1}\left[{\nabla}_{t}{\nabla}_{t}^{\prime}\right]$. This leads to the scaled score${s}_{t}$having a variance of ${\mathrm{\mathcal{I}}}_{t\mid t-1}^{-1}$. It is also an intuitive scaling choice, because in this case the updating equation of *f*_{t} in Equation 1 can be interpreted as a Gauss-Newton algorithm for estimating *f*_{t} over time, as was pointed out by Creal et al. (2011). For this choice of ${S}_{t}$, well-known models become special cases of the score-driven model. For example, a score-driven volatility model ${y}_{t}={f}_{t}{\epsilon}_{t}$, with error${\epsilon}_{t}$being standard normally distributed, reduces to the GARCH model of Bollerslev (1986) for this choice of scaling. Another example of a model that is encompassed by score-driven models for this choice of scaling is the multiplicative error (MEM) model of Engle and Gallo (2006), which in turn encompasses the autoregressive conditional duration and intensity (ACD and ACI) models of Engle and Russell (1998) and Russell (2001), respectively (see Creal et al., 2013, for a more extensive discussion).

Another possible scaling matrix is${S}_{t}={\mathcal{J}}_{t\mid t-1}$, where ${\mathcal{J}}_{t\mid t-1}$is the square root of the

(pseudo-)inverse information matrix such that ${\mathcal{J}}_{t\mid t-1}^{\prime}\cdot {\mathcal{J}}_{t\mid t-1}={\mathrm{\mathcal{I}}}_{t\mid t-1}^{-1}$. This choice of ${S}_{t}$is particularly convenient, because it standardizes the score ∇_{t} such that${s}_{t}$itself has a unit variance, which improves the tractability of the statistical properties of the model.

Finally, a typical last-resort option is to set ${S}_{t}=I\phantom{\rule{0.1em}{0ex}}$. For this choice of scaling, or essentially the lack of scaling, the statistical properties of the model will typically become more complicated. This is a clear disadvantage. However, when the information matrix ${\mathrm{\mathcal{I}}}_{t\mid t-1}$ is a constant (scalar), or it does not depend on the time-varying parameter, the scaling is less relevant, because the scaled score is scaled again through its multiplication by *α* in the updating equation. Such cases include log-scale models of the form ${y}_{t}=exp\left(0.5{f}_{t}\right){\epsilon}_{t}$ and location models of the form ${y}_{t}={f}_{t}+{\epsilon}_{t}$, where ${S}_{t}={\mathrm{\mathcal{I}}}_{t\mid t-1}^{-1}$ is proportional to ${S}_{t}=1$.

#### Estimation, Filter Invertibility, and Asymptotic Properties

As highlighted already, a convenient property of observation-driven models, and hence of score-driven models, is that an explicit expression of the log likelihood is available. Hence the estimation of the static parameter vector *θ* via the method of maximum likelihood (ML) is straightforward. Consider a sample of *T* observations ${\left\{{y}_{t}\right\}}_{t=1}^{T}$ generated as described in Equation 1 under some true parameter${\theta}_{0}$. The true values of ${\left\{{f}_{t}\right\}}_{t=1}^{T}$ that are used to generate the observations are not observed. Therefore, a filtered sequence ${\left\{{\widehat{f}}_{t}\left(\theta \right)\right\}}_{t=1}^{T}$ is constructed recursively using the updating equation of *f*_{t} for some value of *θ* and some starting value ${\widehat{f}}_{1}$. The application of the prediction error decomposition and the use of the filtered sequence ${\left\{{\widehat{f}}_{t}\left(\theta \right)\right\}}_{t=1}^{T}$ provide the maximization problem:

It is straightforward to evaluate the value of the log likelihood for some *θ* because it requires only two steps: calculating the filtered time-varying parameter ${\left\{{\widehat{f}}_{t}\left(\theta \right)\right\}}_{t=1}^{T}$ via the score-driven updating in Equation 1, and calculating the log likelihood contribution of every ${y}_{t}$ given ${\widehat{f}}_{t}\left(\theta \right)\phantom{\rule{0.1em}{0ex}}$, for $t=1,\dots ,T$.

Before turning to the asymptotic properties of this ML estimator, an important property that must be examined when using observation-driven filters is filter invertibility. Invertibility essentially means that the filtered sequence ${\left\{{\widehat{f}}_{t}\left(\theta \right)\right\}}_{t=1}^{T}$ will “forget” its starting value in the limit. The starting value of the time-varying parameter *f*_{t} is unobserved. Hence, the true starting point ${f}_{1}$ is unknown and is not available for initializing the sequence ${\left\{{\widehat{f}}_{t}\left(\theta \right)\right\}}_{t=1}^{T}$. It needs to be replaced by some arbitrary value ${\widehat{f}}_{1}$. Sometimes it is possible to estimate the starting value alongside the other parameters, but typically this is not preferred, especially when *f*_{t} represents a high-dimensional vector. It is important to notice that when filter invertibility fails, the true path ${f}_{t}\left({\theta}_{0}\right)$ will not be retrieved in the limit, even if the true static parameter ${\theta}_{0}$ is known. Clearly, this will be problematic if the filtered values ${\widehat{f}}_{t}\left(\theta \right)$ are used to construct the log likelihood that is used for ML. This crucial point is sometimes overlooked by those who base their theoretical results on the implicit assumption that the value of ${f}_{1}$ is known exactly. Also, this assumption does not appear to be realistic because the rest of the sequence ${f}_{2},{f}_{3},\dots $ is assumed to be unobserved.

The concept of filter invertibility was discussed by Straumann and Mikosch (2006), who stressed that it is a condition needed for applicability of the quasi ML estimation of a general class of GARCH models. The importance of filter invertibility for the EGARCH model was highlighted by Wintenberger (2013). Blasques et al. (2018) provided a way to determine invertibility regions when no feasible analytical invertibility conditions on the parameters are available. Blasques et al. (2022) gave sufficient conditions for filter invertibility for score-driven models. As is usual in the stationary observation-driven literature, invertibility is established by showing that the filtered process converges almost surely to some unique stationary and ergodic limit process. They followed the approach of Straumann and Mikosch (2006), which adapted Bougerol (1993, Theorem 3.1) to obtain low-level conditions for filter invertibility for score filters and their first- and second-derivative processes. The latter results are useful for establishing asymptotic normality of the ML estimator.

After establishing the invertibility results for the score-driven model, Blasques et al. (2022) further provided sufficient conditions for consistency and asymptotic normality of the ML estimator ${\widehat{\theta}}_{T}$. Two conditions are crucial in this development. First, the updating equation of the “true” *f*_{t} must be contracting on average to ensure that the true time-varying parameter is stationary and ergodic. Second, the filtering equation of ${\widehat{f}}_{t}$ must be uniformly contracting to ensure filter invertibility. The asymptotic properties derived by Blasques et al. (2022) are global and they rely on low-level conditions in terms of “building blocks” of score-driven models, which are represented by the equations provided in Equation 1. For example, the derivatives of the score with respect to the parameters must have bounded moments up to some specific order. These results are particularly helpful for researchers who want to establish theoretical properties of the ML estimator for a specific score-driven model. Given certain conditions for the observations, the asymptotic properties remain applicable under potential model misspecification. These theoretical results do not trivially generalize to settings where the observations *y*_{t} and/or the time-varying parameter *f*_{t} are no longer univariate. However, specific asymptotic results can be established for particular multivariate score-driven models (see the discussions in Blasques et al., 2016, and Bazzi et al., 2017).

### Score-Driven Location Models

This section illustrates the workings of score-driven models in a basic setting. The model for location is a natural starting point to obtain further insight into score-driven models. These models focus on the mean equation and are typically used for analyzing macroeconomic time series.

#### Location Model Specification

When the aim is to filter the conditional mean ${f}_{t}=\mathbb{E}\left({y}_{t}|{\mathrm{\mathcal{F}}}_{t-1}\right)$ of a (univariate) sample of observed data ${\left\{{y}_{t}\right\}}_{t=1}^{T}$, consider the score-driven filtering model as given by

where ${\left\{{\epsilon}_{t}\right\}}_{t\in \mathrm{\mathbb{Z}}}$ is assumed to be an independent and identically distributed (i.i.d.) random sequence with mean 0 and probability density function${p}_{\epsilon}$, and where *f*_{t} is updated according to Equation 1. Assume that ${\epsilon}_{t}$is Gaussian with variance ${\sigma}^{2}$. Then ${S}_{t}={\mathrm{\mathcal{I}}}_{t\mid t-1}^{-1}={\sigma}^{2}$, leads to a scaled score ${s}_{t}={y}_{t}-{f}_{t}$ which yields the linear updating equation for *f*_{t} as given by

The updating for *f*_{t} reduces to an exponentially weighted moving average (EWMA) recursion when coefficients have values *ω* = 0, *β* = 1, and 0 < *α* < 1. When *α* > 0, the updating of the conditional expectation *f*_{t} is intuitive, since *f*_{t} is effectively the one-step-ahead forecast of *y*_{t} and ${y}_{t}-{f}_{t}$ is the corresponding forecast error. Hence, the updating equation, Equation 3, takes into account the forecast error to construct ${f}_{t+1}$, which is defined as the one-step-ahead forecast of the next observation ${y}_{t+1}$. Specifically, for *α* > 0, if *f*_{t} has a lower (higher) value than *y*_{t}, then the scaled error $\alpha \left({y}_{t}-{f}_{t}\right)$ will ensure that ${f}_{t+1}$ increases (decreases) compared to *f*_{t}.

Substituting Equation 2 into this equation shows that ${\left\{{f}_{t}\right\}}_{t\in \mathrm{\mathbb{Z}}}$ follows an autoregressive process of order 1, an AR(1) process, with autoregressive coefficient *β*. Hence, a necessary and sufficient condition for stationarity of this sequence is that $\mid \beta \mid <1$. The updating equation, Equation 3, implies that *f*_{t} is a weighted average of all past observations, where observation ${y}_{t-j}$ has weight $\alpha {\left(\beta -\alpha \right)}^{j-1}$, for $j=1,\dots ,t-1$. It follows immediately that the filtered sequence ${\widehat{f}}_{t}$ is stationary in the limit, if and only if $\mid \beta -\alpha \mid <1$. This is also the condition for filter invertibility.

Finally, by substituting Equation 3 into Equation 2, it follows that *y*_{t} is implicitly generated by an ARMA(1,1) process as given by

with autoregressive coefficient *β* and moving average coefficient *α* – *β*. This is an interesting result and it applies to any distribution for ${\epsilon}_{t}$. Notice that the process reduces to an AR(1) process when *α* = *β* , and to an MA(1) process when *β* = 0. Higher-order ARMA processes are obtained when higher-order lags for *f*_{t} and ${s}_{t}={y}_{t}-{f}_{t}$ are considered for the updating equation ${f}_{t+1}$ in Equation 1 or, more specifically, in Equation 3.

#### Robust Filtering

The properties of the score-driven location model are almost identical to those of a parameter-driven Gaussian (stationary) signal plus noise model, which is obtained by replacing ${s}_{t}$with a Gaussian random error sequence in the updating Equation 3. As shown by Harvey and Luati (2014), this parallel vanishes when considering a non-Gaussian predictive conditional density ${p}_{y}\left({y}_{t}|{f}_{t},{\mathrm{\mathcal{F}}}_{t-1};\theta \right)$in Equation 1. For instance, consider a fat-tailed conditional density such as the Student’s *t* density. The data-generation process for this model will lead to many observations that would be referred to as “outliers” in a Gaussian context. The updating of ${f}_{t+1}=\omega +\alpha \left({y}_{t}-{f}_{t}\right)+\beta {f}_{t}$ will not work well, because large observations from previous times will be incorporated in the filtered level *f*_{t}: it will take time for the outlying observations to “work through the system.” Hence, this dynamic model is not robust to large observations that are induced by a heavy-tailed conditional distribution. Therefore, it is preferred to adopt a model that accounts for these fat tails. Score-driven models are designed to do this by considering fat-tailed densities for ${p}_{y}\left({y}_{t}|{f}_{t},{\mathrm{\mathcal{F}}}_{t-1};\theta \right)$in Equation 1. A more elaborate discussion was provided by Caivano et al. (2016, Section 2.3 and 2.4), who demonstrated that the conditional score reflects the tail shape of distributions and how this connects to robustness.

For instance, let ${\epsilon}_{t}$in Equation 2 be Student’s *t* distributed with *v* > 0 degrees of freedom and scale parameter $\sigma >0$. Then the conditional observation distribution becomes

implying that the score-driven updating equation will become

according to Equation 1, using ${S}_{t}={\left(1+{\nu}^{-1}\right)}^{-1}{\sigma}^{2}$, which is proportional to the inverse of the conditional information matrix ${\mathrm{\mathcal{I}}}_{t\mid t-1}=\left(\nu +1\right){\sigma}^{-2}/\left(\nu +3\right)$. Equations 2 and 4 together form the score-driven Student’s *t* model proposed by Harvey (2013, Section 3.1) and Harvey and Luati (2014). Notice that the degrees of freedom parameter *v* is only required to be positive, which is why the model is referred to as a location model and not as a mean model. If the degrees of freedom $\nu \to \infty $, the update becomes identical to the Gaussian update of Equation 3, which is not surprising because in that case the Student’s *t* distribution approaches a normal distribution. The score is plotted for various different choices of *v* in Figure 1. For finite values of *v*, it is clear that the update in Equation 4 downweights large values of ${y}_{t}-{f}_{t}$, and this downweighting is more severe for smaller values of *v*. In that way, the updating takes into account the heavy tails of the disturbances ${\epsilon}_{t}$, because it implies that large prediction errors might be due to the nature of the innovations. As $\mid {y}_{t}-{f}_{t}\mid $ goes to infinity, the score ${s}_{t}$ goes to 0. Because of the redescending nature of the score, Caivano and Harvey (2014) referred to the weighting induced by this score-driven filter as a parametric form of trimming. In the robustness literature, trimming is a common technique in which observations above some threshold receive a weight of 0. Hence, the Student’s *t* location filter essentially induces a soft form of trimming.

Harvey and Luati (2014) provided asymptotic results for the ML estimator of this model, including an explicit asymptotic variance matrix in terms of the model’s parameters, but invertibility of the filter was not explicitly discussed. Blasques et al. (2022) have formulated conditions for consistency and asymptotic normality of the ML estimator of this model while taking account of the invertibility conditions. Blasques et al. (2018) developed a weaker version of the invertibility condition for this model. It is also possible to adopt a nonstationary version of the score-driven local level model as given by

where ${s}_{t}$is defined as in Equation 4; this model was suggested by Harvey and Luati (2014). It is a special case of the model in Equations 2 and 4 with *ω* = 0 and *β* = 1. This updating equation for *f*_{t} also implies an EWMA scheme but now with time-varying weights that account for the shape of the distribution (see Caivano et al., 2016). Furthermore, this model can be extended to a more general unobserved components model with score-driven time-varying trend and seasonal components. Such specifications can also be extended with explanatory variables and fixed effects (see Harvey & Luati, 2014).

#### Leptokurtosis and Asymmetry

Whereas the score-driven location filter based on Student’s *t* innovations induces a form of trimming, the filter of a score-driven location model based on the exponential generalized beta distribution of the second kind (EGB2), as proposed by Caivano and Harvey (2014) and Caivano et al. (2016), has a Winsorizing property. Winsorizing is a well-known approach for the treatment of large observations in the robustness literature. It entails an updating equation for ${f}_{t+1}$ where ${s}_{t}$is set to a constant value when the observation passes some given threshold (see Maronna et al., 2006, Chapter 2).

The EGB2 distribution allows for asymmetry and leptokurtosis, but unlike the Student’s *t* distribution, it has exponential tails instead of heavy tails. If ${\epsilon}_{t}$in Equation 2 is EGB2 distributed with mean 0, variance ${\sigma}^{2}>0$ and nonnegative shape parameters $\xi $ and $\varsigma $, then

where $\mathrm{\Delta}=\psi \left(\xi \right)-\psi \left(\varsigma \right)$, $h=\sqrt{{\psi}^{\prime}\left(\xi \right)+{\psi}^{\prime}\left(\varsigma \right)}$, $\psi \left(\cdot \right)$ is the digamma function, ${\psi}^{\prime}\left(\cdot \right)$ is the trigamma function, and$B\left(\cdot ,\cdot \right)$is the beta function (see, for instance, Wang et al., 2001, for further details). The distribution is symmetric if $\xi =\varsigma $, positively skewed if $\xi >\varsigma $, and negatively skewed if $\xi <\varsigma $. The kurtosis decreases as $\xi $ and $\varsigma $ increase and the distribution encompasses the normal distribution (if $\xi =\varsigma \to \infty $) and the Laplace distribution (if $\xi =\varsigma =0$). The scaled score ${s}_{t}$in Equation 1, corresponding to the predictive density in Equation 5, is given by

where the scale is set to ${S}_{t}={\sigma}^{2}$, which is proportional to the inverse of the conditional information matrix. It is clear that the fraction in the scaled score ${s}_{t}$is uniformly bounded between $-\mathit{h\sigma \xi}$ and $\mathit{h\sigma \varsigma}\phantom{\rule{0.1em}{0ex}}$, and ${s}_{t}$converges to these values as ${y}_{t}-{f}_{t}\to -\infty $ and ${y}_{t}-{f}_{t}\to \infty $, respectively. Hence, this updating mechanism is effectively subject to a gentle or soft form of Winsorizing. The contribution of large observations is bounded, but it is not redescending, as for Student’s *t* distributed errors; see Figure 2 for a visual representation of this.

When $\xi \ne \varsigma $, the distribution is skewed and the score update becomes asymmetric. For example, if the distribution is negatively skewed, so if $\xi <\varsigma $, then large positive values get a larger weight than large negative values of the same magnitude. This is intuitive because negative skewness implies that negative spikes are more likely to occur than positive spikes, and therefore the robust filter should be less sensitive to negative spikes.

Naturally, score-driven location models based on many other distributions can also be considered. This will lead to location models with different properties, because the score will take into account the shape of the distribution. Take, for example, a general error distribution (GED), also known as the exponential power distribution, with parameter *v* (see Harvey, 2013, Section 3.10). The GED encompasses the normal distribution (for *v* = 2) and the Laplace distribution (for *v* = 1). This GED is more peaked in comparison to the EGB2 distribution and has super-exponential tails for *v* > 1. This leads to the score being unbounded, unlike the score of the EGB2 distribution, but it will diverge at a lower rate than the score of the normal distribution if *v* < 2. Other distributions to consider for robust filtering are, for example, the generalized *t* distribution of McDonald and Newey (1988) and skewed versions of all aforementioned symmetric distributions, which can for instance be obtained by using the approach of Fernández and Steel (1998). The section “Score-Driven Scale Models” considers these distributions in the context of score-driven scale models.

#### Illustration: Treasury Bill Rate Spreads

An empirical illustration is presented to show how the score-driven location models can perform in practice.^{1} The data set under consideration consists of quarterly observations of the difference between the 3-month and the 6-month treasury bill (T-bill) rates. The sample ranges from 1959:Q3 until 2021:Q1.^{2} The T-bill rate for a certain maturity is the yield received by investors for T-bills of that particular maturity in the secondary market. The difference between rates of different maturities is referred to as the spread. The left panel of Figure 3 shows a plot of the resulting T-bill rate spreads. This time series is rather noisy, because it has some sudden and temporary spikes and drops, especially in the early 1980s. This seems to indicate the need for a robust filter if the goal is to filter some underlying location parameter.

To accommodate these concerns, score-driven models with Student’s *t* and EGB2 innovations are considered, together with a score-driven model with Gaussian innovations for comparison. The corresponding updating functions are provided in Equations 4, 6, and 3, respectively. The static parameters are estimated by the method of ML. It is anticipated that the robust Student’s *t* and EGB2 models will be the most suitable here, because of the occasional erratic behavior of the data. The importance of robustness is confirmed by the estimation results reported in Table 1. Both the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are the lowest for the model with Student’s *t* innovations and highest for the model with the Gaussian innovations. The EGB2 model has worse AIC and BIC values than the Student’s *t* model, but they are both better than those of the Gaussian model. These results appear to suggest that the Student’s *t* distribution, which has fat tails, fits the data better than the EGB2 distribution, which has exponential tails.

#### Table 1. Maximum Likelihood Estimation Results*

$\omega $ | $\alpha $ | $\beta $ | ${\sigma}^{2}$ | $\nu ,\xi $ | $\varsigma $ | LL | AIC | BIC | |
---|---|---|---|---|---|---|---|---|---|

Normal | 0.609 | 0.679 | 0.553 | 1.763 | ˗422.2 | 3.437 | 3.494 | ||

(0.154) | (0.065) | (0.087) | (0.158) | ||||||

Student’s | 0.353 | 1.567 | 0.714 | 0.516 | 2.632 | ˗370.6 | 3.029 | 3.100 | |

(0.104) | (0.145) | (0.052) | (0.059) | (0.330) | |||||

EGB2*** | 0.421 | 0.432 | 0.710 | 1.394 | 0.181 | 0.154 | ˗376.8 | 3.087 | 3.172 |

EGB2 sym. | 0.319 | 0.441 | 0.714 | 1.400 | 0.172 | ˗378.8 | 3.095 | 3.167 | |

(0.054) | (0.029) | (0.031) | (0.177) | (0.019) |

* The maximum likelihood estimates for the score-driven normal, Student’s *t*, and EGB2 location models, using data of T-bill rate spreads between 3 and 6 months’ maturity (see left panel of Figure 1). The symmetric EGB2 (sym.) model is subject to restriction $\xi =\varsigma $. Asymptotic standard errors are provided in parentheses.

** LL denotes log likelihood value, AIC denotes Akaike’s Information Criterion and BIC denotes Bayesian Information Criterion.

*** Standard errors are not reported for EGB2 due to numerical issues.

Flexible distributions like the EGB2 distribution can lead to optimization problems during parameter estimation, especially for this illustration, where the sample size is small. For this illustration, it appears that likelihood optimization for the EGB2 model enters into a saddlepoint, even when the process is restarted with new starting values. Hence, no standard errors are reported in this case. It appears that this numerical issue is caused by the joint estimation of $\xi $ and$\varsigma $. Therefore, Table 1 also reports the estimation results for the symmetric EGB2 model, which is obtained through the restriction $\xi =\varsigma $. For this model, we do not encounter these numerical issues during the estimation process. This fitted model has a slightly worse AIC, but a slightly better BIC than its unrestricted version.

The estimated degrees of freedom parameter of the Student’s *t* distribution is somewhat small. This estimation result implies that the error distribution is heavy-tailed, which in turn causes the score *s*_{t} to become more robust against outliers. This is clearly visible in the plotted response curve of the models presented in the right panel of Figure 3. For small to moderate values, the Student’s *t* filter reacts stronger than the Gaussian filter, but for larger values this is no longer the case. Indeed, the response of the former filter redescends back to 0 in the limit, whereas that of the latter diverges to infinity. For the EGB2 models, the score follows a path similar to that of the Student’s *t*, but it is monotonically increasing instead of redescending. The unrestricted EGB2 model has a score function that is asymmetric, since $\xi >\varsigma $ implies a mildly positively skewed distribution (the skewness is$0.33$). The left panel of Figure 3 presents the filtered location corresponding to the fitted models, where the filtered path of the symmetric EGB2 model is omitted because of its similarity to that of the asymmetric EGB2 model. It is clear that the Gaussian filter reacts more to large observations than the Student’s *t* and the EGB2 filters. The filtered paths of the Student’s *t* and EGB2 models are very similar, but the latter tends to have a slightly stronger reaction to large shocks than the former.

### Score-Driven Scale Models

Score-driven scale models have been widely studied in the literature. The conditional volatility model with the Student’s *t* density, as discussed in both Harvey and Chakravarty (2008) and Creal et al. (2008), is one of the first compelling applications of score-driven models. Similarly to location models, score-driven scale models with fat-tailed innovations lead to filters that are robust against large observations. For financial data, robustness of the scale or volatility filter is especially relevant, because the data tend to contain many “outliers” that do not necessarily imply a fundamental change in the underlying conditional volatility. However, it is important to emphasize that the potential robustness property of the resulting volatility filters is by no means the only rationale for considering score-driven scale models.

This section reviews univariate score-driven scale models. Artemova et al. (2022) consider multivariate scale models in which multiple volatilities and correlations can be modeled jointly.

#### Univariate Scale Models

For a univariate (demeaned) observation *y*_{t}, the conditional univariate scale is denoted by *f*_{t} and the basic score-driven scale model is formulated as

where ${\left\{{\epsilon}_{t}\right\}}_{t\in \mathrm{\mathbb{Z}}}$ is an i.i.d. sequence with mean 0 such that $\mathbb{E}\left({y}_{t}|{\mathrm{\mathcal{F}}}_{t-1}\right)=0$. When the variance of ${\epsilon}_{t}$is equal to 1, the model (Equation 7) reduces to a volatility model where *f*_{t} is equal to the conditional variance, that is,$\mathrm{Var}\left({y}_{t}|{\mathrm{\mathcal{F}}}_{t-1}\right)={f}_{t}$. However, this restriction is not necessary; the variance of ${\epsilon}_{t}$ could even be infinite. Instead of using demeaned observations, it is straightforward to extend the model to allow for a non-zero mean$\mu $and for autoregressive or ARMA dynamics in the observation equation. To keep this treatment simple, assume that observations${y}_{t}$are generated by the model in Equation 7.

A basic illustration is obtained by taking ${\epsilon}_{t}\sim \mathcal{N}\left(0,1\right)$ for every *t*. If the conditional variance *f*_{t} is updated according to the score-driven framework in Equation 1, with scaling factor ${S}_{t}={\mathrm{\mathcal{I}}}_{t\mid t-1}^{-1}=2{f}_{t}^{2}$, the updating equation becomes

where *ω* > 0, *α* ≥ 0, and *β* ≥ *α* to ensure positivity of *f*_{t}. It is not hard to see that the resulting model is equivalent to a regular GARCH model (Bollerslev, 1986) with parameters *α* and *β* ˗ *α*. There are certain parameter restrictions that impose stationarity and filter invertibility (Straumann & Mikosch, 2006). If instead a Student’s *t* distribution with *v* > 0 degrees of freedom is considered, without changing the updating equation of *f*_{t}, the resulting model is the GARCH-$t$ model of Bollerslev (1987). However, if a score-driven updating equation is used, the resulting model departs from the regular GARCH framework, because then:

where *ω* > 0, *α* ≥ 0, and *β* ≥ *α*, to impose positivity, and where the used scaling factor ${S}_{t}=2{f}_{t}^{2}$ is proportional to the inverse of conditional information matrix (see Table 2). Notice that this model is not based on a standardized Student’s *t* distribution and that *v* > 2 is not required, which is why this model is referred to as a scale model rather than a volatility model. Using a standardized Student’s *t* distribution with unit variance is also possible (see Creal et al., 2013, for an example). Similar to the Student’s *t* location model in Equations 2 and 4, it is clear that for finite values of *v*, large values of ${y}_{t}^{2}$ are downweighted, unlike in the Gaussian scale model. For $\nu \to \infty $, the updating function becomes identical to Equation 8, which also follows from the fact that the Student’s *t* distribution reduces to a Gaussian distribution in that case. Blasques et al. (2022) gave (parameter) restrictions for filter invertibility, consistency, and asymptotic normality of the MLE for this model in their main example.

#### Table 2. Score-driven Updates for Scale Models ${y}_{t}={\sigma}_{t}{\epsilon}_{t}$ for a Selection of Distributions and Parametrizations

Parameter | Distribution | $p\left({y}_{t}|{f}_{t}\right)$ | ${\nabla}_{t}=\partial logp\left({y}_{t}|{f}_{t}\right)/\partial {f}_{t}$ | ${\mathrm{\mathcal{I}}}_{t\mid t-1}$ |
---|---|---|---|---|

${f}_{t}={\sigma}_{t}^{2}$ | ${\epsilon}_{t}\sim \mathcal{N}\left(0,1\right)$ | $\frac{1}{\sqrt{2\pi {f}_{t}}}exp\left(-\frac{{y}_{t}^{2}}{2{f}_{t}}\right)$ | $\frac{1}{2}\left(\frac{{y}_{t}^{2}}{{f}_{t}^{2}}-{f}_{t}^{-1}\right)$ | $\frac{1}{2{f}_{t}^{2}}$ |

${f}_{t}={\sigma}_{t}^{2}$ | ${\epsilon}_{t}\sim t\left(\nu \right)$ | $\frac{{f}_{t}^{-1/2}}{B\left(\frac{1}{2},\frac{\nu}{2}\right)\sqrt{\nu}}{\left(1+\frac{{y}_{t}^{2}}{\nu {f}_{t}}\right)}^{-\frac{\nu +1}{2}}$ | $\frac{1}{2}\left(\frac{\left(\nu +1\right){\nu}^{-1}{y}_{t}^{2}{f}_{t}^{-2}}{1+{\nu}^{-1}{y}_{t}^{2}{f}_{t}^{-1}}-{f}_{t}^{-1}\right)$ | $\frac{\nu}{2\left(3+\nu \right){f}_{t}^{2}}$ |

${f}_{t}=log{\sigma}_{t}^{2}$ | ${\epsilon}_{t}\sim \mathcal{N}\left(0,1\right)$ | $\frac{1}{\sqrt{2\pi}exp\left(\frac{1}{2}{f}_{t}\right)}exp\left(-\frac{{y}_{t}^{2}}{2exp\left({f}_{t}\right)}\right)$ | $\frac{1}{2}\left(\frac{{y}_{t}^{2}}{exp{f}_{t}}-1\right)$ | $\frac{1}{2}$ |

${f}_{t}=log{\sigma}_{t}^{2}$ | ${\epsilon}_{t}\sim t\left(\nu \right)$ | $\frac{exp\left(-\frac{1}{2}{f}_{t}\right)}{B\left(\frac{1}{2},\frac{\nu}{2}\right)\sqrt{\nu}}{\left(1+\frac{{y}_{t}^{2}}{\nu exp\left({f}_{t}\right)}\right)}^{-\frac{\nu +1}{2}}$ | $\frac{1}{2}\left(\frac{\left(\nu +1\right){\nu}^{-1}{y}_{t}^{2}exp\left(-{f}_{t}\right)}{1+{\nu}^{-1}{y}_{t}^{2}exp\left(-{f}_{t}\right)}-1\right)$ | $\frac{\nu}{2\left(3+\nu \right)}$ |

${f}_{t}=log{\sigma}_{t}^{2}$ | ${\epsilon}_{t}\sim \mathrm{GED}{\left(\nu \right)}^{\left(a\right)}$ | $\frac{{K}_{\mathrm{GE}}\left(\nu \right)}{exp\left(\frac{1}{2}{f}_{t}\right)}exp\left(-\frac{1}{\nu}\frac{{\left|{y}_{t}\right|}^{\nu}}{exp\left(\frac{1}{2}\nu {f}_{t}\right)}\right)$ | $\frac{1}{2}\left(\frac{{\left|{y}_{t}\right|}^{\nu}}{exp\left(\frac{1}{2}\nu {f}_{t}\right)}-1\right)$ | $\frac{\nu}{4}$ |

${f}_{t}=log{\sigma}_{t}^{2}$ | ${\epsilon}_{t}\sim \mathrm{Gen}-t{\left(\nu ,h\right)}^{\left(b\right)}$ | $\frac{K\left(\nu ,h\right)}{exp\left(\frac{1}{2}{f}_{t}\right)}{\left(1+\frac{{\left|{y}_{t}\right|}^{h}}{\nu exp\left(\frac{h}{2}{f}_{t}\right)}\right)}^{-\frac{\nu +1}{h}}$ | $\frac{1}{2}\left(\frac{\left(\nu +1\right){\left(|{y}_{t}|exp\left(-\frac{1}{2}{f}_{t}\right)\right)}^{h}/\nu}{1+{\left(|{y}_{t}|exp\left(-\frac{1}{2}{f}_{t}\right)\right)}^{h}/\nu}-1\right)$ | $\frac{\mathit{\nu h}}{4\left(1+h+\nu \right)}$ |

${f}_{t}=log{\sigma}_{t}^{2}$ | ${\epsilon}_{t}\sim \mathrm{EGB}2{\left(\xi ,\varsigma \right)}^{\left(c\right)}$ | $\frac{hexp\left(-\frac{1}{2}{f}_{t}\right)exp\left(\xi \left(\frac{{\mathit{hy}}_{t}}{exp\left(\frac{1}{2}{f}_{t}\right)}+\mathrm{\Delta}\right)\right)}{B\left(\xi ,\varsigma \right){\left(1+exp\left(h\frac{{y}_{t}}{exp\left(\frac{1}{2}{f}_{t}\right)}+\mathrm{\Delta}\right)\right)}^{\xi +\varsigma}}$ | $\frac{1}{2}\left(\left[\left(\xi +\varsigma \right){b}_{t}-\xi \right]\frac{{\mathit{hy}}_{t}}{exp\left(\frac{1}{2}{f}_{t}\right)}-1\right)$ where ${b}_{t}=\frac{exp\left(\frac{{\mathit{hy}}_{t}}{exp\left(\frac{1}{2}{f}_{t}\right)}+\mathrm{\Delta}\right)}{1+exp\left(\frac{{\mathit{hy}}_{t}}{exp\left(\frac{1}{2}{f}_{t}\right)}+\mathrm{\Delta}\right)}$ | $\propto 1$ |

*Note*: The functions are defined as follows: $B\left(a,b\right)=\mathrm{\Gamma}\left(a\right)\mathrm{\Gamma}\left(b\right)/\mathrm{\Gamma}\left(a+b\right)$, $K\left(\nu ,h\right)=h/\left[2{\nu}^{1/h}B\left(1/h,\nu /h\right)\right]$, and ${K}_{\mathrm{GE}}\left(\nu \right)={\nu}^{1-1/\nu}/\left[2\phantom{\rule{0.5em}{0ex}}\mathrm{\Gamma}\left(1/\nu \right)\right]$.

^{a}This is the standardized general error distribution (GED) pdf of Zhu and Zinde-Walsh (2009).

^{b}This is the generalized *t* distribution introduced by McDonald and Newey (1988).

^{c}This is the standardized EGB2 distribution with mean 0 and variance $1$, where *h* and $\mathrm{\Delta}$ are as in Equation 5.

#### Univariate Log-Scale Models

A point of concern for scale models is that filtered conditional volatilities should be strictly positive. In score models, the positivity of the conditional volatility can be typically ensured by imposing appropriate parameter restrictions. To avoid having to impose positivity restrictions on the parameters, it may be convenient to choose a different parametrization, where the logarithm of the scale is modeled instead of the scale itself. An advantage of this choice of parametrization is also that the stationarity condition is relatively straightforward, as it will simply be $\mid \beta \mid <1$. So instead of modeling the scale ${f}_{t}={\sigma}_{t}^{2}$ in ${y}_{t}={\sigma}_{t}{\epsilon}_{t}$, the log scale ${f}_{t}=log\left({\sigma}_{t}^{2}\right)$ is modeled, which gives the log scale model:

where${\epsilon}_{t}$is again i.i.d. with mean 0 and where *f*_{t} is updated according to Equation 1 based on the distribution of the innovations. So if${\epsilon}_{t}$has unit variance, then $exp\left({f}_{t}\right)=\mathrm{Var}\left({y}_{t}|{\mathrm{\mathcal{F}}}_{t-1}\right)$. Creal et al. (2013) stated that the score-driven log scale model is equivalent to the well-known EGARCH model of Nelson (1991) if ${\epsilon}_{t}$has an asymmetric Laplace distribution. For a selection of other distributions, Table 2 reports the score ∇_{t} and the conditional information matrix corresponding to the log scale parametrization. As was pointed out, the conditional information matrix does not depend on *f*_{t} in this case, so the choice of scaling is less critical here. For example, using the Student’s *t* distribution leads to:

for a scaling factor equal to${S}_{t}=2$(see Creal et al., 2011, 2013; Harvey, 2013, Chapter 4). In the latter reference, the model is referred to as the Beta-$t$-EGARCH model since the fraction in the score function has a Beta distribution, if evaluated at the true parameter. This property makes it a theoretically appealing model because the asymptotic properties of the ML estimator of the parameter vector in this model can be derived in a convenient manner (see Harvey, 2013, Section 4.2). This model has been adopted in many empirical studies in the literature, and it has been shown to outperform the regular GARCH and GARCH-*t* models in most of the studies (see, for example, Harvey & Sucarrat, 2014, Blazsek et al., 2016, and Catania & Nonejad, 2020).

Table 2 also gives the score of the EGB2 log scale model. The corresponding score-driven model is discussed briefly in Caivano and Harvey (2014), for a slightly different parametrization and with the main focus on a model where both the location and scale are score-driven.

Figure 4 plots the news impact curves (NIC) of score-driven log scale models with a normal, Student’s *t*, and EGB2 distribution with the same variance. The NIC is calculated by evaluating the score ∇_{t} as a function of the standardized observations${y}_{t}/exp\left({f}_{t}/2\right)$. The bounded score of the Student’s *t* distribution reflects the fact that it is a fat-tailed distribution and therefore large observations do not imply a proportionally large increase in the underlying volatility process. On the contrary, the score of the EGB2 distribution does diverge, which reflects the exponential tails of the distribution. This divergence is linear instead of a quadratic, which reflects that it has excess kurtosis. It is interesting to compare the shape of these NICs to those of the location model that were plotted in Figure 2.

A more flexible alternative to the Student’s *t* distribution is the generalized *t* distribution proposed by McDonald and Newey (1988). Harvey and Lange (2017) considered a score-driven scale model based on this distribution. The corresponding predictive density for two positive shape parameters *v* and *h* is given in Table 2. For *h* = 2, this is a regular Student’s *t* distribution with *v* degrees of freedom, while a GED distribution with parameter *h* is obtained if $\nu \to \infty $. The GED with *v* = 2 is the normal distribution, whereas *v* = 1 results in a Laplace distribution. Table 2 presents the predictive density, score, and information matrix for a log scale model under the GED. The updating of the log scale parameter *f*_{t} of the score-driven model based on the generalized *t* distribution is given by:

using${S}_{t}=2$. So it is clear that this update has the same robustness properties as the Student’s *t* scale filter in Equation 11 as long as $\nu <\infty $, because in that case the score is bounded. Harvey and Lange (2017) provided explicit asymptotic results for the distribution of the ML estimator. They also proposed likelihood ratio (LR) and Lagrange multiplier (LM) tests for the null hypothesis of thin tails (that is, $\nu \to \infty $) against the alternative hypothesis of fat tails (that is, $\nu <\infty $).

#### Univariate Log-Scale Models with Skewness

Another way to introduce flexibility into the score-driven model is to use a skewed error distribution. For instance, Harvey and Sucarrat (2014) considered constructing score-driven scale models based on distributions skewed by the method of Fernández and Steel (1998). Any probability density function${p}_{\epsilon}\left(\epsilon \right)$ that is unimodal and symmetric around 0 can be converted to a skewed density function as follows:

where $0<\gamma <\infty $ is the skewing parameter, and for $\gamma =1$, $\gamma <1$, and $\gamma >1$, the distribution is symmetric, left skewed, and right skewed, respectively. If this skewing method is applied to the Student’s *t* distribution, the score is given by

It follows that skewing the distribution directly induces asymmetry in the score and therefore in the update of the log scale parameter *f*_{t}. Figure 5 presents a plot of the NIC for some different choices of $\gamma $. It is clear that for $\gamma <1$, negative values are downweighted more, whereas positive values are downweighted less than for the symmetric distribution. Hence, under negative skewness, the score update takes into account that positive values are relatively less common than negative values of the same magnitude and should therefore receive a relatively higher weight. The asymptotic results of the symmetric score-driven Student’s *t* log scale model generalize straightforwardly to this skewed variant. The same skewing approach can be used for other distributions, such as the GED.

It is important to realize that because of the skewness, the expectation of the innovations is no longer equal to 0, which can be solved by reformulating the observation equation of the model as

where${\mu}_{\epsilon}=\mathbb{E}\left[{\epsilon}_{t}\right]$. The score and therefore the updating equation of *f*_{t} must be altered accordingly. A more detailed discussion is provided in the study by Harvey and Sucarrat (2014, Section 4); in their empirical study, they showed that the score-driven skewed *t* model generally outperforms competing models, including alternative skewed models, such as the GJR GARCH model (Glosten et al., 1993) with skewed innovations and the normal mixture GARCH model (Alexander & Lazar, 2006).

There are also alternative ways to skew a distribution. For example, Harvey and Lange (2017) considered a score-driven scale model based on a skewed generalized *t* distribution, using the skewing method of Zhu and Galbraith (2010), which uses a slightly different parametrization than that of Fernández and Steel (1998). Furthermore, not only skewness is introduced, but also the shape parameters *v* and *h* are allowed to have a different value above and below 0. So apart from skewness, a more explicit form of asymmetry is introduced to the distribution as well. Harvey and Lange (2017) demonstrated that the resulting score-driven model works well in practice, although it is recommended to simplify it by testing different parameter restrictions using LR, Wald, or LM tests.

#### Volatility in Mean

Multiple extensions of score-driven scale models have been proposed in the literature. As an illustrative example, consider the volatility-in-mean model proposed in Harvey and Lange (2018). This model alters the log specification of the ARCH in mean (ARCH-M) model of Engle et al. (1987) to have score-driven updating of the standard deviation. So instead of the regular log scale specification in Equation 10, now the time-varying scale parameter$exp\left(0.5{f}_{t}\right)$ also occurs in the mean equation of the observations${y}_{t}$:

where the innovation sequence ${\left\{{\epsilon}_{t}\right\}}_{t\in \mathrm{\mathbb{Z}}}$ is i.i.d. with mean 0 and fixed variance, and where the dynamics of *f*_{t} are again score-driven as in Equation 1. Here *μ* and *λ* are fixed but unknown coefficients that determine the level and the ARCH-M effect, respectively. The rationale for including the conditional scale parameter in the mean equation is that volatility of stock returns tends to be positively correlated with the level of the returns, because the equity risk premium increases if uncertainty rises.

When the innovations${\epsilon}_{t}$are Student’s *t* distributed with *v* degrees of freedom, Harvey and Lange (2018) showed that

If *λ* = 0 (and *μ* = 0), the regular log scale update of Equation 11 is recovered. In practice, the last term of the score has little impact on the filter, because *λ* is typically very small. Harvey and Lange (2018) derived theoretical properties for the resulting model, including moments and asymptotic results of the ML estimator.

#### Illustration: Electricity Spot Price Returns

To demonstrate the robustness property of score-driven scale models with heavy-tailed distributions, this section compares the Gaussian and Student’s *t* volatility model for a time series of electricity spot price returns.^{3} The data under consideration is a sequence of daily returns of the PJM electricity market, which serves 13 states in the United States. The data^{4} span from April 6, 2008, to December 31, 2013 (2,096 observations). Electricity prices are known to show occasional extreme behavior in the form of sudden positive spikes, after which they quickly move back to the pre-spike level. The primary cause of this is the nonstorability of electricity

(see Escribano et al., 2011). In effect, the returns based on these prices also display these incidental “outliers.” Figure 6 clearly shows that these properties are observable for the prices and returns currently under consideration. Models that are robust to sporadic extreme observations, such as the Student’s *t* score-driven model, are thus most likely to be valuable in modeling conditional volatility dynamics of these data.

Consider the log scale specification in Equation 10 with a non-zero mean *μ*: ${y}_{t}=\mu +exp\left(\frac{1}{2}{f}_{t}\right){\epsilon}_{t}.$ The scaling factor is chosen to be equal to the inverse of the conditional information matrix, ${S}_{t}={\mathrm{\mathcal{I}}}_{t\mid t-1}^{-1}$. Table 2 reports the score and information matrix of the Gaussian and Student’s *t* distributions that were used to construct the models.

#### Table 3. Maximum Likelihood Estimation Results (Standard Errors in Parentheses)*

$\mu $ | $\omega $ | $\alpha $ | $\beta $ | $\nu $ | LL | AIC | BIC | |
---|---|---|---|---|---|---|---|---|

Gaussian | 0.316 | 0.125 | 0.074 | 0.927 | ˗4770.5 | 4.56 | 4.57 | |

(0.063) | (0.058) | (0.034) | (0.033) | |||||

Student’s | 0.031 | 0.091 | 0.073 | 0.916 | 4.354 | ˗4611.5 | 4.41 | 4.42 |

(0.052) | (0.120) | (0.048) | (0.110) | (0.459) |

* The parameter estimates from the score-driven Gaussian and Student’s *t* volatility models, for the PJM electricity price returns as presented in Figure 6.

The resulting ML estimates are reported in Table 3 alongside the corresponding likelihood-based criteria. According to both reported information criteria, AIC and BIC, the Student’s *t* model has a better in-sample log likelihood value than the Gaussian model, taking into account that the former model has one extra parameter. The estimates of *α* and *β* are similar for both models, but because the estimated degrees of freedom parameter *v* is small, the two resulting models are inherently different.

Figure 7 shows the NICs of the estimated models. The estimated degrees of freedom parameter *v* is low, so the estimated error density has heavy tails, and therefore the score update of the Student’s *t* model is robust to large values. This is visible in the figure, as the response value corresponding to this model is bounded by a constant value. For the Gaussian model, this is not the case, because the response increases at a quadratic rate as *y*_{t} grows. Figure 8 shows the filtered volatility corresponding to the two fitted models alongside the absolute returns. As the NICs indicate, the filtered volatilities of the Gaussian model are impacted more by large observations than those of the Student’s *t* model are. After a spike in the returns, it takes a while for the Gaussian filtered volatility to move back to the pre-spike level. The filtered volatilities of the Student’s *t* model also react to large observations, but in a less extreme manner. A spike in the returns does not indicate a large change in the underlying volatility process, given the fact that prices quickly move back to their original level. Therefore, the robust volatility filter of the score-driven Student’s *t* model seems to be more suitable than the nonrobust filter of the Gaussian model, since the former is less sensitive to these temporary increases in the returns.

### Acknowledgment

The authors are grateful to Ekaterina Ugulava for the outstanding support that she offered in her role as research assistant. Also, the authors thank MSc student Anthony van Veen for contributing to the scale model application.

#### Further Reading

- Harvey, A. C. (2022). Score-driven time series models.
*Annual Review of Statistics and Its Application*, 9(1), 321–342.

#### References

- Alexander, C., & Lazar, E. (2006). Normal mixture GARCH(1,1): Applications to exchange rate modelling.
*Journal of Applied Econometrics*,*21*(3), 307–336. - Artemova, M., Blasques, F., van Brummelen, J., & Koopman, S. J. (2022). Score-driven models: Methods and applications. In
*Oxford research encyclopedia of economics and finance*. Oxford University Press. - Bazzi, M., Blasques, F., Koopman, S. J., & Lucas, A. (2017). Time-varying transition probabilities for Markov regime switching models.
*Journal of Time Series Analysis*,*38*(3), 458–478. - Blasques, F., Gorgi, P., Koopman, S. J., Wintenberger, O. (2018). Feasible invertibility conditions and maximum likelihood estimation for observation-driven models.
*Electronic Journal of Statistics*,*12*(1), 1019–1052. - Blasques, F., Koopman, S. J., & Lucas, A. (2015). Information-theoretic optimality of observation-driven time series models for continuous responses.
*Biometrika*,*102*(2), 325–343. - Blasques, F., Koopman, S. J., Lucas, A., & Schaumburg, J. (2016). Spillover dynamics for systemic risk measurement using spatial financial time series models.
*Journal of Econometrics*,*195*(2), 211–223. - Blasques, F., Lucas, A., & van Vlodrop, A. C. (2020). Finite sample optimality of score-driven volatility models: Some Monte Carlo evidence.
*Econometrics and Statistics*,*19*, 47–57. - Blasques, F., van Brummelen, J., Koopman, S. J., & Lucas, A. (2022). Maximum likelihood estimation for score-driven models.
*Journal of Econometrics*,*227*(2), 325–346. - Blazsek, S., Chavez, H., & Mendez, C. (2016). Model stability and forecast performance of Beta-
*t*-EGARCH.*Applied Economics Letters*,*23*(17), 1219–1223. - Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity.
*Journal of Econometrics*,*31*(3), 307–327. - Bollerslev, T. (1987). A conditionally heteroskedastic time series model for speculative prices and rates of return.
*The Review of Economics and Statistics*,*69*(3), 542–547. - Bougerol, P. (1993). Kalman filtering with random coefficients and contractions.
*SIAM Journal on Control and Optimization*,*31*(4), 942–959. - Caivano, M., & Harvey, A. C. (2014). Time-series models with an EGB2 conditional distribution.
*Journal of Time Series Analysis*,*35*(6), 558–571. - Caivano, M., Harvey, A. C., & Luati, A. (2016). Robust time series models with trend and seasonal components.
*SERIEs*,*7*(1), 99–120. - Catania, L., & Nonejad, N. (2020). Density forecasts and the leverage effect: Evidence from observation and parameter-driven volatility models.
*The European Journal of Finance*,*26*(2–3), 100–118. - Cox, D. R. (1981). Statistical analysis of time series: Some recent developments.
*Scandinavian Journal of Statistics*,*8*(2), 93–115. - Creal, D., Koopman, S. J., & Lucas, A. (2008).
*A general framework for observation driven time-varying parameter models*. Tinbergen Institute Discussion Paper 08-108/4. - Creal, D., Koopman, S. J., & Lucas, A. (2011). A dynamic multivariate heavy-tailed model for time-varying volatilities and correlations.
*Journal of Business & Economic Statistics*,*29*(4), 552–563. - Creal, D., Koopman, S. J., & Lucas, A. (2013). Generalized autoregressive score models with applications.
*Journal of Applied Econometrics*,*28*(5), 777–795. - Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation.
*Econometrica*,*50*, 987–1007. - Engle, R. F., & Gallo, G. M. (2006). A multiple indicators model for volatility using intra-daily data.
*Journal of Econometrics*,*131*(1–2), 3–27. - Engle, R. F., & Russell, J. R. (1998). Autoregressive conditional duration: A new model for irregularly spaced transaction data.
*Econometrica*,*66*, 1127–1162. - Engle, R. F., Lilien, D. M., & Robins, R. P. (1987). Estimating time varying risk premia in the term structure: The ARCH-M model.
*Econometrica*,*55*, 391–407. - Escribano, A., Ignacio Peña, J., & Villaplana, P. (2011). Modelling electricity prices: International evidence.
*Oxford Bulletin of Economics and Statistics*,*73*(5), 622–650. - Fernández, C., & Steel, M. F. (1998). On Bayesian modeling of fat tails and skewness.
*Journal of the American Statistical Association*,*93*(441), 359–371. - Glosten, L. R., Jagannathan, R., & Runkle, D. E. (1993). On the relation between the expected value and the volatility of the nominal excess return on stocks.
*The Journal of Finance*,*48*(5), 1779–1801. - Harvey, A. C. (1981).
*Time series models*. Philip Allen. - Harvey, A. C. (2013).
*Dynamic models for volatility and heavy tails: With applications to financial and economic time series*(Vol. 52). Cambridge University Press. - Harvey, A. C., & Chakravarty, T. (2008).
*Beta-*t*-IGARCH*. University of Cambridge, Faculty of Economics, Working paper CWPE 08340. - Harvey, A. C., & Lange, R.-J. (2017). Volatility modeling with a generalized
*t*distribution.*Journal of Time Series Analysis*,*38*(2), 175–190. - Harvey, A. C., & Lange, R.-J. (2018). Modeling the interactions between volatility and returns using EGARCH-M.
*Journal of Time Series Analysis*,*39*(6), 909–919. - Harvey, A. C., & Luati, A. (2014). Filtering with heavy tails.
*Journal of the American Statistical Association*,*109*(507), 1112–1122. - Harvey, A. C., & Sucarrat, G. (2014). EGARCH models with fat tails, skewness and leverage.
*Computational Statistics & Data Analysis*,*76*, 320–338. - Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.
*Transactions of the ASME–Journal of Basic Engineering*,*82*(Series D), 35–45. - Koopman, S. J., Lucas, A., & Scharth, M. (2016). Predicting time-varying parameters with parameter-driven and observation-driven models.
*The Review of Economics and Statistics*,*98*(1), 97–110. - Maronna, R. A., Martin, D., & Yohai, V. (2006).
*Robust statistics: Theory and methods*. John Wiley & Sons. - McCracken, M., & Ng, S. (2020).
*FRED-QD: A quarterly database for macroeconomic research*. Working Paper 2020-005, Federal Reserve Bank of St. Louis. - McDonald, J. B., & Newey, W. K. (1988). Partially adaptive estimation of regression models via the generalized
*t*distribution.*Econometric Theory*,*4*, 428–457. - Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns: A new approach.
*Econometrica*,*59*, 347–370. - Plackett, R. L. (1950). Some theorems in least squares.
*Biometrika*,*37*(1–2), 149–157. - Russell, J. R. (2001).
*Econometric modeling of multivariate irregularly-spaced high-frequency data*. Technical report, University of Chicago. - Straumann, D., & Mikosch, T. (2006). Quasi-maximum-likelihood estimation in conditionally heteroscedastic time series: A stochastic recurrence equations approach.
*The Annals of Statistics*,*34*(5), 2449–2495. - Wang, K.-L., Fawson, C., Barrett, C. B., & McDonald, J. B. (2001). A flexible parametric GARCH model with an application to exchange rates.
*Journal of Applied Econometrics*,*16*(4), 521–536. - Wintenberger, O. (2013). Continuous invertibility and stable QML estimation of the EGARCH(1,1) model.
*Scandinavian Journal of Statistics*,*40*(4), 846–867. - Zhu, D., & Galbraith, J. W. (2010). A generalized asymmetric Student-
*t*distribution with application to financial econometrics.*Journal of Econometrics*,*157*(2), 297–305. - Zhu, D., & Zinde-Walsh, V. (2009). Properties and estimation of asymmetric exponential power distribution.
*Journal of Econometrics*,*148*(1), 86–99.

### Notes

1. The code developed for this illustration is made available on gasmodel.com, in the code section.

2. The 6-month minus 3-month treasury bill rate [TB6M3Mx] was obtained from the FRED-QD database of McCracken and Ng (2020). The data were multiplied by a factor of 10 for scaling purposes.

3. The code developed for this illustrations is made available on gasmodel.com, in the code section.

4. Data were retrieved from the PJM website. Daily prices

*p*_{t}are constructed as the average of the hourly prices of each day. The corresponding returns were calculated as $\left({p}_{t}-{p}_{t-1}\right)/{p}_{t-1}$ and multiplied by 10 for scaling purposes.