Researchers are more likely to share notable findings. As a result, published findings tend to overstate the magnitude of real-world phenomena. This bias is a natural concern for asset pricing research, which has found hundreds of return predictors and little consensus on their origins.
Empirical evidence on publication bias comes from large-scale metastudies. Metastudies of cross-sectional return predictability have settled on four stylized facts that demonstrate publication bias is not a dominant factor: (a) almost all findings can be replicated, (b) predictability persists out-of-sample, (c) empirical t-statistics are much larger than 2.0, and (d) predictors are weakly correlated. Each of these facts has been demonstrated in at least three metastudies.
Empirical Bayes statistics turn these facts into publication bias corrections. Estimates from three metastudies find that the average correction (shrinkage) accounts for only 10%–15% of in-sample mean returns and that the risk of inference going in the wrong direction (the false discovery rate) is less than 10%.
Metastudies also find that t-statistic hurdles exceed 3.0 in multiple testing algorithms and that returns are 30%–50% weaker in alternative portfolio tests. These facts are easily misinterpreted as evidence of publication bias. Other misinterpretations include the conflating of phrases such as “mostly false findings” with “many insignificant findings,” “data snooping” with “liquidity effects,” and “failed replications” with “insignificant ad-hoc trading strategies.”
Cross-sectional predictability may not be representative of other fields. Metastudies of real-time equity premium prediction imply a much larger effect of publication bias, although the evidence is not nearly as abundant as it is in the cross section. Measuring publication bias in areas other than cross-sectional predictability remains an important area for future research.
Article
Todd E. Clark and Elmar Mertens
Vector autoregressions with stochastic volatility (SV) are widely used in macroeconomic forecasting and structural inference. The SV component of the model conveniently allows for time variation in the variance-covariance matrix of the model’s forecast errors. In turn, that feature of the model generates time variation in predictive densities. The models are most commonly estimated with Bayesian methods, most typically Markov chain Monte Carlo methods, such as Gibbs sampling. Equation-by-equation methods developed since 2018 enable the estimation of models with large variable sets at much lower computational cost than the standard approach of estimating the model as a system of equations. The Bayesian framework also facilitates the accommodation of mixed frequency data, non-Gaussian error distributions, and nonparametric specifications. With advances made in the 21st century, researchers are also addressing some of the framework’s outstanding challenges, particularly the dependence of estimates on the ordering of variables in the model and reliable estimation of the marginal likelihood, which is the fundamental measure of model fit in Bayesian methods.
Article
Joanne Ercolani
Unobserved components models (UCMs), sometimes referred to as structural time-series models, decompose a time series into its salient time-dependent features. These typically characterize the trending behavior, seasonal variation, and (nonseasonal) cyclical properties of the time series. The components are usually specified in a stochastic way so that they can evolve over time, for example, to capture changing seasonal patterns. Among many other features, the UCM framework can incorporate explanatory variables, allowing outliers and structural breaks to be captured, and can deal easily with daily or weekly effects and calendar issues like moving holidays.
UCMs are easily constructed in state space form. This enables the application of the Kalman filter algorithms, through which maximum likelihood estimation of the structural parameters are obtained, optimal predictions are made about the future state vector and the time series itself, and smoothed estimates of the unobserved components can be determined. The stylized facts of the series are then established and the components can be illustrated graphically, so that one can, for example, visualize the cyclical patterns in the time series or look at how the seasonal patterns change over time. If required, these characteristics can be removed, so that the data can be detrended, seasonally adjusted, or have business cycles extracted, without the need for ad hoc filtering techniques. Overall, UCMs have an intuitive interpretation and yield results that are simple to understand and communicate to others. Factoring in its competitive forecasting ability, the UCM framework is hugely appealing as a modeling tool.
Article
Markowitz showed that an investor who cares only about the mean and variance of portfolio returns should hold a portfolio on the efficient frontier. The application of this investment strategy proceeds in two steps. First, the statistical moments of asset returns are estimated from historical time series, and second, the mean-variance portfolio selection problem is solved separately, as if the estimates were the true parameters. The literature on portfolio decision acknowledges the difficulty in estimating means and covariances in many instances. This is particularly the case in high-dimensional settings. Merton notes that it is more difficult to estimate means than covariances and that errors in estimates of means have a larger impact on portfolio weights than errors in covariance estimates. Recent developments in high-dimensional settings have stressed the importance of correcting the estimation error of traditional sample covariance estimators for portfolio allocation. The literature has proposed shrinkage estimators of the sample covariance matrix and regularization methods founded on the principle of sparsity. Both methodologies are nested in a more general framework that constructs optimal portfolios under constraints on different norms of the portfolio weights including short-sale restrictions. On the one hand, shrinkage methods use a target covariance matrix and trade off bias and variance between the standard sample covariance matrix and the target. More prominence has been given to low-dimensional factor models that incorporate theoretical insights from asset pricing models. In these cases, one has to trade off estimation risk for model risk. Alternatively, the literature on regularization of the sample covariance matrix uses different penalty functions for reducing the number of parameters to be estimated. Recent methods extend the idea of regularization to a conditional setting based on factor models, which increase with the number of assets, and apply regularization methods to the residual covariance matrix.
Article
Javier Hualde and Morten Ørregaard Nielsen
Fractionally integrated and fractionally cointegrated time series are classes of models that generalize standard notions of integrated and cointegrated time series. The fractional models are characterized by a small number of memory parameters that control the degree of fractional integration and/or cointegration. In classical work, the memory parameters are assumed known and equal to 0, 1, or 2. In the fractional integration and fractional cointegration context, however, these parameters are real-valued and are typically assumed unknown and estimated. Thus, fractionally integrated and fractionally cointegrated time series can display very general types of stationary and nonstationary behavior, including long memory, and this more general framework entails important additional challenges compared to the traditional setting. Modeling, estimation, and testing in the context of fractional integration and fractional cointegration have been developed in time and frequency domains. Related to both alternative approaches, theory has been derived under parametric or semiparametric assumptions, and as expected, the obtained results illustrate the well-known trade-off between efficiency and robustness against misspecification. These different developments form a large and mature literature with applications in a wide variety of disciplines.
Article
Mariia Artemova, Francisco Blasques, Janneke van Brummelen, and Siem Jan Koopman
The flexibility, generality, and feasibility of score-driven models have contributed much to the impact of score-driven models in both research and policy. Score-driven models provide a unified framework for modeling the time-varying features in parametric models for time series.
The predictive likelihood function is used as the driving mechanism for updating the time-varying parameters. It leads to a flexible, general, and intuitive way of modeling the dynamic features in the time series while the estimation and inference remain relatively simple. These properties remain valid when models rely on non-Gaussian densities and nonlinear dynamic structures. The class of score-driven models has become even more appealing since the developments in theory and methodology have progressed rapidly. Furthermore, new formulations of empirical dynamic models in this class have shown their relevance in economics and finance. In the context of macroeconomic studies, the key examples are nonlinear autoregressive, dynamic factor, dynamic spatial, and Markov-switching models. In the context of finance studies, the major examples are models for integer-valued time series, multivariate scale, and dynamic copula models. In finance applications, score-driven models are especially important because they provide particular updating mechanisms for time-varying parameters that limit the effect of the influential observations and outliers that are often present in financial time series.
Article
Piotr Śpiewanowski, Oleksandr Talavera, and Linh Vi
The 21st-century economy is increasingly built around data. Firms and individuals upload and store enormous amount of data. Most of the produced data is stored on private servers, but a considerable part is made publicly available across the 1.83 billion websites available online. These data can be accessed by researchers using web-scraping techniques.
Web scraping refers to the process of collecting data from web pages either manually or using automation tools or specialized software. Web scraping is possible and relatively simple thanks to the regular structure of the code used for websites designed to be displayed in web browsers. Websites built with HTML can be scraped using standard text-mining tools, either scripts in popular (statistical) programming languages such as Python, Stata, R, or stand-alone dedicated web-scraping tools. Some of those tools do not even require any prior programming skills.
Since about 2010, with the omnipresence of social and economic activities on the Internet, web scraping has become increasingly more popular among academic researchers. In contrast to proprietary data, which might not be feasible due to substantial costs, web scraping can make interesting data sources accessible to everyone.
Thanks to web scraping, the data are now available in real time and with significantly more details than what has been traditionally offered by statistical offices or commercial data vendors. In fact, many statistical offices have started using web-scraped data, for example, for calculating price indices. Data collected through web scraping has been used in numerous economic and finance projects and can easily complement traditional data sources.
Article
Atila Abdulkadiroğlu
Parental choice over public schools has become a major policy tool to combat inequality in access to schools. Traditional neighborhood-based assignment is being replaced by school choice programs, broadening families’ access to schools beyond their residential location. Demand and supply in school choice programs are cleared via centralized admissions algorithms. Heterogeneous parental preferences and admissions policies create trade-offs among efficiency and equity. The data from centralized admissions algorithms can be used effectively for credible research design toward better understanding of school effectiveness, which in turn can be used for school portfolio planning and student assignment based on match quality between students and schools.
Article
Gianluca Cubadda and Alain Hecq
Reduced rank regression (RRR) has been extensively employed for modelling economic and financial time series. The main goals of RRR are to specify and estimate models that are capable of reproducing the presence of common dynamics among variables such as the serial correlation common feature and the multivariate autoregressive index models. Although cointegration analysis is likely the most prominent example of the use of RRR in econometrics, a large body of research is aimed at detecting and modelling co-movements in time series that are stationary or that have been stationarized after proper transformations. The motivations for the use of RRR in time series econometrics include dimension reductions, which simplify complex dynamics and thus make interpretations easier, as well as the pursuit of efficiency gains in both estimation and prediction. Via the final equation representation, RRR also makes the nexus between multivariate time series and parsimonious marginal ARIMA (autoregressive integrated moving average) models. RRR’s drawback, which is common to all of the dimension reduction techniques, is that the underlying restrictions may or may not be present in the data.
Article
Jennifer L. Castle and David F. Hendry
Shared features of economic and climate time series imply that tools for empirically modeling nonstationary economic outcomes are also appropriate for studying many aspects of observational climate-change data. Greenhouse gas emissions, such as carbon dioxide, nitrous oxide, and methane, are a major cause of climate change as they cumulate in the atmosphere and reradiate the sun’s energy. As these emissions are currently mainly due to economic activity, economic and climate time series have commonalities, including considerable inertia, stochastic trends, and distributional shifts, and hence the same econometric modeling approaches can be applied to analyze both phenomena. Moreover, both disciplines lack complete knowledge of their respective data-generating processes (DGPs), so model search retaining viable theory but allowing for shifting distributions is important. Reliable modeling of both climate and economic-related time series requires finding an unknown DGP (or close approximation thereto) to represent multivariate evolving processes subject to abrupt shifts. Consequently, to ensure that DGP is nested within a much larger set of candidate determinants, model formulations to search over should comprise all potentially relevant variables, their dynamics, indicators for perturbing outliers, shifts, trend breaks, and nonlinear functions, while retaining well-established theoretical insights. Econometric modeling of climate-change data requires a sufficiently general model selection approach to handle all these aspects. Machine learning with multipath block searches commencing from very general specifications, usually with more candidate explanatory variables than observations, to discover well-specified and undominated models of the nonstationary processes under analysis, offers a rigorous route to analyzing such complex data. To do so requires applying appropriate indicator saturation estimators (ISEs), a class that includes impulse indicators for outliers, step indicators for location shifts, multiplicative indicators for parameter changes, and trend indicators for trend breaks. All ISEs entail more candidate variables than observations, often by a large margin when implementing combinations, yet can detect the impacts of shifts and policy interventions to avoid nonconstant parameters in models, as well as improve forecasts. To characterize nonstationary observational data, one must handle all substantively relevant features jointly: A failure to do so leads to nonconstant and mis-specified models and hence incorrect theory evaluation and policy analyses.