Statistical Significance and the Replication Crisis in the Social Sciences
Summary and Keywords
The recent “replication crisis” in the social sciences has led to increased attention on what statistically significant results entail. There are many reasons for why false positive results may be published in the scientific literature, such as low statistical power and “researcher degrees of freedom” in the analysis (where researchers when testing a hypothesis more or less actively seek to get results with p < .05). The results from three large replication projects in psychology, experimental economics, and the social sciences are discussed, with most of the focus on the last project where the statistical power in the replications was substantially higher than in the other projects. The results suggest that there is a substantial share of published results in top journals that do not replicate. While several replication indicators have been proposed, the main indicator for whether a results replicates or not is whether the replication study using the same statistical test finds a statistically significant effect (p < .05 in a two-sided test). For the project with very high statistical power the various replication indicators agree to a larger extent than for the other replication projects, and this is most likely due to the higher statistical power. While the replications discussed mainly are experiments, there are no reasons to believe that the replicability would be higher in other parts of economics and finance, if anything the opposite due to more researcher degrees of freedom. There is also a discussion of solutions to the often-observed low replicability, including lowering the p value threshold to .005 for statistical significance and increasing the use of preanalysis plans and registered reports for new studies as well as replications, followed by a discussion of measures of peer beliefs. Recent attempts to understand to what extent the academic community is aware of the limited reproducibility and can predict replication outcomes using prediction markets and surveys suggest that peer beliefs may be viewed as an additional reproducibility indicator.
The second most viewed TED talk with more than 48 million views as of September 2018 is about power posing. The TED talk is based on a study where 42 men and women were reported to have been randomized to hold either a high-power or a low-power position for a couple of minutes (Carney, Cuddy, & Yap, 2010). The results are stunning with large effects from a small intervention. High-power positions supposedly lead to higher testosterone levels, lower cortisol levels, increased financial risk-taking, and increased feelings of power. The study was published in the top scientific journal Psychological Science in 2010. Five years later the same journal published a replication attempt with 200 participants (Ranehill et al., 2015). The replication study differed from the original study in some aspects; one was that the replication was experimenter blind. In the larger sample, Ranehill et al. failed to find support for power posing having any effects on hormones or behaviour. A number of other failed replications have been published since then, and the first author has written an open letter describing some of the reasons why the original paper reports p values that are basically meaningless in terms of understanding false positive risk (Carney, 2016). The power posing paper is by no means alone in making big claims that turn out to not replicate; there are reasons to believe that in many fields in the social sciences a substantial share of published results are false-positives. This article describes the many reasons for why this may be the case. There is also a discussion of the various replication projects attempting to gauge the share of false-positive results, including the Reproducibility Project: Psychology (RPP; Open Science Collaboration, 2015), the Experimental Economics Replication Project (EERP; Camerer et al., 2016), and the Social Science Replication Project (SSRP; Camerer et al., 2018) and the lessons learned from these projects. In particular, the RPP led to a scientific debate about a potential replication crisis (Anderson et al., 2016; Gilbert, King, Pettigrew, & Wilson, 2016) and many subsequent replication projects (Cova et al., 2018; Ebersole et al., 2016; Klein et al., 2014; Schweinsberg et al., 2016). The many different solutions to the replication problem are also relevant. While some of the content in this article is also discussed in a recent book chapter (Camerer, Dreber, & Johannesson, 2019), the most significant difference between this article and the chapter is the focus here on the results from SSRP, which is not part of the book chapter. This article also summarizes the current pooled results on peer beliefs as reproducibility indicators and elaborates more on p values and decision markets.
Reasons for False Positive Results
Many possible factors determine the share of false-positive results in the scientific literature. For example, researchers can fake data. However, even excluding such clearly fraudulent behavior, there are many reasons to believe that a large share of published findings may be false-positives (Ioannidis, 2005).
P Values, Statistical Power, and Researcher Degrees of Freedom
Most researchers in economics and finance are somewhat aware of the importance of p values and statistical power in the sense that researchers typically want low p values and worry about the increased probability of false-negative results that come with low power. Yet, power is rarely reported in economics and finance. Moreover, little attention is typically given to the problems arising from the combination of low power and statistically significant results, with some exceptions (Ioannidis, 2005; Ioannidis & Trikalinos, 2007; Leamer, 1983). Statistically significant results in low-powered studies not only have an increased risk of being false-positive results but also have a high probability of being exaggerated compared to the true effect even if is positive (magnitude [type M] error) and a nonnegligible probability of being of the wrong sign, where the true effect is actually in the opposite direction of the observed statistically significant one (sign [type S] error; Gelman & Carlin, 2014).
There are reasons to worry about low power problems. A study on 159 empirical economics topics with more than 6,700 studies concludes that most of empirical economics is underpowered, with the median power being only 18% (Ioannidis, Stanley, & Doucouliagos, 2018).1 There is also evidence of low power in behavioral and experimental economics, where power is perhaps more easily determined by the experimenter than in other subfields of economics and finance (Zhang & Ortmann, 2013).2
During the last few years there has been an increasing focus on the various “researcher degrees of freedom” (Gelman & Loken, 2013; Simmons, Nelson, & Simonsohn, 2011) as key explanations to the prevalence of false-positive results. “P hacking” (Simmons et al., 2011) refers to a process where a researcher tests a hypothesis and actively tries to find a statistically significant result (typically p < .05). This can be done by, for example, analyzing many measures but only reporting those with p < .05, including many conditions in an experiment but only reporting those with p < .05, testing different functional forms but only reporting those with p < .05, including or excluding covariates in the analysis but only reporting those specifications with p < .05, excluding “outliers” to obtain results with p < .05, and so on. While p hacking implies an intentional search for statistically significant results, the researcher can probably convince him- or herself even unintentionally that the final analysis (where p < .05) is the one that makes the most sense and is what he or she had in mind. The “garden of forking paths” (Gelman & Loken, 2013) is a related phenomenon where a researcher might aim to test a very specific hypothesis that has not been exactly specified, so the researcher allows the analysis to be contingent on the data and thus the results. Consider a researcher who is interested in testing whether some labor market intervention affects labor market outcomes. And consider the rare situation where the researcher has data from a randomized controlled treatment in the field. There are many potential forks in this type of analysis. What if researchers find an effect on younger individuals but not older? They could argue that interventions, of course, matter the most for individuals with little previous experience. Or researchers may find an effect for older individuals and not younger, and they could argue that this result is due to older individuals being more in need to update their labor market skills. Or researchers may find an effect but only for women, or only for men, or only for individuals living in big cities because big cities typically have more competitive labor markets, or only for individuals living in the countryside because the range of jobs are fewer in the countryside, and so on. And how should the researcher define labor market outcomes? Should it be whether individuals are employed or not after some time period? Or should the focus be on the salary level at the time of their first employment after the intervention or at some other point? What time period after the intervention should be the focus? As long as the analysis has not been prespecified, it is enough to do just one test and still be forking since if the results would have come out differently the researcher would have continued forking (Gelman & Loken, 2013). This type of data-contingent analysis is common and often encouraged. However, the problem with both p hacking and forking is that p values end up being meaningless, increasing the rate of “statistically significant” results that are false-positives.
Another relatively neglected factor in assessing whether a published result is true is the prior—the initial probability that the hypothesis is true. Priors are, however, typically subjective and hard to access. As mentioned in the “Discussion” section of this article, prediction markets and other information aggregation tools may be helpful to estimate priors.3
Replicating Treatment Effects
The focus here is on direct replications where an experiment is run in the same way as the original experiment, ideally using exactly the same materials and software as in the original study. Experiments are probably easier to replicate than most other types of data, and there is no reason to think that the rate of replicability would be higher in other fields of economics and finance. Most replications also involve different subject populations from those in original studies. However, even if there are varying treatment effects across populations, there would be no systematic upward or downward bias in estimated effect sizes in the replications. This is important to consider in the discussion of the typically lower average effect size observed in systematic replication projects.4
Defining Successful Replication
The most commonly used indicator for whether a results replicates or not is whether the replication study using the same statistical test finds a statistically significant effect (p < .05, corresponding to a one-sided test at the 2.5% level) in the same direction as the original study—the statistical significance criterion.5 This criterion was used as primary replication indicator in the RPP, the EERP, and the SSRP, and a complementary measure used in all three projects was the continuous measure of relative effect size of the replication, where standardized effect sizes are compared between the original study and the replication.
Additional replication indicators have also been put forward, including whether the original effect is included in the 95% confidence interval (CI) of the replication effect size. This measure is problematic as it does not test whether the replication effect size is different from zero or from the original effect size. Replications with higher power and thus more narrow CIs will thus lead to a lower replication rate. A second replication indicator is a meta-analysis combining the original result and the replication result. This measure is problematic since it is assumed that there is no bias in the original estimate and that the only difference between the original result and the replication result is the sample size, which is unlikely to be the case since, unlike the replication, the original result is typically subject to researcher degrees of freedom and publication bias, leading to a biased estimate. A third replication indicator is the 95% prediction interval approach (Patil, Peng, & Leek, 2016), which implies testing if the original result and replication result differ significantly from each other. This measure is problematic when original results have p values close to 5% and a lower bound of the CI close to zero. In these cases, original results are unlikely to be rejected unless the replication estimate is in the opposite direction of the original result, which will only be the case for true null effects 50% of the time. A fourth replication indicator is the “small telescopes” approach (Simonsohn, 2015b), where it is tested whether the replication effect size is significantly smaller (with a one-sided test at the 5% level) than a “small effect” in the original study. A small effect is defined as the effect size the original study would have had 33% power to detect, and using this approach Simonsohn (2015a) recommends that replication sample sizes are 2.5 times the original sample size since this leads to a power of 80% to reject a “small effect” if the true effect is zero. A failed replication occurs if the effect size in the replication is significantly smaller than this small effect size. However, the definition of a small effect is relatively arbitrary, and there is an implicit assumption about all original studies being equally well powered. Moreover, if original studies are severely underpowered, this increase in sample sizes will not lead to high-powered replications. The final replication indicator discussed here concerns the Bayes factor, which is an alternative way to represent the strength of evidence in favor of the alternative hypothesis—here the original result—versus the strength of evidence of the null hypothesis of no effect (Marsman et al., 2017; Wagenmakers, Verhagen, & Ly, 2016). The Bayes factor expresses how much a data set shifts the support for the alternative hypothesis versus the null hypothesis, showing how much the prior odds are updated to the posterior odds due to the observed data. A default Bayes factor above 1 favors the hypothesis of an effect in the direction of the original paper, and a default Bayes factor below 1 favors the null hypothesis.6
The RPP (Open Science Collaboration, 2015) reported the outcomes for replications of 100 original studies, mainly experiments, published in three top journals in cognitive and social psychology in 2008: Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition. Many of the original articles contained several experiments; the last one was chosen for replication in 84 out of 100 cases.
In the RPP, preregistered replication reports explaining potential differences between the original study and the replication as well as the exact statistical analysis were internally reviewed and also sent to the original authors for feedback. The replications had on average 92% power to detect 100% of the original effect size at the 5% significance level.
Using the primary replication indicator—the statistical significance criterion where a successful replication is defined as finding a statistically significant effect in the same direction as the original study—35 (36%) out of 97 original studies with statistically significant results (p < .05)7 replicated (Open Science Collaboration, 2015). The mean relative effect size in the replications was 49% of the original effect size. Moreover, 47% of the original effect sizes were in the 95% CI of the replication effect size, and the meta-analytic results where the original and replication results were combined suggest that 68% of the effects were statistically significant. Patil et al. (2016) find that 77% (out of 92 explored studies) of the replication effect sizes reported were within the 95% prediction interval. While this share is substantially higher than what was found with the statistical significance criterion, it is important to note that many of the prediction intervals also overlap a zero effect size in the replications. With the small telescope approach, only 25% of the replications conclusively failed to replicate (Simonsohn, 2015a, 2015b). As for the prediction interval approach, the suggested high number of successful replications with this approach is likely to be related to the power of the replications in RPP, which regardless of the high ambitions probably was not as high as desired, given that many of the original effects were probably subject to type M errors and thus exaggerated. Finally, Etz and Vandekerckhove (2016) calculated Bayes factors for 72 of the replication studies in the RPP and found that a minority (21%) showed strong support for the alternative hypothesis, while the rest showed ambiguous information (38%) or weak support for the null hypothesis (35%). One would expect the various replication indicators to converge to some extent for replications with higher power—which is also something that is found by Camerer et al. (2018) for the SSRP.
The EERP (Camerer et al., 2016) performed a systematic replication effort of 18 laboratory experiments in economics published in two top journals in economics 2011–2014; American Economic Review and Quarterly Journal of Economics. This replication project used different inclusion criteria than the RPP, only including studies testing main effects in between-subject designs.8 The average replication power was 92% to find 100% of the original effect size at the 5% significance level. As in the RPP, all replications and analyses were preregistered and communicated to the original authors before the replications occurred.
Using the statistical significance criterion, 11 (61%) out of 18 original studies replicated, which is lower than the mean power of 92% (p < .001). When it comes to the other replication indicators, the mean relative effect size of the replications was 65.9%.9 Twelve of 18 (67%) replication CIs included the original effect size. One replication CI exceeded the original effect size; if this is included the number replicated increases to 13 of 18 (72%). Combining the original and replication results in a meta-analysis suggests that 14 of 18 (78%) of the effects were statistically significant. Eighty-three percent of the replication effect sizes were within the 95% prediction interval using the original effect size. This number increases to 89% when coding the replication with an effect size above the upper bound of the prediction interval as “replicated.” The small telescopes approach yielded identical results to the meta-analysis.
The SSRP (Camerer et al., 2018) performed a systematic replication effort of 21 social science experiments published in the two top general science journals Nature and Science, 2010–2015. This replication project included papers that tested for an experimental treatment effect within or between subjects, where the subjects were students or part of some other accessible subject pool (and to be included the paper also had to report at least one statistically significant treatment effect). Unlike in the RPP, the SSRP selected the first experiment in papers with more than one experiment. When there were several central results in the first experiment, one was randomly picked for replication.
Statistical power was substantially higher in the SSRP than in the RPP and EERP to address potentially exaggerated effect sizes for true-positive results among the original results. In more detail, the SSRP included a two-stage design for conducting replications. In stage 1, the replication had 90% power to detect 75% of the original effect size at the 5% significance level in a two-sided test. If the original result did not replicate in stage 1, the replication continued into stage 2 such that the pooled replication had 90% power to detect 50% of the original effect size at the 5% significance level. Replication sample sizes in stage 1 were on average three times as large as the original sample sizes, while replication sample sizes in stage 2 were on average six times as large as the original sample sizes. The choice of the stage 2 power was based on the RPP’s replication effect sizes being on average about 50% of the original effect sizes. As in the RPP and EERP, all replications were preregistered and communicated to the original authors before the replications occurred.
Using the statistical significance criterion, 13 (62%) out of 21 original studies replicated in stage 2. The mean relative effect size of the replications was 46.2%. For the 13 studies that replicated, the mean relative effect size was 74.5% whereas it was 0.3% for the 8 studies that did not replicate. This is further evidence suggesting that true-positive results also have exaggerated effect sizes in the original studies and thus that replications with “high” statistical power to detect 100% of the original effect size will be underpowered.
When it comes to the other replication criteria, 16 (76%) out of 23 studies had a statistically significant effect in the meta-analysis. Fourteen (66.7%) of the replication effect sizes were within the 95% prediction interval using the original effect size, whereas the small telescopes approach found that 12 (57.1%) studies replicated. The one-sided default Bayes factor exceeded 1 and thus provides evidence in favor of an effect in the direction of the original study for the same 13 studies that replicated according to the statistical significance criterion. This evidence is strong to extreme for 9 of these 13 studies. For the eight studies that failed to replicate according to statistical significance criterion, the Bayes factor is below 1 in support of the null hypothesis, and this evidence is strong to extreme for four of these eight studies. In sum, the various replication indicators agree to a larger extent for the SSRP than for the other replication projects, and this is most likely due to the SSRP’s higher statistical power.
Beside increasing transparency and openness in data, analysis code and materials (as discussed by, e.g., Miguel et al., 2014; Nosek et al., 2015), there are a number of other areas and practices that could be improved to increase the reliability of published results.
A recent paper proposes lowering the p value threshold for statistical significance from .05 to .005 (Benjamin et al., 2018). Results with p values between .05 and .005 would be referred to as suggestive evidence, while results with p values below .005 would be referred to as statistically significant. As Benjamin et al. show, a p value of .05 only gives strong support for an alternative hypothesis to be true if the prior probability and the statistical power are high. There are reasons to believe that this is rarely the case. Two attempts using different methods to estimate the prior probability of the original hypotheses in the RPP find low average priors of around 10% (Dreber et al., 2015; Johnson, Payne, Wang, Asher, & Mandal, 2017). Even when considering a more optimistic prior (1:5), the false-positive rate is also high for high-powered studies with the p value threshold of .05, while it remains low for a substantially larger range of statistical power with the p value threshold of .005.
There is a substantive negative correlation between the p values of the original studies and replicability in the RPP, EERP, and SSRP, which supports that lowering the p value threshold for statistical significance would be an effective way of lowering the fraction of false-positives. To have enough power for statistical tests with the p < .005 thresholds, sample sizes would need to increase by 70% and/or measurement would have to be improved. The importance of increased sample sizes is one of several reasons for more team science (Ebersole et al., 2016; Klein et al., 2014; Munafò et al., 2017; Schweinsberg et al., 2016), with successful examples like the Many Labs projects (Ebersole et al., 2016; Klein et al., 2014) and the Pipeline Project (Schweinsberg et al., 2016) where a large number of laboratories implement the same research protocol to increase power and obtain precise effect sizes while exploring variation across samples and settings. While team science is a clear way forward for more reliable results, it is, however, not clear to what extent there will be more of this in the near future within economics and finance.
An increased use of preregistered analysis plans—or preanalysis plans—in the fields of economics and finance dealing with p values would lead to a decrease in researcher degrees of freedom and thus increase the relevance of p values (Casey, Glennerster, & Miguel, 2012; Nosek, Ebersole, DeHaven, & Mellor, 2018). The preanalysis plan specifies all data collection and management, the tested hypotheses (ideally directional) including a hierarchy in terms of their importance, as well as all statistical tests. The preanalysis plan thus specifies how variables are coded, how control variables are used, how standard errors are treated, and so on—it is very detailed to minimize researcher degrees of freedom. Exploratory analyses where there are no clear hypotheses can also be discussed in the preanalysis plan. Yet “surprising” effects can also be discovered. Researchers can perform additional data-contingent analyses that are not mentioned in the preanalysis plan as long as it is made clear in the paper that these analyses are post hoc and thus have lower credibility. However, these exploratory analyses may be hypotheses-generating for future studies and allow for the discovery of true-positive surprising results.
There has also been a recent trend in using preregistration to combat both researcher-driven and journal-driven publication bias using Registered Reports. While increasingly used in, for example, psychology, they are still rare in economics and finance, with the Journal of Development Economics being a rare exception. With this approach, researchers submit what is basically a preanalysis plan that undergoes peer review and thus commit to perform a specific analysis. The journal, on the other hand, commits to publishing the article no matter the results. Given the evidence for p hacking in many fields of economics (Brodeur, Cook, & Heyes, 2018; Brodeur, Lé, Sangnier, & Zylberberg, 2016), Registered Reports could play an important role in curtailing the problem of unreliable results.10
Encouraging all types of replications is important to increase our understanding of the reliability of published results as well as to curb the use of the many researcher degrees of freedom. The field of psychology has probably moved the furthest among the social sciences in encouraging replications. One such example is the Registered Replication Reports (RRRs) organized by the Association for Psychological Science and published in Advances in Methods and Practices in Psychological Science. Similarly to the Many Labs projects, several laboratories replicate the chosen studies once replication protocols have been developed together with the original authors. A key difference is that a RRR only deals with a specific topic, with recent examples of RRRs exploring the ego-depletion effect in 23 laboratories with a sample size of N = 2,141 (Hagger et al., 2016) as well as the effect of time pressure on cooperation in 21 laboratories with a sample size of N = 3,596 (Bouwmeester et al., 2017). In both cases the high-powered RRRs failed to reject the null hypothesis of no effect. This type of replication should be particularly possible in experimental economics but also other subsets of economics and finance using the experimental method. Several other methods to encourage replications have also been proposed (Butera & List, 2017; Coffman, Niederle, & Wilson, 2017), and more are likely to be put forward.
A number of recent attempts have been made to understand to what extent the academic community is aware of the limited reproducibility and can predict replication outcomes using prediction markets and surveys (Camerer et al., 2016, 2018; Dreber et al., 2015). Prediction markets are a tool to aggregate private information (Plott & Sunder, 1988). In these markets, traders trade contracts with clearly defined outcomes, such as whether the hypothesis will replicate or not, and the market prediction is equivalent to the prediction of a single trader that has all pieces of information. In the case of a binary event, the price of a contract can, with some caveats (Manski, 2006), be interpreted as the probability the market assigns to the event. Prediction markets have been shown to successfully predict outcomes in many domains (Arrow et al., 2008) and were first proposed for use in science by Robin Hanson (Hanson, 1995) and later tested in the laboratory (Almenberg, Kittlitz, & Pfeiffer, 2009).11 Dreber et al. (2015), Camerer et al. (2016), Camerer et al. (2018), and Forsell et al. (2018) used prediction markets and surveys to elicit peer beliefs about replicability in the RPP, the EERP, the SSRP, and the Many Labs 2 (ML2).12
For each of these projects, researchers in the relevant fields were invited to participate in prediction markets where they were endowed with $50 to $100 to be used for trading on the outcome of the tested hypotheses. For the RPP this involved 41 of the 100 replications, while for the EERP, SSRP, and ML2 it involved all replications in these studies. Before participating in the prediction market, the researchers filled out a survey where they were asked to give their subjective probability for each hypothesis to replicate. For these projects taken together, peer beliefs about replication from the prediction markets and the surveys performed relatively well in predicting actual replication outcomes (in terms of the statistical significance criterion); thus peer beliefs may also be viewed as an additional reproducibility indicator.13 As discussed in a recent paper (Forsell et al., 2018), a simple analysis where a prediction market price or survey belief of more than 50% is interpreted as a correct prediction if the study replicates finds that pooling the results from all four prediction market studies yields a 73% (76/104) correct prediction rate whereas the survey yields a 66% (68/103) correct prediction rate.
The prediction market data can also be used in combination with replication results, power, and p value thresholds to estimate the probability that a tested hypothesis is true at different stages of the research process, including the prior probability of the tested hypothesis before the original result (Forsell et al., 2018).
What to Replicate?
Given the limited resources and the abundance of potential studies to replicate, an important question also regards which studies to focus on when only a subset can be replicated. One idea would be to use decision markets to guide this process. In a decision market one could, for example, use a decision rule that gives extra weight to replicate studies where there is a lot of uncertainty about the result (e.g., a prediction market price of 50 out of 100) or where there is substantial disagreement in expert predictions. This is still largely unexplored.
Almenberg, J., Kittlitz, K., & Pfeiffer, T. (2009). An experiment on prediction markets in science. PLOS ONE, 4(12), e8500.Find this resource:
Anderson, C. J., Bahník, Š., Barnett-Cowan, M., Bosco, F. A., Chandler, J., Chartier, C. R., . . . Zuni, K. (2016). Response to comment on “Estimating the reproducibility of psychological science.” Science, 351(6277), 1037.Find this resource:
Arrow, K. J., Forsythe, R., Gorham, M., Hahn, R., Hanson, R., Ledyard, J. O., . . . Zitzewitz, E. (2008). The promise of prediction markets. Science, 320(5878), 877–878.Find this resource:
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., . . . Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10.Find this resource:
Bouwmeester, S., Verkoeijen, P. P. J. L., Aczel, B., Barbosa, F., Bègue, L., Brañas-Garza, P., . . . Wollbrant, C. E. (2017). Registered replication report: Rand, Greene, and Nowak (2012). Perspectives on Psychological Science, 12(3), 527–542.Find this resource:
Brodeur, A., Cook, N., & Heyes, A. (2018). Methods matter: P-hacking and causal inference in economics (No. 11796). Ottawa, ON: University of Ottawa Department of Economics.Find this resource:
Brodeur, A., Lé, M., Sangnier, M., & Zylberberg, Y. (2016). Star wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1), 1–32.Find this resource:
Butera, L., & List, J. A. (2017). An economic approach to alleviate the crises of confidence in science: With an application to the Public Goods Game. Working Paper. Cambridge, MA: National Bureau of Economic Research.Find this resource:
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365.Find this resource:
Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., . . . Wu, H. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436.Find this resource:
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., . . . Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644.Find this resource:
Camerer, C. F., Dreber, A., & Johannesson, M. (2019). Replication and other practices for improving scientific quality in experimental economics. In A. Schram & A. Ule (Eds.), Handbook of research methods and applications in experimental economics (pp. 83–102). London: Edward Elgar.Find this resource:
Carney, D. R. (2016). My position on “power poses.” Unpublished manuscript, University of California, Berkeley.Find this resource:
Carney, D. R., Cuddy, A. J. C., & Yap, A. J. (2010). Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science, 21(10), 1363–1368.Find this resource:
Casey, K., Glennerster, R., & Miguel, E. (2012). Reshaping institutions: Evidence on aid impacts using a preanalysis plan. The Quarterly Journal of Economics, 127(4), 1755–1812.Find this resource:
Coffman, L. C., Niederle, M., & Wilson, A. J. (2017). A proposal to organize and promote replications. American Economic Review, 107(5), 41–45.Find this resource:
Cova, F., Strickland, B., Abatista, A., Allard, A., Andow, J., Attie, M., . . . Zhou, X. (2018). Estimating the reproducibility of experimental philosophy. Review of Philosophy and Psychology. [Advance online publication]Find this resource:
Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3(4), 286–300.Find this resource:
DellaVigna, S., & Pope, D. (2018). Predicting experimental results: Who knows what? Journal of Political Economy, 126(6), 2410–2456.Find this resource:
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., . . . Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences, 112(50), 15343–15347.Find this resource:
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., . . . Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82.Find this resource:
Etz, A., & Vandekerckhove, J. (2016). A Bayesian perspective on the Reproducibility Project: Psychology. PLOS ONE, 11(2), e0149794.Find this resource:
Forsell, E., Viganola, D., Pfeiffer, T., Almenberg, J., Wilson, B., Chen, Y., . . . Dreber, A. (2018). Predicting replication outcomes in the Many Labs 2 study. Journal of Economic Psychology. [Advance online publication]Find this resource:
Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651.Find this resource:
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Unpublished manuscript, Columbia University, New York.Find this resource:
Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328–331.Find this resource:
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science.” Science, 351(6277), 1037.Find this resource:
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., . . . Zwienenberg, M. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11(4), 546–573.Find this resource:
Hanson, R. (1995). Could gambling save science? Encouraging an honest consensus. Social Epistemology, 9(1), 3–33.Find this resource:
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124.Find this resource:
Ioannidis, J. P. A., Stanley, T. D., & Doucouliagos, H. (2018). The power of bias in economics research. The Economic Journal, 127(605), F236–F265.Find this resource:
Ioannidis, J. P., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4(3), 245–253.Find this resource:
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532.Find this resource:
Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. (2017). On the reproducibility of psychological science. Journal of the American Statistical Association, 112(517), 1–10.Find this resource:
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Jr., Bahník, Š., Bernstein, M. J., . . . Nosek, B. A. (2014). Investigating variation in replicability: A “many labs” replication project. Social Psychology, 45(3), 142–152.Find this resource:
Landy, J. F., Jia, M. (Liam), Ding, I., Viganola, D., Tierney, W., Dreber, A., . . . Ebersole, C. R. (2018). Crowdsourcing hypothesis tests: Making transparent how design choices shape research results. Manuscript in preparation.Find this resource:
Leamer, E. E. (1983). Let’s take the con out of econometrics. The American Economic Review, 73(1), 31–43.Find this resource:
Manski, C. F. (2006). Interpreting the predictions of prediction markets. Economics Letters, 91(3), 425–429.Find this resource:
Marsman, M., Schönbrodt, F. D., Morey, R. D., Yao, Y., Gelman, A., & Wagenmakers, E.-J. (2017). A Bayesian bird’s eye view of “Replications of important results in social psychology.” Royal Society Open Science, 4(1), 160426.Find this resource:
Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K. M., Gerber, A., . . . Van der Laan, M. (2014). Promoting transparency in social science research. Science, 343(6166), 30–31.Find this resource:
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., . . . Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021.Find this resource:
Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., . . . Yarkoni, T. (2015). Promoting an open research culture. Science, 348(6242), 1422–1425.Find this resource:
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115(11), 2600–2606.Find this resource:
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251).Find this resource:
Patil, P., Peng, R. D., & Leek, J. T. (2016). What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspectives on Psychological Science, 11(4), 539–544.Find this resource:
Plott, C. R., & Sunder, S. (1988). Rational expectations and the aggregation of diverse information in laboratory security markets. Econometrica, 56(5), 1085–1118.Find this resource:
Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., & Weber, R. A. (2015). Assessing the robustness of power posing: No effect on hormones and risk tolerance in a large sample of men and women. Psychological Science, 26(5), 653–656.Find this resource:
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641.Find this resource:
Schweinsberg, M., Madan, N., Vianello, M., Sommer, S. A., Jordan, J., Tierney, W., . . . Uhlmann, E. L. (2016). The pipeline project: Pre-publication independent replications of a single laboratory’s research pipeline. Journal of Experimental Social Psychology, 66, 55–67.Find this resource:
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.Find this resource:
Simonsohn, U. (2015a). Accepting the null: Where to draw the line? Data Colada, 42.Find this resource:
Simonsohn, U. (2015b). Small telescopes: Detectability and the evaluation of replication results. Psychological Science, 26(5), 559–569.Find this resource:
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54(285), 30–34.Find this resource:
Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology. General, 143(4), 1457–1475.Find this resource:
Wagenmakers, E.-J., Verhagen, J., & Ly, A. (2016). How to quantify the evidence for the absence of a correlation. Behavior Research Methods, 48(2), 413–426.Find this resource:
Zhang, L., & Ortmann, A. (2013). Exploring the meaning of significance in experimental economics. UNSW Australian School of Business Research Paper No. 2013-32. SSRN.Find this resource:
(1.) Underpowered is defined as the study having less than 80% statistical power.
(3.) Other more commonly discussed factors involve file-drawer effects (Rosenthal, 1979), where null findings disappear into the “file drawers” because researchers do not write up or submit the paper, as well as publication bias (Sterling, 1959), where journals are biased against null results.
(4.) The Many Labs replication projects explore the degree of systematic variation across subject pools. The results do not support the idea of substantial systematic variation in average treatment effects across different samples (Ebersole et al., 2016; Klein et al., 2014).
(5.) There are, however, challenges comparing significance levels across experiments (Gelman & Stern, 2006). There is also no consensus on how to define successful replication (Cumming, 2008; Gelman & Stern, 2006; Open Science Collaboration, 2015; Simonsohn, 2015b; Verhagen & Wagenmakers, 2014).
(6.) Bayesian approaches vary in their choice of prior for the alternative hypothesis’ effect size, and these approaches favor the null hypothesis to different degrees (Etz & Vandekerckhove, 2016; Verhagen & Wagenmakers, 2014). The Bayes factor is similar to interpreting the p value as a continuous variable representing the strength of evidence for the tested hypothesis, but the Bayes factor also has a more direct interpretation of the strength of evidence than the p value.
(7.) Actually, four out of these p values were p < .06.
(8.) This is relevant for any potential comparison between RPP and EERP, since the RPP results indicated that main effects were more likely to replicate than interaction effects.
(9.) This number is the mean of the relative effect size in each study. If the mean replication effect size is divided by the mean original effect size as in the RPP, the number is 59%.
(10.) Preanalysis plans cannot insure against dishonest research practices like fraud. However, there are reasons to believe that the many researcher degrees of freedom is a bigger problem than actively deceitful behavior like fraud (John, Loewenstein, & Prelec, 2012).