Calls to reform the way we do science are becoming increasingly more frequent. Most scientists seem to agree that we should fight problems such as p hacking, publication bias, and corrupting incentives. The key question is *how* we should do this, but what makes a reform effective?

On September 1, a quantitatively impressive group of people wrote a paper arguing that we should redefine statistical significance by changing the alpha threshold for new discoveries from 0.05 to 0.005 (Benjamin, D. J., Berger, J.O., Johannesson, M., Nosek, B.A., Wagenmakers, E.J., Berk, R., Bollen, K.A., Brembs, B., Brown, L., Camerer, C., Cesarini, D., Chambers, C.D., Clyde, M., Cook, T.D., De Boeck, P., Dienes, Z., Dreber, A., Easwaran, K., Efferson, C., Fehr, E., Fidler, F., Field, A.P., Forster, M., George, E.I., Gonzalez, R., Goodman, S., Green, E., Green, D.P., Greenwald, A.G., Hadfield, J.D., Hedges, L.V., Held, L., Hua Ho, T., Hoijtink, H., Hruschka, D.J., Imai, K., Imbens, G., Ioannidis, J.P.A., Jeon, M., Jones, J.H., Kirchler, M., Laibson, D., List, J., Little, R., Lupia, A., Machery, E., Maxwell, S., McCarthy, M., Moore, D.A., Morgan, S.L., Munafó, M., Nakagawa, S., Nyhan, B., Parker, T.H., Pericchi, L., Perugini, M., Rouder, J., Rousseau, J., Savalei, V., Schönbrodt, F.D., Sellke, T., Sinclair, B., Tingley, D., Van Zandt, T., Vazire, S., Watts, D.J., Winship, C., Wolpert, R.L., Xie, Y., Young, C., Zinman, J., & Johnson, V.E., 2017).

On September 18, a quantitatively even more impressive group of people wrote a reply, arguing that alpha should not be set to any particular standard value but that researchers should (instead) transparently report and justify all choices they make when designing a study, including the alpha level (Lakens, D., Adolfi, F.G., Albers, C.J., Anvari, F., Apps, M.A.J., Argamon, S.E., Baguley, T., Becker, R.B., Benning, S.D., Bradford, D.E., Muchana, E.M., Caldwell, A.R., van Calster, B., Carlsson, R., Chen, Sau-Chin., Chung, B., Colling, L.J., Collins, G.S., Crook, Z., Cross, E.S., Daniels, S., Danielsson, H., DeBruine, L., Dunleavy, D.J., Earp, B.D., Feist, M.I., Ferrel, J.D., Field, J.G., Fox, N.W., Friesen, A., Gomes, C., Gonzalez-Marquez, M., Grange, J.A., Grieve, A.P., Guggenberger, R., Grist, J., van Harmelen, A.L., Hasselman, F., Hochard, K.D., Hoffarth, M.R., Holmes, N.P., Ingre, M., Isager, P.M., Isotalus, H.K., Johansson, C., Juszczyk, K., Kenny, D.A., Khalil, A.A., Konat, B., Lao, J., Larsen, E.K., Lodder, G.M.A., Lukavsky, J., Madan, C.R., Manheim, D., Martin, S.R., Marin, A.E., Mayo, D.G., McCarthy, R.J., McConway, K., McFarland, C., Nio, A.Q.X., Nilsonne, G., Lino de Oliveira, C., Orban de Xivry, J.J., Parsons, S., Pfuhl, G., Quinn, A.K., Sakon, J.J., Adil Saribay, S., Schneider, I.K., Selvaraju, M., Sjoerds, Z., Smith, S.G., Smits, T., Spies, J.R., Sreekumar, V., Steltenpohl, C.N., Stenhouse, N., Swiatkowski, W., Vadillo, M.A., Van Assen, M.A.L.M., Williams, M.N., Williams, S.E., Williams, D.R., Yarkoni, T., Ziano, I., & Zwaan, R.A., 2017).

With 88 authors, the ‘justify your alpha’ paper is clearly better than the ‘alpha should be 0.005’ paper, which only has a pathetic amount of 72 authors. Both papers have received quite some attention, but have not always been characterized accurately. Recently a new pre-print was published which takes a very different look on the whole p < 0.005 discussion – but before we get into that, lets briefly examine the original two papers.

## Reduce alpha to 0.005

Benjamin et al. propose to* “change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries*“. The key argument is that, when analyzed from a Bayesian perspective, a p value of 0.049 tends to provide very little evidence (given certain decisions about priors). They further claim that reducing the criteria for ‘evidence’ to p < 0.005 will “*immediately improve the reproducibility of scientific research in many fields*” and reduce false positives substantially.

Importantly, the authors do not state that this should become a criteria for publication, but a new standard of evidence, particular for new discoveries. What is also very often missed is that the “*significance threshold […] should depend on the prior odds that the null hypothesis is true, the number of hypotheses tested, the study design, the relative cost of type I versus type II errors, and other factors that vary by research topic.*” As such, they already argue that the alpha value should be flexible.

Of special interest is their pro-active reply to the potential objection that their proposal does not fix issues such as p hacking and publication bias. They argue that reducing the statistical threshold “*complements — but does not substitute for — solutions to these other problems, which include good study design, ex ante power calculations, pre-registration of planned analyses, replications, and transparent reporting of procedures and all statistical analyses conducted.*”

While this seems to be a fair argument (no single measure can solve everything, right?), we will later see that this does *not* hold.

Onwards to the ‘justify your alpha’ paper…

## Justify your alpha

While Lakens et al. agree with many of the points raised by Benjamin et al, they say that they disagree on several important issues. First, they state that “*there is insufficient evidence that the current standard for statistical significance is in fact a ‘leading cause of non-reproducibility’.*” Second, any alpha value is arbitrary, whether this is 0.05 or 0.005. Third, a lower significance can have a wide range of consequences, which can be positive and negative, and which should be carefully evaluated before any large-scale changes are proposed.

Their counter proposal is that “*researchers [should] justify their choice for an alpha level before collecting the data.*”

While they do discuss p hacking from time to time, they do not seem to directly address this as a main issue which will/can be solved through their proposal. Nevertheless, implicitly they do address this by stating that important decisions should be made before data collection, thus preventing researchers to make decisions conditional on the data which should not be conditional on the data (such as the choice for an alpha level).

## Now what…

A bit ironically, both papers ultimately end up making essentially identical statements:

- the current choice of alpha = 0.05 is arbitrary
- a p value of 0.05 provides only weak evidence against the null hypothesis (or for the alternative hypothesis)
- an alpha value of 0.05 leads (under various modeling assumptions) to a high false positive report probability
- we need to change the way we use alpha

The only real point at which the two papers diverge is the proposed solution for *how* we should change our ways (in how we use alpha). Benjamin et al. propose that alpha should be set to a new default of 0.005 and also that it should be adjusted based on the relevant factors (such as subject matter, study design, prior odds). Lakens et al. disagree with the new default and only agree with the ‘adjust and justify alpha’ part. Frankly, most of their disagreement appears to be superficial.

## While two dogs are fighting for a bone…

… a third one runs away with it.

On November 19, Harry Crane published a pre-print entitled “Why ‘Redefining Statistical Significance’ Will Not Improve Reproducibility and Could Make the Replication Crisis Worse.” he specifically addressed the claims that ‘alpha = 0.005’ will decrease false positive rates and improve reproducibility. I tweeted a bit about this paper earlier, as it provides some fascinating arguments.

The two conclusions of this paper are:

- The claimed improvements to false positive rate and replication rate in Benjamin, et al. (2017) are exaggerated and misleading.
- There are plausible scenarios under which the lower [significance] cutoff will make the replication crisis worse

Crane comes to these conclusions based on taking the calculations/simulations of the paper by Benjamin et al., and introducing the effects of p hacking. The essence of his argument is straightforward: if you lower the significance threshold, people will adjust their behavior accordingly. That is, people will hack p’s until they reach the new threshold. When this happens structurally, it diminished the projected positive effects of the lower threshold, up to the point that there are no positive effects.

Crane distinguishes two kinds of p values:

- Sound p values – for which “
*the standard interpretation is valid (i.e., the probability, under the null hypothesis, that the test statistic attains a value as extreme or more extreme than observed)*“ - Unsound p values – for which that interpretation is not valid

Crane continues to show that the promised effects (fewer false positives and better reproducibility) of alpha = 0.005 are conditional on the hidden assumption that all the p values in the published literature are sound ones. This is, of course, false. He also shows that the larger the proportion of unsound significant p values in the literature, the smaller the effect of a new alpha threshold on decreasing false positives and increasing reproducibility. The key question thus becomes: how prevalent is p hacking? But first, lets look at what p hacking actually is.

## P hacking

P hacking is a mysterious enemy that keeps popping up. We should know our enemy well, so lets give p hacking a proper definition:

P hacking is the (undisclosed) use of any technique that results in an unsound p value

I deliberately left out that the common purpose or goal of p hacking is to obtain a p value lower than alpha. Sometimes, we are actually looking for a *high* p value, for example when one (incorrectly) wants to claim that there are no differences between groups or there is no effect of a moderator. However, Benjamin et al. explicitly focuses on *false positive*s (and not *false negatives*), so we will for now only consider the case of hacked p values which are *lower* than the sound value and lower than alpha.

Why is p hacking a thing? Generally speaking there are two categories of causes: 1) human nature and 2) structural incentives. To start with the latter, we are clearly incentivised to report ‘positive results’ with significant p values. Running an experimental study and concluding that there are no significant differences just isn’t as easily published as one in which the differences are significant. However, it is also *human nature* to look for results that confirm our hypotheses (confirmation bias), that are unexpected, tell a fascinating story, will be picked up by newspapers, etc.

Exploring a dataset by running multiple different analyses (which is a form of p hacking) is a very natural thing to do, especially if there is no clear pre-registered analysis protocol in place. We run studies because we expect to find something, so we look until we find something. Changing or removing the significance threshold will not affect *why* we hack, just *how* we hack and for *how long*.

## Is p hacking harder with p < 0.005 than p < 0.05?

It is trivially true that p hacking is ‘harder’ with p < 0.005 than p < 0.05, but this is not as relevant or informative as one might think it is. Obtaining p < alpha is just a matter of time, it’s not really a question of ‘difficulty’. Given enough freedom to re-analyze a dataset you are essentially guaranteed to find p < alpha.

The main limiting factor for p hacking is the maximum researcher’s degrees of freedom you could employ. That is, if you have 1 participant with 1 value, you have no degrees of freedom left. However, as soon as you have decent sample sizes and a range of measured variables the maximum degrees of freedom that you could *potentially* utilize skyrockets. As such, it is fair to claim that for a (very) large proportion of studies the potential researcher’s degrees of freedom is so high that you are virtually guaranteed to find a p value lower than 0.05, or even lower than 0.005.

I concede that I will need to show more evidence to make the above claim, but bear with me – even with a less strong claim we will reach the same conclusion.

Let’s assume that a new alpha value *does* make it substantially harder to p hack and thus that fewer p values are hacked. Counter intuitively, this still does *not* automatically lead to fewer false positives *in the published literature*. This follows from the following (very realistic) assumptions that:

- p hacking is common,
- p hacking predominately results in p values lower than alpha,
- there are (much) more papers being submitted than being published, and
- there is a bias for publishing significant results.

As such, *even if* a lower alpha value will reduce the total number of hacked p values, as long as there are more p hacked manuscripts being submitted than studies that will be published, the number of p hacked published will remain more or less stable – all thanks to publication bias.

## Will an alpha of 0.005 reduce false positives?

In addition to my argumentation, Crane provides another line of reasoning to show that a reduced alpha will have a minimal effect on reducing false positives. See the figure below:

In the above figure, we have ‘persistence’ on the x-axis, which is the extent to p values that were hacked under 0.05 will remain significant under 0.005 (a persistence of 0 means that all hacked p values will no longer be significant, while 1 means that all hacked p values will remain significant under alpha = 0.005). The y-xis shows the rate of false positives. The left window shows projected false positive rates under the assumption that 5% of the p values are unsound (h = 0.05) while for the right window this is assumed to be 15% (h = 0.15).

Benjamin et al. implicitly assumed a persistence of exactly 0, such that their calculations correspond with the very left part of the graph. In that scenario, the false positive ratio under alpha = 0.005 will be very low. However, as soon as we let go of this (false) assumption, the false positive ratio begins to increase rather fast.

Remember that Benjamin et al. said that their proposal *“compliments – but does not substitute for solutions to these problems”*? Apparently this is not exactly how it works, at all. The projected positive effects of their proposal does not *compliment* other solutions (such as pre-registration) but is *conditional* on the full eradication of p hacking. With this condition not being met, the projected positive effects are substantially diminished.

The assumption that a lower alpha will reduce the number of hacked p values seems to rely (at least partially) on the assumption that under alpha = 0.005 scientists will keep p hacking as if alpha = 0.05. This is very unlikely. P hacking behavior tends to result in p < 0.05 because that is the perceived standard of evidence. If you move the standard of evidence, p hacking behavior will follow suit.

In short, the extent to which a lower alpha will reduce the false positive rate critically depends on the *persistence of p hacking*. Given that this persistence is definitely non-zero, the claims by Benjamin et al. provide us only with an unrealistic *upper limit* of the effect on lowering the false positive rates. The *actual *effect will be lower – possibly even much lower, depending on the persistence of p hacking behavior.

Having said that, it still does not mean that we should not lower alpha. It does mean that it becomes a second order solution which will work conditional on other problems being solved first.

## The way forwards

Calls for statistical reforms are always welcome, but sometimes they are cute but meaningless as a policy change. Is progress build on small, incremental steps? Yes, sometimes. Other times you need comprehensive restructuring of the system.

Let me rephrase: is reducing alpha a bad policy? Not necessarily. Is justifying your alpha a bad policy? It’s a noble goal. Are either of these likely to be effective as *policies*? I strongly doubt it.

Reforms need to be comprehensive. Specifically, reforms *have to *include* *a restructuring of the publication process. Without a restructuring of the publication process you are still left with publication bias and hacked p values – this combination will put a very strict upper limit on the potential effect of *any* proposed change.

So, what kind of comprehensive reform would I suggest? Surprise, surprise: **Registered Reports**. What are those again? (from http://www.timvanderzee.com/registered-reports)

Registered Reports are scientific articles which are peer-reviewed prior to any data collection. In practice this means that authors write the Introduction and Method section of a paper and send this to a journal, together with all relevant materials and analyses scripts. Peers will then review the manuscript, highlight any changes that need to be made, and decide to decline or provisionally accept the manuscript for publication. After the authors have finished data collection, data analyses, and have written the Results and Discussions sections they again send the manuscript in for review. As the manuscript was already provisionally accepted, it will be published based only on the condition that the authors have followed their protocols.

Why would Registered Reports be the comprehensive reform that will succeed where others failed? Because by its very nature it does the following:

- Publication decisions can no longer be conditional on the outcomes: no, or near-zero publication bias
- Data analyses decisions can no longer be conditional on the outcomes: no, or near-zero p hacking
- Theoretical decisions can no longer be conditional on the outcomes: no, or near-zero HARKing
- It clearly distinguishes confirmatory findings from exploratory findings (the latter
*can*still be p hacked and HARKed, but it’ll be transparent) - Because the Method is pre-registered
*and*peer-reviewed prior to data collection, peers can actually positively impact the quality of study designs - Unlike casual pre-registration, the reviewers/editors actively check whether the planned protocol was followed

Where some call for ‘we should just be more transparent’, Registered Reports fundamentally restructure the research cycle in such a way that by its very nature it (almost) completely eliminates p hacking (for confirmatory results) and publication bias, amongst other positive outcomes. That is what makes a policy reform effective – not just the noble goals, but a restructuring of the processes such that undesired behavior becomes nearly impossible.

Currently, Registered Reports are becoming increasingly common. They are already accepted by over 80 journals, and this number keeps growing.

**Be part of the revolution – run your studies as Registered Reports.**

*References*

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Cesarini, D. (2017). Redefine statistical significance. *Nature Human Behaviour*.

Lakens, D., Adolfi, F. G., Albers, C., Anvari, F., Apps, M. A., Argamon, S. E., … & Bradford, D. E. (2017). Justify Your Alpha: A Response to “Redefine Statistical Significance”. Pre-print at *PsyArXiv Preprints*

Crane, H. (2017). Why “Redefining Statistical Significance” Will Not Improve Reproducibility and Could Make the Replication Crisis Worse. Pre-print at *PsyArXiv Preprints*.

I believe a pvalue of. 05 is weak evidence against the null and not for the alternative (as stated “now what… “

“6. Unlike casual pre-registration, the reviewers/editors actively check whether the planned protocol was followed”

I think that’s an assumption.

Furthermore, i think that’s not verifiable.

Furthermore, any journal that doesn’t publish RR’s but does publish pre-registered papers could include a simple statement for reviewers to indicate 1) whether the paper contained a link to pre-registration information (y/n), and 2) whether the reviewer checked this information (y/n), and 3) whether the paper reports everything correctly (y/n).

Of course, the latter doesn’t result in editors/reviewers actually checking everything, which is why most importantly pre-registration information should be available to the actual reader so he/she can check this information.

As someone on Psychmap called it: “picture or it didn’t happen”.

https://www.facebook.com/groups/108454679531463?view=permalink&id=501083246935269&refid=18&_ft_=qid.6491608671783863744%3Amf_story_key.501083246935269%3Atop_level_post_id.501083246935269%3Atl_objid.501083246935269&__tn__=*W-R#footer_action_list

Agreed, it is not always verifiable to the public. However, adherence to the pre-registered protocol can be made easily verifiable by uploading and time-stamping it before data collection (e.g., after it has been peer-reviewed and accepted).

In case you haven’t seen this yet, here is a link to a pre-print by Harwicke & Ioannidis showing what i have been afraid of, and been trying to make clear for some time now, concerning the non-availability of pre-registration information for the reader of Registered Reports:

https://osf.io/preprints/bitss/fzpcy/

Should it be useful in some way or form, and/or for your possible entertainment, i tried to get attention for this a final, perhaps somewhat desperate, time here (just before the pre-print of Hardwicke & Ioannidis got published):

https://www.psychologicalscience.org/observer/preregistration-becoming-the-norm-in-psychological-science#comment-8352965

I am done with it all, but i hope you will (continue to) keep a critical eye on things. Keep it open, keep it real. Thank you for all your efforts trying to help improve Psychological Science.

Kind regards,

Alexander A. Aarts

https://osf.io/eqbas/

Lowering alpha to .005 for already published studies makes sense to me, as these studies already have been carried out. Given (H0 & p-hacking), many p-values tend to end up just below .05, so you protect against them when lowering alpha.

For NEW studies, indeed, lowering alpha to .005 is a bad idea, because scientists will adapt their research strategy; when including more variables, more analyses, etc., it is just as easy to obtain p < .005 as it is now to obtain p < .05.

https://www.bayesianspectacles.org/redefine-statistical-significance-part-xi-dr-crane-forcefully-presents-a-red-herring/

The “reply” by the “redefine statistical significance” folks (or at least one of them)….

Nicely argued! For the less informed (e.g. me) HARK = Hypothesizing After the Results are Known

Hi Tim.

If I understand Harry’s calcs, they assume the typical p-hacker will respond to a change from .05 to .005 by increasing their n to maintain power. The assumption then is that moving to p < .005 would not discourage p-hacking practices (or may even encourage them).

My intuition is the "typical" p-hacker is someone who is not that sophisticated, and quite possibly has little understanding of power to begin with. That is, I think the typical p-hacker is more the bumbling fool than the deliberate charlatan. And if that's true, expecting them to double their n in response to a change in alpha may be unfounded. They might instead focus their research a bit better on more valid questions or look for safer options like joining the "let's replicate everything" crowd as a ways to get published.

I think anyone clever about stats prefers to go on about how unclever everyone else is rather than use their 'powers' for evil. 😉 The idea that there are these diabolically clever p-hackers out there prepared to game the system no matter what strikes me as implausible.

And if I'm right about who the p-hackers are, the fpr advantage of a lower alpha is greater when power is low, making it fairly sensitive to the practises of people who are ignorant about power.

Now, one could argue that p-hackers are actually quite sophisticated indeed and are deliberately setting out to game the system by running complex designs with umpteen variables, low 'real' power, and then deliberately failing to adjust for multiple comparisons. And maybe there are some out there doing that. But I don't think it's a large number, maybe 5% tops, and certainly nowhere near 15%. I suppose you could argue in response that a lot ARE doing that, though they aren't doing it with diabolical intentions but rather exactly because they are bumbling, and I admit I wouldn't have a good riposte to that.

But overall, the reason I think the % p-hacked number is smaller relates to my sense of the proportion of the replicability problem that arises from p-hacking vs. the proportion that arises from other causes. I haven't looked at this too closely myself, but by all accounts there are a lot of underpowered studies out there, some of them massively underpowered. Lack of power, combined with p<.05, could well alone account for the vast majority of failures to replicate.

Interested to hear what you think.