I stumbled upon an article which used a Bonferroni correction to ‘control’ family-wise error rates. While this isn’t shocking by itself, I was happily surprised by how they applied it. This is what they wrote:
“In order to account for multiple comparisons, statistical significance was set at a p-value of less than 0.0056 (.05/9 tests) using Bonferroni correction for chi-square and McNemar tests.”
Geryk et al. (2016)
Note that they do not multiply p values by some amount, but instead lower the threshold for statistical significance. Increasing the p value is the typical way of correcting alpha, and I have been told that it is the standard in SPSS, but it is wrong.
By my recollection this is the first and only time that I have seen a paper lowering the alpha instead of raising the p value. Importantly, this is the only correct way to use these kind of error rate corrections. Here’s why:
What does the alpha mean, anyway?
In the Neyman-Pearson approach, alpha is a variable which you need to set a priori; for example at 0.05, 0.01, or even 0.005, if you are so inclined. The alpha is the probability of committing a Type I error (or ‘false positive’) in the long run. That is, would you endlessly repeat the same method you will incorrectly decide to reject the null hypothesis no more than alpha% of the time. The alpha is the upper limit, meaning that you will usually commit fewer than alpha% errors, in the long run.
Importantly, this method of error control does not give you any information about individual tests. When you obtain a p value which is lower than your alpha you do not know whether or not you are making a Type I error. In reality, you are either right or wrong. Long-term error control is just a frequentist method that we can use to, well, control our long-term error rates – it’s just a statistical concept to guide our decision making. If you want to model the evidence in favor of certain hypotheses based on individual trials you need to be calculating things other than p values (contrary to popular belief it doesn’t require you to change religion; you’re just using somewhat different, but related, equations).
So what is this family-wise error control about?
Controlling your long-term error rates is key (in this approach anyway), as you don’t want to fool yourself too many times. However, when you run a lot of tests the probability that you will make at least one Type I error increases rapidly (again, you’ll never now if/when you are making such an error; it’s just a statistical concept that can help us guide our decisions). This is called the family-wise error rate: the probability of making (at least) one Type I error out of multiple tests. This probability is equal to 1−(1−alpha)^k, where k is the number of tests.
If you want to, you can apply any number of corrections to “control” your family-wise error rates. A popular technique is the Bonferroni correction, in which you simply divide the alpha by the number of hypotheses tests you have performed. However, it is overly conservative, so you are better off using a multiple comparison correction such as the Holm-Bonferonni correction, which is more accurate and less conservative. There are many similar corrections, as well as other methods of dealing with multiple comparisons, but I will not go into these here.
Why you change the alpha, not the p value
Earlier I mentioned that the only correct way to go about this is to lower your alpha, and that it is incorrect to increase your p values. It does not make any sense to raise your p values; they have nothing to do with long-term error rates, only the alpha does. The p value gives you the probability of obtaining the data (and more extreme) assuming that the null hypothesis is true. When you are doing multiple tests this value does not increase or decrease; it just is. This is unlike when you go from using a one-tailed to a two-tailed test; in this scenario your p value will change, because you are testing the data under a different hypothesis.
Another reason why you don’t change your p values for the sake of controlling error rates is because you can run into nonsensical situations. For example, if you have a p value of >0.5 and you want to correct it for k = 2 tests, you will end up with a p value of >1.0, which does not make any sense. A final, and very important reason why you should never ever mess with your p values is because it messes up meta-analytical techniques such as p-curve analyses. While I have sometimes seen paper mention both the “corrected” and “uncorrected” p values (again, there’s no such thing), it remains horribly confusing and potentially misleading. Simple state the p values, and set the alpha threshold to whatever you want it to be. As long as you do it a priori, you are all set.
Bonus: What is the biggest contributor to false positives?
It is not the typical level of alpha. Reducing alpha from 0.05 to 0.005 (as argued for by a whole bunch of people in a recent pre-print) will certainly reduce the amount of Type I errors. It also has the very pleasant side-effect of forcing us to use bigger samples, thus increasing the reliability of parameter estimation and research in general. This is all fine and well, but the elephant in the room is researchers’ degrees of freedom (note: simply because reducing the alpha level is not a perfect solution for all problems doesn’t mean that it’s a bad suggestion!).
If no true relationship exists between two variables (i.e., H0 is true) we will on average still obtain a maximum of alpha% significant false positives per analysis. Performing multiple comparisons is one possible fork of so-called ‘forking paths’ – the sequence of possible decisions or forks that influence the obtained results (see e.g., Simmons, Nelson, & Simonsohn, 2011). The forking paths do not only include different statistical analyses, but also include decisions such as which participants to include or exclude, which measures to use, what interactions to consider, which cut-off points to use, outlier detection, and many other decisions (I write more on this in: van der Sluis, van der Zee, & Ginn, 2017). The alpha value or false positive error rate is only controlled under the assumption that exactly one path is followed, which is decided upon a priori and is not conditional on the data. Problematically, a dataset can usually be analyzed in so many different ways that we learn very little by finding a p value lower than alpha (Gelman & Loken, 2013).
Can we reduce false positives by reducing the alpha? Yes we can! But this requires that we not only know how many tests were performed, but the full garden of (hypothetical) forking paths, as they all influence the (also hypothetical) long-term error rates. Even in the most extreme example that you run exactly one test and you get a p<alpha (lucky you, it’s publication time!) your long-term error rates are most likely still much higher than alpha. Why? Because it is a long term error rate, and as such it depends on how you would hypothetically behave on the long run. Should you not have obtained a significant effect, you might have checked for outliers and removed 1 or 2 participants, and then run the hypothesis test again, and again, and again… until you obtain significance.
The cruel nature of this problem is that you don’t actually need to have done these multiple tests; even your hypothetical behavior influences the hypothetical long-term error rates. Long runs are weird like that.
How to really control long-term error rates
The solution to the above problem is, in my opinion, straightforward (although not simple): we need to control our degrees of freedom. In practice, this means that we need to pre-register as much as possible, and we need to be accountable for this. I will go one step further and state that Registered Reports are the only kind of publications which can come close to true error rate control. Editors and peer reviewers also influence which analyses we (potentially) run, and as such they influence our long-term behavior and error rates. Registered Reports is the only amongst the many, many proposed solutions which comes close to eliminating the most influential biases: publication bias, researchers’ degrees of freedom, HARKing, p-hacking, and other questionable research practices.
I started out with explaining what the alpha parameter means. If you want to use it, you want to make use of Registered Reports.
Geryk, L. L., C. Arrindell, C., J. Sage, A., J. Blalock, S., Reuland, D. S., Coyne-Beasley, T., … & M. Carpenter, D. (2016). Exploring youth and caregiver preferences for asthma education video content. Journal of Asthma, 53(1), 101-106.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22(11), 1359-1366.
van der Sluis, F., Van der Zee, T., & Ginn, J. (2017, April). Learning about Learning at Scale: Methodological Challenges and Recommendations. In Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale (pp. 131-140). ACM.
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.