What are long-term error rates and how do you control them?

I stumbled upon an article which used a Bonferroni correction to ‘control’ family-wise error rates. While this isn’t shocking by itself, I was happily surprised by how they applied it. This is what they wrote:

“In order to account for multiple comparisons, statistical significance was set at a p-value of less than 0.0056 (.05/9 tests) using Bonferroni correction for chi-square and McNemar tests.”

Geryk et al. (2016)

Note that they do not multiply p values by some amount, but instead lower the threshold for statistical significance. Increasing the p value is the typical way of correcting alpha, and I have been told that it is the standard in SPSS, but it is wrong.

By my recollection this is the first and only time that I have seen a paper lowering the alpha instead of raising the p value. Importantly, this is the only correct way to use these kind of error rate corrections. Here’s why:

What does the alpha mean, anyway?

In the Neyman-Pearson approach, alpha is a variable which you need to set a priori; for example at 0.05, 0.01, or even 0.005, if you are so inclined. The alpha is the probability of committing a Type I error (or ‘false positive’) in the long run. That is, would you endlessly repeat the same method you will incorrectly decide to reject the null hypothesis no more than alpha% of the time. The alpha is the upper limit, meaning that you will usually commit fewer than alpha% errors, in the long run.

Importantly, this method of error control does not give you any information about individual tests. When you obtain a p value which is lower than your alpha you do not know whether or not you are making a Type I error. In reality, you are either right or wrong. Long-term error control is just a frequentist method that we can use to, well, control our long-term error rates – it’s just a statistical concept to guide our decision making. If you want to model the evidence in favor of certain hypotheses based on individual trials you need to be calculating things other than p values (contrary to popular belief it doesn’t require you to change religion; you’re just using somewhat different, but related, equations).

So what is this family-wise error control about?

Controlling your long-term error rates is key (in this approach anyway), as you don’t want to fool yourself too many times. However, when you run a lot of tests the probability that you will make at least one Type I error increases rapidly (again, you’ll never now if/when you are making such an error; it’s just a statistical concept that can help us guide our decisions). This is called the family-wise error rate: the probability of making (at least) one Type I error out of multiple tests. This probability is equal to 1−(1−alpha)^k, where is the number of tests.

If you want to, you can apply any number of corrections to “control” your family-wise error rates. A popular technique is the Bonferroni correction, in which you simply divide the alpha by the number of hypotheses tests you have performed. However, it is overly conservative, so you are better off using a multiple comparison correction such as the Holm-Bonferonni correction, which is more accurate and less conservative. There are many similar corrections, as well as other methods of dealing with multiple comparisons, but I will not go into these here.

Why you change the alpha, not the p value

Earlier I mentioned that the only correct way to go about this is to lower your alpha, and that it is incorrect to increase your p values. It does not make any sense to raise your p values; they have nothing to do with long-term error rates, only the alpha does. The p value gives you the probability of obtaining the data (and more extreme) assuming that the null hypothesis is true. When you are doing multiple tests this value does not increase or decrease; it just is. This is unlike when you go from using a one-tailed to a two-tailed test; in this scenario your p value will change, because you are testing the data under a different hypothesis.

Another reason why you don’t change your p values for the sake of controlling error rates is because you can run into nonsensical situations. For example, if you have a p value of >0.5 and you want to correct it for k = 2 tests, you will end up with a p value of >1.0, which does not make any sense. A final, and very important reason why you should never ever mess with your p values is because it messes up meta-analytical techniques such as p-curve analyses. While I have sometimes seen paper mention both the “corrected” and “uncorrected” p values (again, there’s no such thing), it remains horribly confusing and potentially misleading. Simple state the p values, and set the alpha threshold to whatever you want it to be. As long as you do it a priori, you are all set.

Bonus: What is the biggest contributor to false positives?

It is not the typical level of alpha. Reducing alpha from 0.05 to 0.005 (as argued for by a whole bunch of people in a recent pre-print) will certainly reduce the amount of Type I errors. It also has the very pleasant side-effect of forcing us to use bigger samples, thus increasing the reliability of parameter estimation and research in general. This is all fine and well, but the elephant in the room is researchers’ degrees of freedom (note: simply because reducing the alpha level is not a perfect solution for all problems doesn’t mean that it’s a bad suggestion!).

If no true relationship exists between two variables (i.e., H0 is true) we will on average still obtain a maximum of alpha% significant false positives per analysis. Performing multiple comparisons is one possible fork of so-called ‘forking paths’ – the sequence of possible decisions or forks that influence the obtained results (see e.g., Simmons, Nelson, & Simonsohn, 2011). The forking paths do not only include different statistical analyses, but also include decisions such as which participants to include or exclude, which measures to use, what interactions to consider, which cut-off points to use, outlier detection, and many other decisions (I write more on this in: van der Sluis, van der Zee, & Ginn, 2017). The alpha value or false positive error rate is only controlled under the assumption that exactly one path is followed, which is decided upon a priori and is not conditional on the data. Problematically, a dataset can usually be analyzed in so many different ways that we learn very little by finding a p value lower than alpha (Gelman & Loken, 2013).

Can we reduce false positives by reducing the alpha? Yes we can! But this requires that we not only know how many tests were performed, but the full garden of (hypothetical) forking paths, as they all influence the (also hypothetical) long-term error rates. Even in the most extreme example that you run exactly one test and you get a p<alpha (lucky you, it’s publication time!) your long-term error rates are most likely still much higher than alpha. Why? Because it is a long term error rate, and as such it depends on how you would hypothetically behave on the long run. Should you not have obtained a significant effect, you might have checked for outliers and removed 1 or 2 participants, and then run the hypothesis test again, and again, and again… until you obtain significance.

The cruel nature of this problem is that you don’t actually need to have done these multiple tests; even your hypothetical behavior influences the hypothetical long-term error rates. Long runs are weird like that.

How to really control long-term error rates

The solution to the above problem is, in my opinion, straightforward (although not simple): we need to control our degrees of freedom. In practice, this means that we need to pre-register as much as possible, and we need to be accountable for this. I will go one step further and state that Registered Reports are the only kind of publications which can come close to true error rate control. Editors and peer reviewers also influence which analyses we (potentially) run, and as such they influence our long-term behavior and error rates. Registered Reports is the only amongst the many, many proposed solutions which comes close to eliminating the most influential biases: publication bias, researchers’ degrees of freedom, HARKing, p-hacking, and other questionable research practices.

I started out with explaining what the alpha parameter means. If you want to use it, you want to make use of Registered Reports.

 

References

Geryk, L. L., C. Arrindell, C., J. Sage, A., J. Blalock, S., Reuland, D. S., Coyne-Beasley, T., … & M. Carpenter, D. (2016). Exploring youth and caregiver preferences for asthma education video content. Journal of Asthma, 53(1), 101-106.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science22(11), 1359-1366.

van der Sluis, F., Van der Zee, T., & Ginn, J. (2017, April). Learning about Learning at Scale: Methodological Challenges and Recommendations. In Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale (pp. 131-140). ACM.

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.

3 thoughts on “What are long-term error rates and how do you control them?

  1. I have a question regarding the forking paths. It seems the underlying assumption is that researchers are either not aware of the multiple tests and parameter settings that can be chosen, or that they search for a minimum p-value across these possibilities. In my experience most scientist are aware, in particularly anyone dealing with time series, but are often unsure how to best deal with these multiple possibilities. I was taught by my PhD supervisor to always test a range of parameters settings (like filter settings for EEG analysis) to check whether an effect is robust against these variations and I advise students likewise. If significant effects are only observed for a particular parameter setting it is likely an analysis artefact. However, the situation is more complex when a range of p-values is observed, many of which are below alpha. How do you then choose the most appropriate parameter setting? Do you follow previous studies or do you select the most ‘conservative’ option (e.g. least amount of filtering or simplest statistical model)? Such a decision inevitably comes up in most of the research projects I’m involved in. Most researchers have their own personal preferences but is there a more systematic solution to this problem? For example, if you perform the same statistical test for 100 different parameter settings and find p-values between 0.001 and 0.01, surely you shouldn’t try controlling your long-term error rates by setting alpha to 0.05/100 = 0.0005 and conclude the effect is not statistical signficant. Likewise, how do you determine these settings apriori when submitting a registered report? And what do you do when these settings are not appropriate for the data you subsequently collect? Do still report the results for alternative settings? In my opinion, the solution is not to constrain the degrees of freedom, but rather to develop a formal approach that appropriately utilises these degrees of freedom.

    • Thanks for your excellent comment!

      You are right: the forking path problem is very complicated, and it is hard to decide how to deal with it. Testing for a range of parameter settings is a sure way to rapidly increase your error probabilities. There are multiple ways to deal with it, such as simply reporting all the tests you have done, correcting alpha (but not by dividing it by the number of tests, but with Holm-Bonferroni), or by actually modeling what you are trying to do. These methods already exist. For example, in Bayesian statistics you could simply use model averaging across all possible parameter settings. Furthermore, we have to remember to not just think in terms of hypothesis testing, as this is rarely useful without appreciating parameter estimations of key variables. How much do the key outcome variables vary when you make different methodological decisions?

      Based on what I understand from your description it sounds like that researchers’ degrees of freedom is a serious problem. Pre-registration/Registered Reports are a good way to reduce them. Is this hard? Sure. But you don’t need to predict everything in advance; you can make a decision tree, or a priori decide that you will report a lot of (corrected) tests. I am not sure what you mean with “what do you do when these settings are not appropriate”, but it highlights the issue that we use post-data results to influence decisions that we should make prior to the data, such as which hypotheses we are testing and at which alpha level. What we end up doing is making our methodologies conditional on the data. This essentially guarantees that we are fooling ourselves, although we won’t know to what extend.

      I know this sounds harsh, but when we are actually able to control error rates we should seriously reconsider why we are trying to do it anyway. If we invalid alpha, we can’t use it.

  2. Just to clarify, I am favour of Registered Reports. While I haven’t used it myself yet, I think it offers a great opportunity for researchers to strengthen their interpretations.

    I indeed believe the researchers’ degree of freedom is real and pervasive. However, rather than considering it a serious problem, I think it offers an opportunity when dealt with appropriately. With increasing computational power it is possible to explore large areas of the parameter space (or forking paths) and accumulate information across these settings (or paths). For example, EEG is generally measured used a large number of electrodes (e.g. 64 channels). It is possible to a-priori predict in which of these channels you expect an experimental effect (and you could specify this in a Registered Report) and only analyse this single channel. Alternatively, you could investigate all channels (and time points) and aggregate this information to identify areas (or clusters) of parameters space where certain experimental relationships hold. For an example, see e.g. cluster-based permutation test implemented in Fieldtrip: http://www.fieldtriptoolbox.org/tutorial/cluster_permutation_freq. I’d expect that similar tools will become available to simultaneously test numerous forking paths and aggregate statistical results across them. From your response I gather that this is already available for Bayesian statistics but this should also be possible for the frequentist approach, no?

Leave a Comment