Error Control: p-values versus Bayes Factors

UPDATE 1 (19/07/2016): Changed R script to assume equal variances (because they are equal). This favors the p-values, but doesn’t affect conclusions. I also fixed an error in the Type I error tables: Hits per Type I error now almost always in favor of BF. Updated tables to show more information.

UPDATE 2 (19/07/2016): Made more changes based on the very helpful comments by Alexander Etz (read his blog, especially how to become a Bayesian in eight easy steps). Changes summarized in the Conclusions section. Also added two references, in case you want to read more.

UPDATE 3 (20/07/2016): Read this post by Alexander Etz where he explains why you don’t even need the simulations I used below, but can solve the question with arithmetic.

I don’t think statistics is a necessary evil. It is necessary, for sure, and obviously it is also quite evil. But it is also so much more than that. How we use statistics is intrinsically related to how we think about evidence and science in general.

As for me, I want to quantify evidence for the things I study. This is why I moving towards Bayesian statistics, because it allows you to do exactly that. However, I was ‘raised’ as a Frequentist, which still influences my thinking much more than I think (and want that) it does. One of those core Frequentist thoughts is about error control in imaginary long run frequencies. type-i-and-type-ii-errors.jpg

Some, like Daniel Lakens (read his blog; very educational) say long run error control is very important and p-values are better at it than are Bayes Factors. I have put that to the test, and conclude otherwise. What follows is a set of simulations regarding the Frequentist error rate properties of p-values and Bayes Factors.

Rate of Type II errors of p-values and Bayes Factors

A Type II error is the failure to reject a false null hypothesis: a false negative. If you get a non-significant p-value for a true effect size, this is a Type II error. For p-values this is identical to the power of a study. For Bayes Factors this is more complicated, as they are continuous measures of relative evidence and not to be seen as dichotomous. Nevertheless, for the purpose of this exercise I used a BF10 of lower than 1/3 as a Type II error. (Note: BF10 means the relative evidence of the alternative, or 1, compared to the null, or 0, hypothesis).

Each time 10.000 studies were simulated with two groups from normal distributions (sd = 1). To estimate Type II errors I varied the true effect size (Cohen’s d) from 0.2 to 1 in steps of 0.2. I also varied the n to achieve about 25%, 50% and 75% power. I calculated p-values (equal variances assumed) as well as Bayes Factors. For the prior on Cohen’s d I used a Cauchy distribution with a scaling of sqrt(2)/2, roughly .707. This means that you would expect half of effect sizes to be lower, and half higher than .707, meaning that most of these simulations are biased against Bayes Factors. More often than not, I would choose a smaller scale, such as .4 (which is the median effect size for educational interventions, afaik).

Type II errors

The results clearly show that using Bayes Factors consistently outperform p-values in terms of Type II error rate. That is, if you use Bayes Factor of 1/3 or smaller to conclude there is no effect, you will (much) less often make erroneous decisions. Surprisingly, for effect sizes of Cohen’s d = 0.6 and larger, you will rarely if ever make a Type II error. Bayes Factors do seem to have more trouble with small effect sizes. However, note that with our prior we put 50% weight on effects smaller and larger than .707 respectively. If you are expecting a smaller effect size, you will downscale your prior and again have extremely good Type II error control.

Do note that you will always more easily get a significant p-value than a Bayes Factor of 3 or larger. This is primarily because p-values are biased against the null hypothesis, whilst Bayes Factors aren’t.

UPDATE: Some (e.g. Ulrich Schimmack) strongly claim that inconclusive BF should also be considered errors, as they also fail to reject a true null hypothesis. I think this is a good point. However, it depends on how you define Type II errors and how you use BF. If you think an inconclusive BF is an ‘error’ than it is certainly true that using BF comes with a very high error rate, as you will often have inconclusive data. Frankly, I fail to see how saying “I don’t have enough evidence yet” can be considered an error. Normally we do not know the true effect size, or even if there’s anything at all. Given limited data, saying the data is inconclusive is not just ‘not an error’, it’s the most accurate statement.

As an additional test I calculated the ratio of Hits per Error:

Hits per Type II error

For p-values, the ratio of Hits per Type II error is a function of power. With 25% power you get 1 significant p-value (hit) per 3 non-significant p-values (Type II error), so the ratio is 1/3. For 50% and 75% power this ratio is 1 and 3, respectively. With large effect sizes you never make Type II errors with Bayes Factors, meaning the Hits per Error ratio is unspecified in those cases.

If we assume that only BF>3 are ‘hits’, then BF outperforms p-values in all cases except with d=2. Again, this is because this tests is biased against BF due to the large prior. However, if we assume that inconclusive BF are also hits, BF always outperform p-values. There’s something to say for this, as with an inconclusive BF you can always just keep sampling.

Rate of Type I errors of p-values and Bayes Factors

A Type I error is the incorrect rejection of a true null hypothesis: a false positive. For p-values these are given by the alpha level, which is usually 5%. Again, we need to calculate these for Bayes Factors. As there is no effect size you cannot calculate power. As such, I simply used a range of arbitrarily chosen sample sizes.

Type I errors

Clearly, Bayes Factors substantially outperform p-values in terms of Type I error control. Even for very small samples of n = 10 or 20 per group, using BF>3 is better than using p<0.05 as a cut-off point. Note how as the sample size increases the proportion of BF smoothly goes from almost exclusively inconclusive to accurately supporting the null hypothesis.

UPDATE: Likewise, you could argue that inconclusive BF should be considered errors as well. However, the definition of Type I error (as I’ve always read them in various books) is the incorrect rejection of a true null. An inconclusive BF does not reject the true null, thus isn’t error. Furthermore, I want to emphasize again that Bayes Factors are to be used as continuous measures of relative evidence. I am perfectly fine with saying ‘inconclusive’ when the data are indeed ‘inconclusive’.

UPDATE 2: BF > 3 simply requires a larger critical t value than p < 0.05. As such, it will necessarily have a lower Type I error rate.

Again, I also calculated the Hits per Error.

Hits per Type I error

With p-values the Hits per Type I error ratio is always 19 hits to 1 error, assuming you use an alpha of 5%. Of course, this assumes you a-priori specify your sampling plan, outlier removal, analysis plan, and that you do 1 analysis. In any other case your Hits per Type I error ratio will inflate, sometimes quite dramatically. For n >= 50 per group, p-values outperform BF. Again, this is because BF are much more conservative, whilst p-values very eagerly refute the null hypothesis. Note that p-values only weakly outperform BF in most cases.

However, if we assume that inconsistent BF are hits, BF substantially and consistently perform better. Again, it is sensible to do so as when there is no effect, an inconclusive BF is more accurate than it is error. Furthermore, if you decide to keep sampling you will necessarily gather more evidence in favor of the null (which is something p-values cannot do).

Conclusions

Based on my (somewhat odd) comparison of p<0.05 and BF>3, we can conclude the following for error rates:

  • Conclusion 1: BF > 3 leads to fewer Type I and Type II errors than p < 0.05.

However, if we also take into account the amount of accurate ‘hits’, the story becomes slightly more complicated. With hits I mean a p < 0.05 or BF > 3 for if there is a true effect size (d > 0), and p > 0.05 and BF < 1/3 for if there is not (d = 0). Starting with the latter, you will easily get many accurate non-significant p-values when d = 0; more than the amount of BF < 1/3. As such, p-values could be considered better in this case, if it were not than you cannot interpret a non-significant p-value in any useful way. In contrast, a BF < 1/3 is actually very informative, as it quantifies the evidence in favor of the null. As the bar is higher with BF < 1/3, you will less easily reach it but once you do, it is worth it. Secondly, if d > 0, you will sometimes get more accurate hits with BF than with p-values, depending on the BF prior. If the prior is sensitive for the true effect size, BF > 3 is a better cut-off point for dichotomous decisions than is p-value, as long as the sample size is sufficient. With priors which are not so sensitive to the true effect size, and/or with smaller sample sizes, p < 0.05 outperforms BF > 3 as a decision rule. Again, this is partly driven by the fact that BF > 3 is simply a much steeper criteria than is p < 0.05, so that it is typically harder to reach it. However, you can make stronger claims with BF > 3 than with p < 0.05. I think it is fair to summarize this as follows:

  • Conclusion 2: BF > 3 is more conservative than p < 0.05, and as such requires more evidence to reach. What this means differ depending on the true effect size:

 

  • Conclusion 2a: If d = 0, it is (much) harder to get a BF < 1/3 than it is to get a non-significant p-value. However, the former can be interpret as evidence for the null, while non-significant cannot be used as such.

 

  • Conclusion 2b: If d > 0, most of the time it will be harder to get a BF > 3 than it is to get a significant p-value. However, the former also constitutes more evidence for the finding.

 

References

Here are two references in case you want to read more about this, and similar topic:

Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential Hypothesis Testing With Bayes Factors: Efficiently Testing Mean Differences.
Schönbrodt, F. D., & Wagenmakers, E. J. (2016). Bayes Factor Design Analysis: Planning for Compelling Evidence.

 

R Script: (I might update this later with a much more powerful script; if you’re interested contact me)

#This script is adapted from a script made by Daniel Lakens. 

require(BayesFactor)

options(scipen=20) #disable scientific notation for numbers
t.dat <- matrix(NA, nSims, 1) #makes an empty matrix to store p-values
bf.dat <- matrix(NA, nSims, 1) #makes an empty matrix to store Bayes Factors

cohensd<-0.0 #set true effect size
n<-1000 #sample size in each group
nSims <- 10000 #number of simulated experiments

for(i in 1:nSims){ #for each simulated experiment
  x<-rnorm(n = n, mean = 0, sd = 1) #produce N simulated participants
  y<-rnorm(n = n, mean = cohensd, sd = 1) #produce N simulated participants
  z<-t.test(x, y, var.equal = TRUE) #perform the t-test
  BF10<-exp(ttest.tstat(z$statistic, n, n, rscale = sqrt(2)/2)$bf)
  t.dat[i, ] <- z$p.value
  bf.dat[i, ] <- BF10
}

#Percentages of significant and non-significant p-values, respectively
psign <- sum(round(t.dat, 9) < 0.05) / nSims * 100
pnonsign <- sum(round(t.dat, 9) >= 0.05) / nSims * 100

#Percentages of BF10 below 1/3, between 1/3 and 3, and above 3, respectively
BF13 <- sum(round(bf.dat, 9) < 1/3) / nSims * 100
BF133 <- sum(round(bf.dat, 9) >= 1/3 & round(bf.dat, 9) < 3) / nSims * 100
BF3 <- sum(round(bf.dat, 9) >= 3)  / nSims * 100

#Give output
cat("Percentage of significant p-values: ", psign, 
    "%.\nPercentage of non-significant p-values: ", pnonsign, 
    "%.\nPercentage of BF10 lower than 1/3: ", BF13, 
    "%.\nPercentage of BF10 between 1/3 and 3: ", BF133, 
    "%.\nPercentage of BF10 higher than 3: ", BF3, "%.", sep="")