UPDATE 1 (19/07/2016): Changed R script to assume equal variances (because they are equal). This favors the p-values, but doesn’t affect conclusions. I also fixed an error in the Type I error tables: Hits per Type I error now almost always in favor of BF. Updated tables to show more information.

UPDATE 2 (19/07/2016): Made more changes based on the very helpful comments by Alexander Etz (read his blog, especially how to become a Bayesian in eight easy steps). Changes summarized in the Conclusions section. Also added two references, in case you want to read more.

UPDATE 3 (20/07/2016): Read this post by Alexander Etz where he explains why you don’t even need the simulations I used below, but can solve the question with arithmetic.

I don’t think statistics is a necessary evil. It is necessary, for sure, and obviously it is also quite evil. But it is also so much more than that. How we use statistics is intrinsically related to how we think about evidence and science in general.

As for me, I want to quantify evidence for the things I study. This is why I moving towards Bayesian statistics, because it allows you to do exactly that. However, I was ‘raised’ as a Frequentist, which still influences my thinking much more than I think (and want that) it does. One of those core Frequentist thoughts is about error control in imaginary long run frequencies.

Some, like Daniel Lakens (read his blog; very educational) say long run error control is very important and p-values are better at it than are Bayes Factors. I have put that to the test, and conclude otherwise. What follows is a set of simulations regarding the Frequentist error rate properties of p-values and Bayes Factors.

## Rate of Type II errors of p-values and Bayes Factors

A Type II error is the failure to reject a false null hypothesis: a false negative. If you get a non-significant p-value for a true effect size, this is a Type II error. For p-values this is identical to the power of a study. For Bayes Factors this is more complicated, as they are continuous measures of relative evidence and not to be seen as dichotomous. Nevertheless, for the purpose of this exercise I used a BF10 of lower than 1/3 as a Type II error. (Note: BF10 means the relative evidence of the alternative, or 1, compared to the null, or 0, hypothesis).

Each time 10.000 studies were simulated with two groups from normal distributions (sd = 1). To estimate Type II errors I varied the true effect size (Cohen’s *d*) from 0.2 to 1 in steps of 0.2. I also varied the n to achieve about 25%, 50% and 75% power. I calculated p-values (equal variances assumed) as well as Bayes Factors. For the prior on Cohen’s d I used a Cauchy distribution with a scaling of sqrt(2)/2, roughly .707. This means that you would expect half of effect sizes to be lower, and half higher than .707, meaning that most of these simulations are biased *against* Bayes Factors. More often than not, I would choose a smaller scale, such as .4 (which is the median effect size for educational interventions, afaik).

The results clearly show that using Bayes Factors consistently outperform p-values in terms of Type II error rate. That is, if you use Bayes Factor of 1/3 or smaller to conclude there is no effect, you will (much) less often make erroneous decisions. Surprisingly, for effect sizes of Cohen’s *d* = 0.6 and larger, you will rarely if ever make a Type II error. Bayes Factors do seem to have more trouble with small effect sizes. However, note that with our prior we put 50% weight on effects smaller and larger than .707 respectively. If you are expecting a smaller effect size, you will downscale your prior and again have extremely good Type II error control.

Do note that you will *always* more easily get a significant p-value than a Bayes Factor of 3 or larger. This is primarily because p-values are biased against the null hypothesis, whilst Bayes Factors aren’t.

UPDATE: Some (e.g. Ulrich Schimmack) strongly claim that inconclusive BF should also be considered errors, as they also fail to reject a true null hypothesis. I think this is a good point. However, it depends on how you define Type II errors and how you use BF. If you think an inconclusive BF is an ‘error’ than it is certainly true that using BF comes with a very high error rate, as you will often have inconclusive data. Frankly, I fail to see how saying “I don’t have enough evidence yet” can be considered an error. Normally we do not know the true effect size, or even if there’s anything at all. Given limited data, saying the data is inconclusive is not just ‘not an error’, it’s the most accurate statement.

As an additional test I calculated the ratio of Hits per Error:

For p-values, the ratio of Hits per Type II error is a function of power. With 25% power you get 1 significant p-value (hit) per 3 non-significant p-values (Type II error), so the ratio is 1/3. For 50% and 75% power this ratio is 1 and 3, respectively. With large effect sizes you never make Type II errors with Bayes Factors, meaning the Hits per Error ratio is unspecified in those cases.

If we assume that only BF>3 are ‘hits’, then BF outperforms p-values in all cases except with d=2. Again, this is because this tests is biased *against* BF due to the large prior. However, if we assume that inconclusive BF are also hits, BF always outperform p-values. There’s something to say for this, as with an inconclusive BF you can always just keep sampling.

## Rate of Type I errors of p-values and Bayes Factors

A Type I error is the incorrect rejection of a true null hypothesis: a false positive. For p-values these are given by the alpha level, which is usually 5%. Again, we need to calculate these for Bayes Factors. As there is no effect size you cannot calculate power. As such, I simply used a range of arbitrarily chosen sample sizes.

Clearly, Bayes Factors substantially outperform p-values in terms of Type I error control. Even for very small samples of n = 10 or 20 per group, using BF>3 is better than using p<0.05 as a cut-off point. Note how as the sample size increases the proportion of BF smoothly goes from almost exclusively inconclusive to accurately supporting the null hypothesis.

UPDATE:Likewise, you could argue that inconclusive BF should be considered errors as well. However, the definition of Type I error (as I’ve always read them in various books) is the incorrect rejection of a true null. An inconclusive BF does not reject the true null, thus isn’t error. Furthermore, I want to emphasize again that Bayes Factors are to be used as continuous measures of relative evidence. I am perfectly fine with saying ‘inconclusive’ when the data are indeed ‘inconclusive’.

UPDATE 2:BF > 3 simply requires a larger critical t value than p < 0.05. As such, it will necessarily have a lower Type I error rate.

Again, I also calculated the Hits per Error.

With p-values the Hits per Type I error ratio is always 19 hits to 1 error, assuming you use an alpha of 5%. Of course, this assumes you a-priori specify your sampling plan, outlier removal, analysis plan, and that you do 1 analysis. In any other case your Hits per Type I error ratio will inflate, sometimes quite dramatically. For n >= 50 per group, p-values outperform BF. Again, this is because BF are much more conservative, whilst p-values very eagerly refute the null hypothesis. Note that p-values only *weakly* outperform BF in most cases.

However, if we assume that inconsistent BF are hits, BF substantially and consistently perform better. Again, it is sensible to do so as when there is no effect, an inconclusive BF is more accurate than it is error. Furthermore, if you decide to keep sampling you will necessarily gather more evidence in favor of the null (which is something p-values cannot do).

## Conclusions

Based on my (somewhat odd) comparison of p<0.05 and BF>3, we can conclude the following for error rates:

**Conclusion 1: BF > 3 leads to fewer Type I and Type II errors than p < 0.05.**

However, if we also take into account the amount of accurate ‘hits’, the story becomes slightly more complicated. With hits I mean a p < 0.05 or BF > 3 for if there is a true effect size (d > 0), and p > 0.05 and BF < 1/3 for if there is not (d = 0). Starting with the latter, you will easily get many accurate non-significant p-values when d = 0; more than the amount of BF < 1/3. As such, p-values could be considered better in this case, if it were not than you cannot interpret a non-significant p-value in any useful way. In contrast, a BF < 1/3 is actually very informative, as it quantifies the evidence in favor of the null. As the bar is higher with BF < 1/3, you will less easily reach it but once you do, it is worth it. Secondly, if d > 0, you will sometimes get more accurate hits with BF than with p-values, depending on the BF prior. If the prior is sensitive for the true effect size, BF > 3 is a better cut-off point for dichotomous decisions than is p-value, as long as the sample size is sufficient. With priors which are not so sensitive to the true effect size, and/or with smaller sample sizes, p < 0.05 outperforms BF > 3 as a decision rule. Again, this is partly driven by the fact that BF > 3 is simply a much steeper criteria than is p < 0.05, so that it is typically harder to reach it. However, you can make stronger claims with BF > 3 than with p < 0.05. I think it is fair to summarize this as follows:

**Conclusion 2: BF > 3 is more conservative than p < 0.05, and as such requires more evidence to reach. What this means differ depending on the true effect size:**

**Conclusion 2a: If d = 0, it is (much) harder to get a BF < 1/3 than it is to get a non-significant p-value. However, the former can be interpret as evidence for the null, while non-significant cannot be used as such.**

**Conclusion 2b: If d > 0, most of the time it will be harder to get a BF > 3 than it is to get a significant p-value. However, the former also constitutes more evidence for the finding.**

**References**

Here are two references in case you want to read more about this, and similar topic:

**R Script: **(I might update this later with a much more powerful script; if you’re interested contact me)

#This script is adapted from a script made by Daniel Lakens. require(BayesFactor) options(scipen=20) #disable scientific notation for numbers t.dat <- matrix(NA, nSims, 1) #makes an empty matrix to store p-values bf.dat <- matrix(NA, nSims, 1) #makes an empty matrix to store Bayes Factors cohensd<-0.0 #set true effect size n<-1000 #sample size in each group nSims <- 10000 #number of simulated experiments for(i in 1:nSims){ #for each simulated experiment x<-rnorm(n = n, mean = 0, sd = 1) #produce N simulated participants y<-rnorm(n = n, mean = cohensd, sd = 1) #produce N simulated participants z<-t.test(x, y, var.equal = TRUE) #perform the t-test BF10<-exp(ttest.tstat(z$statistic, n, n, rscale = sqrt(2)/2)$bf) t.dat[i, ] <- z$p.value bf.dat[i, ] <- BF10 } #Percentages of significant and non-significant p-values, respectively psign <- sum(round(t.dat, 9) < 0.05) / nSims * 100 pnonsign <- sum(round(t.dat, 9) >= 0.05) / nSims * 100 #Percentages of BF10 below 1/3, between 1/3 and 3, and above 3, respectively BF13 <- sum(round(bf.dat, 9) < 1/3) / nSims * 100 BF133 <- sum(round(bf.dat, 9) >= 1/3 & round(bf.dat, 9) < 3) / nSims * 100 BF3 <- sum(round(bf.dat, 9) >= 3) / nSims * 100 #Give output cat("Percentage of significant p-values: ", psign, "%.\nPercentage of non-significant p-values: ", pnonsign, "%.\nPercentage of BF10 lower than 1/3: ", BF13, "%.\nPercentage of BF10 between 1/3 and 3: ", BF133, "%.\nPercentage of BF10 higher than 3: ", BF3, "%.", sep="")

Just a heads up on the script, it doesn’t work if you copy and paste it directly. t.dat and bf.dat need to go below nSims as they both call nSims. Those two lines just need moving below nSims and it works fine.

A few things wrong with this, collectively that suggest your study does not prove Bayes > Freq when it comes to independent sample t-tests. I actually did similar simulations recently with a larger range of sample sizes and effect sizes with much smaller increments; AND I did not stack the deck towards p-values with prior power analyses. I also used the same prior for the test as you, and did non-welch t-test varieties for the frequentist t-test. You can see consistently in mine on 3D mesh plots as you vary the BF01 and alpha thresholds that any difference in type-I vs type-II error rates is only a function of the different scales. In other words, if you approximate the p-value on the bayes scale using Bayes Factor Bounds (BFB), you get comparable type-I and type-II error rates. Where do I think you went wrong? First, a personal thing, you have your bayes factors backwards. BF01, the standard, is the ratio of the null probability to the alternative, so if the alternative is much more likely you should have BF01<1.0. You are using the inverse, the BF10 factor. Second, as others have mentioned, a type-II error is the 'failure to reject an inaccurate null hypothesis'. So, the 'inconclusive' section in the relevant graph SHOULD be considered a type-II error. With those corrections, your data suggests (on the surface) that the bayes factor has a better type-I error rate while p-values have a better type-II error rate. However, you used alpha=0.05 and BF10=3 as your thresholds. What you'd find if you used the proper BFB equivalent in these studies, BF10~2.0, is that the error rates of each type are roughly equivalent. My thoughts? The bayes factor, for THIS very specific application with cauchy associated priors, isn't a more powerful test to scientists like myself, you just used a more stringent threshold as an artifact of their different scales!!