Sequential sampling with Bayes Factors: effects on error rates and parameter bias

By @RickCarlsson and @Research_Tim.

This post was written by Rickard and me. We might be wrong. As always; don’t belief something just because you’ve read it somewhere on the internet. Let us know if and how we are wrong.

How much participants do you need for your study? There are two ways to do this: a-prior decide on a fixed sample size, or use sequential sampling. Many have been warned that the latter invalidates your analyses, or at least inflates your p-values if you use those. However, with proper error control you can do sequential sampling with frequentist statistics. In this post we will consider sequential sampling in Bayesian statistics. Controlled error rates and other frequentist desiderata are not part of Bayesian data analysis. But that doesn’t stop the curious frequentist, or the statistical agnostic to look into it.

Fixed sample size

A common way of running a study is the ‘fixed sample size’ method, which can be informed by power analyses, habits, or in the worst case: flair. In a previous post, Tim showed how Bayes Factors fair much better than p-values in terms of controlling false-positives (Type I errors), but this is simply because BF > 3 requires more evidence than a p < 0.05.

The fixed sample size approach has various drawbacks, such as easily being either inefficient (when you run a larger study than necessary) or underpowered (when your sample is too small). Due to publication bias and QRP’s, the literature contains biased parameter estimation, which greatly limits the practical use of power analyses.

Sequential sampling

A more efficient sampling method is to recruit participants in smaller batches, and check whether there is sufficient evidence present. Because of the efficiency and practical applications of optional stopping, it is good to know how this method behaves. We will look at how this can be done by using Bayes Factors as the sampling criteria, and how this affects long-term error rates and parameter bias.

Simulation parameters

Analysis and hypotheses: We simulate data for default Bayesian t-tests for independent groups. In every test, the H0 = 0, and the alternative H1 is a Cauchy distribution centered on 0 with a scale factor of 0.707. This prior gives 50% weight to all effect sizes between 0.707 and -0.707, with the remaining 50% mass on more extreme values. For simplicity’s sake we will not vary this prior, but later we will discuss how this will affect your outcomes.

Population parameters: We fixed the population effect at the following “t-shirt” sizes, Cohen’s d = 0 (nude), 0.2 (small), 0.5 (medium), 0.8 (large), and 1.5 (extra large).

Bayes Factor criteria: We ran the simulations for two sets of criteria: BF>3 and BF<1/3 (in favor of H1 and H0 respectively), as well as BF>10 and BF<1/10. These seem to be common thresholds. Do note that there is absolutely no reason to use any threshold at all, because BF can be directly interpreted. However, in this scenario we need an a-priori threshold to satisfy the imaginary reviewers of our pre-registration.

Sampling method: Our sampling method is as follows: we start with an initial batch of 25 participants in each group, run the Bayesian t-test, and check whether the BF has reached a certain criteria.If not, we add 5 participants more to each group and re-run the analysis, and repeat the whole process. If the criteria has still not been reached by n = 655 we stop the simulation; we choose this upper limit as it is has 95% power to detect a small t-shirts (d = 0.2).

Outcomes: In terms of outcomes there are several things of interest:

  1. False-positive (Type I error): defined as supporting the H1 whilst H0 is true.
  2. False-negative (Type II error): defined as supporting the H0 whilst H0 is false.
  3. Magnitude error (Type M error): defined as the extent of the mismatch between the Cohen’s d in the sample versus the ‘true’ Cohen’s d.

Note: Both false-positives and false-negatives are tricky to define in the Bayesian framework; we’ll come back to this later.

Simulation Results

Error rates

bf3-error-rates

First, we’ll look at false-positives (d = 0). Because BF > 3 requires a larger t-value than p < .05, it is conservative if we sample once with a fixed n. With optional stopping, we find H1 to be favored in 5.96% of the cases. The overall median sample size is 30 per group, which is very efficient. The 5.96% false-positives is somewhat higher than you would get in a fixed design with p-values and an alpha of 5%; this is the price you pay for flexibility and efficiency.

The results for d = 0.2 are interesting. Intuitively, one might say that in 77% of the cases we incorrectly support the null. However, remember that the prior on the effect size was 0.707. In this case, the data tends to support the H0 correctly because it is in fact a better model than the alternative. The true effect of 0.2 is closer to 0 than to 0.707. With d = 0.5 we see exactly the reversed: 77% of the simulations support the alternative, and rightfully so because the null is the worse model. Still, it is clear that you can get ‘unwanted’ BF. Finally, the results for 0.8 and 1.5 are more than excellent, especially given the low criteria.

bf10-error-rates

Now let’s look at what happens if we sample until BF>10 or BF<1/10. When the d is 0, we only get 2.99% false-positives, but we also get a fair amount (13%) of simulations which remain undecided as they reach the maximum sample size. Interestingly, the 3% false positives have a very low sample size of 70, compared to 295 in the correct analyses. Sometimes you just get an extreme sample, and if you are so unlucky the study will terminate very early (this is just how random sampling works; do independent replications to counteract this).

Because the prior is so far off, the d = 0.2 causes your experiment to run very long. Unlike with the BF > 3 criteria, you now mostly start to support H1, or remain undecided. Finally, the results for 0.5 and larger could not be better, as you get perfect scores with small sample sizes.

Take away point 1: In Bayesian statistics, you typically compare specific hypothesis; that is, much less vague than “d is not 0″. However, for some true effects the prior we used was still very vague. If ‘the truth’ is badly described by the hypotheses, the Bayes Factor can give odd results, including a high ‘error’ probability from a frequentist point of view, depending on the BF criteria.

To summarize, sampling until BF > 3 is not conservative compared to classical statistics (Type- I error 6 %). Further, for weak true effect size, there’s little chance of accepting the alternative. In contrast, sampling to BF > 10 ensures Type-I error rates of 3 %, and type-II error of less than 20%. Further, in all cases, the median sample size will be smaller than for a fix N design of 80% or more power and alpha = .05.

Parameter bias

Up to now, we have been stuck in the realm of binary decisions, which is not all that informative. Here we will consider parameter estimation, and especially the extent of bias we get due to sequential sampling. This is what we call Type-M errors: the distance from the true d we get in our sampled d. It’s important to note that we are looking at sampled d here and not the posterior d. This is equivalent to using flat priors, which is something we would normally argue against using in (almost) all cases. However, for simplicities sake we will use the sample d as the parameter estimation; we will leave posteriors for an upcoming blog post.

For d = 0, the effect sizes average to zero for both BF > 3 and 10. There’s a peculiar dip in the distribution of effect sizes for BF > 3, though, making it bimodal for small positive and negative effects. However, this is perfectly symmetrical and the average is thus zero. The effect size distributions for BF > 10 don’t show this but only a few mild peaks around zero.

Moving on to the small t-shirts (d = 0.2), we see that for BF > 3, it is underestimated, whereas it’s slightly overestimated for BF > 10. For d = 0.5, we see that both BF > 3 and BF > 10 optional stopping will lead to overestimation. The reason for over and underestimation is easy to understand. For example, at n = 30, we can only reach BF > 3 with d > 0.63 or d < 0.19. The effect sizes in between are effectively censored. When the procedure favors the null when the null is false, it will tend towards underestimation. When it favors the alternative, and the alternative is ‘true’, it will tend toward overestimation. With BF > 3, there is a substantial ‘bump’ of estimations slightly above 0. These are type of samples which also show support for H0, and are due to unlucky sampling. This does not occur for BF > 10, because this higher criteria threshold makes the change of a large unlucky sample extremely unlikely.

The plots for d = 0.8 and 1.5 are overall much better, with 1.5 being identical for both criteria because of how easy it is to gather sufficient evidence for such a large effect.

The last two plots show the the difference between the sample d‘s and the true d‘s. Values close to 0 mean little to no bias.

Take away point 2: Sequential sampling based on BF tends to bias population parameters. The extend of this bias depends on various factors, such as the difference between the hypotheses you are comparing (with smaller difference allowing more bias), the prior (the stronger and more inaccurate the prior, the more bias), the BF criteria, and after how many participants you run the test(s).

Take away point 3: Fixed sample size designs are either inefficient (when you run a larger study than necessary) or underpowered (when your sample is too small). Sequential sampling designs are (much) more efficient, but introduce upwards parameter bias (when you terminate early) or downwards parameter bias (when you terminate late). Take your pick.

And now what?

There is always error in parameter estimation, and in many cases there is bias due to the specific modeling choices. Had we run the simulations without optional stopping, there would also be error. In statistics, we seek to remove all sources of bias; in practice we seek to balance bias with efficiency. We can’t decide for you how that balance should be. Nevertheless, some general recommendations and remarks:

  • Overall, sequential sampling is more efficient than fixed sample size. In fact, it is 50-70% more efficient than NHST designs with a-priori fixed sample sizes. You also don’t need to rely on effect sizes in the literature which are biased (upwards) due to publication bias.
  • Sequential sampling based on the BF has an impact on parameter estimation, which can be quite substantial under certain conditions. Studies which stop early cause overestimation, while late terminations cause underestimations. Note that this is a characteristic of sequential designs in general. In short: don’t use sequential sampling (based on BF) for accurate parameter estimation.
  • If you are interested in accurate parameter estimation, this bias can of course be a serious problem. Note that for this goal you generally need very large samples anyway (e.g. upwards of 250), such that the bias caused by sequential sampling because relatively very small given that you will start with a much larger sample to begin with.
  • You don’t need to use a BF (or p-value) as a criteria for sequential sampling. You can also aim for accuracy, and sample until the Credible Interval is sufficiently small. If and how this affects parameter estimation might be a topic for a new post.
  • Always start with an initial batch of participants which is relatively large, only then start adding new participants in smaller batches.
  • Always use multiple priors.
  • A BF of 3 is not a whole lot of evidence. It’s typically stronger and more informative than a p < 0.05, but it’s still not a lot of evidence. This can be especially dangerous when you are looking for small effects. If resources allow, aim for a BF of 10 at the least.
  • (One of) The most serious forms of bias is not a form statistical bias, but is publication bias. Always replicate your, and others’, studies. Attempt to pre-register your studies at journals whenever possible. Fight the bias!

References:

We used Greg Francis’ calculator to get the critical d and t values mentioned in the post..

In R we used the BayesFactor package by Richard Morey to run the analyses.

Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize?. Journal of Research in Personality, 47(5), 609-612.

Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential Hypothesis Testing With Bayes Factors: Efficiently Testing Mean Differences.

Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301-308.

Leave a Comment