What are confidence intervals? It is the time interval between when you have successfully learned what confidence intervals mean and when you start to realize that you’ve never fully understood them.

As a master student I had to read Geoff Cumming’s “New Statistics” book, which prominently featured confidence intervals. To this day I still don’t fully understand why someone would advocate relying on them – it seems to be (at least partly) based on common misconceptions.

Yesterday someone notified me about a question regarding confidence intervals based on the idea that if the confidence intervals of two parameters overlap you can conclude that these parameters are *not* significantly different from each other. This is, however, wrong. We’ll get to why this is, but first we are going on a little adventure.

### What do confidence intervals *really* mean? (Part I)

Confidence intervals give a range of values which will “in the long run” contain the population value of a parameter a certain % of the time.

That likely did not help much, did it?

Put differently, if you construct a hundred 95% CIs, then about 95 of these intervals will contain the population value. Typically we only calculate a single CI, and not a hundred. Problematically, this gives us no way of knowing whether our single CI happens to be one of the the lucky intervals which contain the population value. It either does or it doesn’t, but we don’t know.

Importantly, the *specific* *values *of a calculated CI should not be directly interpreted. If we have a 95% CI of [4.0, 6.0] we can *not* interpret these limits. We can also *not* state that there is a 95% probability that the true population is between 4 and 6, nor that we can have 95% confidence in this claim (whatever that means).

What helped me in better understanding what CIs mean is to realize that the “95%” part does not refer to the limits that you calculate, but to the *procedure of calculating these limits*. We can be 100% confident that this procedure will, in the long run, provide us with limits that in 95% of the time contain the population value.

We’ll come back to interpreting CIs again later.

### Calculating a 95% confidence interval

We’ll start with the textbook example of a CI: a 95% CI of a mean. While the % can be any number and you can calculate it for any parameter, the 95% CI of the mean appears to be the only thing most people will ever calculate.

The formula to calculate a 95% CI for a mean is: , where X is the sample mean, s the standard deviation, and n the sample size. Dividing the standard deviation (s) by the square root of the sample size (n) simply gives us the standard error (SE). I like to present the formula in this format as you can deduce that the interval will become smaller as the sample size increases (or when you have a lower standard deviation, for example when removing outliers).

Why do we use the standard error and not just the standard deviation, you ask? Excellent question! (Literally nobody ever asks such questions, but I’m just mimicking statisticians who like to pretend that people do.) If you want to get a confidence interval of the *sample*, you use the standard deviation of the sample. However, we are not interested in the sample but in the population. The standard error is simply the standard deviation of the sample means; because we are interested in the population (of sample means) we use the standard error.

### What do confidence intervals *really, really* mean? (Part II)

The above helps us to refine our understanding of CIs. The calculation of CIs is based on the central limit theorem, which states that as sample size increases the distribution of sample parameter (in this case, the mean) will start to approximate a normal distribution.

What does this mean? That the validity of calculating a CI relies on the assumption of normality. Importantly, this does not mean that the scores *in the sample *need to be normally distributed, but that the *scores in the population of samples* needs to be (approximately) normally distributed. Luckily, this is very often (approximately) the case if your sample size is large enough, but things like skew and outliers can dramatically ‘slow down’ the process of having a large enough sample to get this distribution.

In short, constructing confidence intervals gives us ranges of data that will contain the population mean X% of the time, assuming that the population of sample means is normally distributed.

### Calculating any X% confidence interval

Above we looked at the 95% confidence interval, but there is no reason to limit ourselves to 95%. What if want to calculate a 90% confidence interval?

To calculate a CI for any particular ‘confidence level’ we can use this formula: , where p is the probability is the confidence level that you are interested in. In the case of a 95% CI you would thus look up the z value of (1-0.95)/2 = 0.025. Looking up the z value for a standard normal deviation (as that is the assumed distribution of sample means) gives us a z of 1.96. That is why a 95% CI is calculated with the 1.96 multiplier.

For a 90% CI we would look up the z value of (1-0.90)/2 = 0.05, which gives us a multiplier of 1.64. Does the z value of 0.05 ring any bells? If so, you are right, because a 90% CI has a very interesting relationship with the infamous p < 0.05. Lets explore this taboo romance between p values and confidence intervals…

### What do confidence intervals *really, really, really* mean? (Part III)

As a reminder, a p value gives us the probability of obtaining a result as at least as extreme as the one we found (e.g., a difference score of 3 and higher) assuming that the population value is zero, and that the distribution of sample values is normally distributed. It is no coincidence that this sounds roughly similar to the previously given definition of a confidence interval, because the calculations and assumptions of p values and confidence intervals are very much alike.

If we get a p < 0.05 we can (in frequentist statistics) decide to rule out that the population value is zero. However, if we get a p > 0.05 we can *not* state that the population value is (likely to be) zero. P values only allow us to *rule out* values, but *not rule them in*. Similarly, confidence intervals allow us to rule out values, but not rule in values. Put differently:

- Non-significant p values do not allow us to accept zero as a likely value for the population mean; we simply have not (yet) ruled it out.
- Confidence intervals do not allow us to accept the values inside the interval as likely population means; we simply have not (yet) ruled them out.

These frequentist methods are very useful tools for falsification, but not for confirmation.

### What do overlapping confidence intervals (not) mean?

Lets get back to the original question: how can two confidence intervals overlap while the difference between the two values is significant?

I use to think that this was a valid line of reasoning, and if my memory serves me right this is also what I have been told in my statistics classes. However, it is based on faulty reasoning.

Imagine that we have interested in the question whether Bayesians are taller than Frequentists (this is obviously true, but bear with me):

- Bayesians have a mean height of 1.90 meter, s = 0.1, n = 33.
- Frequentist have a mean height of 1.85 meter, s = 0.1, n = 33.

We can construct confidence intervals around these means:

- Bayesians have a mean height of 1.90 meter, 95% CI [1.866, 1.934].
- Frequentist have a mean height of 1.85 meter, 95% CI [1.8159, 1.884].

These CIs overlap, so does this mean that these means do not differ significantly? It does not. If we run a (unpaired) t test we get: t(64) = 2.0310, p = 0.0464. Lo and behold, Bayesians are taller than Frequentists despite the fact that the confidence intervals of their height overlap.

Our reasoning here is flawed, as we are comparing different things. The CIs around the two means are based on the assumption that each population mean is equal to each sample mean, but the t test is based on the assumption that the *population of difference scores* is equal to zero.

In other words, when we are running a significant test to compare two means, we are constructing a single distribution consisting of the difference scores. We then calculate the probability of the found difference score or more extreme (in this case 0.05 and larger values) or larger (>0.05) assuming that the true difference score is 0.

Because we combined the two distributions of the two groups into a single distribution the way we keep track of error is different than for two separate distributions. This is because we *square* error before adding it. If the error both groups is 0.1, we don’t add this together to 0.2; instead take sqrt(0.1^2 + 0.1^2) which is 0.14. Due to this the error in a combined distribution is always smaller than of two separate groups added together – not doing this will make us overestimate the amount of error and thus incorrectly assume that two groups can’t be significantly different when their confidence intervals overlap.

In short, this is what you need to remember when comparing two confidence intervals:

- When two confidence intervals do not overlap, the difference between the two parameters will be significant.
- When two confidence intervals do overlap, the difference between the two parameters can be significant or non-significant.

Now it’s time for a stats joke. Ha Ho.

“Our reasoning here is flawed, as we are comparing different things. *The CIs around the two means are based on the assumption that each population mean is equal to each sample mean*, but the t test is based on the assumption that the population of difference scores is equal to zero.”

The starred section is false, isn’t it?

I am fairly confident that it is correct.

When you calculate a CI around a mean you center the distribution of sample means on that mean. This is the ‘null hypothesis’ you are working with. In such a case it might be more informative to read ‘null hypothesis’ as ‘the to be nullified hypothesis’ (which, if I remember correctly, was the initial meaning of it anyway). As an example, say that the sample mean is 5 and you get a 90% CI of [4.0, 6.0]; you basically just ran two one-sided hypotheses tests: 4.0 and all smaller values are significantly different from the null hypothesis of X=5, and 6.0 and all larger values are also significant.

Here’s my heuristic for the conclusions that one can draw from overlapping error bars, whether those are SEs or CIs, assuming all kinds of unlikely things like normal distribution, 1.96=2.00, alpha = .05, and two-tailed tests.

1. If the error bars are CIs, then each end is 2 SEs from the mean. If they don’t overlap, there is at least 4 SEs between the means. So we can be confident (no pun intended) the groups are significantly different. If they do overlap, we can’t tell (without other information) whether the difference is significant.

2. If the error bars are SEs, then each end is 1 SE from the mean. If they do overlap, there is less than 2 SEs between the means, so we can be confident that the difference between the groups is not significant. If they don’t overlap, again, we need more information.

So error bars based on CIs allow you to positive identify significant differences, but not non-significant ones, and error bars based on SEs allow you to positive identify non-significant differences, but not significant ones.

This is probably all garbage, but for the moment I find it comforting.

It seems to me this can’t possibly be true. If you were to assume that the two populations had means equal to your sample means, there would be no point testing anything. The difference between sample means would be the true difference and you’d be done. In contrast, the coverage guarantee (or at least aspiration) for the interval is that P([sample.mean – 1.96 se < true.mean < sample.mean + 1.96 se]) = alpha, in which the true mean is whatever it is and the sample means vary.

Rather than two tests, it's perhaps helpful to view the interval as the the result of running an infinite number of tests for all possible mean values and recording what range of them would not get reject the sample mean at some fixed level. This is the interval construction procedure that gets the guarantee. The probability that your sample confidence interval contains the true mean is, in this framework at least, either 0 or 1. This makes it clearer, I think, that while we can always think in terms of equivalent hypothesis tests, we haven't privileged any particular null value in those tests.

Yes, constructing a CI is essentially running an infinite number of tests. My point is that with those tests your H0 is the sample mean and you’re essentially running significance tests to see which numbers are unexpectedly extreme assuming that the sample mean is the population mean. Those numbers are the ones outside the confidence interval.

Regarding: “If you were to assume that the two populations had means equal to your sample means, there would be no point testing anything”. You are never testing the null hypothesis – only the probability of data (and more extreme) given a particular null model.

Imagine that we have a sample mean of 0, and you calculate a CI around it of [-1.0, 1.0]; this means that all values more extreme than (-)1.0 are significantly different from 0. Now add 5 to all data points: your CI will be [4.0, 6.0], meaning that all values more extreme than 4 and 6 are significantly different from 5.

> “Yes, constructing a CI is essentially running an infinite number of tests. My point is that with those tests your H0 is the sample mean”

Nope. If you are testing with H0 equal to all the possible (sample) means in turn then your H0 is by definition not just the actual sample mean.

> “and you’re essentially running significance tests to see which numbers are unexpectedly extreme assuming that the sample mean is the population mean. Those numbers are the ones outside the confidence interval.”

There are no other numbers but the ones you’ve just decided to make the population means, so I’m not sure how this helps anything.

> Regarding: “If you were to assume that the two populations had means equal to your sample means, there would be no point testing anything”. You are never testing the null hypothesis – only the probability of data (and more extreme) given a particular null model.

Apologies. My phrasing was too terse and a bit ambiguous. What I meant was that there was no point testing because (without another two samples) there is no data left to test anything with; you just used it all getting your two H0s.

The crux of the matter, it seems to me, is that we can (on a good day with the wind behind us) get confidence intervals for each group mean, or an interval for the difference between the group means. The space, and thus the nulls, implicitly involved in each of these constructions are different. Since the difference of means question is answered by nulls implicit in the second interval, we answer the question using that interval, not some overlap or other function of the first intervals.

We might find it convenient to frame the problem as a null hypothesis test for H0: population difference = 0, but we don’t need to. We can equally well ask for an interval for that difference and reject if 0 is not in it, as above. As always, it makes no difference whether we test or interval. But the difference that confuses the people to whom this post is directed is, I submit, that the question has changed so the relevant nulls have to as well.

Thanks for all the replies! I’m under the impression that we are both stating approximately the same thing, but that my phrasing is off (or just plain wrong; that’s always a very plausible option). So I’m going to try again to see if and where I’m wrong. For convenience, I’ll number my statements:

1) In a typical null hypothesis test we assume that the population value X is 0.

2) To construct the CI limits we’ll not just calculate the probability of the data assuming X = 0, but of all possible population values. So we’ll calculate the probability of the data assuming X = 0.1, 0.2, etc.

3) The first assumed population values which gives us a too extreme test statistics are the upper/lower limit of the CI.

4) For example, for assumed population values of X < 4 the observed test statistic is significant, and the same would be true for X > 6. Thus the CI would be [4.0, 6.0].

5) This tells us under which assumed population values the observed value would be “too unlikely” or “too surprising”. We have ruled out the values <4 and >6, but not ruled in the values of 4 < x < 6, as we can only refute but not confirm. Would you agree with these claims and the way I phrased them?

Dear Tim, thanks for the post. CIs are always a nagging issue. You tie these to the normal distribution. Although the central limit theorem is nice and well, there is no fundamental reason for this connection. You can calculate CIs for other distributions, discrete, multidimensional, infinite variance, … Therefore implying a strict connection of CI to Gaussian distribution is more misleading than helpful. Best, Peter

Why can’t I interpret a 95% confidence interval as containing the true parameter with 95% probability, when I could adopt a Bayesian viewpoint with the same model and an uninformative prior and, apparently, do so in all reasonableness? This seems to be a very popular argument against confidence intervals, that I have never understood…

The Bayesian interval allows you to make such a statement conditional on your assumption of a flat prior, which can be criticized as implausible. The frequentist interval allows you to make such a statement about the long-run performance of the procedure you use to generate intervals, but not about any interval in particular.

In any context where there is external information about the effect size (e.g., all values from -inf to +inf are not equally plausible a priori), the frequentist interval and the (numerically equivalent) bayesian interval under a flat prior will sometimes be plainly wrong as a statement about the value of the population parameter.

See

http://andrewgelman.com/2013/11/21/hidden-dangers-noninformative-priors/

Tim: Leaving approximation issues aside (e.g. your Wald interval will be a bad test and a dodgy interval for even quite large amounts of binomial data), I’d agree that if the underlying model is correct then you’ve inverted a test at level alpha to get an interval you have 100*(1-alpha)% confidence in.

Also, looking back, I think I see the expository problem more clearly. You say “you basically just ran two one-sided hypotheses tests: 4.0 and all smaller values are significantly different from the null hypothesis of X=5, and 6.0 and all larger values are also significant.” Fair enough. These are two end-point defining tests. But earlier you say that “The CIs around the two means are based on the assumption that each population mean is equal to each sample mean”. There are two tests in being described in each sentence, but the nulls and tests of the first sentence are not the nulls and tests of the second sentence.

Reading back my comments I can only agree with you. My phrasing was incorrect, thanks for correcting me!

Hello Tim,

I think this is on the whole a good clarification of an interpretative problem lots of people have or have had at one point in their statistical education. A couple of perhaps minor issues keep me from adding this post to the recommendations for further reading for my students, which I thought I’d summarise here. (I appreciate that you probably purposefully glossed over these details for pedagogical reasons.)

– Much of what you discuss is also discussed in greater detail by Cumming & Finch (2005) (“Inference by eye: confidence intervals and how to read pictures of data”; American Psychologist).

– As other commenters pointed out, you can compute CIs without relying on the central limit theorem (e.g., using bootstrapping). In fact, the formula you present (I presume for pedagogical tractability) is only of limited use as it assumes Gaussian data (fine-grained, practically unbounded data in which there’s no relationship between the mean and the variance). This excludes Likert-type data, proportions etc. Furthermore, it assumes that the population SD has been estimated with great certainty (i.e., that n is large enough) and that the data point are independent of one another. Often, then, one can’t be 100% confident that the formula will yield 95% coverage intervals; indeed, one can often be 100% that it *won’t* (e.g., due to clustering, blocking, heteroskedasticity etc.). Other formulae are available for such cases but they have their own assumptions, so one can rarely be 100% confident in an algorithm’s coverage properties.

– As CP pointed out, I don’t think it’s helpful to say that confidence intervals “assume” that the population mean is equal to the sample mean. Such an assumption would obviate the need for confidence intervals in the first place. What you mean is that the sample mean is taken as a point of departure in the construction of the CI around it, but that’s just because the sample mean is an unbiased estimator of the population mean. Similarly, the t-test doesn’t assume the null hypothesis is true; this assumption would again obviate the need for testing it. Rather, it takes the null hypothesis as, well, a hypothetical.

– “What does this mean? That the validity of calculating a CI relies on the assumption of normality. Importantly, this does not mean that the scores in the sample need to be normally distributed, but that the scores in the population of samples needs to be (approximately) normally distributed.” That’s an assumption of using t-distributions when constructing CIs, but the formula you provided only assumes that the sample means are normally distributed, not that the data points themselves were sampled from a normal distribution.

– Even after Morey et al.’s (2016) “The fallacy of placing confidence in confidence intervals” reading, I still didn’t really understand why one can’t say that any given CI doesn’t contain µ with a 95% probability. This was often explained in terms of “the CI either contains µ or it doesn’t, but we don’t know which”, which I didn’t find too helpful. (I’ll either win the lottery or I won’t, but I can still quantify the probability of each outcome.) I’m still not fully sure how best to explain it, though.

– Related to the previous point: Confidence intervals are indeed easily misinterpreted. But I’ll go out on a limp and say that typical misconceptions about confidence intervals are less aggravating than typical misconceptions about p-values. That doesn’t mean one shouldn’t try to understand them or teach them correctly, but it’s some form of consolation. (Incidentally, I think that Bayesian credible intervals can equally as easily be misinterpreted, namely by neglecting their dependence on the specification of the model or any bias in the data.)

(apologies if double-posted – first attempt had a typo in my email address)

“Importantly, the specific values of a calculated CI should not be directly interpreted.”

Strictly speaking, that’s not so. There is a way to interpret realized (calculated) CIs. The concept is “bet-proofness”. I learned about it from a recent paper by Mueller-Norets (Econometrica 2016).

Mueller-Norets (2016, published version, p. 2185):

“Following Buehler (1959) and Robinson (1977), we consider a formalization of “reasonableness” of a confidence set by a betting scheme: Suppose an inspector does not know the true value of θ either, but sees the data and the confidence set of level 1−α. For any realization, the inspector can choose to object to the confidence set by claiming that she does not believe that the true value of θ is contained in the set. Suppose a correct objection yields her a payoff of unity, while she loses α/(1−α) for a mistaken objection, so that the odds correspond to the level of the confidence interval. Is it possible for the inspector to be right on average with her objections no matter what the true parameter is, that is, can she generate positive expected payoffs uniformly over the parameter space? … The possibility of uniformly positive expected winnings may thus usefully serve as a formal indicator for the “reasonableness” of confidence sets.”

“The analysis of set estimators via betting schemes, and the closely related notion of a relevant or recognizable subset, goes back to Fisher (1956), Buehler (1959), Wallace (1959), Cornfield (1969), Pierce (1973), and Robinson (1977). The main result of this literature is that a set is “reasonable” or bet-proof (uniformly positive expected winnings are impossible) if and only if it is a superset of a Bayesian credible set with respect to some prior. In the standard problem of inference about an unrestricted mean of a normal variate with known variance, which arises as the limiting problem in well behaved parametric models, the usual [realized confidence] interval can hence be shown to be bet-proof.”

We had a good discussion about it over at Andrew Gelman’s blog some months ago.

http://andrewgelman.com/2017/03/04/interpret-confidence-intervals/

Some good contributions there, esp. by Carlos Ungil and Daniel Lakeland.

Full reference:

Credibility of Confidence Sets in Nonstandard Econometric Problems

Ulrich K. Mueller and Andriy Norets (2016)

https://www.princeton.edu/~umueller/cred.pdf

http://onlinelibrary.wiley.com/doi/10.3982/ECTA14023/abstract