What are confidence intervals? It is the time interval between when you have successfully learned what confidence intervals mean and when you start to realize that you’ve never fully understood them.
As a Masters student I had to read Geoff Cumming’s “New Statistics” book, which prominently featured confidence intervals. To this day I still don’t fully understand why someone would advocate relying on them – it seems to be (at least partly) based on common misconceptions.
Yesterday someone notified me about a question regarding confidence intervals based on the idea that if the confidence intervals of two parameters overlap you can conclude that these parameters are not significantly different from each other. This is, however, wrong. We’ll get to why this is, but first we are going on a little adventure.
What do confidence intervals really mean? (Part I)
Confidence intervals give a range of values which will “in the long run” contain the population value of a parameter a certain % of the time.
That likely did not help much, did it?
Put differently, if you construct a hundred 95% CIs, then about 95 of these intervals will contain the population value. Typically we only calculate a single CI, and not a hundred. Problematically, this gives us no way of knowing whether our single CI happens to be one of the the lucky intervals which contain the population value. It either does or it doesn’t, but we don’t know.
Importantly, the specific values of a calculated CI should not be directly interpreted. If we have a 95% CI of [4.0, 6.0] we can not interpret these limits. We can also not state that there is a 95% probability that the true population is between 4 and 6, nor that we can have 95% confidence in this claim (whatever that means).
What helped me in better understanding what CIs mean is to realize that the “95%” part does not refer to the limits that you calculate, but to the procedure of calculating these limits. We can be 100% confident that this procedure will, in the long run, provide us with limits that in 95% of the time contain the population value.
We’ll come back to interpreting CIs again later.
Calculating a 95% confidence interval
We’ll start with the textbook example of a CI: a 95% CI of a mean. While the % can be any number and you can calculate it for any parameter, the 95% CI of the mean appears to be the only thing most people will ever calculate.
The formula to calculate a 95% CI for a mean is: , where X is the sample mean, s the standard deviation, and n the sample size. Dividing the standard deviation (s) by the square root of the sample size (n) simply gives us the standard error (SE). I like to present the formula in this format as you can deduce that the interval will become smaller as the sample size increases (or when you have a lower standard deviation, for example when removing outliers).
Why do we use the standard error and not just the standard deviation, you ask? Excellent question! (Literally nobody ever asks such questions, but I’m just mimicking statisticians who like to pretend that people do.) If you want to get a confidence interval of the sample, you use the standard deviation of the sample. However, we are not interested in the sample but in the population. The standard error is simply the standard deviation of the sample means; because we are interested in the population (of sample means) we use the standard error.
(Note: The above formula is just one of many, each of which has specific purposes and limitations. The one I present is probably one of the most commonly taught ones, but is not necessarily “better”. It assumes Gaussian data, which is actually not all too common – for example you can’t use this formula for Likert-scale data.)
What do confidence intervals really, really mean? (Part II)
The above helps us to refine our understanding of CIs. When we are working with t-tests, ANOVA’s, etc. we tend to be working with Gaussian data. The calculation of CIs for this kind of data is based on the central limit theorem, which states that as sample size increases the distribution of sample parameter (in this case, the mean) will start to approximate a normal distribution.
What does this mean? That the validity of calculating a CI relies on the assumption of normality. Importantly, this does not mean that the scores in the sample need to be normally distributed, but that the sample scores in the population of samples needs to be (approximately) normally distributed. Luckily, this is very often (approximately) the case if your sample size is large enough, but things like skew and outliers can dramatically ‘slow down’ the process of having a large enough sample to get this distribution.
In short, constructing confidence intervals gives us ranges of data that will contain the population mean X% of the time, assuming that the population of sample means is normally distributed.
Calculating any X% confidence interval
Above we looked at the 95% confidence interval, but there is no reason to limit ourselves to 95%. What if want to calculate a 90% confidence interval?
To calculate a CI for any particular ‘confidence level’ we can use this formula: , where p is the probability is the confidence level that you are interested in. In the case of a 95% CI you would thus look up the z value of (1-0.95)/2 = 0.025. Looking up the z value for a standard normal deviation (as that is the assumed distribution of sample means) gives us a z of 1.96. That is why a 95% CI is calculated with the 1.96 multiplier.
For a 90% CI we would look up the z value of (1-0.90)/2 = 0.05, which gives us a multiplier of 1.64. Does the z value of 0.05 ring any bells? If so, you are right, because a 90% CI has a very interesting relationship with the infamous p < 0.05. Lets explore this taboo romance between p values and confidence intervals…
What do confidence intervals really, really, really mean? (Part III)
As a reminder, a p value gives us the probability of obtaining a result as at least as extreme as the one we found (e.g., a difference score of 3 and higher) assuming that the population value is zero, and that the distribution of sample values is normally distributed. It is no coincidence that this sounds roughly similar to the previously given definition of a confidence interval, because the calculations and assumptions of p values and confidence intervals are very much alike.
If we get a p < 0.05 we can (in frequentist statistics) decide to rule out that the population value is zero. However, if we get a p > 0.05 we can not state that the population value is (likely to be) zero. P values only allow us to rule out values, but not rule them in. Similarly, confidence intervals allow us to rule out values, but not rule in values. Put differently:
- Non-significant p values do not allow us to accept zero as a likely value for the population mean; we simply have not (yet) ruled it out.
- Confidence intervals do not allow us to accept the values inside the interval as likely population means; we simply have not (yet) ruled them out.
These frequentist methods are very useful tools for falsification, but not for confirmation.
What do overlapping confidence intervals (not) mean?
Lets get back to the original question: how can two confidence intervals overlap while the difference between the two values is significant?
I use to think that this was a valid line of reasoning, and if my memory serves me right this is also what I have been told in my statistics classes. However, it is based on faulty reasoning.
Imagine that we have interested in the question whether Bayesians are taller than Frequentists (this is obviously true, but bear with me):
- Bayesians have a mean height of 1.90 meter, s = 0.1, n = 33.
- Frequentist have a mean height of 1.85 meter, s = 0.1, n = 33.
We can construct confidence intervals around these means:
- Bayesians have a mean height of 1.90 meter, 95% CI [1.866, 1.934].
- Frequentist have a mean height of 1.85 meter, 95% CI [1.8159, 1.884].
These CIs overlap, so does this mean that these means do not differ significantly? It does not. If we run a (unpaired) t test we get: t(64) = 2.0310, p = 0.0464. Lo and behold, Bayesians are taller than Frequentists despite the fact that the confidence intervals of their height overlap.
Our reasoning here is flawed, as we are comparing different things. The CIs around the two means are based on the assumption that each population mean is equal to each sample mean, but the t test is based on the assumption that the population of difference scores is equal to zero.
In other words, when we are running a significant test to compare two means, we are constructing a single distribution consisting of the difference scores. We then calculate the probability of the found difference score or more extreme (in this case 0.05 and larger values) or larger (>0.05) assuming that the true difference score is 0.
Because we combined the two distributions of the two groups into a single distribution the way we keep track of error is different than for two separate distributions. This is because we square error before adding it. If the error both groups is 0.1, we don’t add this together to 0.2; instead take sqrt(0.1^2 + 0.1^2) which is 0.14. Due to this the error in a combined distribution is always smaller than of two separate groups added together – not doing this will make us overestimate the amount of error and thus incorrectly assume that two groups can’t be significantly different when their confidence intervals overlap.
In short, this is what you need to remember when comparing two confidence intervals:
- When two confidence intervals do not overlap, the difference between the two parameters will be significant.
- When two confidence intervals do overlap, the difference between the two parameters can be significant or non-significant.
Now it’s time for a stats joke. Ha Ho.