Venturing into explanation space

You have to design a study so that it can answer the research question(s). This is a given. Except that it is not.

Starting with the obvious: studies can’t answer questions, they can only provide data. While we often use statistics to help us deal with the data, that also doesn’t give us any answers, just another type of data; and the data is silent.

There is a very important distinction between explanations of findings and the findings themselves.The findings are not true or false, they just are. Only the interpretations and explanations of those findings can have some truthiness to them. We have to interpret the data in an attempt to say something meaningful. I think that one of the most vital, but often overlooked thing to consider when interpreting a finding is the explanation space.

Venturing into explanation spacebayesian likelihood distribution.png

The explanation space is the collection of all the possible explanations of a given finding. Arguably only one of them is true, but many will have some truthiness in them. The challenge is to gather enough evidence so that we know which one is true, or at least more likely to be true than reasonable alternatives.

Problematically, the explanation space is virtually unlimited; there can be countless reasons why something is the way it is. A consequence of this is that the a priori likelihood of any explanation is very low. The combined likelihood of all explanations must add up to 1 (or 100%), so if there are many explanations, each individual explanation has a low likelihood.

That is, of course, if we have no prior information about the likelihoods. A good study can give us relevant information which we can use to update our beliefs about which explanations are more likely.

Let’s look at the following example:

“Our chief method was to test the efficiency of some function or functions, then to give training in some other function or functions until a certain amount of improvement was reached, and then to test the first function or set of functions. Provided no other factors were allowed to affect the tests, the difference between the test before and after training measures the influence of the improvement in the trained functions on the functions tested.”

Woodworth & Thorndike (1901)

In this classical paper on the interdependence of mental functions, Woodworth and Thorndike argue that the (for that time revolutionary) study design allowed them to conclude that the training affected the participants’ mental functions. Although they do highlight a critical assumption about an alternative explanation (‘provided no other factors […] affect the test’), they would argue that overall they have successfully increased the relative likelihood of their interpretation.

Note that this is not what the data says; mostly because it doesn’t say anything, but also because the given explanation is just one of many explanations. The participants could have become better at the tests after the training because they already made very similar tests before the training; a testing effect (3). Or, they (were) expected to become better and thus scored higher; a placebo or Pygmalion effect (2, 4)  Or, they didn’t actually increase in ability but scored higher by chance, for example by regression to the mean (6). Or, or, or…

Strictly speaking, this particular study provides little information on the plausibility of each of these alternative explanations. However, given what we know about how the world generally works, it is indeed very reasonable to accept that the training affected the mental functions.

Bluntly put: the value of an (experimental) study lies in its (potential) ability to affect the distribution of likelihoods in the explanation space. An informative study will redistribute the likelihoods across explanations so that some explanations ‘steal’ likelihood from others (remember that the sum likelihood is always 100%). If a study fails to do this, it is uninformative. It’s important to realize that the potential informational value of a study depends mostly on choices made before the study is actually done: the study design.

Enter control groups

Many studies make use of control groups, and for many good reasons. I would argue that the main function of a control group is to limit the explanation space, and thus to increase a study’s informational value. Note that I am talking about control groups in the broad sense; any type of comparison between or within subjects.

The use of control comparisons has a long history. The study by Woodsworth and Thorndike is often incorrectly attributed to be first to use a control group (e.g. 5), but I’ve found an even earlier study using the same principle:

“A rabbit was made immune against tetanus […]. Blood was taken from the carotid artery of this rabbit. […] [t]his blood was injected into the body cavity of a mouse, and […] into that of another. Twenty-four hours later, these mice, together with two control-mice, were inoculated with tetanus of such virulence that the latter showed symptoms of tetanus after 20 hours, and were dead in 36 hours. Both of the treated mice, on the contrary, remained healthy.”

Hankin (1890)

Without the two control mice, the study would be much weaker as there would be many more plausible explanations for the finding that the treated mice remained healthy. In other words: for the exact same finding, having one or more controls can vastly increase the informational value of a study.

Each explanation can be treated as a theory (or model, or hypothesis) which makes certain predictions. As many explanations will share a limited number of predictions, we don’t have to treat them individually but can set up a control group (or comparison) to specifically test the prediction(s) of sets of explanations we want to rule out. Given the central role of comparisons, it is essential to decide which comparison(s) you want to make.

The horses of explanation space

In medicine there is a saying that ‘when you hear hoof beats, think horses not zebras’. This applies very well to explanation space, as there are many common ‘baseline explanations’ which should typically take priority over more exotic interpretations, unless you have sufficient information to argue otherwise. Many studies share their explanation space to a large extent. We can (and should) use this to construct the horses of explanation space: baseline explanations or common alternative explanations which any zebra explanation must compete with.

Here is a set of baseline explanations which apply to most intervention-type studies, but most of which are applicable to many other types of research as well:

  1. People differ from each other, and from their earlier/future self
  2. People and measurements change over time (e.g. regression to the mean, random variation, measurement error)
  3. Doing something is better than doing nothing (e.g. A+B>A)
  4. People become better when they expect to become better (e.g. placebo or expectancy effect)
  5. People behave differently when they (think they) are observed (e.g. observation effects, social desirability bias)
  6. People who choose something are different those who didn’t (e.g. self-selection bias)
  7. People who dropped out from a study are different from those who didn’t (e.g. non-random attrition)

If a study reports on a finding which resembles any of the above, carefully analyze whether they have provided information which can convince you to refute these null hypotheses and accept a more exotic explanation. If you are not able to show why any of these baseline alternative explanations are (much) less likely than the explanation of interest, you ought to be very careful with drawing conclusions.

Do you have suggestions for more baselines, or do you maybe disagree with what I’ve said? Use the comment section!

Practical recommendations:

Designing a study:

  1. Limit thy explanation space! Design studies so that they are likely to be informative.
  2. Make a list of all (sets of) plausible explanations of the intended outcome(s). Which can you actively control for? Which can you ‘argue away’?
  3. Although you can often very reasonably argue for or against the likelihood of specific explanations, it is just so much more easy to argue if you have data.
  4. Given the partly shared explanations, some explanations are especially informative because they are uniquely diagnostic. Use this to your advantage!
  5. Control, or it didn’t happen.
  6. It is often more efficient to do a single large study with several good controls, than having to repeat a study several times to finally be able to refute baseline hypotheses.
  7. However, replications are vital, as it is needed to refute explanations such that it was a mere researcher effect, or the even more boring baseline that ‘something happened once’.

Writing a paper:

  1. Which alternative explanation will reviewer 3 come up with in order to reject your paper?
  2. In case your study has some expected and some unexpected findings, do not selectively use post-hoc explanations to explain away the unexpected but not the expected findings (such as described here). This ignores the fact that many findings have a partly shared explanation space.
  3. Actively discuss the likelihoods of alternative explanations, including boring baseline explanations. Explicitly mention if and how much impact your study should have on the distribution of likelihoods.
  4. Sometimes, the main contribution of a study is more like an exploration of the explanation space, e.g. map the range of plausible explanations for a particular finding, some of which might have been overlooked by other research.

Interpreting a study:

  1. For each finding, note which explanation(s) is/are given. Did they actually test these, or are they (plausible) assumptions? What are the simpler explanations which are still likely?
  2. Confidence in explanations should not be binary, but on a continuum. Ask yourself how much this particular study should affect the distribution of likelihoods in explanation space.
  3. Criticizing a paper is easy, but remember: being a skeptical scientist means saying that I am wrong most of the time.

If you have suggestions for practical recommendations or baseline hypotheses, please let me know and I’ll add them to the list. All other comments are most welcome as well!


  1. Hankin, E. H. (1890). A cure for tetanus and diphtheria. Nature, 43, 121-123.
  2. Montgomery, G. H., & Kirsch, I. (1997). Classical conditioning and the placebo effect. Pain, 72(1), 107-113.
  3. Roediger, H. L., & Butler, A. C. (2011). The critical role of retrieval practice in long-term retention. Trends in cognitive sciences, 15(1), 20-27.
  4. Rosenthal, R. (1997). Interpersonal Expectancy Effects: A Forty Year Perspective.
  5. Solomon, R. L. (1949). An extension of control group design. Psychological bulletin, 46(2), 137.
  6. Stigler, S. M. (1997). Regression towards the mean, historically considered. Statistical methods in medical research, 6(2), 103-114.
  7. Woodworth, R. S., & Thorndike, E. L. (1901). The influence of improvement in one mental function upon the efficiency of other functions. Psychological Review, 8(3), 247-261.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.