Multilevel, or it didn’t happen.

Imagine that on some days you have a headache, but not on other days. You suspect that this is either because you drink beer or because you drink water, and you want to find out which. You carefully keep track of when you get a headache while you either drink beer or water. Subsequently, you do some kind of significance test. Lo and behold, drinking beer is related to getting a headache, p < 0.05!

Given the study design and result, which of the following conclusions are valid?:

  1. If I will drink beer, I will probably get a headache.
  2. If I will drink any alcoholic beverage, I will probably get a headache
  3. If anyone drinks beer, they will probably get a headache.
  4. If anyone drinks any alcoholic beverage, they will probably get a headache.

beer blogI suspect that most will agree that the first is the only warranted statement, while all the others are – strictly speaking – not warranted. If you understood it so far: congratulations! You now realize why for many studies it is absolutely essential to use a multilevel analysis.

Statement 2 is wrong because you did not account for the variance in how your body reacts to other types of alcoholic beverages. Beer is only one of many alcoholic beverages, and because you did not sample from the population of alcoholic beverages you cannot generalize to other alcoholic beverages. Only when you have a valid sample of the population of alcoholic beverages can you account for the variance between them (do try this at home, but sample with care).

Statement 3 and 4 are wrong because you also did not account for how other people vary in how they react to alcoholic beverages. You are only one of many people, and because you did not sample from the population of people you cannot generalize to other people.

In short, you can only generalize to whatever population you sampled from. In the above example you only sampled from your drinking behavior of beer and water, so you can only generalize to your future drinking behavior of beer and water. Makes sense, right?

Pyramids

Now let’s explore why this is important for designing studies using the following example:

You want to test which instruction method is able to teach children how to read. Specifically, you want to compare two methods: the ‘phonics’ method versus the ‘whole language’ method. You find a teacher who teaches many classes, and is willing to teach half of the classes according to the phonics method, and the other half with the whole language method. Lo and behold, the whole language method is related to learning how to read, p < 0.05!

Again, which conclusion is valid?

  1. If children receive the whole language method, they will probably learn how to read.
  2. If children receive the whole language method, they will probably learn how to count.
  3. If anyone receives the whole language method, they will probably learn how to read.
  4. If anyone receives the whole language method, they will probably learn how to count.

Again, I assume that most will agree that statements 2 through 4 are false. However, statement 1 is false as well. It was a trick question;  sorry.

Statement 1 is false because you also overgeneralize. Both teaching methods were only given by one and the same teacher. You did not sample teachers from the teachers population, therefore, you cannot generalize to other teachers. In other words, you can only generalize to other students who will be taught by this specific teacher, because that is the population you sampled from.

If we want to be able to generalize to children taught by any teacher, we should have sampled teachers from the teacher population, because only then can we account for how teachers vary amongst each other. Likewise, we should acknowledge that students are nested in classes, which are nested in schools. Why? Because students in a class are often more alike then students from different classes, and this difference in variance can quite easily confound your study. Education is like a very large pyramid.

However…

… all of the above is a very one-sided perspective. Whilst a “SAMPLE ALL THE THINGS!” approach to study design is certainly admirable, it is often very costly, not always plausible, sometimes nonsensical, and there are other ways to justify generalizations.

In the ‘learning how to read’ example, it would certainly be better if the study included multiple teachers. However, you could also argue you not only need to sample teachers, but also schools. But why stop there? You could account for differences between school districts, and provinces, and … There are virtually unlimited hierarchical structures you could identify, but it is not plausible to account for all of them.

Sometimes it is nonsensical to talk about samples and populations. The phonic and whole language methods are nested in the population of “teaching methods for reading”. But what does this population look like? How many teaching methods are there? Even if you could define this population, how would you draw a random sample from it? This doesn’t make any sense. What is even more important, we do not want to generalize to other teaching methods.

Arguing for generalizations

In the ‘drinking beer’ example, it was not tested whether you also get a headache from other beers, but you can argue why this is (most probably) the case. Arguing for generalizations is a tricky thing; many researchers will disagree on which findings can or cannot be generalized. If anything, recent large (multilab) replication studies have shown that we should be very cautious with claims about ‘true effects’ and assume they generalize (e.g. 1, 2). Strong claims need very strong evidence.

So why can we be confident that the beer-headache finding will generalize to any alcoholic substance? Because we know the mechanism.

While knowing the mechanism might not be solo criteria or even a sufficient one, it is certainly paramount to having strong evidence. Problematically, fields like education and psychology are (generally speaking) far from uncovering such mechanisms.

Lacking sufficient understanding of (possible) mechanisms, we should be very wary from claims about generalizations in these fields, and many others. There is one thing we can do though: gather more evidence and lower the scope of our claims to match the provided evidence.

Don’t generalize to what you didn’t sample from.

Practical guidelines

  1. Whatever you study, it is likely that there is hierarchical structure in the data. Make a diagram of this structure! Decide to which levels you do want to generalize to, or do not want to generalize to.
  2. Be aware that even if you are not specifically interested in the effects of a certain level it can still influence your results, so you will often still need to account for it. If you can’t control it, measure it.
  3. I haven’t said anything about how to do multilevel analyses. Maybe I’ll blog about it in the future, but for now I want direct you to your local statistician to help with planning and analyzing a study.
  4. Generalizations is not just a boring statistical topic; it is (or should be) core to our theories. Whatever you study, it is always localized or restricted to a range of e.g. people or settings. To truly understand what you’re researching, you must also know the boundary conditions of when it doesn’t apply anymore.
  5. Don’t generalize to what you didn’t sample from.
  6. Don’t generalize to what you didn’t sample from.
  7. Don’t generalize to what you didn’t sample from. Unless you have strong arguments to allow an untested generalization.
  8. It is a lot easier to make such arguments if you have supporting data.
  9. You can’t sample from every relevant population, so at least be aware of the limitations of your study, argue why you might still be able to make generalizations (and end with a recommendation that future research should use multilevel designs; surely that will work, right?).

As always, if you have any suggestions for this practical guidelines list, let me know and I’ll add it!

References

 

 

 

4 thoughts on “Multilevel, or it didn’t happen.

  1. A great post that I wish I’d had available before I tried to understand multi-level modelling for the first time. A couple of textual niggles (just to show I read it attentively!)

    Whatever you study, it is very unlikely that there is no hierarchical structure in the data.
    Um… typo? Did you mean “there is a hierarchical structure”.

    You did not sample teachers from the teachers population, therefor,
    “Therefor” is actually a word in English (it translates to “daarvoor”), but I think you wanted “therefore” (“daarom”).

    Education is like a very large pyramid scheme.
    Well, it’s like a very large pyramid. But calling things a pyramid scheme (i.e., multilevel marketing, which is often a form of fraud) is more contentious. Plenty of things in life do resemble pyramid schemes, but education as a whole doesn’t. (Well, actually, adult masters programmes come close sometimes…)

    • Thanks for your suggestions, and your attention! I’ve made all the edits except the first suggestion: I do actually think that almost any study topic has hierarchical data. I did remove the double negative in the sentence, maybe it’s more clear now.

    • Thanks for your comment. I agree in the sense that you typically shouldn’t (and can’t) declare the truth based on a single study. In addition to your points, a “p < 0.05” gives (too) little information regarding the size of the evidence (not to be mistaking with the size of the effect). But as far as the example goes, it does make it (more) probable that you will probably get an headache if you repeat the experiment. Different statistics are needed to quantify how much more probable (e.g. Bayes Factors), and the study should of course be well controlled, as I argued for here: https://timvanderzee.wordpress.com/2016/06/29/venturing-into-explanation-space/.

      The main point of the example was that you certainly can’t say that if someone would repeat the experiment that person will also probably get an headache, as this study provides no evidence whatsoever that this would be the case.

      The (maybe only?) clear exception to this rule is if we have good understanding of the mechanism of the finding (e.g. based on previous studies). Given that we do know the mechanism, we can indeed make all those larger claims, but this specific study has added little to no evidence for this purpose.

      (Note that having multiple studies replicating the finding that drinking beer is related to getting a headache does not count as knowing the mechanism. Replications are vital, but knowing the mechanism is a different thing.)

Leave a Comment