Correlation Doesn’t Imply Causation (But It Does Waggle Its Eyebrows Suggestively and Gesture Furtively While Mouthing “Look over There”)
Well, this weekend in my private tutoring ended up being all about correlation/causation issues, and I feel like sharing, so here you go:
The issue of correlation versus causation shows up on the LSAT, the GMAT, and the GRE, among other standardized tests. It’s a good idea to start by trying to understand correlation and causation generally. We’ll look at specific examples/applications and a full question analysis towards the end.
What is Correlation?
To say that two items are correlated is to say that they vary together. Correlation is a statistical concept. A correlation between two items means that if one item is present then the other item is more likely to be present than it otherwise would be. The presence of one item does not have to guarantee the presence of the other or even make the presence of the other item likely in any absolute sense.
On standardized tests like the LSAT, statements like the following indicate correlation:
- • “Babies born prematurely were more likely to have low birth weights and to suffer from health problems than were babies not born prematurely.” (Premature birth is correlated with low birth weight and health problems.)
- • “People who drive cars equipped with antilock brakes have more accidents than those who drive cards not equipped with antilock brakes.” (Antilock brakes are correlated with accidents.)
- • “The number of airplanes equipped with a new anticollision device has increased steadily during the past two years. During the same period, it has become increasingly common for key information about an airplane’s altitude and speed to disappear suddenly from air traffic controllers’ screens.” (The new anticollision device is correlated with the disappearance of key information.)
Correlation is not causation.
What is Causation?
Causation, in fact, is neither so easily defined nor so easily established. In the purest sense, causation means that the “effect” event would not have happened had the “cause” event not occurred. The problem with this sort of definition is that it is virtually always impossible to draw definitive conclusions about what would have happened if things had been different. These “counterfactuals” can never be known with 100% certainty. Furthermore, many useful more-flexible definitions of causation fail to meet this difficult standard. In practice, we will avoid focusing on any exact definition of causation, and will instead “know it when we see it.”
On standardized tests like the LSAT, statements like the following indicate causation:
- • “Adequate prenatal care significantly decreases the risk of low birth weight babies.”
- • “Experiencing an earthquake can cause people to dream about earthquakes.”
- • “Banks will lend more money if those standards are relaxed.”
Importantly, we must also recognize what causation is not. First, causation is not correlation, and we must be able to identify statements of correlation (see above) and to distinguish them from statements of causation. Second, causation is not the stuff of formal logical arrow diagrams and if/then statements – causation is not implication. To say that “smoking ’causes’ cancer” does not mean that “if you smoke, then you will get cancer.” Causation as it is used on these tests can exist even if the cause in question does not always produce the effect. Third, causation is not dependent on the effect in question always being a product of the given cause. That is, something can be a cause of a given effect even if it is not the only cause capable of producing that effect. All of these points are regularly represented in trap wrong answers on these tests.
How Can Correlation be Explained?
Given that correlation does not imply causation but certainly hints at it, many questions are built around presenting correlation and inviting an inference of causation. Thus, it is useful to investigate the possible explanations of an observed correlation, most of which involve causation of one sort or another.
Ultimately, any correlation can be explained in one of just five ways.
- 1. A causes B. Given that two events A and B are correlated, it could be the case that A does, in fact, cause B. Even though simple A-to-B causation is not the only explanation of an observed correlation, we should not forget that causation is one possible explanation for that correlation. If smoking and lung cancer are correlated, maybe it’s because smoking really does cause lung cancer.
- 2. B causes A. Another possibility is that causation exists between A and B but runs in the opposite direction. Maybe it’s not that smoking causes lung cancer but instead that lung cancer causes smoking. This particular example seems highly unlikely for reasons both practical and theoretical (to be discussed below), but it is important as a general matter to bear in mind that causation can run in either direction and that the “obvious” direction isn’t always the right one.
- 3. C causes A and B. If some third factor actually caused both A and B, then a correlation could be observed between A and B even though those two events had no direct connection at all. If drug use and teen pregnancy are correlated, maybe neither actually causes the other, but, instead, rebellious personality traits (or broken homes, or poverty, or lack of positive role models, or whatever) cause both. Given any correlation between items A and B, an infinite number of possible third factors lurk and threaten to throw a wrench into straightforward A-to-B causation.
- 4. A Mix of the Above. Causation is complicated, and it will often be the case that more than one factor has a role in creating an observed correlation. Perhaps causation runs bidirectionally, in a sort of feedback loop – A promotes B, which promotes A, which promotes B, and so forth. Two people can end up at lunch together without either person being the exclusive cause of their mutual presence. Perhaps multiple third factors have parts to play. Such more-complex relationships are less often tested, but they are equally valid as explanations of correlations between variables.
- 5. Coincidence. Finally, sometimes an observed correlation is simply a result of random chance. If two people are both present in a given coffee shop once or twice, and then both are absent at a couple of other times, it might mean something, but then again, it might just be a fluke.
How Can We Determine the Right Explanation for a Correlation?
Okay, so given a particular A-and-B correlation, which of the above five explanations is the right one? Fortunately, each possible explanation comes along with a method of ruling it out. These tools make it possible to parse among these five possibilities and try to establish what’s really going on.
One aspect of causation that is not required of either implication or correlation is time ordering. Causes happen before effects. Simply showing that a given causal or reverse causal relationship is out of time order is sufficient to reject it. It is highly unlikely that lung cancer causes smoking not only because such a relationship has no obvious mechanism (how and why would lung cancer cause smoking?) but also, more fundamentally, because the smoking usually happens first.
This time ordering issue extends as well to analysis of possible third factors, but we also have a second tool to handle those cases. That tool is known as a “control,” and it involves holding the third factor fixed to see if the A-and-B correlation persists. If so, then it can be concluded, at least, that the third factor is not the only cause involved in producing the A-and-B correlation. For example, in order to determine whether the connection between teen pregnancy and drug use is caused by a third factor like rebelliousness, one could study only individuals with similar levels of rebelliousness. In those cases, does the drug use/teen pregnancy correlation persist? If not, then rebelliousness may well have been the true cause. If so, then something beyond rebelliousness must be going on. This other something could be causation between drug use and teen pregnancy, but it could also be some other third factor (broken homes? poverty?). Thus, even a “favorable” result does not establish the causal link between A and B.
The above methods work for mixed causation as well, because mixed causation is just a sum of the possibilities discussed above.
Finally, the possibility that a correlation is merely a coincidence can never be ruled out entirely, but the use of multiple trials can render coincidence extremely unlikely. Seeing two people both present at a coffee shop once or twice and both absent on a couple of other occasions completely by chance might not be entirely likely, but it is certainly plausible. However, seeing that same pattern repeated even 90% of the time across hundreds of observations would lead any reasonable observer to conclude that something more than mere coincidence was at play. On tests like the LSAT, mention of a scientific study is generally sufficient to imply the requisite multiple trials to effectively rule out explaining a correlation as mere coincidence.
How Can These Ideas Be Used on the Test?
Solving a correlation/causation question typically comes down to evaluating the various possible explanations for a given correlation and applying the above analysis. These questions often establish correlation and proceed to conclude that causation is present. Answer choices that don’t speak to any of the five possible explanations of correlation should generally be eliminated or ignored.
If the question calls for strengthening the inference of causation, a correct answer could be one that uses the above techniques to rule out other possible explanations of the correlation. Alternatively, a premise that showed that the inferred causation was in time order would also strengthen the conclusion and be a right answer.
If the question instead calls for weakening the conclusion, an answer that provides or supports another plausible explanation for the correlation would be correct, as would an answer demonstrating that the inferred causal relationship is out of time order.
In short, look through the answer choices, ignoring anything other than the above-listed explanations of correlation, and find one (the right answer) that applies one of those explanations to either strengthen or weaken the inference of causation (as appropriate). It is useful to have an idea in mind before turning to the answer choices, but be aware that there is generally more than one way to strengthen or weaken an inference of causation.
Want to See an LSAT Example?
Sure! Thanks for asking!
Okay, let’s look at LSAT PrepTest 30, Section 4, Question 17 (Next 10 LSAT, page 74).
This is a weaken question about correlation and causation. The first sentence is a premise, and it establishes a correlation between heavy coffee drinking and heart disease. Notably, it also explicitly rules out smoking and age as possible third factor explanations. The test-taker should make a mental note at this point to the effect that any answer choice discussing smoking or age is almost certainly wrong – and such an answer choice is highly likely to appear, since the question-writer went to the trouble of mentioning the correction for these two factors. The last sentence, while not exactly a conclusion, states an inference of causation; surely the researchers limited their coffee drinking because they thought that excessive coffee drinking was a cause of heart disease. The question stem makes this inference of causation more explicit (note the use of the word “result” toward the end) and establishes the question type as “weaken.”
Before turning to the answer choices, let’s speculate a bit about what we might be looking for.
To weaken the correlation-to-causation inference of the argument, we could establish that causation is impossible based on incorrect time order. That is, we could show that heart disease precedes, rather than follows, coffee drinking. This explanation would certainly work, if present in an answer choice, but it seems unlikely that we’ll find an answer along these lines given what we know about heart disease and coffee consumption (i.e. in real life, coffee-drinking usually happens first).
A correct answer could also offer an alternative explanation for the coffee-heart disease correlation. The options are reverse causation (heart disease causes excessive coffee drinking), third cause (something else causes both heart disease and excessive coffee drinking — but remember, we can’t use smoking or age!), a combination of the above, or coincidence. Reverse causation seems unlikely, both because of time order and on the basis of common sense. If an answer choice provided reverse causation, it would be correct, but such an answer choice will probably not be forthcoming. Third cause is very possible. What factors other than age and smoking could influence both coffee drinking and heart disease? A combination of the above won’t show up often at all, but if it does the analysis is just a combination of the analysis above. Finally, coincidence can be ruled out here; the question noted researchers and implied the presence of a study, and such a study can be assumed to provide enough trials to make coincidence implausible.
Given this analysis, we might expect the correct answer to feature an alternative cause of coffee drinking and heart disease – but one other than smoking or age.
Let’s see those answer choices.
- Answer A doesn’t speak to any part of the correlation/causation analysis. It’s not really worth speculating about how Answer A might apply to the argument, at least until the other answer choices have been considered. Ignore it for now, with the understanding that it’s almost certainly not the right answer.
- Answer B mentions third factors – soft drinks and health worries – that reduce coffee drinking. However, it’s not clear how either of these factors interact with heart disease, especially to reduce it. And if they indeed reduce coffee drinking, then they ought to reduce heart disease as well, if they are to explain the observed coffee-heart disease correlation. Don’t strain too hard speculating on unstated, indirect causal chains. Let’s keep looking.
- Answer C presents a different third factor – stress. Stress is established as “a major causal factor in heart disease.” Even better, it is very reasonable to think that high stress could lead people to drink coffee in excess. Answer C notes that the researchers did not study possible connection between these two factors. Absent such study, it seems just as likely that stress is behind the coffee-heart disease correlation as it does that coffee is the ultimate cause acting in this case. This sort of third factor is exactly what we were looking for, and Answer C is correct.
- Answer D mentions smoking. Smoking was ruled out, so Answer D is wrong. The end.
- Answer E throws an intermediate cause into the chain. Apparently, coffee drinking causes high blood cholesterol, and high blood cholesterol causes (or at least “indicates”) heart disease. Causation can involve multiple steps, so the presence of an intermediate cause in no way weakens the inference that coffee drinking causes heart disease. In fact, by establishing a specific mechanism explaining that causation, Answer E powerfully strengthens the conclusion of this argument. That great – except that this is a weaken question. Note that it’s typical for one wrong answer in a weaken question to strengthen the conclusion instead. This is that wrong answer.
In summary, this question asked us to examine a conclusion of causation based on a premise establishing correlation, to consider possible alternative explanations for the observed correlation, and to browse through the answer choices until coming to one that offered such an explanation. We ignored answer choices not falling into any of the categories of explanations or the methods for ruling out those same explanations, avoided any answer relying on specifically excluded factors, and dodged the typical weaken answer choice that cut the wrong way (strengthening the conclusion instead of weakening it). Those steps led to the correct answer, which was C.
Additional examples of the correlation/causation issue can be found in the following LSAT and GMAT questions:
- LSAT PrepTest 29, Section 4, Question 20 (Next 10 LSAT, page 41)
- LSAT PrepTest 29, Section 4, Question 24 (Next 10 LSAT, page 43)
- LSAT PrepTest 30, Section 2, Question 12 (Next 10 LSAT, page 57)
- LSAT PrepTest 30, Section 2, Question 15 (Next 10 LSAT, page 58)
- LSAT PrepTest 30, Section 2, Question 25 (Next 10 LSAT, page 61)
- LSAT PrepTest 30, Section 4, Question 11 (Next 10 LSAT, page 73)
- LSAT PrepTest 31, Section 2, Question 9 (Next 10 LSAT, page 90)
- LSAT PrepTest 31, Section 3, Question 9 (Next 10 LSAT, page 98)
- LSAT PrepTest 32, Section 4, Question 1 (Next 10 LSAT, page 138)
- LSAT PrepTest 33, Section 1, Question 25 (Next 10 LSAT, page 159)
- LSAT PrepTest 33, Section 3, Question 20 (Next 10 LSAT, page 173)
- LSAT PrepTest 35, Section 1, Question 9 (Next 10 LSAT, page 222)
- LSAT PrepTest 35, Section 4, Question 24 (Next 10 LSAT, page 246)
- LSAT PrepTest 37, Section 2, Question 14 (Next 10 LSAT, page 299)
- LSAT PrepTest 37, Section 4, Question 17 (Next 10 LSAT, page 312)
- LSAT PrepTest 38, Section 1, Question 13 (Next 10 LSAT, page 325)
- LSAT PrepTest 52, Section 1, Question 2 (10 New LSAT, page 8)
- LSAT PrepTest 53, Section 1, Question 17 (10 New LSAT, page 48)
- LSAT PrepTest 54, Section 2, Question 14 (10 New LSAT, page 91)
- LSAT PrepTest 55, Section 1, Question 7 (10 New LSAT, page 118)
- LSAT PrepTest 55, Section 1, Question 22 (10 New LSAT, page 122)
- LSAT PrepTest 55, Section 3, Question 9 (10 New LSAT, page 134)
- LSAT PrepTest 56, Section 3, Question 17 (10 New LSAT, page 168)
- LSAT PrepTest 57, Section 2, Question 6 (10 New LSAT, page 194)
- LSAT PrepTest 57, Section 2, Question 14 (10 New LSAT, page 196)
- LSAT PrepTest 57, Section 2, Question 20 (10 New LSAT, page 198)
- LSAT PrepTest 57, Section 3, Question 4 (10 New LSAT, page 201)
- LSAT PrepTest 58, Section 1, Question 11 (10 New LSAT, page 227)
- LSAT PrepTest 58, Section 4, Question 20 (10 New LSAT, page 249
- LSAT PrepTest 59, Section 2, Question 1 (10 New LSAT, page 264)
- LSAT PrepTest 59, Section 2, Question 4 (10 New LSAT, page 265)
- LSAT PrepTest 59, Section 2, Question 8 (10 New LSAT, page 266)
- LSAT PrepTest 59, Section 2, Question 11 (10 New LSAT, page 267)
- LSAT PrepTest 59, Section 2, Question 22 (10 New LSAT, page 269)
- LSAT PrepTest 60, Section 1, Question 14 (10 New LSAT, page 299)
- LSAT PrepTest 60, Section 3, Question 4 (10 New LSAT, page 309)
- LSAT PrepTest 61, Section 4, Question 4 (10 New LSAT, page 352)
- The Official Guide for GMAT Review, 13th Edition by the Graduate Management Admission Council — Section 8.4: Critical Reasoning Practice Questions: #19 (page 505); #37 (page 511); #55 (pages 516-517); #82 (page 525); #115 (page 535); #118 (page 536)
- GMAT Verbal Review, 2nd Edition by the Graduate Management Admission Council — Critical Reasoning Sample Questions: #5 (page 117); #7 (page 118); #12 (page 120); #20 (page 122); #30 (page 126); #33 (page 128); #46 (page 134); #47 (page 135); #55 (page 138); #59 (page 140); #62 (page 141)
For even more about the correlation/causation fallacy, check out the Wikipedia article http://en.wikipedia.org/wiki/Correlation_Does_Not_Imply_Causation.
Finally, thanks to the wonderful Randall Munroe for the title quote. For his excellent comic on the subject, check out http://xkcd.com/552/.Tags: Correlation versus Causation, GMAT, GRE, Logical Reasoning, LSAT