Correlation Does Not Imply Causation

Reading Time: 2 minutes

Last week’s post about the ideal length for all kinds of different digital content sparked an offline debate about the differences between correlation and causation. Understanding the differences is crucial to many forms of analytics, and some people still get the two muddled up.

In simple terms, a correlation implies a statistical relationship between two random variables of data sets. For example, it just so happens that there’s a very tight correlation between the Per Capita Consumption of Cheese in the USA and the Number of People Who Die By Becoming Tangled In Their Bedsheets. Here’s a chart to prove it:

Spurious Correlation

Here, the correlation is very high at 0.947091 (the closer a correlation is to 1, the more perfect the relationship between the data sets is).

But, of course, this is a purely random, chance correlation. There’s no logical reason why more or less people should die tangled in their bedsheet if people eat more or less cheese, or vice versa. That’s like observing that all elephants have four legs and that zebras also have four legs, and therefore deducing that all zebras must be elephants.

To save us from this zoological confusion, the crucial consideration of causation comes in. Whenever a correlation is found, for example between the number of retweets received by tweets various lengths, we also have to ask ourselves if one thing may actually have caused the other to occur. Is it plausible that altering the length of a tweet will have a causal effect on the likelihood of it being retweeted? The answer is ‘almost certainly not,’ because there are far more variables at play like the nature of the tweet, the words used, the time it is sent, its social currency at the time and so on. And, of course, no-one has yet found a credible explanation why an independent choice like tweet length could influence a reciprocal and direct retweeting act elsewhere.

Whenever we encounter correlation in our data analyses we must take care not to assume causation. At most, correlation is a hint that causation may exist. Once observed, we must then probe additional research and datasets before we can prove beyond doubt that causation actually exists.

For my explanation of correlation for dog owners, click here.
For more entertaining ‘Spurious Correlations’, see Tyler Vigen’s blog here.