On Outliers: What they represent, and why the Central Limit Theorem is Typically Off.
by Danielle Fong
The central limit theorem states that if you have many small, independent, random variables, then their sum is distributed approximately as a bell curve. Strikingly, almost everything is made up of many small parts, and these parts don’t tend to influence each other very much.
So much of what can measure seems to fit a bell curve. This is why the normal distribution works. Because this assumption tends to work well, it is usually taken as a matter of course. Students are taught it, lecturers preach it, researchers apply it, and startlingly few stop to question it.
Suppose the variables are not small, or suppose they’re not independent. Suppose, under certain conditions, the value of one variable would seriously affect another. Suppose we’re talking about the buildup of snow on a mountain slope. Most of the time, snowflakes can gradually build, without significant effect. But once enough builds, you don’t find snowflakes resting calmly upon a drift. What you find is an avalanche.
The sum total of snowflake movement isn’t what we might expect. The snowflakes on the top used to be lightly packed by the new, gradually coming down. The snowflakes on the bottom used to just sit there. But they’re not just sitting there. They’re moving fast, and they’re moving down.
The central limit theorem doesn’t always hold! If you have a model where, most of the time, a change in one part doesn’t effect another much, but some of the time it really does, then you can’t assume that your outcomes will follow a bell curve. Entirely different outcomes are possible.
Traditionally, when you have some measurement that’s far outside of normal, you call this an outlier. In much statistical analysis, these are ignored or thrown out (for example, even the most extraordinary measurements won’t effect a median). This is useful if you want to study the ordinary behavior of something. But sometimes, the information you gain from outliers is by far the most interesting. Some of the outliers are expected in normal distributions (bell curves). But some of them are outliers because the model doesn’t apply. Some outliers are avalanches.
We started from an iffy assumption. Not everything is made up of independent random variables. Parts effect each other. Sometimes it’s violent, like a chain reaction, or our avalanche. And sometimes, it’s magical. Life is an example of this. If the chemicals that made up our bodies didn’t bond so strongly to one another, our DNA would unwind, and you’d be a puddle.
Statisticians try to account for this. One example on the tips of our tongues lately: finance. Roughly speaking, the ‘beta’ used in Finance means the volatility of a stock with the linear correlations with the company’s existing portfolio factored out. Despite this, there seem to be ‘six-sigma events’ happening all the time, things which, according to the theories, and ‘ordinary’ data, really shouldn’t be happening at all. 1 What’s going on?
When small effects just add up simply, you can model them by what’s called a linear model (so called because if you add up small things along an XY graph, you’ll get a line). This doesn’t always work, and in fact, most interesting phenomena are non-linear.
Models are limited. They break down. And one can’t really account for every possibility. Financial markets can collapse due to dustbowls, and furthered by widespread investor panic. The destructive power of armies during world wars can be dwarfed by exhaustion and a powerful flu, though normally flu is beaten with chicken soup.2 And planets full of ten story tall reptiles can be wiped out by meteorites. No number of small rocks would matter — usually they just ping off them. Bet they didn’t see that coming.
The next time you hear about your monthly six, or seven, or eight sigma event, keep this in mind. Outliers are where the model breaks down. They happen more often than standard models would expect, and they often point out problems in the understanding of the system. If they start happening all the time, start mistrusting your statisticians. And your wall street.4
One can show the recursiveness of the model verification problem. Suppose one says to you: drug x worked better than drug y, with a 95% confidence. You might reply: what’s the confidence of that confidence? Somewhere along the line, they’ll have to say 100, and then they need to prove consistency. One can show that there are theories with unknown external variables and relationships (as are dealt with in statistics) aren’t even formally recursively enumerable,5 but even if they were, no such theory could contain a statement of its own consistency.
Back on planet earth, is there anything we should be worried about? Some who talk about the climate claim that, since basically the temperature of the earth just goes up and down naturally, it’s not really anything to worry about. But outliers can mark where models break down. Where systems change. This is a chart of temperature over the past few centuries. You may notice, to the right, an outlier. Maybe it’s telling us something.6
“Now let’s talk about efficient market theory, a wonderful economic doctrine that had a long vogue in spite of the experience of Berkshire Hathaway. In fact one of the economists who won — he shared a Nobel Prize — and as he looked at Berkshire Hathaway year after year, which people would throw in his face as saying maybe the market isn’t quite as efficient as you think, he said, “Well, it’s a two-sigma event.” And then he said we were a three-sigma event. And then he said we were a four-sigma event. And he finally got up to six sigmas — better to add a sigma than change a theory, just because the evidence comes in differently. [Laughter] And, of course, when this share of a Nobel Prize went into money management himself, he sank like a stone.”
5 – As wikipedia states, a recursively enumerable language is a formal language for which there exists a Turing machine (or other computable function) which will enumerate all valid strings of the language.
6 – (update) For further musings on uncertainty and randomness, a terrific book is Nassim Taleb’s The Black Swan. A related essay is available here. Also intriguing is the working paper “Extreme Sample Selection Bias: Conditions that Cause the Correlation Between Two Variables to Switch Signs” by Tim Groseclose