Friday, June 12, 2015

SD v. SE (of the mean)

A few days ago I wrote an article criticizing a post from Dr. Ross McKitrick about Karl et al. (2015), and one of the points that I brought up was that Mckittrick's use of the standard deviation of sample measurements to describe the uncertainty of the mean was incorrect.  I said that the standard error of the mean is instead the correct figure to use.  Several commenters at other blogs (such as here and here, examples from I think the same person) have either asked which of us is correct, or otherwise suggested that I am incorrect.  From the comments that McKitrick and I exchanged on my post, it seems that we have an understanding that the standard error of the mean is correct; however, there are some area-weighting issues that still exist that McKitrick pointed out, and which I have very briefly talked about here.

I think some may still be confused on the difference between standard deviation and standard error of the mean, though.  This post will help to illustrate that difference, and the application each serves as.


Distributions


Let's say that I want to measure something.  Maybe I want to know the fraction of income that people spend on rent.  Maybe I want to know how a group of people score on an eye exam, and I can do this by giving them a number from zero (perfect eyes), with negative being more near-sighted and positive being far-sighted.  Maybe I want to know income distribution of some population, and I'm going to measure that in the tens of thousands of dollars.  These all have to do with people, but they don't have to: fraction of days in a year that regions get rainfall, distance that an ant may travel from its colony in a day, so on.

The graphs below are all potential distributions for some of these questions.  They represent what we can think of as "reality": if we sampled everyone, or everything, these are potential probability distributions that we would come up with; they're the probability distribution underlying what looks to us to be the randomness of our sampling.  The top distributions are examples of Beta distributions, and range from 0 to 1 (a uniform distribution is a form of a beta distribution).  The middle row shows Normal/Gaussian distributions, and the last row shows Gamma distributions.
The vertical bars in each graph are the means of these distributions.  They were calculated from simple equations you can find online that depend on the parameters for each distribution.  In other words, each distribution gives us a unique mean.  If we know the distribution, we know the mean (clearly we can't do this the other way around).

Let's say I want to figure out the fraction people spend on rent, and the "reality" is the distribution given by the gray Beta distribution.  I start my sampling, and record my observations at 100 people samples, 10,000 people sampled, and 1,000,000 people sampled.  Each person is a random drawing from the "real" distribution above.  This result is graphed in a series of histograms below.

The curved black line is the actual distribution, the outline of the curve in the previous figure.  The histogram bars outline the sampling distribution.  The horizontal lines above each histogram indicate a straight 2-times standard deviation calculation for the observed data (gray lines), and the bounds of the mean-centered 95% of the sampling distribution (black lines).  The dots are the sample mean.

After only 100 samples I can see that most people spend less than half of their income on rent, and they tend to cluster around 1/4 – 1/3.  After 10,000 people sampled, I get a much better idea of the distribution, and after a million people I have a very smooth sampling distribution that has converged to the actual distribution.  Now I don't, of course, know what the actual distribution is when I'm doing my sampling of people.  I just know that the more data I get, the more I learn about this distribution.  A very discerning eye will pick up that the sample mean also converges on the actual mean.

This is what sampling helps us do.  It helps us determine, with better certainty, mathematical characteristics of the actual distribution, such as the mean, or the median, or the mode, and so on.  We can see the same thing with some other distributions; take for instance the blue normal distribution, or the red gamma distribution from above.
(A minor point too: while the standard deviation can be calculated no matter what the sample is, clearly from these graphs it does not translate well into percentage intervals.  It only does that job for distributions that are approximately normal.)

No matter how much I sample, the standard deviation of my sample does not change.  However, clearly as my sample size increases, my certainty in the mean (or maybe in the 25th percentile, the mode, what have you) goes up.  So we cannot use the standard deviation value as a description of my certainty of any of these.

Shrinking Uncertainty


So what do I use?  One can work through the Central Limit Theorem to come up with an exact equation (which is a very simple equation in fact), but let's do this with simulation.  Let's say that I sampled 100 people, or 300 people, or 1000 people, or 3000 people.  Obviously there are many ways for my sample to be a representative selection of all people, so each sample of a certain size will be different than another possible sample of the same size.  But, since each such sample is of the same number of people, we have the same expectation for each about what it will tell us about the real distribution's characteristics.

We can run many possible scenarios of such-sized samples, and make histograms of those results.  Those are graphed below, keeping with our "real" gray beta distribution that represents fraction spent on rent.  The numbers next to each curve represent the size of each sample that was drawn to help make that curve; overall 100,000 instances of each sized sample were drawn.
Each histogram also has a corresponding curve, which is a normal distribution with mean equal to the mean of all of the samples in the histogram (so a mean of means), and standard deviation equal to the standard deviation of all of the samples (so a SD of means).  The fit is quite remarkable, no?  This is ultimately the result of the Central Limit Theorem: as sample size increases (so, from 100 to 1000 to 3000), the certainty of the mean estimate increases as well.  As in, the distance you can expect your sample mean to be away from the actual mean decreases the larger your sample size is.

What more, this standard error of the mean decreases with the square root of the sample size.  A quick and dirty way to check this with normal distributions is to check the ratio of the peaks of two distributions: the peak of the 3000-size curve is 136.5, and the peak of the 1000-size curve is 79.  The square of the ratio they form is 2.99, so essentially 3.  The ratio between the 3000-peak and 300-peak is similarly 3.15, whereas the square root of 10 is 3.16.

And for the final kicker: the standard deviation of all of the samples in the size-3000 histogram is 0.002922, and the standard deviation of the real beta distribution (the square root of the variance of this distribution, which can be calculated by equations found at Wikipedia) is 0.1597.  This is very close to the previous number multiplied by the square root of 3000.

This is a very lengthy means of arriving at this final result, which you can find in any introductory statistics textbook (or online as here or here): if you have a sample mean, call it m, then it approximates the distribution mean ยต by


(Here I use a 1-standard error bound.)  So, for Karl et al., disregarding the question of area-weighting (and even regarding it), we want to use the standard error numbers: 0.12±0.02 (2 s.e. bound), or 0.146±0.024 (2 s.e. bound) as per my previous post on the matter.

(In particular, an unbiased estimator of the standard error of the mean is found by dividing by N – 1, not just N.  However, the effect is pretty minimal, and I won't go into showing this.)

No comments:

Post a Comment