## Americas

• United States

# Lies and statistics

Opinion
Feb 05, 20044 mins
NetworkingSecurity

* Statistics can be deceiving when taken out of context

I recently had occasion to write to the publisher of a magazine with a couple of complaints about the way they had represented information based on statistical analysis. It seems to me that readers of this column may appreciate a little clarity about exactly what they should expect from writers and editors when they report on, say, computer crime statistics.

Here’s what I wrote; I have changed all the details to avoid embarrassing the guilty.

* * *

In Mumble_Mumble for Winter 2004, an unnamed author wrote, “How to stop hacking in school” on page 82:

In the article, the author wrote, “One study found that 60 percent of boys in grades 6 through 12 who hacked into their schoolmates’ computers were involved in at least one criminal computer trespass by age 24.”

(IMPORTANT NOTE TO READERS: This is NOT what was written. I’m MAKING THIS UP for the example only. DON’T USE THIS STATISTIC as if it were true.)

This is half of what statisticians call a “two-way contingency table” – that is, it is supposed to allow us to understand relationships between two variables. In this case, the variables are (a) hacking in school and (b) being convicted of criminal computer trespass by age 24. The full table would look something like this:

No criminal trespass by age 24

Children who hacked: 40%

Children who did not hack: ?

At least one criminal trespass by age 24

Children who hacked: 60%

Children who did not hack: ?

As you can see, part of this table is missing. The information that was reported in the article is completely useless without the rest of the contingency table. We need to know what percent of the students who did not hack others were involved in at least one criminal trespass by age 24. Without that part of the picture, there is no way to evaluate the meaning of the statistic. For example, did half as many non-hackers commit criminal trespass as hackers? The same proportion? Twice as many?

The second issue is that the writer reported the study results with no indication of reliability. Was the sample 10? 100? 1,000? 10,000? Each of those sample sizes is associated with different (and well-established) reliability for estimated proportions. There are well-known formulae (and tables based on them) that allow us to guess what are called “confidence intervals” for estimated percentages. Confidence intervals for an estimated percentage define a range of percentages such that the likelihood of being right in asserting that the true proportion lies within that range is some arbitrary degree of confidence – usually 95% or 99%. For example, one might ask whether the true percentage of children later convicted of criminal trespass was between 59 and 61%? 55 and 65%? 50 and 70%? 40 and 80%? What??

Any elementary statistics book will show you that the formulae for calculating the upper and lower 95% confidence limits of a percentage based on an observed percentage “p” from a sample of size “n” are:

L(lower) = p – {1.96 * SQRT[p(100-p)/n] + 50/n}

L(upper) = p + {1.96 * SQRT[p(100-p)/n] + 50/n}

If the 60% proportion were based on a sample of, say, 100 children in all, then the 95% confidence limits would be 50% to 70%. That is, we would be right 19 times out of 20 that our calculated 95% confidence interval included the true population percentage when taking random samples of 100 children from this population.

Finally, one should always note that association and correlation do not prove causality. That is, even if a higher proportion of kids who hacked were really convicted of criminal trespass, the observation by itself would not prove that hacking in childhood caused the children to commit criminal trespass. The association could be the result of sampling variability (i.e., the scientists were unlucky and got an unrepresentative sample). The result could also occur simply because the two phenomena had shared roots, but neither one caused the other. The observations as reported don’t prove or disprove either explanation.

In summary, when reading such statistics, be sure that you have looked at both parts of a two-by-two contingency table, always check for the sample size and the confidence limits of statistics based on sampling from a population, and don’t assume that statistical associations necessarily imply causal relationships.

* * *