* Statistics can be deceiving when taken out of context I recently had occasion to write to the publisher of a magazine with a couple of complaints about the way they had represented information based on statistical analysis. It seems to me that readers of this column may appreciate a little clarity about exactly what they should expect from writers and editors when they report on, say, computer crime statistics.Here’s what I wrote; I have changed all the details to avoid embarrassing the guilty.* * *In Mumble_Mumble for Winter 2004, an unnamed author wrote, “How to stop hacking in school” on page 82: In the article, the author wrote, “One study found that 60 percent of boys in grades 6 through 12 who hacked into their schoolmates’ computers were involved in at least one criminal computer trespass by age 24.”(IMPORTANT NOTE TO READERS: This is NOT what was written. I’m MAKING THIS UP for the example only. DON’T USE THIS STATISTIC as if it were true.) This is half of what statisticians call a “two-way contingency table” – that is, it is supposed to allow us to understand relationships between two variables. In this case, the variables are (a) hacking in school and (b) being convicted of criminal computer trespass by age 24. The full table would look something like this:No criminal trespass by age 24Children who hacked: 40%Children who did not hack: ?At least one criminal trespass by age 24Children who hacked: 60% Children who did not hack: ?As you can see, part of this table is missing. The information that was reported in the article is completely useless without the rest of the contingency table. We need to know what percent of the students who did not hack others were involved in at least one criminal trespass by age 24. Without that part of the picture, there is no way to evaluate the meaning of the statistic. For example, did half as many non-hackers commit criminal trespass as hackers? The same proportion? Twice as many?The second issue is that the writer reported the study results with no indication of reliability. Was the sample 10? 100? 1,000? 10,000? Each of those sample sizes is associated with different (and well-established) reliability for estimated proportions. There are well-known formulae (and tables based on them) that allow us to guess what are called “confidence intervals” for estimated percentages. Confidence intervals for an estimated percentage define a range of percentages such that the likelihood of being right in asserting that the true proportion lies within that range is some arbitrary degree of confidence – usually 95% or 99%. For example, one might ask whether the true percentage of children later convicted of criminal trespass was between 59 and 61%? 55 and 65%? 50 and 70%? 40 and 80%? What??Any elementary statistics book will show you that the formulae for calculating the upper and lower 95% confidence limits of a percentage based on an observed percentage “p” from a sample of size “n” are: L(lower) = p – {1.96 * SQRT[p(100-p)/n] + 50/n}L(upper) = p + {1.96 * SQRT[p(100-p)/n] + 50/n}If the 60% proportion were based on a sample of, say, 100 children in all, then the 95% confidence limits would be 50% to 70%. That is, we would be right 19 times out of 20 that our calculated 95% confidence interval included the true population percentage when taking random samples of 100 children from this population.Finally, one should always note that association and correlation do not prove causality. That is, even if a higher proportion of kids who hacked were really convicted of criminal trespass, the observation by itself would not prove that hacking in childhood caused the children to commit criminal trespass. The association could be the result of sampling variability (i.e., the scientists were unlucky and got an unrepresentative sample). The result could also occur simply because the two phenomena had shared roots, but neither one caused the other. The observations as reported don’t prove or disprove either explanation.In summary, when reading such statistics, be sure that you have looked at both parts of a two-by-two contingency table, always check for the sample size and the confidence limits of statistics based on sampling from a population, and don’t assume that statistical associations necessarily imply causal relationships.* * *Readers who want to learn more about reading statistics without being bamboozled can download the paper, “Understanding Studies and Surveys of Computer Crime” from my Web site at:https://www2.norwich.edu/mkabay/methodology/crime_stats_methods.pdfOr you can read Chapter 4 of the _Computer Security Handbook, 4th edition_ edited by Seymour Bosworth and M. E. Kabay. Related content news EU approves $1.3B in aid for cloud, edge computing New projects focus on areas including open source software to help connect edge services, and application interoperability. By Sascha Brodsky Dec 05, 2023 3 mins Technology Industry Technology Industry Technology Industry brandpost Sponsored by HPE Aruba Networking Bringing the data processing unit (DPU) revolution to your data center By Mark Berly, CTO Data Center Networking, HPE Aruba Networking Dec 04, 2023 4 mins Data Center feature 5 ways to boost server efficiency Right-sizing workloads, upgrading to newer servers, and managing power consumption can help enterprises reach their data center sustainability goals. By Maria Korolov Dec 04, 2023 9 mins Green IT Servers Data Center news Omdia: AI boosts server spending but unit sales still plunge A rush to build AI capacity using expensive coprocessors is jacking up the prices of servers, says research firm Omdia. By Andy Patrizio Dec 04, 2023 4 mins CPUs and Processors Generative AI Data Center Podcasts Videos Resources Events NEWSLETTERS Newsletter Promo Module Test Description for newsletter promo module. Please enter a valid email address Subscribe