Verizon data breach investigations report, Part 1

* A sound methodology

The Verizon Business RISK Team recently published a valuable analysis of four years of data on security breaches among their clients entitled "2008 Data Breach Investigations Report." Today I want to draw readers' attention to the methodology of this landmark study.

The Verizon Business RISK Team recently published a valuable analysis of four years of data on security breaches among their clients entitled “2008 Data Breach Investigations Report.” Wade H. Baker, C. David Hylender and J. Andrew Valentine are the authors; their contributors include my old friend and colleague Dr Peter Tippett, MD, PhD, A. Bryan Sartin, Stan S. Kang, Christopher Novak, and members of the Verizon RISK Team.

Brad Reed has pointed out the main findings recently in Network World and the paper itself includes a good executive summary; therefore, in the next few columns, I will elaborate on the implications of specific points from the report.

Today I want to draw readers’ attention to the methodology of this landmark study.

As most people realize, all published information about data-security breaches (Compare Data Leak Protection products) should be examined with critical faculties fully engaged. Studies and statistics about computer crimes consistently suffer from the following methodological problems:

• Limited ascertainment (the crimes may not be detected).

• Restricted reporting (many organizations don’t want to report breaches at all and there is no centralized reporting facility to collate the data).

• Non-random samples (it is not possible to generalize from the samples to a wider population because the reports come from self-selected reporting organizations).

For more information about these issues, see my paper, “Understanding Computer Crime Studies and Statistics v4.” 

I believe that the study is unique in drawing upon a massive database of more than 500 specific investigations carried out by the Verizon RISK Team over the last four years. As the authors write, “Furthermore, it contains firsthand information on actual security breaches rather than on network activity, attack signatures, vulnerabilities, public disclosures, and media interpretation that form the basis of most publications in the field. While many reports in the security industry rely on surveys as the primary data collection instrument, this data set is inherently more objective.”

Surveys are inherently limited because it is difficult or impossible to determine whether the willingness to participate in the survey is correlated with any particular attributes of the participants; e.g., perhaps those who refuse to participate have worse security than those who participate – or vice versa. We don’t know and cannot know based on the survey results.

In contrast, the organizations studied in the Verizon report were clients of the RISK Team (or on an incident response retainer contract) either before they had breaches or they were referred to Verizon after the breaches. In either case, the fact that these are known clients increases the reliability of the findings compared with surveys where anonymous respondents can fill in the blanks without verification of their identity. No one is claiming that the sample is a random sample that allows generalization of the sample results to the universe of all possible corporate victims of security breaches; the authors themselves warn:

“Though challenges such as sampling techniques, response rates, and self-selection are not relevant to the research method used in this study, it cannot be concluded that the findings are therefore unbiased. Perhaps most obvious is that the data set is dependent upon cases which Verizon Business was engaged to investigate. Readers familiar with publicly available statistics on data loss will quickly recognize differences between these sources and the results presented in this report. This has much to do with caseload. For instance, it is simply more likely that an organization will desire a forensic examination following a network intrusion than a lost laptop.”

It is refreshing to see a security report with this degree of statistical awareness.

Most important, the detailed statistics, including causes or methods, numbers of records compromised, types of data involved, time span of events, discovery methods, and estimated costs were based on analysis by trained professionals, not on self-reported, unverified guesswork by anonymous respondents. One of the most serious methodological problems of studies which rely on multiple-choice responses by unknown respondents is that it is difficult to validate the data; presentation of cost-classes, for example, naturally attracts respondents to whichever categories are presented. Checking a box on a form is a lot easier than actually measuring costs or analyzing causes, but the unverified results are of dubious reliability.

Survey-design courses demonstrate many methods for validation of surveys, none of which are ever used in popular security surveys as far as I have seen in over 20 years of study. Examples of internal and external survey-validation techniques include multiple questions in different parts of the survey instrument addressing the same metric using different wording and different scales; repeated administration of the instrument to identifiable individuals to measure intra-respondent variability; and follow-up studies to compare the survey results with data collected independently.

For an excellent brief (30-page) introduction to sound survey design, see David S. Walonick’s free tutorial

In the next column, I’ll look at the implications of a surprising finding: “In a finding that may be surprising to some, most data breaches investigated were caused by external sources.”

Learn more about this topic

 
Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Take IDG’s 2020 IT Salary Survey: You’ll provide important data and have a chance to win $500.