Why Big Data and analytics should care about my burning waterskis

How you code data in your Big Data project has consequences, just take a look at ICD-10

bigdata 2267x1146 white
Credit: Wikipedia

A list friend recently noted that the latest International Statistical Classification of Diseases and Related Health Problems (otherwise called "ICD codes”) is very interesting not to mention that it’s also very amusing. The ICD system, now in its 10th version, was designed and is administered by the World Health Organization and is described as:

… the standard diagnostic tool for epidemiology, health management and clinical purposes. This includes the analysis of the general health situation of population groups. It is used to monitor the incidence and prevalence of diseases and other health problems, proving a picture of the general health situation of countries and populations.

What’s so interesting about the ICD-10 is how remarkably specific it is in what it refers to. For example, there are codes for injuries related to walking into things:

W22.02XA, “walked into lamppost, initial encounter"

lamppost man Flikr

A man colliding with a lamppost

The coding W22 is for "striking against or struck by other objects” but not for “striking against or struck by object with subsequent fall” (which would be W18.09). The following “.0” indicates “striking against stationary object” while the subsequent “2” indicates a lamppost. If, instead of “2” it was a “3”, the thing struck would be furniture while a “1” would denote a wall (unless, of course, it was a swimming pool wall in which case it would be a “4”).

The “XA” suffix indicates an initial encounter  (here “encounter” means “interview with medical professional”), while “XD” would indicate a subsequent encounter. The suffix “XS” would indicate “sequela” which means “a pathological condition resulting from a prior disease, injury, or attack” implying that the lamppost collision caused some kind of ongoing problem (to the patient, not the lamppost).

You think that’s detailed? How about:

W59.21 Bitten by turtle

W59.22 Struck by turtle

W59.29 Other contact with turtle

hawksbill sea turtle carey de concha 5840602412 Wikipedia

A turtle about to strike

The thing I can’t fathom here is how one gets “struck by turtle” … turtles are not generally prone to rising up and slapping people nor are they often (if ever) found sailing through the air so how this classification ever arose (ha!) has to be a mystery.

But wait! It gets better. How about:

 V91.07 Burn due to water-skis on fire

Really? I mean how often in the history of mankind has anyone other than perhaps Evel Knievel been burnt by their waterskis bursting into flame?

at home with evel knievel Wikipedia

Evel Knievel off the clock

While I’m sure that this incredibly detailed data is considered crucial by someone, somewhere, it raises interesting and relevant questions about the use of such intelligence in Big Data and analytics. 

To start with, the accuracy of data collected from real world sources by humans is subject to interpretation and misinterpretation and the greater the specificity of coding the more likely it is that an event might be miscoded.

640px okapia johnstoni  marwell wildlife hampshire england 8a Wikipedia

An okapi ready to attack

Secondly, it illustrates the a priori interest bias that is often found in coding systems. The fact that turtles are specifically referenced in the ICD-10 while, for example, okapis aren’t shows that someone in the coding committee had turtle issues and managed to convince the other committee members that her particular hobby horse was crucially  important to world health (struck by a hobby horse would probably be coded as the rather more generic “W20.8 Other cause of strike by thrown, projected or falling object” unless, of course, she ran into it when it would be “W22.8 Striking against or struck by other objects").

Thirdly, it shows regionality bias because I’ll bet there aren’t a hell of lot of burning waterski accidents in, say, Iraq while “burn due to camel on fire” is most likely an everyday occurrence (I actually can’t figure out how to code this as it falls between the groups “W20-W49  Exposure to inanimate mechanical forces” and “W50-W64  Exposure to animate mechanical forces”).

With these kinds of issues in a coding system where entities are so minutely specified  detail can gain a prominence that isn’t necessarily useful while, at the same time, making the analysis task more complex. Sure, exactly what animal bit you matters but to have a code for turtles but not for okapis when a general qualifying text field would serve the same purpose as well as provide more relevant data would be a better strategy.

Perhaps the medical world moves in far more mysterious ways than I understand but this would seem to illustrate that the maxim “less is more” applies even when you’re dealing with Big Data.

Now, excuse me, I must put a bandage on my okapi bite (covered by the more general “W55.81 Bitten by other mammals”) and extinguish my wakeboard which was what got the okapi upset in the first place. Luckily I didn’t get burned by the wakeboard (which would be “V91.89 Other injury due to other accident to unspecified watercraft”).


To comment on this article and other Network World content, visit our Facebook page or our Twitter stream.
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.