Big Data is a phrase we hear over and over again. Yes it's obvious, Big Data means well, big data, lots of it. We all get that Facebook, Twitter and the other mega-web apps generate literally tons and tons of data. But beyond the mega web apps, what really is Big Data? What can we do with it and why does everyone get so excited by it? For help with this I went to my friends at LexisNexis, makers of HPCC Systems.
When we speak about big data, the problem is not amassing a lot of data, it is the analysis of the data to make something of value out of it that is the real trick. The folks at LexisNexis have been doing this for a long time. HPCC Systems is LexisNexis's own in-house big data solution, which they open sourced about a year ago.
For purposes of this article, though, whether we are speaking about HPCC or Hadoop or any other big data solution, is not important. I wanted to illustrate what you can do with good analysis of big data. I am going to share a case study by HPCC Systems on a proof of concept they did for the Office of the Medicaid Inspector Generation (OMIG) of a large Northeastern state.
HPCC Systems was given a large list of names and addresses. Overlapping thier own publicly available data, they sought to identify social clusters of Medicaid recipients living in expensive houses and driving expensive houses. Of course, it helps if you have 50Tb of public data and lots of experience building social graphs.
In any event these are the kind of tasks that HPCC and big data solutions are built for. Comparing Medicaid roles with purchases of cars and homes revealed some interesting results. Here is a map that was generated:
Not only did the analysis turn up lots of likely Medicaid fraud, but it also turned up connections that could be indicative of money laundering and mortgage fraud. This kind of result simply would not be possible without the power of a big data analysis engine like HPCC Systems.
I had a chance to speak with Jo Prichard of LexisNexis, who showed me some other examples of big data analysis. One involved taking the total page views of Wikipedia for the year, along with public mentions of specific personalities. So, tracking hits on Whitney Houston to her Wikipedia hits. Again, the results were pretty extraordinary. Another example was drug prescription abuse. Again overlaying public data on the initial data set shed some eye opening results.
This really only scratches the surface of what you can do with big data if you have the horsepower and analysis to use it. In this case it is HPCC Systems, but it could be Hadoop (though the LexisNexis folks say not as easily as you can wtih HPCC) or another big data solution. This kind of insight is what gets people really excited about big data beyond the Facebook-Twitter crowd.