Searching your data resources to see if you're dealing with any of the entities named in the Panama Papers isn't a Big Data problem.
On May 9th, The International Consortium of Investigative Journalists will release a searchable database that will detail over 200,000 entities that are part of the Panama Papers investigation.
While this will be intriguing for most of us, if you’re in a financial organization of any kind and there’s the remotest chance that you might have dealings with any of these entities, or with parties who might be fronting for or involved with them, May 9th will be (or depending on when you read this, is or has been), shall we say, “a bad day” for you.
Your challenge, once you get your hands on the ICIJ data, will be to search your organization's data resources looking for these names. The problem will be that it’s pretty much guaranteed you’re not prepared for this. What most of you will do is assume that this is a Big Data problem and attempt to aggregate all of your data resources into what will amount to a Frankenbase so you can run analytics on it. This will be neither simple not quick.
The reasons it won’t be simple or quick are because financial institutions are some of the worst offenders when it comes to having multiple data silos. Moreover, even when you’ve negotiated and or bashed heads to overcome the politics of silo ownership, you’ll be faced with the not insignificant task of normalizing the data you’re hoping to stuff in your Frankenbase.
And even when you’ve got all of your ducks in a row, there’s a grim reality to what you’re trying to do: A 2013 survey by InfoChimps found that a remarkable 55% of Big Data projects are never completed and 39% of these were due to … yes, you guessed it … siloed data and non-cooperation (otherwise called “politics”) ... along with the 41% that were stymied by technical roadblocks.
Oh, and then there’s the cost: A 2014 Dell survey found “Budgets for big data projects are expected rise to an average of $6 million over the next two years.”
My friends over at Pneuron [Disclosure: In 2014 I wrote a short series of posts for the Pneuron blog] recently pitched me on their approach to analyzing the ICIJ data using their technology. I’ve been a big fan of Pneuron’s technology since I first wrote about the company back in 2013 and their strategy for doing the sort of data mining required for this near-Herculean task makes a lot of sense.
Simon Moss, Pneuron’s CEO, argues: “This is not a Big Data problem, it’s a diversity and distribution problem.” By diversity, Moss means that the range of formats and contexts in which the data is stored are going to be a big issue and by distribution, he's referring to the data being virtually and geographically dispersed. These are not trivial problems.
Mosse points out that the first step in addressing the problem of searching for entity names is sorting and matching; in the ICIJ data there will be names such as “Robert P. Jones” but in the various data silos in a financial institution that name might appear as “Jones, R.P.” or “Jones, Robert” or even “Bob Jones.” Moreover, data in one silo might identify his spouse as “Ethel Jones” who also has accounts in her own name only in other silos. The matching problem is even more tricky when it comes to company names and relationships. There’s also the issue of multiple entities partnering in an account and they all need to be connected and their transactions scrutinized both jointly and separately.
The whole idea of moving this massive, complex, distributed data into a centralized database or databases should not only be daunting it should really inspire horror at the scale of the task.
Pneuron’s solution is their eponymous technology that involves deploying modules, called “Pneurons”, on or near to, each of the the data resources. There are different Pneurons for various tasks: Data and Application Pneurons are used to access data sources including databases, files, spreadsheets, and web services, while Analysis Pneurons perform various types of matching (deterministic, probabilistic, Bayesian, etc. - a function that is obviously highly relevant in searching for the names of suspect entities), predictive modeling, and statistical analysis. All of these modules normalize the data before passing it up the chain to the Pneuron Cortex which manages, routes, and coordinates the activities of the various deployed Pneurons. Finally, output Pneurons persist, visualize, or deliver data to files, databases, or other destinations.
System design is done using Pneuron's Design Studio which provides a graphical drag and drop interface that makes configuration and modification about as intuitive as it gets.
Pneuron claims they can go from installation through deployment and configuration to displaying analytics results or handing data off to an external analytics systems in under one month for large scale projects and, in simple cases, in as little as four hours. In the case of the ICIJ entity search, you won’t have a lot of time to waste. So, if on May 9th you’re going to be having, are having, or had, a “bad day”, you might want to check out Pneuron’s strategy.
Ultimate guide to Raspberry Pi operating systems, Part 3Next Post
The Micro M3D, affordable 3D printing for the masses
Amazon Web Services and Microsoft Azure, take very different approaches in how their data centers are...
A review of 19 companies that offer free cloud storage
By forcing Windows 10 on users, Microsoft has lost the tenuous trust and credibility users had in the...
The attacks that overwhelmed the internet-address lookup service provided by Dyn today were well...
At particular risk are touch-screen voting machines that have no paper trails. If those are hacked or a...
Use these tips for finding inexpensive study resources and getting hands-on experience.
As companies rely more on co-location, cloud and other off-premise computing models, enterprise IT...