Saying that big data is hot today is as big an understatement as saying that the New England Patriots like to stretch the rules of football. It's hard to go anywhere or speak to anyone without the term "big data" coming up. In fact, I flew to Milan and back this week and saw a big data story in the airline magazine. The term big data is a bit over-used, as it means different things to different people. But there's one commonality to all the definitions, and that's…(drumroll, please)…data!
My statement above – that big data depends on data – seems very obvious, but success with analyzing big data requires more than just raw data. It requires good, quality data. So maybe a more accurate statement should have been that success with big data requires prepared data. When it comes to analytics, there's an old axiom that goes "garbage in, garbage out," meaning that if you throw high volumes of poorly formed data into an analytic solution, you'll get bad results.
Historically, the cleansing and preparation of data has been a long, arduous, time-consuming process. When I was at Yankee Group, we migrated CRM systems, but before we could do the migration, the company spent a year doing nothing but cleaning up the records in the existing system so we didn't import bad data. Even with all the work we did, we still had a bunch of bad information that was migrated over.
Recently, I ran across a company called Paxata that provides a solution that does something called "self-service adaptive data preparation." The technology can combine, clean, and shape data before any kind of analytics or operational reporting is done. Many of the existing business intelligence products on the market claim to make the analytic process easier, but the fact is that most data scientists and data analysts spend the majority of their time trying to prepare data for analysis. Data scientists and analysts are a rare commodity today, so most of them command high salaries. Given that, I would think that most businesses would rather have these highly paid resources figuring out what the data means, rather than cleansing it.
Paxata offers the entire lifecycle of data preparation, including exploring, cleansing, changing, shaping, and publishing the data to get it ready to be analyzed. The product also allows different data teams to share the same data set but enables the various teams to simultaneously edit and access the information across multiple devices. The product is also a governance solution that tracks every step within a project with full replay capabilities to review changes that were made.
Paxata customers can expect an increase in analytic productivity on larger data sets while minimizing the risk of data sprawl. The product is available as both a cloud service for data prep flexibility and as an on-premise solution that can be integrated into something like Hadoop for faster time to value.
As I said earlier, big data is a hot topic today, but business and IT leaders need to understand that analyzing bad data means bad analytics and maybe the wrong business decisions being made. Because of this, I certainly expect to see data preparation becoming a market that's as hot or hotter than big data.
Since the Super Bowl is this weekend, I'll finish with another football analogy. Big data, like Tom Brady, gets all the headlines, but data preparation is like Adam Vinatieri, the unsung hero and the real reason why the Patriots won those early Super Bowls.