- Silicon Valley's 19 Coolest Places to Work
- Is Windows 8 Development Worth the Trouble?
- 8 Books Every IT Leader Should Read This Year
- 10 Hot Hadoop Startups to Watch
Page 2 of 4
MYTH 4: Collect it now, sort it out later
Storage is getting cheaper all the time, but it's not free. However, for many companies, the appetite for data is expanding faster than storage costs are decreasing, says Brad Peters, CEO of San Francisco-based Birst, a cloud-based business intelligence vendor.
Companies think that if they just collect the data, they'll figure out what to do with it later, he says. “I see a number of large corporations collecting boatloads of stuff, their expense on it goes up, and they don't get any value out of it.”
In fact, with some data sets, the law of diminishing returns starts to apply. Say, for example, you're polling people to predict an election. You need a certain number of people to get a representative sample. But after a point, adding more people won't significantly affect the margin of error.
“Do you store a bunch of data you may need, that might give you a couple more digits of precision?” he asks. “Or do you buy more people power? Do you secure your networks better? We're not going too fast as an economy, and budgets aren't increasing.”
And it's not just storage costs, says Dean Gonsowski, global head of information governance and big data management at San Francisco-based Recommind, which specializes in unstructured data analytics.
For example, it may cost the company if the data gets out, he says. And having data sitting around in warehouses means that it's subject to e-discovery arising from court cases.
Finally, the more data, the longer it takes to sort through it. “When the repositories get into the billions of records, searches take hours or weeks,” he says. “The volume of information really start clogging systems that were never built to handle those volumes.”
MYTH 5: All data is created equal
The state of Virginia has been collecting data on student enrollments, financial aid, and degree awards for the past 20 years. But that doesn't mean that the data collected 20 years ago and stored in the same data field is necessarily the same data.
“The biggest problem I deal with, is that just because it's in the data dictionary, researchers think it's fair game,” says Tod Massa, the policy research and data warehousing director for Virginia's State Council of Higher Education. “For example, data on student test scores on the ACT and SAT were initially only collected on in-state students, then there was a gap, then it was collected on both in-state and out-of-state students.” Similarly, race and ethnicity is tracked differently at the K-12 level and in higher education.
In fact, any particular data point might be reported differently by different institutions, or at different points in time, or by different people at those institutions. “If you're in an isolated shop or enterprise that is solely responsible for the data it collects, then you might have a different situation,” he says. “But then even, I suspect that the meanings of data change over time.”
As a result, analysts need to have not just statistical skills, but also local knowledge of the data and knowledge of trends in the industry as a whole, such as SAT and ACT scores being re-calibrated.