Skip Links

The 7 steps in Big Data delivery

By Jill Dyché, vice president of thought leadership for DataFlux, special to Network World
July 11, 2012 10:24 AM ET

Network World - This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note that it will likely favor the submitter's approach.

The Big Data trend represents the evolving need to process large amounts of data with a new crop of technology solutions that aren't necessarily your father's database. So, what does a company need to consider when contemplating getting started with Big Data?

First, they need to know what Big Data is. Here is how I define it:

"The emerging technologies and practices that enable the collection, processing, discovery and storage of large volumes of structured and unstructured data quickly and cost-effectively."

Big Data -- from financial trades to human genomes to telemetry sensors in cars to social media interactions to Web logs and beyond -- is expensive to process and store in traditional databases. To solve that problem new technologies leverage open source solutions and commodity hardware to store data efficiently, parallelize workloads and deliver screaming-fast processing power.

MORE: Open source: Leading the way for big data applications

ROUNDUP: 9 open source big data technologies to watch

As more IT departments research Big Data alternatives, the discussion centers on stacks, processing speeds and platforms. And inasmuch as these IT departments are savvy enough to grasp the limitations of their incumbent technologies, many can't articulate the business value of these alternative solutions, let alone how they will classify and prioritize the data once they identify it. Enter Big Data governance.

In fact as we look at the emerging need for Big Data, the platforms and processes discussions are only part of the overall approach to Big Data delivery. In reality we're seeing seven steps in realizing the full potential of a Big Data development effort:

Collect: Data is collected from the data sources and distributed across multiple nodes -- often a grid -- each of which processes a subset of data in parallel.

Process: The system then uses that same high-powered parallelism to perform fast computations against the data on each node. The nodes then "reduce" the resulting data findings into more consumable data sets to be used by either a human being (in the case of analytics) or machine (in the case of large-scale interpretation of results). [Also see: "Could data scientist be your next job?"]

Manage: Often the Big Data being processed is heterogeneous, originating from different transactional systems. That data usually needs to be understood, defined, annotated, cleansed and audited for security purposes.

Measure: Companies will often measure the rate at which that data can be integrated with other customer behaviors or records and whether the rate of integration or correction is increasing over time. Business requirements should inform the type of measurement and ongoing tracking.

Consume: The resulting use of the data should fit in with the original requirement for the processing. For instance, if bringing in a few hundred terabytes of social media interactions helps us understand whether and how social media data drives additional product purchases, then we should set up rules for how social media data should be accessed and updated. This is equally important for machine-to-machine data access.

Latest News
rssRss Feed
View more Latest News