Defining 'big data' depends on who's doing the defining

When does data become big? AWS, IBM and research firms each have their own definitions.

Big data is an IT buzzword nowadays, but what does it really mean? When does data become big?

At a recent Big Data and High Performance Computing Summit in Boston hosted by Amazon Web Services (AWS), data scientist John Rauser mentioned a simple definition: Any amount of data that's too big to be handled by one computer.

Some says that's too simplistic. Others say it's spot on.

CLOUD TRENDS: New bare metal cloud offerings emerging 

HADOOP: Hadoop wins over enterprise IT, spurs talent crunch 

"Big data has to be one of the most hyped technologies since, well the last most hyped technology, and when that happens, definition become muddled," says Jeffrey Breen of Atmosphere Research Group.

The lack of a standard definition points to the immaturity of the market, says Dan Vesset, IDC program vice president of the business analytics division of the research firm. But, he isn't quite buying the definition floated by AWS. "I'd like to see something that actually talks about data instead of the infrastructure needed to process it," he says.

Others agree with the AWS definition.

"It may not be all inclusive, but I think for the most part that's right," says Jeff Kelly, a big data analytics analyst at the Wikibon project. Part of the idea of big data is that it's so big that analyzing it needs to be spread across multiple workloads, hence AWS's definition. "When you're hitting the limits of your technology, that's when data gets big," Kelly says.

One of the most common definitions of big data uses three terms, all of which happen to start with the letter V: volume, velocity and variety. Many analyst firms, such as IDC and companies, such as IBM, seem to coalesce around this definition. Volume would mean the massive amount of data generated and collected by organizations; velocity, refers to the speed at which the data must be analyzed; and variety means the vast array of different types of data that is collected, from text, to audio, video, web logs and more.

But some are skeptical of that definition, too. Breen has a fourth "v" to add to the definition: vendor.

Companies such as AWS and IBM tailor definitions to support their products, Breen says. AWS, for example, offers a variety of big data analytic tools, such as Elastic Map Reduce, which is a cloud-based big data processing feature.

"The cloud provides instant scalability and elasticity and lets you focus on analytics instead of infrastructure," Amazon spokesperson Tera Randall wrote in an e-mail. "It enhances your ability and capability to ask interesting questions about your data and get rapid, meaningful answers." Randall says Rauser's big data definition is not an official AWS definition of the term, but was being used to describe the challenges facing business management of big data.

Big data analytics in the cloud is an emerging market though, Kelly says. Google recently, for example, released BigQuery, the company's cloud-based data analytics tool. IBM, for its part, says information is "becoming the petroleum of the 21st century," fueling business decisions across a variety of industries moving forward.

Big data, IDC says, is a big market though. According to IBM, IDC estimates enterprises will invest more than $120 billion by 2015 to capture the business impact of analytics, across hardware, software and services. The big data market is growing seven times faster than the overall IT and communications business, IDC says.

But Vesset, the IDC researcher, says big data is not about how it is defined, but rather about what is done with the data. The biggest challenge organizations have today is understanding which technologies are best for which data and use cases. With the rise of Hadoop, an open source big data analytics tool, some question if that's the end to traditional relational databases compared to unstructured data services, like Hadoop.

"Both have a role to play," he says, and most large organization will likely use each. Relational databases will have some structured approach to the data, which will be used for organizations that have a large amount of data that is subject to compliance or security requirements, for example. Large scale collecting of data on an ad hoc basis is more unstructured and would take advantage of Hadoop computing clusters, he believes.

How big data is defined though, is slightly more intangible, at least so far. Kelly perhaps has the best definition though: "You know it when you see it."

Network World staff writer Brandon Butler covers cloud computing and social collaboration. He can be reached at and found on Twitter at @BButlerNWW.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2012 IDG Communications, Inc.