This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
In IT we love creating new hype cycles and catchphrases. And like fashion trends, we seem to have a 20-year cycle where we go back to what we've done before but slap a new name on it and insist everybody must "have" it immediately. The latest hype: big data.
From Interop to cloud conferences and even to Dilbert, we are being told if we don't have a big data strategy -- that, by the way, aligns with our cloud strategy -- we are behind, and our company will crash and burn.
IN PICTURES: 'The Human Face of Big Data'
There are three important reality checks about big data. First, it's not really new. Companies like Amazon, Microsoft and Google have been doing big data work since the '90s. In fact, companies have been mining data for decades. It may have been only accessible or affordable to a few very large companies with big wallets and big main frame installations, but it has existed. Today, advanced data mining capability and algorithms are accessible to nearly everyone thanks to inexpensive computing and storage capacity as well as new tools and techniques.
In fact, many folks think big data is just a new name for business intelligence (BI). While there are similarities, big data goes beyond BI. I love how Stuart Miniman, a senior analyst at Wikibon, talks about the "bit flip" from BI to big data. Here is how I see that bit flip in action:
Second reality check: The "big" part is relative. We are absolutely dealing with a record level of digital data growth across all industries and organizations. According to IDC, we are creating more than 58 terabytes of data every second, and we expect to have some 35 zettabytes of digitally stored data by 2020. However, big data doesn't have to be massive. It's not so much the size but what you need to do with it and the time required to process it. A small company with 100 terabytes might have a big data problem, because it needs to extract, analyze and make decisions from multiple data sets about its product.
Third, the definition of data used in big data processes is broad. It can include both structured and unstructured data, and for some companies, the most vital big data is metadata, or the data about the data. Gartner does a good job of defining the data characteristics in big data as having volume, variety and velocity.
McKinsey defines big data as "datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze." What I would add to this is: "that requires massively parallel software (systems) running on tens, hundreds or even thousands of servers (clouds)."
Beyond coming to a common understanding and definition of big data, the next big hurdle for most companies is how to get started. As with cloud computing, big data seems to require a massive investment and implementation of multiple solutions, new IT and business processes, and a new level of business agility. Here are seven steps to big data success:
Step 1: Admit you have a problem. This is always the hardest step. Ten years ago, we refused to admit our network was no longer protected by a ring of firewalls and proxy settings, and we had to open up our infrastructure for employee remote access and embrace the Internet. With big data, IT leaders need to step back and evaluate their data situation.
- Are you overwhelmed with your data sets?
- Do you NOT know where all your data sits?
- Are you (or the business leaders) NOT getting the information you need from your data?
- Do you have business leaders making decisions NOT based on data?
- Do you see an opportunity to make IT more relevant in business policy and strategic decisions?
If you're like most companies, the answer is yes to some or all of these questions, and it's time to get control of your data and the intelligence you can gain from that data to benefit the business.
Step 2: Recognize the big opportunity you have with big data. We are always being told to be more relevant to the business. The term "business technology" has been thrown around for years, but it's not always easy to see how our latest software and processes directly impacts revenue or global growth. Big data can. Why? Because information is power, and business leaders need the information trapped in the data to compete, thrive and grow. The business, from sales to marketing to the C suite, is overwhelmed by the amount of data being generated by employees, customers and the market. Your ability to bring concise and real-time information and analysis of that data can and will drive increased revenue.
Step 3: Create your big data plan. As with any plan, you should start with the end in mind. What does the business need to know? What are the questions they need answered? Define this and get joint agreement before you even start playing with Hadoop. The whole point of this exercise if driving business intelligence and success. Then, follow these steps (obviously over-simplified, and each step could take weeks or months depending on your organization):
- Isolate the data that is part of your "big data" equation
- Separate "product" big data from "company" big data, such as making sure employee data needed for HR analysis is separate from customer or product search data in your e-commerce platform
- Recognize and understand the peaks and valleys of your data
- Understand which technologies allow real-time (or near real-time) big data processing
- Identify key solutions/vendors
- Start small, evaluate and grow -- do a project where you can quickly show success and ROI, then move to the next big data project
- Continually analyze, adjust and give input -- big data is agile and should be adjusted as your data, the intelligence and the business requirements change
Step 4: Think distributed. Big data requires us to shift our thinking about our systems and infrastructure. Just as virtualization fundamentally changed how we were able to utilize servers and applications, so does distributed systems and processing enable us to manage big data, as a distributed architecture allows us to break the problem into many tasks and then distribute those tasks across multiple systems. The good news is we have a growing number of tools and architectural frameworks to leverage. Names like Cassandra, Hadoop, VMware, Red Hat and many more. Distributed systems are not new, but big data takes earlier approaches to a whole new level. Some examples of distributed approaches include:
- Multi-tenant architecture
- Distributed database
- Multi-core CPUs
- Parallel processing
- Distributed file systems
- Distributed load balancing
- RAID algorithms
Step 5: Go beyond distributed to decentralized. This is the real paradigm shift for most companies. And this is where the cloud and big data come together, since the Internet is the largest distributed and decentralized system in the world, and we should leverage the Internet backbone as much as we can when implementing big data.
We are comfortable with distributed instances or compute processing but decentralized often brings the feeling of lost control. Why is this necessary? Embracing a decentralized approach to big data is required because of all of the unused instances and storage capacity going to waste due to over build out and orphaned services.
More importantly, distributed components alone will not allow us to keep up with our data growth. Remember that 35 zettabtytes expected by 2020? Even if we stay on pace with our current data center build out, which is at a record high worldwide, we can't build centralized infrastructure fast enough. IDC estimates that by 2020, we will have a 60% gap between digital data created and data center capacity available (see chart below).
Source: IDC Digital Universe Study, 2011
However, part of that is because we don't fully utilize the capacity we already have. Gartner estimates that most computers, servers and networks are running at 30% capacity in order to be ready for peaks or future growth. While we would never run at 90% or 100% capacity, we can do more to better utilize the excess capacity we already have without creating excess risk, while saving millions of dollars and improving the TCO of the infrastructure we already have.
What are the key characteristics of decentralization:
- No central bottleneck
- Power of large numbers
- Organic, demand driven growth in capacity
- Leverages existing infrastructure and devices on edge
- Shared information
- Concept of "Contribution" to the community
- Assumes everyone/every node is "untrusted"
- Geographic spread of:
- Ownership and participation
- Management overhead
There are good examples of decentralized approaches today. Perhaps the most well-known one that we don't really think of as decentralized is the open source movement, which is characterized as: Programmers who support the open source movement philosophy contribute to the open source community by voluntarily writing and exchanging programming code for software development. If you look at the definition of decentralization above, the open source community is a perfect example of decentralized development. And, while 10 years ago few enterprises utilized open source code in production, today you are in the minority if you do not leverage open source components in your stack.
There are two newer examples of decentralization in the dev/ops community, which I define as decentralized cloud systems: CloudStack and OpenStack. What I like about OpenStack is not only the community aspect and how even big vendors like IBM, HP and Cisco are jumping on board, but how it tackles the "control" issue of decentralized architectures by providing a centralized dashboard and interface. We are still in the early stages of decentralization, but this will be a key trend over the next few years as we continue to experience record data growth and require the need to process, analyze and make decisions about that data. [Also see: "Vendors continue to pick sides between CloudStack, OpenStack"]
Step 6: Hire/grow the right people and skills. I have long said that cloud does not mean fewer IT jobs, but the advent of cloud and big data does mean we need to evolve our skill set and talent pool. There are some existing roles, like database administrator, that become even more vital in the big data world. Other roles you should start nurturing and hiring in your IT organization include:
- Data scientists
- Random theorists (algorithms)
- Business analysts
- UX/UI experts
ROUNDUP: Top 5 cities for big data jobs
Some of these seem logical for an IT shop, but the ones I always get questions about are the business analysts and UX/UI roles, which have not traditionally sat in IT. You could put these roles in product management, but they need to work hand in hand with the dev/ops team on the big data solution. This is because if you cannot visualize the big data information to the business side, you will not succeed. Dashboards, charts and easy-to-understand analysis are key.
Also, since I mentioned it, if you have not already integrated your dev/ops teams to better manage your cloud implementations, then do it now. Our world can no longer operate with these two functions as separate silos. They must be joined at the hip and working in complete alignment for any cloud or big data strategy to succeed.
Step 7: Use data with your big data. Just like IT roles might start to look strangely business-focused, IT needs to change the way it is measured. Everyone on your team must be metrics driven and have a passion for tracking and moving toward that key performance indicator (KPI). And these should be aligned with business metrics not just about releasing on time or delivering quality code.
I think the best description of this required cultural shift is "growth hacker." A growth hacker is someone who loves driving toward metrics, is a creative problem solver, and is continually exploring new ways to push metrics up and to the right. While typically this was the job of the business side, every member of the tech team should have clear metrics and be empowered to find new ways to drive stretch results.
Big data might not be the answer to all of our prayers, but it does represent an opportunity for IT to have a seat at the table and directly drive stronger revenues, market penetration and share of voice in an ever more competitive global marketplace.
Margaret Dawson is a 20-year high-tech industry veteran and cloud expert. She is a frequent author and speaker on cloud computing, big data, network security, integration and other business and technology themes. Currently, Margaret is vice president of product management at Symform.