Three incremental, manageable steps to building a “data first” data lake

Instead of extracting, transforming and loading data into separate analytic clusters or data warehouses, converge data so all applications can use it in real time


Applications have always dictated the data. That has made sense historically, and to some extent, continues to be the case. But an “applications first” approach creates data silos that are causing operational problems and preventing organizations from getting the full value from their business intelligence initiatives.

For the last few decades, the accepted best practice has been to keep operational and analytical systems separate in order to prevent data analysis workloads from disrupting business operations. With this approach, any holistic analysis of the data stored in operational systems requires extracting, transforming and loading into separate analytic clusters or data warehouses. This requires additional resources, generates duplicate data, and takes considerable time, making it difficult or impossible to achieve the operational agility or algorithmic business processes recommended by Ernst & Young and Gartner, respectively.

A “data first” approach, by contrast, holds the promise of creating an infrastructure capable of capturing and consolidating all data into a converged data store or “data lake” where it can be accessed simultaneously and securely by many different applications in real-time as it becomes available. Such a converged architecture simplifies data management and protection, supports new applications that combine operations and analytics, and avoids the dreaded “multiple versions of the truth” phenomenon inherent in data silos.

Outlined here are three incremental and manageable steps any organization can take to begin implementing a data first strategy.

Step #1: Create a data lake. Start by creating a data lake, and include as many data sets and sources as possible. To minimize duplication, endeavor to make the data lake serve as a system of record for as many applications as practical by fully migrating their data sets. Then “complete” the data lake by replicating, as needed or desired, data from those existing applications whose data sets cannot be migrated—at least initially—for whatever reason. In other words, migrate what you can, and replicate what you must. To enable more holistic analyses, also be sure to include in the data lake those sources of data that are currently unused, but hold potential value.

While filling the data lake, be mindful of the requirements of any shared data environment, including satisfying the needs for a global namespace, unified security, high availability, high performance, multi-tenancy, data protection (replication, backup/restore and disaster recovery), etc. Of these requirements, the only one that might be new or substantially different with a data first data lake is the need for multi-tenancy. Because the consolidated and converged data will need to be shared simultaneously by different applications and users in different roles across different departments, it will be important to support the various “tenants” in a way that preserves data availability, security and integrity.

To keep costs low while making the data lake scalable, consider using commodity hardware deployed in clusters. And to maximize the data lake’s ultimate potential, use open, standards-based software with published interfaces, plug-ins and other means for integrating with other applications, services and systems. Such an “open first” approach would give preference to technologies like Linux, KVM, Hadoop, Spark, Mesos and OpenStack, and would limit the use of any extensions or enhancement only to those based on applicable industry standards, such as SQL or NFS.

To avoid a setback, resist the temptation to take on too much data too soon. Even a partially-full data lake (think: reservoirs in California) can provide immediate benefits by offloading at least some data from data warehouses, Web analytics, databases, mainframes and other enterprise storage systems that are orders of magnitude more expensive. So start small, but think big.

Step #2: Begin using the data lake. The second step is to begin achieving those immediate benefits by identifying one or more new applications or uses cases that were previously impractical or impossible with disparate data sources. To maximize the potential for a successful first attempt, pick some low hanging fruit that will be easy to implement and impose minimal risk to the business. But also consider use cases that will be able to leverage a wide and deep data lake.

Examples for initial projects include integrating analytics into some operations, taking advantage of the lake’s increased data variety, volumes and/or velocities, and mining newly-available data sources. True, implementing a new application that utilizes new data sources will likely take more effort, but the rewards are likely to more meaningful to the business.

A good example that is common across virtually all industries is a “Customer 360” application that leverages both existing and new data. Keep it simple, though, at least initially, by using the app only to support a marketing campaign or enhance a CRM application.

After gaining some experience and competency with the data lake, give serious consideration to taking on some of the use cases that more fully leverage the breadth and depth of its data, especially those applications that enhance revenues, reduce costs, streamline operations, mitigate risk and/or address security needs.

Step #3: Make the data lake real-time. The third step involves putting the data lake to the test with real-time applications. Getting actionable insights in real-time is something siloed architectures struggle to do and, therefore, holds the potential for maximizing the return on the investment in the Data First strategy.

Real-time functionality is at the core of the many new transformational applications that need to be able to perform analytics directly on operational data as it becomes available. These applications are normally unique to each industry, with visible early adopters in the retail, financial services and telecommunications sectors. But what they all share is the need for speed, versatility and extensibility to accommodate diverse requirements, groups and business functions—all of which embody what a data lake is designed to do.

Such transformational operational insight and agility requires the ability to quickly analyze and understand streaming data in context. The context comes from understanding both short- and long-term trends and patterns from both data-at-rest and incoming data.

And this raises an important point: To get maximum benefit from having a data lake, the applications need to be able to work with both data-in-motion and data-at-rest. Many data analysts tend to consider “Big Data” as always being at rest, marveling at its volume and variety, and can lose sight of the fact that all of that data was created one event at a time from a wide range of sources—old and new, batch and transactional.

Indeed, it is this ability to harness many different data flows, and to understand their meaning in context and in real-time, that should be considered the hallmark of a successful data first data lake. And when that ability is achieved, the data lake becomes “enterprise grade” and ready to take on truly transformational applications.

The three incremental steps outlined here can enable any organization to approach a data first data lake the prudent way: feet first. And by helping to build competence and instill confidence, these small first steps will clear the way to diving down deep to discover the operational insights and competitive advantages previously hidden beneath the surface.

Must read: Hidden Cause of Slow Internet and how to fix it
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies