How to derive real, actionable insights from your data lake: Five best practices

By taking steps to ensure the quality of assets within the data lake, organizations can prevent their lakes becoming data swamps

data lake

This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter’s approach.

More businesses are embarking on data lake initiatives than ever before, yet Gartner predicts 90% of deployed data lakes will be useless through 2018 as they’re overwhelmed with data with no clear use cases. Organizations may see the value of having a single repository to house all enterprise data, but lack the resources, knowledge and processes to ensure the data in the lake is of good quality and actually useful to the business.  To truly leverage your organization’s data lake to derive real, actionable insights, there are five best practices to keep in mind:

Ensure you’re populating the data lake with all enterprise data, not just the best data you can get to.

Companies are making massive investments in emerging technologies like Hadoop, Spark and Kafka to build their data lakes, but their ability to gain value and insight is limited by their ability to get data assets from diverse data sources into those environments. Most companies have no trouble ingesting newer sources of data from IoT or mobile devices, but often miss the mainframe, which is inherently difficult to access, but vital to completing the 360-degree view of a business. This data serves as key customer reference data and helps make sense of newer sources.

Consider compliance needs before beginning any data lake project.

Businesses should first discuss their regulatory needs, and when necessary, create a system to preserve a copy of their data in its original, unaltered format. This is especially important in highly-regulated industries like banking, insurance and healthcare, who must maintain data lineage for compliance purposes. To keep up with evolving regulations, businesses also need the flexibility to write and adjust their rules to reflect the ever evolving regulatory updates, and should look for a vendor that provides this.

Create clear, consistent rules to catalogue and govern data.

It’s essential to not only document and catalogue data, but also create enterprise-wide business and technical rules to govern it, preventing misinterpretation by different departments. For example, if a business creates rules to define “mortgage risk” within their data, they’re able to use it to report to regulatory authorities. It’s certainly no easy task, as it often involves multiple sources of data within disparate departments and is subject to human error, as people often add their own free form rules to segment data that may not be clear to others. But when done right, it helps bring value to even the most non-technical person.

Don’t look at data integration and data quality as sequential steps. Wherever possible, it’s best to cleanse the data as it’s ingested into the data lake.

Ensuring data is of good quality while simultaneously ingesting it hourly, daily or weekly will save time and frustration later on. While this is true in most cases, there are exceptions, especially when looking for duplicate customer information that may already exist within the lake.

Enlist the help of third party databases to find and add the missing information to create a single view of customers.

Organizations want to use data lakes to create a single, 360-degree view – whether for marketing purposes or otherwise. But common “dirty data” issues, like duplicate records or mismatched email addresses, detract from the efforts and ROI of the entire data lake initiative. One way to add missing information is through third party databases, creating a complete picture of a customer. To get the most out of these third parties, ensure you select a partner that has both a world-wide view of data and expertise in targeting both B2B and B2C.

For a complete, accurate and detailed view of prospects and customers, companies should consider adding the following data:

  • Email Services to confirm global email addresses are valid, active and deliverable to target accurate and usable email addresses.
  • Phone Services to ensure a phone number is valid, in service, and matches the subscriber name.
  • IP Services to identify the location of an IP address and whether it’s a proxy.
  • Address Services to determine whether a postal address is a P.O. Box or a single- or multi-unit dwelling.

By taking steps to ensure the quality of assets within the data lake, organizations can not only prevent their lakes becoming data swamps, but can truly reap the benefits of the troves of data available to them. By harnessing all types of data – from legacy to newer sources – companies can rest assured their decisions are being based on complete, enterprise-wide data sets, and that they have a complete view of their customers.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2017 IDG Communications, Inc.

IT Salary Survey: The results are in