Understanding mass data fragmentation

Data is viewed as the primary source of fuel for digital transformation, but mass data fragmentation is holding companies back.

cloud data warehouse
Thinkstock

The digital transformation era is upon us, and it’s changing the business landscape faster than ever.

I’ve seen numerous studies that show that digital companies are more profitable and have more share in their respective markets. Businesses that master being digital will be able to sustain market leadership, and those that can’t will struggle to survive; many will go away.

This is why digital transformation is now a top initiative for every business and IT leader. A recent ZK Research study found that a whopping 89% of organizations now have at least one digital initiative under way, showing the level of interest across all industry verticals.

Digital success lies in the quality of data

The path to becoming a digital company requires more than a CIO snapping his fingers and declaring their organization digital. Success lies in being able to find the key insights from the massive amounts of data that businesses have today. This requires machine-learning–driven analytics, and there has been a significant amount of media focus on that topic.

The other half of the equation is data. Machine learning alone doesn’t do anything. It needs to analyze data, and as the old axiom goes, good data leads to good insights, and bad data leads to bad insights.

Mass data fragmentation hinders digital initiatives

For most companies, data isn’t the fuel that powers digital transformation — it’s the biggest obstacle because of something I’m calling mass data fragmentation (MDF), which is a technical way of saying that data is currently scattered all over the place and unstructured, leading to an incomplete view of data. Data is fragmented across silos, within silos and across locations.

Adding to the problem is that most companies have multiple copies of the same data. Some data managers have told me that about two-thirds of their secondary storage is comprised of copies, but no one knows which copies can be kept or deleted, forcing them to keep everything. If bad data leads to bad insights, then fragmented data will lead to fragmented insights, which can lead to bad business decisions.  

Digital natives such as Amazon and Google are data-centric and architected their infrastructure to avoid the MDF issue. This is why those businesses are agile, nimble and always seem to be at the forefront of market transitions. They have access to a larger set of quality data and are able to gain insights that other companies can’t.

Many factors contribute to mass data fragmentation

The majority of companies were born in an era when data was viewed not as a competitive asset but rather as a necessary evil. For most companies, the mere mention of data evokes images of high-priced storage systems, ineffective backups, a complex management problem and a source of risk that could cripple the company.

To solve the MDF problem, it’s important to understand how we got here. Below are the main factors that have contributed to MDF.

  • Data has exploded. Data growth continues at an exponential rate. Ninety percent of all the data ever generated has been created in the past five years. Video, IoT, messaging and the cloud will only exacerbate the problem. The legacy mindset of “keep everything forever” is no longer viable.
  • Most data is unstructured. Most organizations have far more data than they are aware of. The typical storage manager knows how much data is resident on centralized storage systems, but that’s just a fraction of what exists. The average enterprise likely has millions of gigabytes of information stored on ad hoc storage systems in dozens or perhaps hundreds of locations. Then there’s the cloud, which includes corporate-sanctioned public services as well as hundreds of consumer-grade ones that workers use. IT’s ability to secure, control and use all the data no longer exists.
  • Data is dark. Even if IT managers actually knew where all their data was, it’s unlikely that they would know the contents. This includes information such as personally identifiable information, who the owner is, when it was accessed last and by whom. Data is essentially a black hole, making it a nearly impossible task to manage and to meet increasingly stringent compliance requirements.
  • Secondary storage dominates. Often about 80% of an organization’s data falls into the secondary storage bucket. This includes data that is stored in backups, archives, file shares, object stores, data warehouses and public clouds. Secondary storage is primarily used for where the usage is infrequent instead of actively contributing to the overall data set of the company. This means any insights captured in the secondary storage will likely never be discovered.
  • Data is managed by legacy infrastructure. The IT industry is currently in an unparalleled time of innovation. Containers, flash storage, the cloud, mobile improvements and software-defined infrastructure have made infrastructure highly agile and brought it into alignment with digital trends. However, secondary storage has stood still for the better part of three decades. Most organizations use a mix of siloed and outdated point products that were built for one specific function, such as backups or file shares.  

MDF is a serious enough problem that it is now impairing organizations' ability to compete in the digital era. It’s time for a major shift in the storage industry — not a few incremental improvements, but rather a complete rethink of managing data that addresses the many problems associated with MDF. This requires a fresh approach to data management, but that’s the subject of another post. 

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Now read: Getting grounded in IoT