Data hoarding is not a viable strategy anymore

As the volume of data that organizations collect continues to grow, storing everything simply isn’t cost effective. A strategy that balances data reduction with fidelity is increasingly vital.

card-catalog filing cabinet / data repository
Sashkinw / Getty Images

For years it has been normal practice for organizations to store as much data as they can. More economical storage options combined with the hype around big data encouraged data hoarding, with the idea that value would be extracted at some point in the future.

With advances in data analysis many companies are now successfully mining their data for useful business insights, but the sheer volume of data being produced and the need to prepare it for analysis are prime reasons to reconsider your strategy. To balance cost and value it’s important to look beyond data hoarding and to find ways of processing and reducing the data you’re collecting.

Exponential data growth

The volume of data that’s being produced daily is growing fast. People generate enormous amounts of data, but machine generated data is set to eclipse that. As the IoT grows from an estimated 23 billion connected devices this year to almost 31 billion by 2020 and a staggering 75 billion by 2025, according to IHS data at Statista, collecting and storing all that raw data is starting to look impractical. 

We’ve kept pace with data generation so far by adopting better compression technologies and backing up incrementally with a focus on what has changed, but as the volume increases we’re going to fall woefully behind. We must find a way to reduce the amount of data that we’re collecting.

Identifying what you need

The most expensive way to store data is in its raw form, so we need to reduce it, extracting pertinent details like averages, or standard deviations. Streamlining the data we collect and processing it to ensure that it’s in a useful format seems an obvious answer, however, it’s not as easy as it sounds.  

In some cases, it may be prudent to store raw data for future audits in the event of liability exposure. Regulatory requirements must also be weighed in when deciding what data to keep and what to let go of.

Part of the difficulty with boiling data down is that we’re still developing analysis through machine learning and artificial intelligence. That means we’re betting on what will be valuable and what we can afford to discard. It’s not practical or prudent to try and store all raw data, but there’s a balance to be found and much depends on your specific business.

Processing at the edge

Figuring out what data you want to keep and how the remaining data you’re collecting should be processed is just one piece of the puzzle. You also need to work out where the processing and data reduction is going to take place. There’s a natural tendency to want to centralize data for analysis but collecting the data and sending it to the cloud for processing is going to take time and cost money.

In many cases it will prove more cost effective to reduce data at the edge, as close as possible to where it’s generated. This is a good way of reducing storage requirements and network traffic by only sending forward what you need for analysis. The trick is accurately identifying what you need, but as machine learning advances we’ll be able to progress beyond educated guessing.

Mapping the future

To mitigate the risk of discarding valuable data you need to draw up some projections and ask probing questions about the future of your business. Don’t just look at what you use data for today, ask what you might use it for tomorrow. If there are new sources on the horizon, work out what they’ll need to provide for effective business analytics.

There must be some kind of ROI calculation here. What is the cost of storing this data versus its potential future value? Work out your ideal topology and plan how you’ll reduce data, forward it, store it, process it, and analyze it.

In the short term it may be necessary to err on the side of caution and make provisions to store more data points. The best strategy right now may be to process at the edge where you can, but combine that with more traditional centralization of data where there’s less clarity around its value.

Being proactive

As the mountain of data grows ever larger, failing to act is asking for trouble. You need a smart cloud data management strategy to drive innovation and it will rely on the data collection and processing foundation you build. The speed with which new data is accumulating and its projected growth mean that time is of the essence. Trying to retrofit a processing procedure or introduce a streamlined data topology will never be as easy as it is today.

Use your current business performance and future goals to identify the data you need, find ways to process that data at the edge where practical, and weigh up the value of analysis versus storage. The ideal data strategy is going to take time to figure out, and will differ from organization to organization, but what’s certain is that data hoarding is no longer a viable approach.

This article is published as part of the IDG Contributor Network. Want to Join?

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Now read: Getting grounded in IoT