Skip Links

Network World

  • Social Web 
  • Email 
  • Close

Data shuffling: A safer way to analyze confidential data?

Oklahoma State professor explains data protection technique
By Bob Brown , Network World , 06/19/2008

Oklahoma State University’s Technology Business Assessment Group recently announced it will fund research on an approach to information protection called data shuffling. The project is led by Professor Rathindra Sarathy of OSU’s Department of Management Science and Information Systems, who explains to us just what data shuffling is and why it could be coming to your network soon.

Can you give me a quick layman’s explanation of data shuffling, then a little more technical one for our readers in IT security? Also, how’s it different from encryption?

Data shuffling (U.S. patent: 7200757) belongs to a class of data masking techniques that try to protect confidential, numerical data while retaining the analytical value of the confidential data. Let us say that you want to provide confidential salary data to an analyst. The goal is to try to answer questions such as “Controlling for experience, education and other factors, is there a difference between male and female managers?” or “What are the best predictors of salary among variables such as Age, Sex, Experience, Education, Race, etc.?”

You do not want to provide the original salary data to the analyst, for obvious confidentiality reasons. Even if you remove personally identifiable information before providing the original confidential data, security is not assured since it is usually easy to identify an individual if you know their characteristics. Conventional encryption techniques would not be of value, since the unencrypted original salary is necessary to perform analysis. Hence, one approach is to try to modify the numbers (masking the numbers) before you provide them to the analyst. Data shuffling would intelligently re-assign the original salary numbers such that the results of the analysis come out correct. Simultaneously it prevents you from associating the original salary numbers with the correct individuals. The real power of data shuffling shows up when you want to maintain complicated relationships among several variables, including both confidential and nonconfidential, such as in the second question above.

Data shuffling isn’t something we’ve written about, though I do see a fair number of references to it on the Web. Do you have a sense of how hot a concept this is now?

Several researchers are working on data masking concepts. Data shuffling is a particular method of data masking that we have patented. We believe that it has strong potential. Unfortunately, organizations have not realized the power of data shuffling and the potential benefits that come from using this approach. Our main thrust in the next two years will be to educate and promote the benefits of data shuffling.

As for commercial products, there are a couple of data masking products in the marketplace. But, unlike data shuffling, they provide fairly simplistic situations. As a result, the masked data does not offer the same quality assurance that data shuffling provides.

I saw a presentation you did that focused on protecting data in healthcare settings. Is that where you see data shuffling taking hold initially, or what other vertical markets do you think are especially good fits?

Healthcare is definitely one of our current focus areas, but there are many other applications such as insurance claims data or other types of financial analysis applications where it can be useful. In fact, data shuffling can be used in any situation where an organization wishes to analyze or share any confidential data.

What form might data shuffling take in commercial products?

We think that data shuffling can function as a stand-alone software product, as an add-in to other products (including spreadsheets), or even as an XML Web service delivered through the Internet. Prototypes of all three versions are being made and initial tests have been successful.

I see you’re looking to spin off your research into a start-up. Have you done a start-up before and what’s your vision for what the company might be like?

This would be our first start-up. Our vision is to have our product become as ubiquitous as say software packages such as SAS or spreadsheets. We envision that once companies see the potential of our product, they will not hesitate to use the right experts for analysis (whether in-house or outsourced), and leverage the value of historical and legacy data that otherwise sits unused because of confidentiality concerns.

Any other hot buttons or points you think are essential to make for someone trying to get their arms around the data shuffling/masking concept?

The biggest hurdle that we have come across is simply lack of awareness among organizations that there are intelligent mechanisms that can harness the analytical value of confidential data. Understandably, the instinctive reaction among many organizations is to lock away the data, or restrict access to it using expensive methods such as “secure centers” for pre-approved authorized analysts. One organization literally has a highly secure physical room, where only those people who have been thoroughly “vetted” can even enter. Yet we have seen frequent instances where privacy is compromised where someone loses their laptop or a CD containing confidential data. Our method offers a refreshing, less expensive and more secure alternative. In summary, think of data shuffling as a way to leverage the value of confidential data in a more efficient and effective manner.

Conceptually, data shuffling is relatively simple. As I said before, assume that we have data on a thousand individuals on their gender, education, years of experience, age and salary. Data shuffling essentially assigns the salary of the first individual to the 85th individual, the age of the first individual to the 657th individual, and so on. The process is essentially repeated for every individual so that when the data shuffling is completed the original confidential values have all been reassigned. But if we do the shuffling randomly (like a card shuffling machine), we would lose all the relationships between the variables. With data shuffling, the values are shuffled, but relationships and confidentiality are preserved.

For more on network research visit our Alpha Doggs blog 

Partner Content

Brilliantly simple security and control solutions for email, web and endpoint

www.sophos.com

Stopping data leakage

Learn how to exploit your current security investment to control the information that flows into, through and out of your network.

Download the white paper.

Why detection rates aren't enough

Evaluating endpoint security products is a time-consuming and daunting task. Learn the six critical questions you need to ask prospective vendors to get the right endpoint solution.

Download the white paper.

Applications: taking back control

Employees installing unauthorized applications is a growing threat to business security and productivity. Cost-effectively reduce this threat by integrating control into your malware protection.

Learn more today.

Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a NetworkWorld account? Log in here. Register now for a free account.

Videos

rssRss Feed

Whitepapers

Magic Quadrant for Application Delivery Controllers

Gartner summarizes its view on Application Delivery Controllers, evaluates strengths and weaknesses...

Vulnerability Management For Dummies

Download this concise book "Vulnerability Management for Dummies," to learn about the simple steps...

The ROI and TCO Benefits of Data Deduplication for Data Protection in the Enterprise

This paper examines and quantifies the costs and benefits of backup with deduplication storage as...

Webcasts

Transforming the Enterprise WAN Edge: Video from Cisco

Life on the edge of your WAN has changed dramatically. With the need to deliver advanced services,...

PoE Plus: Impact on the PoE Market

The standard for Power over Ethernet (PoE), IEEE Std. 802.3af(tm)-2003, advanced networking,...

Harnessing the power of communications to increase workplace performance

Due to the convergence of IT and telecommunications technologies, the business workplace has been...

Special Reports

The Evolution of Network Security

We have so many holes punched in our firewalls today that many industry insiders question the value...

The self-managed network

We aren't there yet, but advances in network and systems management tools are making it possible to...

Get instant email notification when white papers, webcasts, executive guides are added to our library. Stay informed and up-to-date with the latest on IT Technologies with Network World's Resource Alerts.