Oklahoma State University's Technology Business Assessment Group recently announced it will fund research on an approach to information protection called data shuffling. The project is led by Professor Rathindra Sarathy of OSU's Department of Management Science and Information Systems, who explains to us just what data shuffling is and why it could be coming to your network soon.
Oklahoma State University’s Technology Business Assessment Group recently announced it will fund research on an approach to information protection called data shuffling. The project is led by Professor Rathindra Sarathy of OSU’s Department of Management Science and Information Systems, who explains to us just what data shuffling is and why it could be coming to your network soon.
Can you give me a quick layman’s explanation of data shuffling, then a little more technical one for our readers in IT security? Also, how’s it different from encryption?
Data shuffling (U.S. patent: 7200757) belongs to a class of data masking techniques that try to protect confidential, numerical data while retaining the analytical value of the confidential data. Let us say that you want to provide confidential salary data to an analyst. The goal is to try to answer questions such as “Controlling for experience, education and other factors, is there a difference between male and female managers?” or “What are the best predictors of salary among variables such as Age, Sex, Experience, Education, Race, etc.?”
You do not want to provide the original salary data to the analyst, for obvious confidentiality reasons. Even if you remove personally identifiable information before providing the original confidential data, security is not assured since it is usually easy to identify an individual if you know their characteristics. Conventional encryption techniques would not be of value, since the unencrypted original salary is necessary to perform analysis. Hence, one approach is to try to modify the numbers (masking the numbers) before you provide them to the analyst. Data shuffling would intelligently re-assign the original salary numbers such that the results of the analysis come out correct. Simultaneously it prevents you from associating the original salary numbers with the correct individuals. The real power of data shuffling shows up when you want to maintain complicated relationships among several variables, including both confidential and nonconfidential, such as in the second question above.Data shuffling isn’t something we’ve written about, though I do see a fair number of references to it on the Web. Do you have a sense of how hot a concept this is now?
Several researchers are working on data masking concepts. Data shuffling is a particular method of data masking that we have patented. We believe that it has strong potential. Unfortunately, organizations have not realized the power of data shuffling and the potential benefits that come from using this approach. Our main thrust in the next two years will be to educate and promote the benefits of data shuffling.
As for commercial products, there are a couple of data masking products in the marketplace. But, unlike data shuffling, they provide fairly simplistic situations. As a result, the masked data does not offer the same quality assurance that data shuffling provides.
I saw a presentation you did that focused on protecting data in healthcare settings. Is that where you see data shuffling taking hold initially, or what other vertical markets do you think are especially good fits?
Healthcare is definitely one of our current focus areas, but there are many other applications such as insurance claims data or other types of financial analysis applications where it can be useful. In fact, data shuffling can be used in any situation where an organization wishes to analyze or share any confidential data.What form might data shuffling take in commercial products?
We think that data shuffling can function as a stand-alone software product, as an add-in to other products (including spreadsheets), or even as an XML Web service delivered through the Internet. Prototypes of all three versions are being made and initial tests have been successful.
I see you’re looking to spin off your research into a start-up. Have you done a start-up before and what’s your vision for what the company might be like?
This would be our first start-up. Our vision is to have our product become as ubiquitous as say software packages such as SAS or spreadsheets. We envision that once companies see the potential of our product, they will not hesitate to use the right experts for analysis (whether in-house or outsourced), and leverage the value of historical and legacy data that otherwise sits unused because of confidentiality concerns.
Any other hot buttons or points you think are essential to make for someone trying to get their arms around the data shuffling/masking concept?
The biggest hurdle that we have come across is simply lack of awareness among organizations that there are intelligent mechanisms that can harness the analytical value of confidential data. Understandably, the instinctive reaction among many organizations is to lock away the data, or restrict access to it using expensive methods such as “secure centers” for pre-approved authorized analysts. One organization literally has a highly secure physical room, where only those people who have been thoroughly “vetted” can even enter. Yet we have seen frequent instances where privacy is compromised where someone loses their laptop or a CD containing confidential data. Our method offers a refreshing, less expensive and more secure alternative. In summary, think of data shuffling as a way to leverage the value of confidential data in a more efficient and effective manner.
Conceptually, data shuffling is relatively simple. As I said before, assume that we have data on a thousand individuals on their gender, education, years of experience, age and salary. Data shuffling essentially assigns the salary of the first individual to the 85th individual, the age of the first individual to the 657th individual, and so on. The process is essentially repeated for every individual so that when the data shuffling is completed the original confidential values have all been reassigned. But if we do the shuffling randomly (like a card shuffling machine), we would lose all the relationships between the variables. With data shuffling, the values are shuffled, but relationships and confidentiality are preserved.
For more on network research visit our Alpha Doggs blog