• United States
Contributing Writer

Real-world backup woes and how to fix them

Aug 21, 201912 mins
Cloud ComputingData CenterDisaster Recovery

How four enterprises solved problems with their backup and restore processes

CSO > cloud computing / backups / data center / server racks / data transfer
Credit: gorodenkoff / Getty Images

Data backup and restoration can be somewhat of a black-box effort. You often don’t know whether you fully nailed it until disaster strikes, and there is always room for improvement, especially as cloud and hybrid options grow. We asked four network professionals to share what made them realize they should do more to bolster their organization’s backup and recovery processes, and how they made that happen. Here are their stories.

A Kansas university outgrows tape backups

The aha moment: In May 2011, a tornado hit Joplin, Mo., and Tim Pearson, a volunteer fire chief in a nearby town, was called in to help in the aftermath. “Suddenly, I was in a town that I knew well but couldn’t recognize anything. They literally painted intersection names on the streets to help people get oriented,” says Pearson, who is director of infrastructure and security at Pittsburg State University in Pittsburg, Kan.

His colleagues with data centers in Joplin, Mo., were struggling just to identify where the sites should be, let alone how to get their networks back online. He realized that PSU’s approach of having traditional tape backups, rotated weekly, in a bank vault across town didn’t provide enough reliability for the region’s weather patterns. “We had to take a fresh look at our vulnerabilities,” he says.

Geographic diversity

The fix: Initially, Pearson and his team addressed the university’s geographic vulnerability by placing another Dell Equalogic storage array and 50 percent of its virtual computing horsepower in the basement of a library across campus from the university’s primary data center. The team also added a Dell MD3200 storage array at Wichita State University (WSU), which PSU connects to via the Kansas Research and Education Network, using a high-speed fiber ring. Data was manually replicated to the secondary site (the library) several times throughout the day. Backups were sent nightly to WSU, eliminating the cumbersome tape process that had been in place.

“A tape retrieved from the vault might be a week old and take a day to recover,” Pearson says, adding that a disaster that took out the primary and secondary sites would make it even more difficult to restore the data from the tapes.

Although the library and WSU arrays worked well, the PSU team decided to improve backup and recovery even more, weaving in Hedvig’s Distributed Storage Platform (software-defined storage) for automated orchestration. Hedvig uses agreed-upon policies to manage data replication in real time among multiple nodes: the primary data center, the library and WSU. “As long as two of the three nodes are up and running, our data is accessible,” he says.

The system was tested recently when the link to WSU was temporarily shut down due to an unplanned router reboot. “Hedvig noted a problem, isolated it and got the WSU system caught up as soon as the link came back online 15 minutes later. Our data center continued normal operations throughout the incident,” Pearson says.

Hedvig works well with the university’s legacy systems, which are still housed on a Unix server with iSCSI connections. “Most of the other vendors we looked at didn’t support that type of legacy configuration [which the school is dependent upon], but Hedvig handles it quite elegantly. Their client-facing ‘proxy’ interfaces (small physical or virtual Linux servers) serve as multiprotocol connectors into the Hedvig storage environment and offer a range of block and object-oriented protocols, including NFS, Amazon S3 and even iSCSI,” Pearson says. 

PSU’s IT team tests recoverability as part of routine maintenance, bringing down nodes and recording response times. All of the storage network configurations are well documented and updated often. 

“My experience in the fire service and at Joplin makes me aware that you can’t take anything for granted, and my advice is to get as much geographic diversity in your storage network as possible,” Pearson says.

Correctional services team shores up backup vulnerabilities

The aha moment: “There were two moments that really drove us into high gear for changing how we’re doing backup and recovery – one man-made and the other a natural disaster,” says Dwain Caldwell, a systems administrator in Iowa’s Department of Correctional Services. Caldwell works in DCS’s First Judicial District, which provides correctional services to 11 counties in northeast Iowa.

A few years ago, a user in a supervisory role visited a Web site, not knowing it had ransomware. “Nothing jumped out to the person,” Caldwell says. The ransomware penetrated the main file systems, but Caldwell and his team were able to stop it relatively quickly. Although the team had a valid backup to restore to, the time it took to bring operations back to normal was longer than expected. “Training employees helps, but we can’t control social engineering. What we can control is how fast we can get back online,” he says.

The second incident was a storm that sent water into the building where the primary site is housed and caused a power outage in the secondary site’s building. “I didn’t think we were susceptible [to full downtime] until that happened,” Caldwell says. Having primary and secondary sites so close together with no third alternative was an unreliable strategy.

Virtualization speeds data recovery

The fix: In recent years, DCS and the Department of Corrections as a whole have worked to virtualize their computing environments, including using virtual desktop infrastructure, and Caldwell says his district of DCS is at about 80 percent virtualized. This has made implementing a new data-backup and restoration plan much simpler.

DCS uses Nutanix Core hyperconverged infrastructure to handle VDI and data protection and disaster recovery in the data center and remote sites. “We are able to set up our policies for backup and restore so it all happens behind the scenes if someone makes a mistake,” he says.

Nutanix frequently takes and stores snapshots of production environments, so if DCS is hit by a ransomware attack, Caldwell and his team can automatically restore the system to the most recent snapshot, which is typically every 15 minutes.

The IT team has developed experiments to test recovery time, including taking down a server room so a node goes offline. “The goal is to see how long it takes VMs on that node to come back online on other nodes,” he says.

Restoring applications goes hand in hand with restoring data, he says, because most of the applications are so data-dependent, such as the probation and parole applications. “Users need access to historical data as much as the application itself,” he says.

In the event data becomes unavailable from the Nutanix system, as in a flood or storm, Caldwell can tap into incremental backups stored on an EMC Data Domain storage appliance located in the same city as well as one in another geographical location, with the closer location getting backed up more frequently. “We’d spin the best backup into a virtual-sandbox environment and then push it to the main data center,” he says.

“Backup solutions today are so much more universal than before. You used to have to make sure the environment you were restoring the tape in exactly matched the original configuration. In our hypervisor environment, we are able to have our data available more quickly and efficiently,” Caldwell says. The virtualized environment and automation also enable all storage responsibilities to be handled by two members of the IT team. “We are able to perform the backup and restoration piece and still wear a lot of other hats.”

Backup and recovery for Microsoft Office 365

The aha moment: The Aquilini Group has a lot of subsidiaries, including the Vancouver Canucks and its home rink Rogers Arena. The company also owns all of the arena’s operations, including food and beverage services, as well as hotels, construction companies, restaurants, and blueberry and cranberry farms. The common theme across these investments is the need to protect data – whether it be customer information, surveillance-camera footage or point-of-sale transactions. That protection was tested when a third-party-led SAN upgrade went wrong and had the potential to lose a significant amount of data.

“We wouldn’t have been able to serve food and beverage at an event, which would have resulted in revenue losses and customer dissatisfaction,” says Bryce Hollweg, director of IT at Aquilini Investment Group in Vancouver, B.C. Fortunately, the internal IT team had backed up the data properly and was able to restore all data. But the episode left Hollweg wanting to be even more proactive about backing up all data – even data generated by applications in the cloud.

3rd-party backup for SaaS

The fix: The Aquilini Group has migrated to Microsoft Office 365 for its nearly 1,500 employees. And while Microsoft is good about guaranteeing uptime of the application, like most SaaS providers, it is less willing to take responsibility for data integrity. “We have some sensitive data that traverses the Office 365 network and need to protect it,” Hollweg says. In addition, loss of the company’s mailboxes would undoubtedly cause productivity degradation. “The more layers you can put in place, the better. A secondary and tertiary measure for cloud applications is not a bad practice.”

Aquilini uses Veeam Backup for Microsoft Office 365 as a secondary measure to protect Exchange Online, SharePoint Online, Teams (chat), and OneDrive against accidental deletion, support rapid restore, and meet compliance demands. The backups can be stored on premises, in the cloud in Microsoft Azure or Amazon Web Services, or at a third-party provider.

Hollweg says he doesn’t mind having multiple, targeted tools to manage, even with a lean staff, because the protection is customized to the type of data being stored, which makes recoverability faster and easier. “Segregating information is good so there’s not one pot where if someone cracks the code, they have access to the crown jewels.”

Local protection for virtual machines

The aha moment: When The CSI Companies, a recruiting and healthcare IT consulting firm based in Jacksonville, Fla., decided to virtualize its environment, including SQL Server, with VMware, Matt Greaves wanted to make sure that recovery time objectives remained intact.

“When we started doing recovery tests for all the virtual machines, the results were scary. An entire site restore, which we thought would take 30 hours, was more like 90 hours. That was a huge pain point,” says Greaves, director of IT at The CSI Companies. “With 3,000 to 4,000 people needing to get paid each week, even two hours of downtime for payroll systems can cause a significant rift.”

The previous backup and recovery software that The CSI Companies used required IT to manually set policies for when to perform backups, for what period of time, and for which applications. Inevitably, there were gaps that would leave them with an out-of-date or incomplete backup, and the only option after a disaster would be to manually dig through and restore individual transaction logs.

On-premises backup can cost less

The fix: Greaves decided to take advantage of the virtualized environment and deployed a stand-alone storage appliance from Rubrik that hooks directly into the VMware environment. IT can apply a specific policy – gold, for example – to the VMs listed in vCenter and automatically protect data on a granular level. “They do policy-driven backup points so I can set the SQL Server to get a transaction log snapshot every few minutes and then a full database snapshot every couple of hours,” he says. Transaction logs are now applied automatically as needed for a full restore.

“Backup and recovery used to be something managed on a daily basis, now the only time we need to manage Rubrik is if we get an alert and need to go investigate,” he says. As for documentation, Greaves says coworkers can get up to speed on Rubrik’s use with a one-page best-practices sheet that sits on the company’s SharePoint site. 

He considered moving applications and infrastructure to the cloud, including backup and recovery, but balked at the price. “It’s so easy to get into the cloud for infrastructure and start spinning stuff up, but there is an hourly cost to all those tools. When we did a cost analysis, it was far cheaper to keep everything on premises,” he says.

Experts recommend SaaS bacup

Many IT managers feel confident about their ability to back up and restore data from on-site or from a secondary data center. It’s when you introduce cloud-based services that things get murky.

“We see companies engaging a cloud service to replace on-premise service for applications like CRM without any real understanding of how that service handles backup and restore issues,” says John Burke, CIO and principal research analyst at Nemertes Research.

Customers often get hyper-focused on failover capabilities and business continuity but don’t consider data corruption issues or times when you need to roll back to a previous week’s data. “That’s not always a default capability,” Burke says.

Vinny Choinski, senior IT validation analyst at Enterprise Strategy Group, agrees, emphasizing that “data recovery is your responsibility” when it comes to SaaS. “What if someone deletes your data? It’s prudent to make sure you understand the recovery climate of your application.”

One option for winnowing a growing field of backup and recovery service providers is to ask your SaaS provider who they prefer. Opting for one of their partners could make integrating backup for SaaS easier as well.

And while signing on to backup and recovery services for your SaaS will likely add to what you planned to be a lower cost option for your applications, both Burke and Choinski say not doing so will leave your data vulnerable.