American Cancer Society embraces Big Data

blake sanders
American Cancer Society

In 2012, the American Cancer Society, one of the largest non-profit groups in the country, realized that its de-centralized organizational structure had to change. So, the ACS consolidated 13 locations (a national home office and 12 separate locations, each its own charity) into one company based in Atlanta with a dozen divisions.

The agency quickly realized that its IT structure needed an overhaul as well. The newly centralized Siebel database with 4,000 objects and 150 tables ran on 8-year-old HP-UX hardware. Reports took an average of four hours, meaning users needed a second computer for other work while reports were churning.

In March 2013, the American Cancer Society hired Blake Sanders, who brought with him 20 years of experience in business analytics and data warehousing. In the newly created position as Vice President of Architecture and Data Management, his orders were to prepare for a Big Data future. He decided to start with a data warehousing appliance to address speed and data lag, and to provide a foundation for addressing data complexity going forward.

"We went through a fairly standard RFP process," says Sanders. "We planned it all out rigidly. There was a template sent to all parties, with a description of the situation and the problems to solve, asking them to please respond to all questions." Sanders and the staff had gathered plenty of information on the market, and sent the RFP to four vendors: Oracle/Exadata, Microsoft, IBM/Netezza, and Teradata.

Questions such as, "Does your platform allow integration with these specific ETL (extract, transform, load) tools?" were included. As were questions related to connectivity to other systems, data modeling software, maintainability and maintenance, staffing requirements, and integration with other toolsets beyond ETL. For a couple of months, Sanders and team evaluated and clarified responses. Ultimately, they narrowed the Proof of Concept vendors down to a manageable two: Teradata and Netezza.

"We knew we couldn't handle four proof of concepts at one time," says Sanders. "Nor could all of them solve our problems, and that's what we had to prove: the solution would make a significant impact on operations." Including hardware, software, installation, and services, the project budget was close to a million dollars.

Sanders had installed a Netezza (now officially the IBM Pure Data System for Analytics) system at a previous job in 2006 and been happy with the results. Although careful to remain neutral, he would be perfectly fine installing another Netezza system for the ACS.

A meaningful Proof of Concept

Although donors to the American Cancer Society understand the need for technology to support research, the image in their minds is doctors and test tubes, not computers, says Sanders. "Donor spending on technology can be seen as less valuable."

Being a good steward means proving that the computer improvements would more than pay for themselves, so tracking Total Cost of Ownership and Return on Investment would be critical. And he wanted to stop waiting for data and start using the data.

Sanders outlined his Proof of Concept process and goals:

  1. Clearly support business needs
  2. Establish and track success metrics
  3. Fully explore product features
  4. Differentiate fact from fiction (marketing hype)
  5. Examine "exotic use cases"
  6. Attempt to illustrate return on investment

Knowing it would be difficult to go back and add something later, Sanders needed to solve current issues with speed and provide a foundation for the next three to five years.

It’s difficult to put a number on something like productivity gains, but Sanders tried to get specifics on how the company would benefit from saving X work hours per week. The increased efficiency might allow management to reduce staff or to launch new initiatives without adding headcount. Those details would go into a chart as cumulative savings and exotic use cases.

The race was on

Netezza and Teradata installed their systems side-by-side in the ACS data center the same week. Sanders didn't want any of the data leaving the building, so he couldn't rely on cloud and remote testing. Plus, any tweaking would be done by his team rather than the vendors.

The data the ACS currently manages comes from 76 million constituents (donors, volunteers, staff, etc.) gathered from more than 6,000 charity events per year. The total dataset was, says Sanders, "a surprisingly small 2.5TBs of current data."

Sanders created a test dataset of about 20 tables (4,000 objects) from his total of 150 tables, and provided the same dataset to both vendors. Preparing the data for the test was a dry run for converting all his data to the new system.

Rather than splitting his internal IT group into Team Netezza and Team Teradata, Sanders wanted all of the members of his team to use both systems so everyone could compare them during later evaluations. Each step of the execution plan was the same for both systems so Sanders could compare apples to apples.

The Proof of Concept lasted about six weeks. On each system, they loaded data, noted features and administration details, and performed query tuning. They ran small, medium, and large queries, about 15 in all, and monitored which tables were being used. Sanders had a "fast food" slogan for the project: "Make it fast, make it fresh, make it better."

Sanders and his ACS team made a script of things to accomplish. Set up the database, import the dataset, evaluate the management tools available, and start testing response times. While testing, tweak systems for indexing and aggregations. They wound up spending an extra week testing areas that were not current issues but that they knew would be necessary in the future.

Productivity gains were immediate. Query times went from an average of four hours on the old system to about 40 seconds on the new systems. Yes, 370x times faster than before. Instead of being able to run 1,000 reports a week, they could now run 4,990 reports in the same amount of time. That added up to $119,700 per week in savings based on employee time saved.

And for the first time, users could drill down into reports as easily as they did on spreadsheets. Users could look at the data in ways they never could before. Sanders says the speed increase has driven behavior changes in users who now can ask questions of their data over and over in seconds.

One hard cost eliminated by a new system would be the second computer needed by each user. When reports took hours, users needed another system to use while waiting. Some even had three systems on their desk. Other hard costs that would be eliminated and help pay for the new system included dropping their Oracle licenses and maintenance on older HP-UX equipment. Cumulative savings after a few years will add up to the new system purchase price, and the savings will only grow after that.

Sanders felt Netezza was in the lead at that point. And what he knew about Netezza back in 2006 was still current.

Comparing futures

The second half of the Proof of Concept was looking toward the future. He was essentially testing appliances to perform data warehousing today, but he wanted to evolve into a true Big Data system over the next three to five years. The ACS needed to move to Hadoop and monitor real time data movement, such as ways to personalize the ACS website experience for volunteers and Relay for Life charity participants.

Netezza had new hardware and was faster than the 2006 model Sanders had used in the past, but their software remained essentially the same. However, Teredata's 14.10 operating system showed considerable improvement from prior versions. "It looked like they had innovated their software lots more than Netezza," says Sanders. "Netezza had been in the lead, but I changed my mind at that point to Teradata."

Sanders did not want to say exactly how much the project cost, but he said that projects of this type for companies of similar size to American Cancer Society end up being around $750,000. Based on a "node compute power" scale Sanders developed to equalize different hardware requirements, there was little price difference between Netezza and Teradata.

So far, so good

The RFP/POC process lasted about six months, and the Teradata hardware was installed in mid-October of 2013. By December, the ACS was in production with a weekly refresh of the Siebel reporting system. By January, the refresh was occurring daily.

Since then, the ACS added a datamart for the Finance, Planning and Accounting group and provided data sources for the marketing team to do some basic campaign analysis, according to Sanders.

“One year later, and we are still enjoying a 350-370x query performance gain over the old data architecture, and are moving forward with simplifying the data model to make it more friendly to ad-hoc queries. Maintenance has not been an issue. We’ve not had any downtime with the system during the year, and any maintenance has been minor.  What we’re working on next is to improve our data pipeline even further by using change data capture on our Siebel application data in order to load into Teradata in near real time, thus reducing our batch overnight load window even further. We will be able to load data from application to reporting data store as it changes, and provide business activity monitoring where we’ve never been able to before,” says Sanders.

James E. Gaskin writes books, articles, and jokes about technology, and consults for those who don't read his books and articles.


Copyright © 2015 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022