Network World Editor in Chief John Dix first spoke to Derek Strauss a year ago when he was about three years into his new new role as TD Ameritrade’s first Chief Data Officer. He had built a new group, the Enterprise Data and Analytics Group, and just finished 18 months of work to stand up nine new platforms, including a Hadoop data store and a metadata repository. Dix recently visited Strauss to see how this massive undertaking is working out.
Derek Strauss, Chief Data Officer, TD Ameritrade
Where do we start for an update on what you’ve achieved since we last spoke?
I’ve got a long list of things we’ve been tracking in terms of value, so I can hit some of the high spots, and then it might be good to step back and look at some of the other things we’re gearing up for that are only possible because of the foundation we’ve laid. We’re going to be embarking on a pretty aggressive timeline for these new initiatives, and I feel good about being aggressive because the foundation is in place.
You mentioned the Hadoop effort so why don’t we start there. The drive with Hadoop is around personalization so our clients feel like we know them and we can provide useful insights and education without it feeling creepy. The focus is to be like Amazon’s suggestions, where you go, “Wow, I like what they’re suggesting, that’s really useful.”
We’re calling the Hadoop environment the data marshalling yard. Why? Because that’s what is typically upstream from a warehouse. Think about raw materials being brought together to be manufactured into something. They will often be transported by rail and come into a marshalling yard where they’ll be sorted for delivery to various factories and warehouses downstream, and you perform analytics on the raw material as it stands. So it seemed like a natural analogy to call it a data marshalling yard.
What have we done with that? A couple of key things. We have mainly focused on pulling in chat information and emails, a lot of textual stuff, to try and understand client behavior and so we can optimize the client experience in terms of scenarios. We’re also looking at what our clients are talking about and reading. When they phone us, what do they want to talk about? Putting all of that together with their activity on our site, we figure out this client is really interested in certain types of asset classes and we can then look to see if there any reports by third parties, by government, by whoever, and say, “It seems like this is an area you’re interested in. Are you aware these resources have just been published and here’s a link to them.” All of that is around personalization.
So we’re realizing analytics benefits, but there are also benefits around data and data management.
Let’s take a simple example of a codes table. A code could be anything, but let’s look at country codes. South Africa is ZA. USA is United States of America. When it comes to programmers writing programs, if there isn’t one country code table everyone can refer to as the authoritative table, everyone hard codes the table into their program. But any large organization has hundreds of systems, so you’ve probably got 100 country code tables hanging around, or worse, one for every program.
Master data management is all about trying to solve that. Country code is just one simple example, but when we started looking at this it was amazing how many times people have created redundant tables, and that can lead to all sorts of regulatory and compliance problems and a lot of inaccuracies.
Take me, for example. I was born in Rhodesia. Rhodesia doesn’t exist anymore, but if you’re looking for Derek’s birthplace, are you going to know Rhodesia is now Zimbabwe? Keeping that memory of geographical stuff centralized is something every organization needs and no one really has.
We implemented a master data management capability and the first thing we tackled was country codes. Now our application development teams know they can go to one authoritative source to find it. They’re not continuing to perpetuate the redundancy and the inaccuracies in the data, plus if something changes, they don’t have to remember to update their program because someone in the business now owns and is responsible for updating that data.
Those kinds of efficiencies are huge and very often get overlooked. When you think of the Chief Data Officer role, people just think about the sizzle of the analytics side, but there’s a very real efficiency side on the data set which is a big plus for any organization.
Once you have this master data management capability, I presume you go around looking for duplication of effort and multiple versions of the truth?
Right. And when you find it you need to find someone to own it. That’s the data governance side of things. You find an owner and that owner points to the data steward who is normally someone who is already doing work trying to fix the problem, and you say, “Here’s a tool where you can analyze all the different values you’ve got today, harmonize them, create one source of the truth and you own that and you make sure that is up to date and everyone else starts using that.” That makes a big difference.
But there are literally hundreds and hundreds of instances where this would apply and it’s a question of working with the business groups who are constantly tripping over these things, prioritizing them, and just picking them off one at a time and working through it.
The big elephant in the room is the client, because we, like many financial organizations, have grown up being account-centric. So John, let’s open an account for you. Oh, and you’d like to try something else? Well, let’s open another account for you, and another, and another. Every time we open an account for you we redundantly create information about you in that account record. We don’t have one central record about you.
Behind the scenes, for financial firms to be able to deal with you as a client and understand your total business with us and treat you accordingly, we’ve got a thousand gnomes running around all night trying to bring all this information together.
I’m exaggerating for effect, of course, but it’s a big thing because it’s like open heart surgery for the organization and you’ve got to really know that you’re going to be successful and you’ve got to plan the creation of a client master very carefully. We now have an opportunity to address that head-on because we’ve put a lot of the building blocks in place. I’ll come back to that one. That was just sowing the seed. Master data management is a key benefit and it’s all about efficiency.
Data quality improvement is another key benefit. The Patriot Act stipulated a bunch of things about anti-money laundering, and there are about five major attributes of client that are critical and have to be in good order. One of them is date of birth.
How could there be any fluctuation around that?
Any company that has grown through acquisition has had to make some decisions where expediency won out over guarantees for the highest quality of data. For example, if we had acquired a book of business with a couple thousand clients and their records related to date of birth were incomplete, we might have decided to bring them in with today’s date being the date of birth and the idea that we would go back and fix it over time. The expedient thing was to get the conversion done. Other times the programs capturing the data in the companies we acquired didn’t have the right sort of edits so you had people with birth dates in the 1800s instead of the 1900 or birthdates in the future. Just crazy stuff.
We saw all those things and thought, “Okay, this is going to be interesting. We’re going to have to do some real work analyzing these and figuring out the root causes and figuring out the best way of remediation.”
In the past we didn’t know the extent of the problem. We stumbled on it occasionally and have had problems running various types of reports, and we’ve had to rush back and try to figure out what was going on. Now we know what’s going on. Now we know where the problems are. Now we’re actually going back and working to fix it, which is huge. That’s all the authorities want from any organization they audit. They know it’s not perfect. It’s what you’re doing about it and do you understand the risk.
And all of these things, of course, have spinoff advantages to the analytics group because they’re starting to work with data that is in better shape, and of course if you’re working off data that’s got high integrity your decisions are going to be stronger and it’s going to be easier.
Are you bringing all the data into one place to improve the quality, or trying to improve it where it sits?
We’re trying to fix it where it is, at the actual source. But that’s a good point because, as we start thinking about creating a client master, ideally in the fullness of time we’ll have just one place where that data is and it will be good data. But because we’ve started fixing it at the source now, when we do create that client master we’re going to be creating it with good data as opposed to data that we have to go fix.
But its complicated. If there are seven different sources for this particular thing, say, for date of birth, which of those would we consider to be the authoritative source? If we really wanted to save ourselves the trouble of trying to fix all seven of them, which one would we fix now? We’re trying to do that thinking as well.
In some cases it’s not possible to do that; we’ve got to go out to all seven because of the way our systems are set up. But in other cases it’s possible to just go after one now. Again, this blocking and tackling around data wrangling is not the sexy stuff, it’s not the sizzle, but it’s critical to getting it right for the organization.
Has all of this effort required you to bring in some new types of specialists?
We’re not going to employ 100 data scientists. It’s just not going to happen in a company our size. It’s much better to try and think of a way to crowd source our data science skills.
So working with some universities we set up a collaborative data science platform using an Amazon Cloud. We moved a bunch of our data up there, signed NDA agreements with about 12 universities and said, “You guys need real data so your masters and your doctorate students can roll their sleeves up and play with data, and we need crowd sourcing of ideas. This is a marriage. We can both give and get something from this.”
We had a formal launch of the platform in June and we’ve had really good interaction between our analysts and the university guys. The universities have come back with phenomenal ideas and insights that we’re still developing. Over time it gives us access to some of the best and brightest students, some of which may want to come join us. This has been very successful and we continue to push.
Coming back to the client master, where do you stand in creating that?
We created a client profile from a lot of the data we’ve been collecting, which is a consolidated view of key client attributes. We’ve never had a client record as such and this is a start, but this is not the master yet. This is tactical, but we’re already starting to use that to effectively target specific clients because we now have a view of what their interests are. In fact, this is part of the bigger personalization initiative.
Within personalization there may be 20 different topics. One of them is onboarding. When we onboard our clients we’re creating 30 attributes related to that client and right now we’re holding it in an Oracle database, but in time we’re going to set up our client master and move this into the client master domain.
So, you will still have multiple versions but now synchronized?
It will take some before it’s one and only one that everyone is using directly. Usually what happens is you first create what’s known as a registry, which is a central index which creates the joins between all these different instances where your client records are held. You’ll start using that as a point people can refer to, and it grows over time and you’re creating more and more authoritative data in it. It refines over time and ultimately it becomes the golden source, the golden record everyone uses. It’s a journey. It takes a couple of years to achieve that, but the registry, the client index, is something you can stand up much faster.
So there are interim steps toward that Holy Grail.
Yes. There’s certain data our business folks have wanted to get their hands on forever, and for one reason or another it’s just been too hard to get hold of. We’ve now implemented this virtual capability where we don’t have to move the data. We can actually create a view of the data across many different sources and that has helped people get an understanding of the data without having to write new programs.
In the past, someone in analytics would say, “In order to do this I think I need this kind of data and I think it’s sitting in those systems.” Then they’d go to the data warehouse team and say, “I need that data to be extracted, transformed and loaded into the enterprise data warehouse.”