Nordstrom's use of APM lets it identify app performance issues in minutes vs. days

The high-end retailer uses Dynatrace Application Performance Management tools in development all the way through to production

gopal brugalette

Gopal Brugalette, Senior Applied Architect for Performance Engineering in Nordstrom’s Performance Engineering Group

Application performance is critical in most organizations, but it is particularly vital when it comes to the booming world of online retail. Network World Editor in Chief John Dix recently caught up with Gopal Brugalette, Senior Applied Architect for Performance Engineering in Nordstrom’s Performance Engineering Group, to learn how Nordstrom is using Dynatrace Application Performance Management tools in development all the way through to production.

How about we start with a description of your group and an explanation of your role.

I am in the Performance Engineering Group and we work with just about all the IT systems within Nordstrom to ensure systems can scale to support various workloads. We also do performance engineering, where we use tools in a test environment to simulate thousands of users or millions of transactions against our environments and then monitor and analyze how well they’re performing and make tuning and code changes so when they go into production they will perform well.

My primary responsibility is on the website, www.Nordstrom.com, although I work on some other systems as well, including our store systems, which salespeople use to help customers, and internal systems like our order management system and some of our internal reporting systems.

Are you typically called in after a performance problem has cropped up, or in the design phase to ensure it is done right from the get go?

We typically try to get involved fairly early so we can start planning for testing as well as production monitoring. The end-to-end performance engineering lifecycle is really about starting early and designing for performance. There are certain design patterns that can be implemented to better support scalability and performance, and it is ideal to start thinking early about what we call the workload model -- what kind of volume does the system need to support, which system is it going to interface with, and how best to performance design for that interface -- and then think about the best way to do the testing, the best way to do monitoring, performance areas of concern in a given architecture, the kind of things we know are vulnerable to failure, etc.

How big is the performance engineering group?

We have 30-40 people spread out over India, the United States and Mexico.

Before we dig in deeper on the performance stuff, can you give us a thumbnail description of your technical environment?

We have two data centers supporting about 119 full-line stores, 167 Nordstrom Rack stores and about 63,000 employees.

Is the website hosted internally and, if so, on how many servers?

Internally on hundreds of servers. And that’s just our website. We have all our store systems and inventory systems and order management, etc. It goes on and on.

When it comes to performance management, where did the need come from? Were you trying to solve a specific problem or just being proactive to make sure customers had the best possible experience?

It’s a combination of both. I’ve been with Nordstrom about four years and the performance engineering team for the website was formed about six years ago after an outage that made it clear we needed to look more closely at performance. But on an ongoing basis it’s definitely about making sure the customer has the best experience possible, which is one of Nordstrom’s driving philosophies: Always focus on customer experience.

But this is also about making sure our site can scale to support our business growth. Over the last three years our website grew 30% year-over-year in terms of dollars, and volumes essentially increased concomitantly. So making sure we can continue to support that growth is a key priority.

When you arrived were there any APM tools in place?

We didn’t have any APM tools when I came in, so a big focus for me has been bringing in additional tools and the performance engineering techniques and methodologies to manage it. At the time, we were using tools to simulate thousands of users and that gives you some information, but basically the only thing you can do is look at end-to-end response time. So, if a simulated customer clicks on a button and it takes five seconds for that page to load, that’s the customer experience at a very simple level.

And if we saw performance was unacceptably slow we would dig around. We would look at server health, like CPU and memory usage, to see if systems were operating within acceptable parameters, and do isolation tests to determine if there might be a problem in the code, and then look through the code and do a lot of guesswork to try to figure out where the problem was and try to come up with a solution and then retest. It was a very time consuming and inexact approach.

That brings up another performance related issue: speed of delivery. We want to be able to get new features out to the customer very quickly, and it had been taking us a long time to find issues in the test environment and resolve them. That’s from the test perspective. We had a similar problem in production. Because we weren’t able to do the level of testing we wanted, issues would get into production and we would again have a problem detecting them and identifying the root cause, so it could result in production downtime.

So we knew we needed to bring in an APM tool. We wanted help with both preproduction testing and development as well as in production, so we evaluated a few. We ultimately went with Dynatrace and implemented it in the test and development environments, and then once we got familiar with it we moved it into production.

When was that?

It was about three years ago. The thing that really stood out for us about Dynatrace was their PurePath capabilities that let you dive in and follow a transaction across the entire infrastructure, as one system makes a call to another system, which then makes a call to another system, etc.   It makes it very easy to trace and identify where performance issues are.

Once you implemented it, was it like turning on a big performance spotlight?

It was, absolutely. Because before our testing tool would say, “Your transaction took five seconds” while we had a target of, say, two seconds. We knew this transaction might call four or five different servers, but we had no idea where the time was being spent and why.

There is a term in testing called “black box testing” where you’re testing something you can’t see inside. It’s a total black box. You don’t know what’s going on. You’re just kind of poking around from the outside. Dynatrace completely opened it up so we could see everything that our code was doing. So if a transaction was taking five seconds we could go into Dynatrace and see three of those seconds are being taken by this method on this call, and the developer could go fix that. What would literally take three or four days could now be accomplished in a few hours. I can’t even imagine trying to do any sort of performance engineering without an APM tool anymore.

You wonder how you got by without it up to that point.

Honestly, I have been doing performance work for 15 years and before these tools existed it was like beating our heads against the wall. With Dynatrace, identifying issues went from days to hours. Our test cycles went from weeks to days. We have also used it in production to be able to identify the root cause of issues in a few hours or minutes that would have taken days to track down without it.

Working with a production issue is challenging. A typical approach would be to try to determine the factors which contributed to an issue, and then try to reproduce it in a test environment. Then you proceed to determining the root cause and identifying the fix. An APM solution will help you determine the root cause directly in production, from the actual incident. This significantly speeds up the resolution time.

So with black box testing you would tweak whatever parameters were available to you and see what the outcome was?

It was a very iterative test and tune methodology. Okay, we think the problem is here. Let’s either change some parameters on the server configuration or make some changes to the code. We weren’t confident that would be it, but we knew these were dials we could turn. So we would make those changes and retest to see if the performance was improved.

Or we would try to change our test to try to accentuate a certain area. For example, if we had five servers involved we would try to change the test until we hit one server differently than the others so that we would get some additional information. Or test one server at a time to see if it was one system that was having a problem. It was more exploratory and iterative. It took a lot of time, which is why test cycles would extend to weeks.

How granular does the Dynatrace tool get?

It can go to essentially the method level. If you think of code design, you’re essentially executing all these methods or functions. Generically it’s kind of like a stack trace, so you can go in, not quite to the line of code, but one level up from lines of code, and you can see methods or functions being executed and see where the time is being spent.

Additionally, it gives you a lot of other great information. You can very easily see how many calls you’re making. So it might show you a database call is taking 10 milliseconds, but then it will tell you you’re making 1,000 calls. So while the database call itself is not a problem, the fact that you’re making 1,000 is a problem because 1,000, even if they are 10 milliseconds, is adding up to seconds of time being spent. You can also see if time is being spent in the network layer, the operating system layer, the code level. It really gives you a lot of detail.

How often is it a network or server hardware problem versus a coding problem?

It’s almost always coding. We very rarely need to add servers. It does happen occasionally, but that’s usually when we only have one or two servers to begin with and then our volumes double and we need to increase server capacity. At a guess, I’d say 90% of the time it’s a code problem, or even more.

After using the tool for a while, can you classify the coding problems? Do most problems end up being X?

There are so many varieties. But there is a group we call “design issues.” People are using a lot of shared libraries or shared functions and may not fully understand how something works and end up with something that makes those extra database calls. If we’re calling a database for the same information over and over again, not realizing that the previous line of code already made that call, that’s going to slow you down. Even if there are only two calls and they are just 10 milliseconds, if one of those is extra, that added 10 milliseconds is going to seriously affect your scalability. That’s a design thing Dynatrace is really good at looking at.

Do you use any of the other Dynatrace tools?

Off and on we’ve been using the User Experience Monitoring for some of our store systems. So, while the core Dynatrace tools are server side instrumentation, UEM involves instrumentation on the user’s browser or mobile device.   If it is a browser, for example, you can see how long an image or JavaScript took to load and execute. It gives you that full end-to-end picture. We’ve been using that off and on for our store system so we can see what our salespeople are experiencing and we’re looking at expanding that.

Anything you wish the tool could provide that it doesn’t today, or anything you would want more of?

Once you implement APM across your entire enterprise like we have, the challenge is the huge volumes of data you generate. It is almost beyond the human ability to analyze and comprehend. So we essentially need a machine learning technology to be able to assist with issue detection and correlation analysis, basically a way for the APM system to tell us, “Hey, you’ve got a problem here and this is the extent of the problem and we’ve done some automated analysis and this is probably where the problem area is” so you can quickly resolve it.

The typical approach is you set predetermined thresholds. You might monitor the response time of a particular page, like your checkout page or your home page, and say “I want the system to tell me every time the home page takes longer than six seconds to load.”

That sounds good, but what happens is its going to happen a lot and you’ll get these alerts and then you have to decide, “Is this important for me to look at because I’m getting all these other alerts as well?” If you decide to look at it, then you have to figure out if it is serious and what the cause is. So you start looking through a lot of data to do the full analysis and that takes a ton of time. It’s very complicated.

So where the tools are going, and where they really need to go, is to the next step, where they can say, “Hey, you have a problem and it’s affecting this many customers and it’s of this nature and here’s what’s involved in addressing it.” That is, I think, an exciting new area for the APM tools.

Great. Anything else I didn’t think to ask that you think is important to get across?

I think we kind of touched on it, but if readers are serious about the performance of their systems, then it’s really about addressing it through the whole lifecycle in development, in testing and in production, really taking that APM mindset. A lot of implementations just focus on one area, but to be really successful you have to do it essentially from start to finish, from design and development all the way through to production. That’s one of the reasons we really like Dynatrace, because it can be used in all of them. But regardless of what tools you’re using, you need to take that mindset to really be successful.

Copyright © 2015 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022