This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter’s approach.
Many companies focus on using cheaper, faster data warehouses to organize their accumulated customer data for business intelligence. But in order to remain competitive, businesses need to be using customer data to build applications that make real-time, data-driven decisions.
If you can interpret each attribute, event and transaction as a hint to help make the best predictions and decisions for your customers, you can transition from just customer understanding to customer action. But acting upon data to optimize customer experiences requires an architecture different from traditional data warehousing and business intelligence applications.
An enterprise’s journey from merely possessing data to acting upon it happens in four stages:
- Collection. Make sure all your touch points with your customers are instrumented to record all relevant information about each interaction.
- Organization. Get all this data into one place because it is difficult to work with in silos, and make the data available to be easily consumed by other systems and analysts.
- Understanding. Ask questions, get answers and form hypotheses based on your customer data that may help inform business decisions.
- Action. Close the feedback loop by plumbing your collected customer data and their derived insights back into the customer experience.
By instrumenting customer touch points like web and mobile to record impressions, clicks and transactions and then using Hadoop HDFS and MapReduce to store and organize those logs at scale, most modern companies already have the first two stages covered. Many have also mastered the understanding stage with Hadoop-based business intelligence tools like Hive to provide 360-degree views of customers, predict ROIs, and build informative reports that inform business decision-making.
But few organizations have been able to make the leap from customer understanding to the ultimate goal of improved customer experience. Successful examples of bridging this gap include new, high-value use cases:
- Recommendations, not just for content and products, but also for less traditional choices like financial investments and connections on social networks
- Search personalization (“Is this person looking for information about Jaguar the car, the operating system, or the animal?”)
- Prediction and prevention of problems like bad weather for agriculture or inefficient energy consumption
- Targeted offers and promotions
- User experience optimization, such as creating different interfaces for people who will use your website differently
Even the companies that have explored these high-value use cases usually undergo an unnecessarily long process that involves humans at too many points. Their data flows in batch between stages: Events collected from customer touch points are written to log files or transactional databases, then they are bulk imported into HDFS, then they go through a series of ETL jobs that result in clean files that a data scientist can consume.
Finally, those data scientists author and run their algorithms over these clean files to produce results that should be displayed to users (for example, recommendations). They hand these results over to engineers, who load them into a key-value store such as HBase or Cassandra so the website and other touchpoints can display results to customers.
By the time this process is complete, the data sitting in the key-value store is already stale because it has taken days, if not weeks, for it to go through this long pipeline of stages. We could be doing more – and doing it faster. But better, faster predictive analytics applications require a different architecture from the beginning of the collection stage through the action stage. This architecture must ingest data in real time and allow for experimentation and rapid iteration by data scientists.
Real-time data ingestion
If predictions are based on out-of-date information, they likely won’t be useful in the present. If you don’t have someone’s purchase history up to date, for example, they may no longer be looking for the item you’re recommending by the time you recommend it. Any point, click, or swipe you miss can cost you valuable insight about the customer. For data scientists to create algorithms based on all the data a user has generated up to each moment, all per-user information must be maintained and available in real time to score predictive models and drive decisions.
The Lambda Architecture provides an approach to solving this problem by augmenting the staged data pipeline of traditional systems (the “batch layer”) with a system responsible for collecting and processing only recent incremental changes (the “speed layer”). This means the batch layer may continue crunching historical data and the speed layer need only crunch the data accumulated since the last batch process was completed. In this paradigm, a “serving layer” merges the results from the batch layer and speed layer, generating combined responses that consider both historical data from the batch layer and recent data from the speed layer. For a helpful summary of the Lambda Architecture, read the blog post by James Kinley.
While this architecture allows data to flow through the system and answer queries in real time, the process of testing and modifying these queries remains slow. Especially when building predictive models and recommender systems, the process of making an algorithmic modification may require changing code in all three layers of the Lambda Architecture.
The ability to experiment with algorithmic modifications, however, is as important as having data available in real time. Data scientists must be able to run experiments and get feedback quickly to optimize their predictive models. After all, what’s the point of getting the wrong answer in real time?
Experimentation and rapid iteration
To optimize the quality of predictive analytics applications, data scientists need to quickly experiment with new algorithms and then make modifications based on their findings. If they have an idea about how to detect risk or fraud or give better product recommendations, they should be able to test that hypothesis by running a live experiment – and they should be able to run these experiments in days, not months.
Instead of just using key-value stores as serving layers for scored results, you can use them as 360-degree views of each customer that can be updated in real time. Each key should represent a single user, and the values should contain the attributes and interactions recorded about each user. This customer-centric dataset can be directly used for training predictive models without complex preprocessing. But instead of loading batch-generated model scores back into the key-value store for serving, data scientists can deploy scoring functions to be run against real-time customer data at request-time.
As customers interact with touch points, a scoring service can provide an API endpoint that scores the predictive models deployed by data scientists against the relevant per-customer data from the key-value store. Since the per-customer data is kept up to date as events are being collected in real time, the the predictive model will have interaction data up to the current moment available, resulting in a fully informed prediction or decision.
This approach allows data scientists, who often take on the responsibility of data cleansing and getting code into production when engineers are overloaded, to focus on doing data science: authoring models, running experiments, and turning the learning from those experiments into improved experiences.
Turning your customer data into action can be a slow and challenging journey if the key ingredients of real time data ingestion and experimentation must be bolted onto a system originally designed for customer understanding. But if software architects keep these requirements in mind from the beginning, purpose-built systems for rapidly developing and optimizing customer experiences can enable next-generation use cases.
Wu founded WibiData in 2010 and serves as the company’s Chief Technology Officer. His expertise includes web-scale distributed infrastructure, personalization algorithms and predicting consumer behavior to optimize customer experience. Previously, Garrett was the technical lead of Google’s personalized recommendations team.