Handmade marketplace Etsy has grown to 800,000 sellers and 40+ million monthly visitors. All that activity generates enormous quantities of data, which Etsy uses to drive site improvements.
Credit: Robert Hagadone
Etsy’s big data brain trust: Senior Software Engineer Steve Mardenfeld, left; Principal Engineer Dan McKinley and Nell Thomas, group manager, data analysts, at the company’s Brooklyn, N.Y., headquarters. Also pictured is McKinley’s dog, Dottie Matrix.
At Etsy, the online marketplace for handmade crafts and vintage items, the benefits of big data analytics come in small packages - for example, something as simple as tweaking a "favorite" button, which lets visitors bookmark products they like.
"Someone was digging through data and noticed that relatively few people click on that button, but of those who do click on it, a pretty high percentage wind up signing up for the site," recalls Dan McKinley, principal engineer and five-year Etsy veteran. "So we tried making it more prominent for people who hadn't signed up yet. That was worth a few percentage points of people signing up, which in our world is a pretty huge increase."
[2013 JOB WATCH: Top 11 metro areas for tech jobs]
Sifting through data, adjusting page elements, and improving site engagement is standard operating procedure at Etsy, which uses an approach known as continuous deployment. Any of Etsy's 150+ engineers can deploy code to the live site at any time -- and that happens 20 to 30 times a day. (Newly hired engineers are encouraged to deploy on their first day on the job.)
"With continuous deployment, we're able to push lots of small, incremental changes," says Steve Mardenfeld, senior software engineer at Etsy. "It's the perfect vehicle for experiments."
Data provides the rationale for these engineering experiments -- and Etsy has gobs of it.
The Etsy way
More than 14 million shoppers have made purchases and 100 million items have been sold since Etsy was founded in 2005 in an apartment in the Fort Greene neighborhood of Brooklyn, N.Y. Founder Rob Kalin came up with the idea because he couldn't find a viable marketplace to sell his photos, paintings and carpentry products. He teamed with Chris Maguire and Haim Schoppik to design, build and launch the site.
Today Etsy occupies a former cardboard-box factory in an area of the borough known as DUMBO (which stands for Down Under the Manhattan Bridge Overpass). It's a laid-back environment where dogs roam, conference rooms are decorated with handcrafted pieces commissioned from Etsy artists, and employees are given a decorating budget to trick out their workspaces. A staff lunch program, known as "Eatsy," serves up free lunch for employees three times a week.
Creativity is rampant. "No matter what job you have here, people are encouraged to think creatively about how to get things done," says Nell Thomas, group manager, data analysts, at Etsy.
At the helm is Etsy's onetime CTO Chad Dickerson, who took on the CEO role in mid-2011. The company has nearly 400 employees, and last year it raised $40 million in Series F venture financing. There are 800,000 active sellers and 40+ million monthly visitors to the site. In 2012, sales jumped 70% to $895.1 million, and page views climbed 28% to 16.7 billion.
With all that activity, Etsy generates enormous quantities of data. Every interaction with the site -- a page view, a click, a pop-up -- is collected. "We're doing about 175 million events per day, which amounts to roughly 75 gigabits of event data that we store per day," Mardenfeld says.
Democratization of data
Data analysis is everywhere at Etsy; it's not the domain of any single group. "We try to have it be a work in progress, be part of the culture, and be embedded throughout different parts of the site and parts of the company," Thomas says.
The lack of centralization is deliberate. "There's less central control, which can mean more opportunity for people to bite off different parts of the data we have and use it. That can lead to really positive things, and it can also be a challenge in terms of making sure people understand the data and are making decisions based on correct interpretations," says Thomas, who acts as an ambassador between Etsy's data teams and the rest of the company.
"It might be simpler if there were one monolithic group that came down as the source of data truth, but it would create a bottleneck and a silo that wouldn't necessarily help us move quickly and use data to inform what we're doing."
Internally, data analysis is incorporated throughout the product life cycle, helping development teams to design and prioritize site changes.
"The engineers and product people who are building features on the site are doing experimentation, and a majority of features are A/B tested, so everybody in those groups, to some extent, uses big data in order to analyze those things," McKinley says.
"We also use data to decide what we're going to do going forward, working with our product road map," Mardenfeld adds. "We use it all over. We use data to make sure that our products are behaving the way we're expecting them to. We use data to understand and gain insight into how people are using the site, and we use it to iterate as well. It's part of all these different steps."
Sharing the data
With such a massive volume of merchandise for sale, it's a constant challenge to try to make sellers' items more discoverable by shoppers. Etsy uses big data to power the content that's being shown to site visitors via its product recommendation system, for example, and search ranking. The clickstream data is processed in real time and used to deliver relevant content to a user.
[Etsy gets geeky: Techiest homemade arts & crafts]
At the feature level, big data powers Etsy's Taste Test, which takes users through a product quiz of sorts before recommending products they might like, and recommendations for visitors who come to Etsy via Google Product Listing Ads.
"The way we use the data allows us to differentiate between user groups and helps optimize the experience for buyers and for the crafters and small businesses that are trying to sell their goods to people around the world," Thomas says.
Externally, Etsy prepares analytic products for shop owners that allow each seller to see how they're doing. Shop Stats, an analytics system for sellers, shows what people were searching for, how they navigated to the shop, and how many purchases were made, for instance. In the big picture, Etsy publishes a monthly report that shares overall business metrics such as total goods sold by the Etsy community, number of items listed, site membership and page views.
Hadoop and critical thinkers
Making data usable throughout the company requires a combination of people and technology.
On the people front, working with data at Etsy requires a blend of business, analytics and engineering skills.
"When we hire people who we intend to be analysts, we look for qualities like critical thinking skills, skepticism, and an ability to think statistically. We expect that we'll be able to train them in whatever programming language they'll need to do their day-to-day job," McKinley says. "Typically we're hiring engineers and training them in basic statistics, or we're hiring people in statistics and training them in basic engineering. The people who are awesome at both of those things are very few and far between."
On the technology front, Etsy uses a wide range of tools. The company collects transactional data, which is anything to do with products, listings and purchases, as well as behavioral data, which includes any kind of interaction that people have while they're browsing the site. As site traffic has grown, so have Etsy's analytic capabilities. The e-commerce site has beefed up its event-logging platforms, its analytics infrastructure and its presentation tools.
Ensuring data consistency and accuracy is one of the biggest challenges. "We're making decisions with data, yet it's very hard to actually make sure that the data is correct," says Mardenfeld, who's focused on building the infrastructure that powers Etsy's big data projects. "We put a lot of work into error checking, making sure our collection pipelines are working. Data is a little bit of a different beast. You can't just get your code to compile. You have to compile and also make sure that it makes sense. I think that's the hardest part about this."
In terms of platforms and tooling, Hadoop plays a key role in storing and processing the data. Etsy runs dozens of workflows each night on Amazon's cloud-based Elastic MapReduce service. Rather than keeping a single cluster running continuously, Etsy brings up a new cluster for each job so it can tailor the number and types of instances to the workload.
"We have our own custom event-logging frameworks, and we store all the data in [Hadoop Distributed File System (HDFS)]. We process the data into ETL using a data flow language known as Cascading, and then we push it downstream to a data warehouse, which is Vertica," Mardenfeld says.
Etsy also uses Elastic MapReduce clusters to analyze the data and perform predictive analytics. "Hadoop is an important part of our pipeline. I don't think we'd be able to do any of this without it," Mardenfeld says.
[Hiring trends: Hadoop wins over enterprise IT, spurs talent crunch]
To digest the data, Etsy has built a number of homegrown tools. "We write a bunch of custom UIs for this, for our internal tools. One of them is what we call the A/B Analyzer, which allows us to easily do analysis on experiments that we run. We also have our own internal funnel tool and our own dashboard tool," Mardenfeld says.
The homegrown presentation tools make it easier for teams throughout Etsy to access and make use of data for experimentation and to inform product development, even if they don't have statistical expertise. A launch calendar keeps track of all the current, active experiments at Etsy, and Etsy employees can simply click on an experiment and, using the homegrown dashboards, see the results to date of that experiment.
"We had a lot of questions that were of the same type, so we've generalized those so it's easy for people to get the answers to those questions without doing a lot of work," Mardenfeld says. "For more custom questions, you can answer questions in Vertica, you can use SQL, and for more in-depth data mining and analysis and building products, then you can jump down to writing things in MapReduce and Cascading."
Big data, little changes
Etsy's continuous deployment approach sets up an ideal scenario for tying single, isolated site changes to experiments, and it makes it easy to identify the culprit if a code change causes problems.
"When you make multiple changes to a site or a page, it's hard to figure out what's not working the way you want it to," Mardenfeld says. "When you change one thing at a time, you're able to see where you went down the wrong path and can backtrack very easily."
Another benefit of continuous deployment is the ability to pull the plug on a code change that didn't live up to expectations. "We're more likely, we think, to notice that we're doing something that's bad and to stop," McKinley says. "Whereas operationally and emotionally, if you work on something for many months and then release it, there's nothing that will stop you from releasing it because you're invested in it."
Most often, the changes yield modest gains with minimal impact -- and that's the plan. "We're playing with peoples' livelihoods here, so it behooves us to be very careful," McKinley says.
Being able to improve business for the 800,000 craftspeople and small business owners that enable Etsy's existence is a motivator for employees.
"It feels really good at Etsy to be optimizing something that's not just about a corporation's bottom line," Thomas says. "The work we're doing, the ways that we're looking at data and using it to make things better, is all about helping the sellers on Etsy to be more successful."