Founded in 2007, CrunchBase is a website offering massive amounts of data about startup activity. Want to know who founded a startup, who invested in it, or who they're competing with? CrunchBase has the answers. And in a marketplace that is somewhat frothy, CrunchBase is an increasingly heavily trafficked web property. The site contains over 650,000 profiles of individuals and companies and is a massive repository of data. As such, CrunchBase has a massive opportunity to monetize that data, and is accordingly concerned about people who seek to use that data for their own commercial aims.
I spent time talking with Kurt Freytag, head of product at CrunchBase, to have a look at the engineering work that goes into the site. As the site grew in size and traffic, Freytag noticed oddly shaped traffic and random spikes that were putting significant strain on its infrastructure. Of course, it could have simply thrown more horsepower at the site, but Freytag was keen to identify real root causes for the issues. He quickly concluded that bot traffic was hitting the site hard and crawling through its data. While this is a primary concern in terms of performance, it also introduces real commercial risk as third parties use the sites data elsewhere. People were literally stealing CrunchBases's data and monetizing it. Something had to be done.
Part of the problem was that the initial version of CrunchBase was built in only a few months and, while functional, wasn’t exactly robust (or, more fairly, hadn't had the time taken to test the robustness). It only took traffic north of 1,000 page views per minute to impact upon the performance of the site. Delving into usage patterns, however, soon surfaced some interesting insights. Whereas the CrunchBase DevOps team was largely dealing with the load on the infrastructure, aggressive bad bots literally brought the site down on two successive occasions. Freytag remembers, “We would get 200 simultaneous requests for long tail pages. It was totally unlike user traffic -- nobody hits those pages, but bots do. When they hit 200 simultaneously, all hell breaks loose! There was no apparent path forward to solve this problem without a lot of effort and playing whack-a-mole with IP addresses wasn’t working.”
Delving deeper, Freytag saw that automated bots were jumping IPs and masquerading as different user agents. CrunchBase didn’t have the tools or instrumentation to control malicious traffic, and IT couldn’t block the bots. At the same time, Freytag knew this was not his team’s skill set, nor one that he wanted to build.
“Bot detection and deterrence isn’t (and shouldn’t be) our core competency," he said. "Bots are a constant distraction from making CrunchBase better for our users.”
Freytag was keen to find a fix. He came across security vendor Distil and agreed to take a look at how applicable Distil was to CrunchBase's problem set. Initially, Distil was seen as a tool to deal primarily with the performance issues, and protection of IP became a secondary value offering. Freytag was adamant that he didn’t want to have to make changes to his underlying web infrastructure to implement a solution. Distil runs traffic through a proxy, so it needs to change little within the infrastructure to implement it.
“I wanted the solution to be non-intrusive from an operations perspective. I didn’t want all of our traffic going through someone else’s servers," said Freytag, "Distil’s touch is very lightweight, handling the request inline and moving it along. That drastically simplifies bot detection in my ecosystem.”
Distil quickly helped to stabilize the site, and then the secondary value kicked in. Freytag told me that once they got the site stabilized, they quickly switched to bot fighting mode. Freytag hadn’t anticipated that so much of their traffic was due to bots. Having discovered it, they wanted to do everything they could to protect the CrunchBase Dataset. CrunchBase started slowly, adding CAPTCHA for almost all Distil traps. They quickly realized that they were showing CAPTCHA to ~10% of their traffic, yet only ~1% of those CAPTCHAs were even attempted. Those numbers told Freytag that CAPTCHA'd browsers weren’t human, so they tightened security and blocked browsers that failed Distil tests.
The improvement in performance seems to be a long-term impact. "The only bots we believe still get through are those that crawl the site very slowly over a long period of time, jumping IPs periodically," said Freytag. "Fortunately, such an approach dramatically limits the amount of data any one bot can retrieve."
In terms of the commercial value that resolving the issue has generated for CrunchBase, the results speak for themselves. Blocking bots and locking down their API generated almost 1,000 prospects for CrunchBase’s commercial content license.
The company unexpectedly saw an increase in the number of inquiries for their commercial content license - almost 1,000 to date. While many web scrapers will never become legitimate users, it was a welcome surprise that many were interested in becoming paying customers Freytag reports.
This article is published as part of the IDG Contributor Network. Want to Join?