In the course of my research, I'm running across some VERY large data warehouses. Several of them, especially in the web log/network event area, are in the multi-petabyte range. Perhaps most surprisingly, they're run on a broad range of data management software -- not just Teradata, but also Greenplum, Hadoop/Hive (which isn't even a DBMS!), Greenplum, and others.
My current golly-gee-that's-really-big list goes something like this:
- eBay has a 6 1/2 petabyte database running on Greenplum and a 2 1/2 petabyte enterprise data warehouse running on Teradata.
- Facebook has a 2 1/2 petabyte datawarehouse runnin on Hadoop/Hive.
- Wal-Mart, Bank of America, another financial services company, and Dell also have very large Teradata databases.
- Yahoo’s web/network events database, running on proprietary software, sounded about 1/6th the size of eBay’s Greenplum system when it was described about a year ago.
- Fox Interactive Media/MySpace has multi-hundred terabyte databases running on each of Greenplum and Aster Data nCluster.
- TEOCO has 100s of terabytes running on DATAllegro.
- To a probably lesser extent, the same is now also true of Dell.
- Vertica has a couple of unnamed customers with databases in the 200 terabyte range.