- 10 Hot Big Data Startups to Watch
- 11 Unique Uses for Google Glass, Demonstrated by Celebs
- How to Export Your Google Reader Account
- How to Better Engage Millennials (and Why They Aren't Really so Different)
Page 2 of 2
So what's the library to do? Big data experts say there are a variety of options to consider. It would probably make the most sense for library officials to find a tool for storing the data, another for indexing it, and yet another to run queries against it, says Mark Phillips, director of community and developer evangelism at Basho, maker of Riak, an open source database tool with a simple, massively scalable key-value store.
Big data management tools have turned into a robust industry with both proprietary and open source options available for different use cases and costs. One of the biggest questions Library of Congress officials will have to tackle is how hands-on they're willing to be in creating and managing the system. If the library wants to take an open source route, there are a variety of tools that can be used to create and manage databases -- everything from a Hadoop cluster to a Greenplum database that specializes in high input/output read/write capabilities. Those can be combined with Apache Solar, which is an open source search tool. Open source provides a free way for developers to take the source code and construct a system based on commodity hardware, but also can take a lot of developer work on the back end. The library can also go the proprietary -- and more expensive -- route of using database software from the likes of Oracle or SAP.
Either way, the amount of data the library has for the Twitter project is not insurmountable. 133TB, and growing, is a large amount of data, but Basho has customers managing petabytes of data on its platform, Phillips says. If the library can track how much the database will be growing each month or quarter, then so long as it has the hardware capacity to store the data, the database software should be able to handle it.
WHY NOT FLASH DRIVE IT? First 1TB USB flash drive coming soon
Should the library use the cloud? Theoretically, the library could use a public cloud resource like Amazon Web Services to store all this data and just have AWS provide the constantly increasing amount of hardware capacity that's needed to store all these tweets. Seth Thomas, a Basho engineer, doesn't know if that would be cost-effective over the long term, though. A hybrid architecture is likely more fiscally wise since the library plans to keep this data forever. Perhaps storing the data on-site and using a cloud-based service for an analytics tool could work. That would allow the queries to dynamically scale resources as they are needed to execute a search, enabling the final system to handle the range of requests leveled upon it.
However the library decides to index the tweets, just remember next time you update your status on Twitter, it's being recorded somewhere.
Read more about voip & convergence in Network World's VoIP & Convergence section.