- 15 Non-Certified IT Skills Growing in Demand
- How 19 Tech Titans Target Healthcare
- Twitter Suffering From Growing Pains (and Facebook Comparisons)
- Agile Comes to Data Integration
Computerworld Australia - The National Library of Australia has opted for an open source platform to drive its newly unveiled search engine.
Called, Trove, the search engine provides access to more than 90 million items about Australians and Australia, sourced from more than 1000 libraries and cultural institutions across the country.
The project's team of five developers used SOLR 1.4, which internally uses Lucene 2.9, for the main bibliographic search database and the web page archive, and MySQL 5 for managing all data relationships.
"That was something that was pretty important to us, we didn't want to go and build something in-house," Trove manager, Rose Holley, said. "We wanted it to be shareable when we were finished."
Holley said the search engine evolved out of the library's newspaper newspaper digitisation program, which began two years ago and runs off Lucene 2.9 "natively".
That program involved the use of Optical Character Recognition software to automatically convert old newspaper images into digital text. The small fonts and uneven printing of many of the newspaper pages made conversion difficult and not always accurate. As a result, more than 5000 online users helped corrected text and subsequently the top correctors were slated to receive Australia Day awards for their efforts.
"If that was successful then our master plan was always to transfer the rest of our service into that infrastructure," Holley said. "So the infrastructure for the newspapers service is the same for Trove."
The project team also opted for Jetty as a web server, Nginx as the http front-end / reverse proxy, Java Server Pages (jsp) for the newspapers part of the site, and Restlet and FreeMarker for the remaining portions of the service.
Additionally, one of the main steps taken was to use Solid State Disks (SSDs) -- four Intel X-M25 160GB drives in each machine -- for the Lucene indices to achieve the necessary performance. Trove issues more than 8000 i/os per second to the SSDs, which the team says would be expensive to achieve with even the fastest SAN setup.
The Trove website, meanwhile, is split into eight searchable categories:
Unlike Google search, which provides a list of websites for search results, Trove displays links to items.
"We are searching across meta data mostly from cultural heritage organisations," Holley said. "We have over a thousand organisations that have been providing their data. Obviously libraries have been looking to standards of data sharing for several years and we use a mechanism called OAI -- the Open Archives Initiative.
"Whereas Google is trolling meta data for a website, we are doing a similar thing for data but for unique Australian objects, many of which are objects that cultural heritage institutions have digitised. Most of them you wouldn't normally be able to find on the web through Google search because they are in the deep web or some database that wraps them up.