A couple of articles ago I wrote about archiving messages in your Gmail account using Gmail Backup. So, once you’ve got an archive, what can you do with it? Using Gmail Backup you can restore it to Gmail or another server but what about using the data that’s in your email to understand more about your connections and relationships?
Given that you’re dealing with Biggish Data (my email archive is 18.5GB containing 296,185 messages in 233 folders) it would be nice to be able to search the collection but at this scale you’re going to need something more powerful than grep: I’d suggest you check out dtSearch.
dtSearch has been around a long time (1991) and has a lot of organizations both big and small as clients. It’s a Windows-only product that will ingest any text documents you care to throw at it including MS Office files Word, Excel, PowerPoint, Access, and OneNote, as well as other "Office" formats including ZIP, HTML, XML/XSL, and PDF. It also supports Exchange, Outlook, Thunderbird and other popular email types, including multilevel nested attachments and can be interfaced to databases.
dtSearch can build an index of up to one terabyte in size (obviously the size of the data indexed can be much greater than the size of the index) and you can simultaneously search as many indexes as you please. Indexes can be updated manually or you can schedule updates in the dtSearch Index Manager and you can build multiple indexes simultaneously.
I tested dtSearch Desktop with Spider version 7.77 which adds the ability to “spider” web sites for content. Indexing my email archive took just under six hours but once the archive was built searching for anything gave results in about 1 second!
You can do complex searches that include conditionals and, as I noted above, that also span multiple indexes. Searching supports “stemming” (identifying the root of words which allows “stem” and “stemming” to be found), phonic searching (“sounds like”), fuzzy searching (which finds words even if misspelled), and searching for synonyms.
When you’ve run a query dtSearch can print single found items, reports that summarize or detail multiple items, and export in CSV or XML formats.
dtSearch also offers other versions including Network with Spider (designed for use in a LAN environment), Web with Spider (for publishing searchable data online; it provides HTML5 templates for presentation; see the online demo), Publish (for creating searchable document collection on CDs, DVDs, and USB drives), and Engine (an SDK for integrating dtSearch technology into applications).
The performance of dtSearch is truly impressive and the fact that it’s not only fast but can handle Big Data makes it ideal for all sorts of heavy lifting searches as well as digital forensics; indeed, the company has extensive advice on how to use dtSearch for just that purpose.
There are some things dtSearch doesn’t do such as exporting the data from only one or more indexed fields (for example, just “Sender” and “Date”) although exporting to CSV and importing into Excel allows you to slice and dice the data with ease. My only other criticism of dtSearch is that its user interface looks a little dated.
Those quibbles aside, dtSearch is a search monster that is an absolute must if you’re doing Big or Biggish Data stuff. Desktop with Spider is priced at $199 per seat and gets a Gearhead rating of 5 out of 5.
“Wake up!” the good folks a Merriam-Webster just tweeted. “Sheeple is in the dictionary now.”
Wireless LAN users can’t just stay comfortable in the 5GHz realm – the older 2.4GHz frequency bands are...
A review of 18 companies that offer free cloud storage
The legal fight between Apple and Qualcomm on licensing modem technology is turning uglier every day.
F5’s new President and CEO, Francois Locoh-Donou talks about his vision for the app delivery...
Companies of all sizes and types are rethinking not only how their employees work, but also where that...
Use cases in industries such as aeronautics and chemicals are a proving ground, and a roadmap to...