A couple of articles ago I wrote about archiving messages in your Gmail account using Gmail Backup. So, once you’ve got an archive, what can you do with it? Using Gmail Backup you can restore it to Gmail or another server but what about using the data that’s in your email to understand more about your connections and relationships?
Given that you’re dealing with Biggish Data (my email archive is 18.5GB containing 296,185 messages in 233 folders) it would be nice to be able to search the collection but at this scale you’re going to need something more powerful than grep: I’d suggest you check out dtSearch.
dtSearch has been around a long time (1991) and has a lot of organizations both big and small as clients. It’s a Windows-only product that will ingest any text documents you care to throw at it including MS Office files Word, Excel, PowerPoint, Access, and OneNote, as well as other "Office" formats including ZIP, HTML, XML/XSL, and PDF. It also supports Exchange, Outlook, Thunderbird and other popular email types, including multilevel nested attachments and can be interfaced to databases.
dtSearch can build an index of up to one terabyte in size (obviously the size of the data indexed can be much greater than the size of the index) and you can simultaneously search as many indexes as you please. Indexes can be updated manually or you can schedule updates in the dtSearch Index Manager and you can build multiple indexes simultaneously.
I tested dtSearch Desktop with Spider version 7.77 which adds the ability to “spider” web sites for content. Indexing my email archive took just under six hours but once the archive was built searching for anything gave results in about 1 second!
You can do complex searches that include conditionals and, as I noted above, that also span multiple indexes. Searching supports “stemming” (identifying the root of words which allows “stem” and “stemming” to be found), phonic searching (“sounds like”), fuzzy searching (which finds words even if misspelled), and searching for synonyms.
When you’ve run a query dtSearch can print single found items, reports that summarize or detail multiple items, and export in CSV or XML formats.
dtSearch also offers other versions including Network with Spider (designed for use in a LAN environment), Web with Spider (for publishing searchable data online; it provides HTML5 templates for presentation; see the online demo), Publish (for creating searchable document collection on CDs, DVDs, and USB drives), and Engine (an SDK for integrating dtSearch technology into applications).
The performance of dtSearch is truly impressive and the fact that it’s not only fast but can handle Big Data makes it ideal for all sorts of heavy lifting searches as well as digital forensics; indeed, the company has extensive advice on how to use dtSearch for just that purpose.
There are some things dtSearch doesn’t do such as exporting the data from only one or more indexed fields (for example, just “Sender” and “Date”) although exporting to CSV and importing into Excel allows you to slice and dice the data with ease. My only other criticism of dtSearch is that its user interface looks a little dated.
Those quibbles aside, dtSearch is a search monster that is an absolute must if you’re doing Big or Biggish Data stuff. Desktop with Spider is priced at $199 per seat and gets a Gearhead rating of 5 out of 5.
If you’re a Linux user who wants a pocket-size terminal, PocketCHIP from Next Thing Co. fits the bill...
The above headline on a post to Reddit piqued my interest this afternoon because it was in that site’s...
The U.S. government reportedly pays Geek Squad technicians to dig through your PC for files to give to...
The number of IT jobs shrank in 2016 compared to the year before, most analysts believe, although they...
Get a new laptop as a gift and want to turn your old computer into some crisp Benjamins? These 10...
If your cyber insurance vendors offer incentives or discounts for companies who meet high data security...
Rackspace ended months of speculation in August by announcing it had been acquired by a private equity...