dtSearch: How to handle Big or even Biggish Data

You've got lots of business intelligence to sift through and grep won't cut it. What's a geek to do?

man 65049 640 100411611 orig

A couple of articles ago I wrote about archiving messages in your Gmail account using Gmail Backup. So, once you’ve got an archive, what can you do with it? Using Gmail Backup you can restore it to Gmail or another server but what about using the data that’s in your email to understand more about your connections and relationships?

Given that you’re dealing with Biggish Data (my email archive is 18.5GB containing 296,185 messages in 233 folders) it would be nice to be able to search the collection but at this scale you’re going to need something more powerful than grep: I’d suggest you check out dtSearch.

dtSearch has been around a long time (1991) and has a lot of organizations both big and small as clients. It’s a Windows-only product that will ingest any text documents you care to throw at it including MS Office files Word, Excel, PowerPoint, Access, and OneNote, as well as other "Office" formats including ZIP, HTML, XML/XSL, and PDF. It also supports Exchange, Outlook, Thunderbird and other popular email types, including multilevel nested attachments and can be interfaced to databases. 

dtSearch can build an index of up to one terabyte in size (obviously the size of the data indexed can be much greater than the size of the index) and you can simultaneously search as many indexes as you please. Indexes can be updated manually or you can schedule updates in the dtSearch Index Manager and you can build multiple indexes simultaneously.

dtSearch indexing my email archive Mark Gibbs

I tested dtSearch Desktop with Spider version 7.77 which adds the ability to “spider” web sites for content. Indexing my email archive took just under six hours but once the archive was built  searching for anything gave results in about 1 second! 

Search Request Mark Gibbs

Setting up a dtSearch request.

You can do complex searches that include conditionals and, as I noted above, that also span multiple indexes. Searching supports “stemming” (identifying the root of words which allows “stem” and “stemming” to be found), phonic searching (“sounds like”), fuzzy searching (which finds words even if misspelled), and searching for synonyms.

When you’ve run a query dtSearch can print single found items, reports that summarize or detail multiple items, and export in CSV or XML formats.

dtSearch also offers other versions including Network with Spider (designed for use in a LAN environment), Web with Spider (for publishing searchable data online; it provides HTML5 templates for presentation; see the online demo), Publish (for creating searchable document collection on CDs, DVDs, and USB drives), and Engine (an SDK for integrating dtSearch technology into applications).

The performance of dtSearch is truly impressive and the fact that it’s not only fast but can handle Big Data makes it ideal for all sorts of heavy lifting searches as well as digital forensics; indeed, the company has extensive advice on how to use dtSearch for just that purpose.

There are some things dtSearch doesn’t do such as exporting the data from only one or more indexed fields (for example, just “Sender” and “Date”) although exporting to CSV and importing into Excel allows you to slice and dice the data with ease. My only other criticism of dtSearch is that its user interface looks a little dated. 

Those quibbles aside, dtSearch is a search monster that is an absolute must if you’re doing Big or Biggish Data stuff. Desktop with Spider is priced at $199 per seat and gets a Gearhead rating of 5 out of 5.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: 10 new UI features coming to Windows 10