Americas

  • United States

Digging the search scene

Opinion
Sep 03, 20033 mins
Enterprise ApplicationsWeb Search

* ht://Dig from the ht://Dig Group

Building effective search features into Web sites is easy when the amount of content is limited but when you get involved with documentation, lists, or anything else that is voluminous then you need something rather more industrial. Free would be even nicer.

OK, here’s the answer: ht://Dig from the ht://Dig Group (see links below), a powerful medium scale search engine released under the GNU General Public License.

According to the group: “The ht://Dig system is a complete World Wide Web indexing and searching system for a domain or intranet. This system is not meant to replace the need for powerful Internet-wide search systems like Lycos, Infoseek, Google and AltaVista. Instead it is meant to cover the search needs for a single company, campus, or even a particular sub section of a Web site.”

ht://Dig can span multiple servers as long as they understand HTTP because the tool builds its index by crawling the sites to be indexed as if it were a Web browser.

To run ht://Dig you’ll need a Unix machine and both a C and a C++ compiler (C++ is needed for ht://Dig itself while the C compiler is needed to compile some of the GNU libraries).

ht://Dig has been tested on these machines with these compilers:

* Sun Solaris SPARC 2.X (using gcc/g++)

* Sun SunOS 4.1.4 SPARC (using gcc/g++ 2.7.0)

* HP/UX 10.X (using gcc/g++)

* IRIX 5.3 and 6.X (SGI C++ compiler.)

* Most Linux Distributions (using gcc/g++)

* Most BSD platforms, including BSDI and Mac OS X (using gcc/g++)

The tool has also been implemented under IIS on Windows.

Like any search system ht://Dig requires a lot of disk space. For example, an index for 13,000 documents with a full word index will require about 150M bytes of storage.

ht://Dig has a long feature list that includes: Support for the Robot Exclusion Protocol; Boolean expression searching; configurable search results using HTML templates; fuzzy searching; multiple searches algorithms including exact, soundex, metaphone, stemming (common word endings), synonyms, accent stripping, and substring and prefix; support for searching HTML and text files; e-mail notification of expired documents; and SGML entities like ‘à’ and ISO-Latin-1 characters can be indexed and searched.

If you want to see ht://Dig in action, check out the Web site for the National Public Radio station KCRW in Los Angeles (the greatest radio station ever) – it makes extensive use of the subsystem which performs excellently.

mark_gibbs

Mark Gibbs is an author, journalist, and man of mystery. His writing for Network World is widely considered to be vastly underpaid. For more than 30 years, Gibbs has consulted, lectured, and authored numerous articles and books about networking, information technology, and the social and political issues surrounding them. His complete bio can be found at http://gibbs.com/mgbio

More from this author