The chicken or the egg?
When it comes to widespread use of XML on the Internet, that could be the question. None of the major search engines support XML, because so few Web sites use it. But without encouragement from the search engines, who's going to add XML to their pages?
Last year, the World Wide Web Consortium approved an XML specification that many thought would help spark a revolution in Web-based information retrieval. With XML tags more precisely defining the information contained in documents, the Web would become a far easier place to find just what you were looking for.
The problem, however, is the millions upon millions of legacy HTML documents out there.
"HTML is the name of the game," says Mark Sprague, co-founder and senior vice president of product design at Northern Light, a search company in Cambridge, Mass. "There would need to be a critical mass of XML or Dublin Core [a proposed metadata element set for use in the discovery of electronic resources] for us to support it. We'll be reactive to XML."
Not all bad
While search engines are not supporting XML like many believed they would, the specification does have its uses:
* A replacement for EDI: Antiquated and hard to implement, Electronic Data Interchange is ripe for the picking. XML can be used in place of the mainframe-based system and can be used on just about any platform.
* Extranets: Exchanging information with a multitude of outside partners can be difficult. Using XML and Data Type Definitions (DTD), companies can provide a uniform data handling system for all of its partners. Lycos and Excite, to some degree, are both using XML to gather information from external partners.
* Simplified APIs: XML can also be used to create application-programming interfaces (API) for various Web applications. Developers can custom code their own API to interact with their database, Web and Application servers.
* ERP integration: For companies with multiple enterprise resource planning (ERP) systems or the need to integrate existing data sources with an ERP, XML could be a solution. XML can be used as the common communication and data exchange mechanism between the various data sources.
* Document management: A given, with XML being a simplified version of SGML. Complicated and unwieldy, SGML, like EDI, is hard to implement, especially for smaller outfits. XML's open architecture allows virtually anyone to create a document repository with little resources.
HTML describes how data should be displayed in a browser window, but does not define it. The page could contain a conference agenda, a classified ad, a book review or the latest product news, but it's all basically the same to a browser.
Still, because of its ease of use and mostly standardized tags, HTML has become wildly popular.
XML, in contrast, has few defined tags: Two Web sites could have XML tags with the same name, but might define them in very different ways, making XML more difficult to implement than HTML.
In turn, search engines have had to develop their own language for interpreting, sorting and indexing the data returned from their Web crawlers and spiders. Because they already have their own meta-data schema, they are reluctant to develop systems to handle another one that few people currently use - no matter how cumbersome those existing schema are.
Northern Light, which catalogs not just the Web, but a number of "special" information collections, uses a proprietary formatting system to handle data from its partners and Web crawlers.
For each new partner the company takes on, Northern Light engineers must develop a system for extracting information from the incoming data stream. "We have to take the third party's vocabulary and match it to ours," says Sheri Larsen, director of content processing at Northern Light. "Any new project is a big one, as we try to pull out as much as possible from documents."
XML could provide a standard metadata language for site developers, such as Northern Light and authors, such as its partners. Web crawlers would have to travel no further than the XML tags to know exactly what is on any given page. The impetus to support XML in their crawler would have to come from people and companies developing sites with XML, said Sprague. At the moment, most sites are sticking with the tried and true HTML. "There is just so much HTML out there," Sprague says.
Northern Light is not alone in shunning XML when it comes to scouring the Web. Lycos, AltaVista and Excite also dismiss XML.
"We are not currently planning on using XML in the near future," says Ilene Quinn, spokeswoman for Compaq-owned Altavista. "That does not rule out any future use, but at the moment there is nothing underway to incorporate XML into the AltaVista search engine."
So what's it good for?
This not to say XML is a complete waste of time.
Excite and Lycos say they are looking to use XML behind the scenes to handle data from partners.
"Our intent is to use XML on the backend as a means of very structured access," says Graham Spencer, co-founder and chief technology officer at Excite. "That's really what XML is better for."
Spencer says that Excite is trying to convince new information partners to encode data streams being sent to Excite in XML. Many of Excite's existing data feeds have varying types of formats, meaning Excite's content team must develop different Perl scripts to sort all the incoming information.
Lycos developed an XML data type definition (DTD) for communications with certain partners, according to Lincoln Jackson, product manager for search and navigation. The DTD provides a standard format for data being streamed into the Lycos site.
But using XML to parse data from corporate partners, where there is a degree of control, is different from the Web, where anything goes.
Jackson frets that once unleashed on the Web, XML could mean the same type of "keyword spamming" that search engines now try to filter out of existing HTML meta tags. Unscrupulous developers could alter tag definitions for their own gain, defeating the purpose of a standard.
The Dublin Core initiative aims to bring a Dewey Decimal-like system to the Web, that will give mainstream users and Web catalogers alike a standard means of tagging their information resources. According to the "Dublin Core Metadata Initiative" Web site, the most of the specification's elements "have a commonly understood semantics of roughly the complexity of a library card catalog card."
While such a system could aid in a search engine's ability to catalog the Web, it is far from becoming a specification and put into mainstream use.
Sprague believes if and when XML begins to gain acceptance, many sites will split into two versions - one supporting XML and a sister HTML site.
For now, though, HTML remains king of the hill.
Special Focus: XML servers enabling e-comm and Web models
Network World, 3/15/99
XML: The language of the World Wide Web
Network World, 11/02/98
More information on the XML specification
From the W3C.
XML Net Resources: primers and more
Network World Fusion