Search /
Advanced search  |  Help  |  Site map
Click for Layer 8! No, really, click NOW!
Networking for Small Business
Where's my gigabit Internet, anyway?
How a cyber cop patrols the underworld of e-commerce
For Red Hat, it's RHEL and then…?
Will the Internet of Things Become the Internet of Broken Things?
Kill switches coming to iPhone, Android, Windows devices in 2015
Israeli start-up, working with GE, out to detect Stuxnet-like attacks
Galaxy S5 deep-dive review: Long on hype, short on delivery
Google revenue jumps 19 percent but still disappoints
Windows XP's retirement turns into major security project for Chinese firm
Teen arrested in Heartbleed attack against Canadian tax site
Still deploying 11n Wi-Fi?  You might want to think again
Collaboration 2.0: Old meets new
9 Things You Need to Know Before You Store Data in the Cloud
Can Heartbleed be used in DDoS attacks?
Secure browsers offer alternatives to Chrome, IE and Firefox
Linksys WRT1900AC Wi-Fi router: Faster than anything we've tested
Heartbleed bug is irritating McAfee, Symantec, Kaspersky Lab
10 Hot Hadoop Startups to Watch
Server makers rushing out Heartbleed patches
Fortinet, McAfee, Trend Micro, Symantec, Bitdefender battle in socially-engineered malware prevention test
Net neutrality ruling complicates US transition to IP networks
6 Social Media Mistakes That Will Kill Your Career
Canonical's new Ubuntu focuses on the long haul
4 Qualities to Look for in a Data Scientist
Big bucks going to universities to solve pressing cybersecurity issues

Searching for XML

Today's breaking news
Send to a friendFeedback

The chicken or the egg?

When it comes to widespread use of XML on the Internet, that could be the question. None of the major search engines support XML, because so few Web sites use it. But without encouragement from the search engines, who's going to add XML to their pages?

Last year, the World Wide Web Consortium approved an XML specification that many thought would help spark a revolution in Web-based information retrieval. With XML tags more precisely defining the information contained in documents, the Web would become a far easier place to find just what you were looking for.

The problem, however, is the millions upon millions of legacy HTML documents out there.

"HTML is the name of the game," says Mark Sprague, co-founder and senior vice president of product design at Northern Light, a search company in Cambridge, Mass. "There would need to be a critical mass of XML or Dublin Core [a proposed metadata element set for use in the discovery of electronic resources] for us to support it. We'll be reactive to XML."

Not all bad

While search engines are not supporting XML like many believed they would, the specification does have its uses:

* A replacement for EDI: Antiquated and hard to implement, Electronic Data Interchange is ripe for the picking. XML can be used in place of the mainframe-based system and can be used on just about any platform.

* Extranets: Exchanging information with a multitude of outside partners can be difficult. Using XML and Data Type Definitions (DTD), companies can provide a uniform data handling system for all of its partners. Lycos and Excite, to some degree, are both using XML to gather information from external partners.

* Simplified APIs: XML can also be used to create application-programming interfaces (API) for various Web applications. Developers can custom code their own API to interact with their database, Web and Application servers.

* ERP integration: For companies with multiple enterprise resource planning (ERP) systems or the need to integrate existing data sources with an ERP, XML could be a solution. XML can be used as the common communication and data exchange mechanism between the various data sources.

* Document management: A given, with XML being a simplified version of SGML. Complicated and unwieldy, SGML, like EDI, is hard to implement, especially for smaller outfits. XML's open architecture allows virtually anyone to create a document repository with little resources.

Separating the data from the presentation could make data interchange and retrieval across the Internet more efficient by letting authors and sites define just what's in their documents.

HTML describes how data should be displayed in a browser window, but does not define it. The page could contain a conference agenda, a classified ad, a book review or the latest product news, but it's all basically the same to a browser.

Still, because of its ease of use and mostly standardized tags, HTML has become wildly popular.

XML, in contrast, has few defined tags: Two Web sites could have XML tags with the same name, but might define them in very different ways, making XML more difficult to implement than HTML.

In turn, search engines have had to develop their own language for interpreting, sorting and indexing the data returned from their Web crawlers and spiders. Because they already have their own meta-data schema, they are reluctant to develop systems to handle another one that few people currently use - no matter how cumbersome those existing schema are.

Northern Light, which catalogs not just the Web, but a number of "special" information collections, uses a proprietary formatting system to handle data from its partners and Web crawlers.

For each new partner the company takes on, Northern Light engineers must develop a system for extracting information from the incoming data stream. "We have to take the third party's vocabulary and match it to ours," says Sheri Larsen, director of content processing at Northern Light. "Any new project is a big one, as we try to pull out as much as possible from documents."

XML could provide a standard metadata language for site developers, such as Northern Light and authors, such as its partners. Web crawlers would have to travel no further than the XML tags to know exactly what is on any given page. The impetus to support XML in their crawler would have to come from people and companies developing sites with XML, said Sprague. At the moment, most sites are sticking with the tried and true HTML. "There is just so much HTML out there," Sprague says.

Northern Light is not alone in shunning XML when it comes to scouring the Web. Lycos, AltaVista and Excite also dismiss XML.

"We are not currently planning on using XML in the near future," says Ilene Quinn, spokeswoman for Compaq-owned Altavista. "That does not rule out any future use, but at the moment there is nothing underway to incorporate XML into the AltaVista search engine."

So what's it good for?

This not to say XML is a complete waste of time.

Excite and Lycos say they are looking to use XML behind the scenes to handle data from partners.

"Our intent is to use XML on the backend as a means of very structured access," says Graham Spencer, co-founder and chief technology officer at Excite. "That's really what XML is better for."

Spencer says that Excite is trying to convince new information partners to encode data streams being sent to Excite in XML. Many of Excite's existing data feeds have varying types of formats, meaning Excite's content team must develop different Perl scripts to sort all the incoming information.

Lycos developed an XML data type definition (DTD) for communications with certain partners, according to Lincoln Jackson, product manager for search and navigation. The DTD provides a standard format for data being streamed into the Lycos site.

But using XML to parse data from corporate partners, where there is a degree of control, is different from the Web, where anything goes.

Jackson frets that once unleashed on the Web, XML could mean the same type of "keyword spamming" that search engines now try to filter out of existing HTML meta tags. Unscrupulous developers could alter tag definitions for their own gain, defeating the purpose of a standard.

The Dublin Core initiative aims to bring a Dewey Decimal-like system to the Web, that will give mainstream users and Web catalogers alike a standard means of tagging their information resources. According to the "Dublin Core Metadata Initiative" Web site, the most of the specification's elements "have a commonly understood semantics of roughly the complexity of a library card catalog card."

While such a system could aid in a search engine's ability to catalog the Web, it is far from becoming a specification and put into mainstream use.

Sprague believes if and when XML begins to gain acceptance, many sites will split into two versions - one supporting XML and a sister HTML site.

For now, though, HTML remains king of the hill.


NWFusion offers more than 40 FREE technology-specific email newsletters in key network technology areas such as NSM, VPNs, Convergence, Security and more.
Click here to sign up!
New Event - WANs: Optimizing Your Network Now.
Hear from the experts about the innovations that are already starting to shake up the WAN world. Free Network World Technology Tour and Expo in Dallas, San Francisco, Washington DC, and New York.
Attend FREE
Your FREE Network World subscription will also include breaking news and information on wireless, storage, infrastructure, carriers and SPs, enterprise applications, videoconferencing, plus product reviews, technology insiders, management surveys and technology updates - GET IT NOW.