Working out the bugs in XML databases
There's a growing belief that XML-based information needs its own database.
As network executives begin to experiment with Web services, they're likely to find that they need a new kind of data store: the XML database.
These software products are designed to efficiently store and manage the growing numbers of XML documents that users are creating, especially in Web interactions with business partners and customers. Advocates cite several advantages of XML databases compared with traditional databases: simplicity, ease of application development, ability to search and query XML documents, and fast document retrieval.
There's no formal, standard definition of an XML database, although the XML:DB Initiative (www.xmldb.org) describes such a database as one that defines a logical model for an XML document (not for the data in the document), and manages documents based on that model. The key point is the database "thinks and acts" based on XML - XML goes in, and XML comes out, even though these products can physically store the documents in an object or relational database or a proprietary storage model, such as indexed files.
The lack of formal definition is just one issue that raises the hackles of critics. They also point to the immaturity of the products and of XML standards; the absence of a standard, reliable query language to match the SQL used in relational databases; and possible data integrity problems.
Relational vendors are also adding better support for XML. For example, Microsoft is developing the Yukon release of SQL Server. Oracle demonstrated to customers in December a technology called Project XDB. The goal of both projects is to let the databases treat XML documents as a new data type and manage them as they now work with relational data and objects.
"If I had an Oracle [relational] database, I'd want to really know what's going in the background to handle XML," says Larry Hanson, data architect for the California Board of Equalization (BOE), a tax authority that handles sales and other taxes for the state. "If you store these documents as objects, for example, can you query them, and tag them?" Oracle claims that these actions will be possible with XDB but how well the technology performs when processing lots of data or very large data sets remains to be seen.
Hanson's point, echoed by others, is that XML data is fundamentally different from relational data.
"XML data are extremely well-suited to hierarchical storage," says Hanson, who is a former database administrator. "In XML databases, an online tax return can be stored in its entirety. In a relational database, each line of the return would have to be a different table [of data in rows and columns]."Trying to "force fit" an XML document into the rigid relational structure can waste storage space and lead to inefficiencies in queries and retrievals.
Analysts expect these benefits to fuel a fast-growing market. IDC estimates enterprise spending for XML databases will grow by 130% annually, reaching $700 million in 2004. XML databases will complement relational databases, according to IDC analyst Anthony Picardi - the former being better suited for storing and processing XML documents, the latter for numbers and text.
There are plenty of choices for network executives to evaluate, with at least two dozen native XML database products (see XML Database Products).
The key vendors include Software AG and eXcelon - which stores documents in its ObjectStore object-oriented database. There are a host of smaller vendors, such as NeoCore, IXIA and ZYZFind, working on XML database products. There are also a number of open source projects. One is Xindice, formerly dbXML Core, which now is being handled by The Apache Software Foundation.
Knowing whether and when to use a native XML database hinges on the kind of data you're dealing with, and what you want to do with it.
Companies are finding that new applications such as Web services, which are built on XML, tend to have data models that don't map well to traditional relational structures, says Philippe Gelinas, CEO of software developer Xiasoft, which developed the TextML Server for XML documents.
The server is designed as a low-cost product - about $10,000, while some rivals cost about $50,000 - that can work with an array of development tools.
"Often customers try to make these applications work first with an existing [relational] database and find it doesn't work," he says. "Then they shop for an XML database."
Some users, like California's Hanson, are early adopters, already convinced of the importance of XML to the corporation. Two years ago, Hanson began designing an alternative to paper tax returns: filing electronically via a Web site. The tax data had to eventually end up in the mainframe database, the venerable Adabase from Software AG.
But the two options for that each had drawbacks. With the first option, if XML documents were stored in Adabase as huge binary large objects, as images and sometimes text are stored in relational databases, then the documents became opaque. They could not be searched or queried.
The California BOE was already doing some work with the second option: The documents are picked apart by a parser program, and the data sent to the mainframe in a form Adabase can use. But this creates more processing overhead, and changes to the documents, such as adding a new line to the sales tax form, would force administrators to make changes to the underlying database structure.
Hanson deployed Software AG's Tamino XML database. The XML documents created by tax filers at the Web sites are stored in Tamino.
The subset of data needed by the mainframe is parsed out.
The entire unmodified sales tax filing, and all of its data, is stored in Tamino, where BOE users, working with a Web browser, have begun querying the data and creating management reports.
"Once people move into XML, they'll run into the same thing we did," Hanson predicts.
"If you're getting the XML document instead of paper, where do you put it? How do you store it, and what are you going to do with it?" he adds. In the long term, his goal is to let users have a combined view of all data, in XML and traditional databases, through a Web browser.
Achieving that goal is not easy as the weak points for XML databases are numerous. The user interface for new products may be rough. In the case of the California BOE, data administrators had to write extra code to update Tamino and the mainframe database. Queries are a challenge because there are several different XML query languages, and these are still in flux. Finally, integration between XML and corporate data stores requires still more coding at this early stage. n