|
|
![]() |
|
A nurse tags vital medical documents, allowing a patient's pharmacy and insurance carrier to share specific information almost instantly...A company's server pulls the key legal information it needs from a vendor for the two to begin conducting electronic business right away...A government employee immediately calls up a listing of comparison prices for a new computer monitor from all the procurement contracts available to her... Scenarios such as these demonstrate the type of system integration, expanded functionality and processing speed that could all be possible in the near future - enabled by XML. XML (Extensible Markup Language), a document-description language based on SGML (Standard Generalized Markup Language) and proposed successor to HTML (HyperText Markup Language), has quickly become one of the hottest emerging technologies. Gartner Group positioned XML at the top of its 1998 Hype Cycle of Emerging Technologies and most of the other major research organizations have produced reports about the XML initiative. Genesis of XMLXML is a subset of SGML that is optimized for information delivery over the Web. SGML defines a document syntax that "tags" sections of a document as having distinctive meanings. For example, certain words or sections of this article might be "marked up" with SGML tags as the <title>, the <author>, the <biography>, the <body> and the <acknowledgments>. In the publishing industry, SGML has been well established as the standard for how document markups should be structured. However, as a generic language, SGML itself does not specify what the markup tags should be. SGML allows you to define your own tags and markup rules to describe arbitrarily complex semantic relationships among elements of a document. During the development of the World Wide Web, its architect Tim Berners-Lee decided that predefining a small number of specific tags and markup rules for all documents exchanged on the Web would make it easier to develop document readers (now called "browsers") to format marked-up documents. Berners-Lee focused on how documents were presented (rather than how to process their contents) and came up with a definition of tags and markup rules that is called the Hypertext Markup Language (HTML). HTML has been an overwhelming success, and the current document Web is ubiquitous. But while HTML made it easy to specify how documents should be presented on a screen for human beings to read, it didn't provide any help in structuring the information content of documents for software systems to process. With the exponential growth of documents available on the Web, people soon wanted to do more than just read documents - they wanted to conduct business on the Web as well. The Web was becoming a victim of its own success, having become a ubiquitous "information system" without information-processing capabilities. Let's look at Figure 1 as an example. This table represents a typical Internet shopping Web site, featuring "best deals of the site." Human beings immediately understand that the price of an Intel Pentium chip is $111.69. However, we derive this semantic information from our human visual perception and experience. It doesn't exist in the HTML document for software programs to extract.
Figure 1. A web page presentation. To understand this point, let's look at the corresponding HTML source in Figure 2. The HTML source basically specifies how to print a table. Each pair of <TR> and </TR> tags encloses a table row; each pair of <TD> and </TD> tags encloses the table data for a cell of the table. Other tags cover presentation attributes such as font type, display color and table-cell size. A browser uses these presentation markups to render the table in Figure 1. Because 111.69 is printed under the price column heading and in the Pentium chip row, human intelligence makes the price-relationship inference. Yet, a software system scanning through the HTML document in Figure 2 can only discern character strings in table cells and rows!
Figure 2. The corresponding HTML source file. To enable automatic processing of information, we need markups similar to those in Figure 3. In this example, tags associate information content, not just display attributes, with the text. The resulting document is self-describing in terms of information content and can be interpreted easily by both human beings and software systems.
Figure 3. A corresponding XML document. With a document marked up this way, you could do more effective information searching. For example, if you wanted to buy a Pentium chip from an Internet store, you could submit a search for "<Product> Pentium" on the Web. This type of search would eliminate unwanted pages, such as those referencing Pentium computers using the chip, the chip's performance information, the chip's architecture, development guides for the chip, bugs related to the chip or even the manufacturing process of the chip. Only pages specifically identified as selling the product would be returned. Even better, you could direct the search to sort the returned items based on the values under the <Price> tag and quickly identify the store from which you would buy the chip. The potential of XML is well recognized by leaders in Internet development. It has been developed under the auspices of the World Wide Web Consortium (W3C) since 1996 with strong support from enterprises in a wide spectrum of industries. Both Microsoft browsers (IE4) and Netscape browsers (beta for v4 and will be integrated into v5) currently support XML. In addition, many public domain XML tools are available on the Internet, with perhaps Microsoft the strongest contributor and supporter. XML is on its way to becoming another ubiquitous information format on the Net. Presenting XML DocumentsXML markups may not cover presentation attributes - therefore, we need a separate mechanism to control the display of XML documents for human readers. Two basic approaches exist for this purpose. The first approach specifies an XML-to-HTML mapping, which contains a set of rules describing the occurrence of XML markups and the corresponding HTML markups to be generated. The Extensible Stylesheet Language (XSL) is developed to specify this mapping. The XSL example in Figure 4 specifies that the XML <Price> tag should result in the generation of an HTML <TD> tag with the specified attributes. When an XML file and its accompanying XSL file are transmitted to a browser, the XSL processor (soon to be integrated into Microsoft browsers) will automatically generate the resulting HTML file and present it on the browser. This activity happens behind the scene-users only see a table like Figure 1 appear on the browser.
Figure 4. An XSL mapping example. The second approach, for sophisticated client-side presentation control, is client-side script programs. When an XML file and associated script program are transmitted to a browser, the browser's built-in XML parser will parse the XML file and allow the script program to access any information element in the XML file via the Document Object Model (DOM) API. The script program can take full control of the information presentation and customize the presentation based on the user interactions. The combination of XSL and DOM support enable XML client-side processing capabilities that approach what we can do now with Java applets, ActiveX controls or browser plug-ins. Yet, the development process is much simpler and the resulting system much cleaner. As XML becomes ubiquitous, XSL and DOM-exploiting scripts should become the mainstream mechanisms for client-side capabilities. Document Type Definition (DTD)Looking at the markups in Figure 3, we may wonder about the origin of those strange tags. Technically, XML allows us to define whatever appropriate tags we choose for our applications. In reality, however, each industry has its own jargon and business rules, which naturally describe the issues and concepts important for the trade. They constitute an XML vocabulary: a set of elements (words) and markup rules (grammar) for valid constructions of those elements used in a particular trade. An XML vocabulary may be defined in an XML document type definition (DTD) and referenced by all XML documents relevant to the industry.
Figure 5. A DTD file for e-mail. Figure 5 is a fragment of a DTD file for e-mail. It says that each e-mail shall have two parts: a head and a body. The head part shall have a from field, one or more to fields, zero or more cc fields and a subject field. The from field has an optional name, and an address. A name field is just a text string. Such a DTD file is useful in several different ways. First, it allows special e-mail editors to be developed. With the help of the DTD file, these editors can guide users to fill in required fields (such as the address field), skip over optional fields and create well structured e-mails. The inclusion of markup tags into the document may happen behind the scene and be totally transparent to the users. Second, by referencing the same DTD file, a receiver can parse the document, validate its syntactical correctness, retrieve each information item and process it accordingly. Effectively, a DTD file shares specifications about the information-content structure among parties involved in the information exchange. The e-mail example may seem intuitive and simplistic. That's because e-mail structure is already defined in the Simple Mail Transfer Protocol (SMTP) and is widely used today. The power of XML resides in not needing to define new protocols. As long as we document the information structure in a DTD file, we can use the same infrastructure to exchange any information, including bank transactions, credit reports, financial statements, purchase orders, medical records, traffic citations and telephone bills. XML essentially establishes a universal framework for information exchange, with DTD the associated data dictionary and database schema for meta-information. Many industry-specific languages are already defined under this universal information framework. The Open Financial Exchange (OFX) specifies the data format used by personal financial management software, including Intuit Quicken and Microsoft Money, to conduct financial transactions over the Web. The Open Software Distribution (OSD) defines important attributes of a software component, so that it may be automatically distributed and installed over the Internet when a new version becomes available. The Chemical Markup Language (CML) allows chemists to exchange descriptions of molecules, formulas and chemical structures. These are all XML-based languages, each setting a standard of information encoding and exchange for its specific industry. Information IntegrationWhile many new XML-based languages are still evolving, existing systems may have to exchange information before an industry-wide standard becomes available. XML technologies can also address this immediate need. In essence, an XML file is a flat-text file with arbitrary text markups. This data format is perhaps the lowest common denominator for communication, since any software system has to be able to handle simple text I/O. Flat-text files may not be as compact as other binary representations, but they have the advantage of universal availability. Any existing information, regardless of its source and format, may be converted to flat-text XML for integration and further processing. With enterprise documents routinely converted to HTML for distribution on the Internet, interchange of complex, structured information on the Internet represents the next frontier. What should be the upgrade of HTML for complex, structured information? In absence of a universal database representation, the flat-text-based XML appears to be the only emerging standard. Furthermore, our experience of Internet standards driving intranet ones will likely repeat in XML. As more and more structure information gets converted to XML for external consumption, people naturally will begin to ask if they should simply produce XML information in the first place. The ready availability of XML parsers, development tools and support infrastructures will further this momentum. Electronic CommerceA major objective of XML development is to facilitate software processing of structured information across the Web, including both consumer-to-business and business-to-business electronic commerce activities. The current HTML Web works effectively for simple consumer-to-business transactions, such as on-line shopping. OFX and other XML-based technologies will be able to handle more sophisticated transactions, such as on-line banking, in which consumers use fairly sophisticated personal financial management systems to interact with their banks. Business-to-business electronic commerce existed well before the advent of the Web primarily as Electronic Data Interchange (EDI). Despite its quarter-century history, however, there are currently only 80,000 EDI-enabled businesses, or less than 2% of all registered businesses in the United States. The Web movement has motivated most enterprises to connect to the Internet and gradually open up extranet connections to their business systems; yet, business-to-business transactions across the Web are still relatively rare, because of the human-centric nature of the current HTML Web. Because software systems can process XML information automatically, business-to-business electronic commerce across the Internet will be more feasible when existing business systems are enhanced to process XML documents. Both EDI and XML need standards to conduct electronic commerce. But the standards and standardization processes are very different. EDI standards are formal, rigorous and top-down. An industry must first specify standards on message formats, element dictionary and processing interactions. Then participating organizations can join a service bureau, subscribe to its network services and develop proprietary software to connect to their applications. All message bits and all valid interaction sequences are defined, and in practice, one or a few industry powerhouses often steer standards to fit their business processes and information systems. As a result, most small and midsize businesses find EDI rigid, expensive and hard to maintain. In contrast, XML-based standards are flexible and often evolve from the bottom up. Initially, organizations can just point to the same DTD file and exchange XML-based business information in absence of any formal standard. The implied agreement covers only a shared tag set to represent business data elements. Existing applications already understand these business data elements. The enhancement works simply as a front end to process text-based XML streams and parse them into equivalent internal forms. Over a period of time, a popular DTD should emerge as the standard for the industry and be fully recognized by all participants. Opportunities and RisksIn the current state of the Internet evolution, CIOs face a major challenge to "open-up" enterprise applications for customers, suppliers and partners to conduct business on the Internet. This challenge presents a major opportunity for solution vendors. For example, SAP has launched "Retail Online Store" electronic commerce functions and updated its licensing agreement to cover "Session Users"-Internet users who are not its licensee's employees. AMS solutions also face similar situations. Procurement systems need to interface with external vendors' catalog systems for product selection and specification, and billing systems need to support bill presentment, payment and customer care functions across the Internet-to name just a few. These electronic commerce capabilities signify not only tactical product differentiation, but also a strategic opportunity for a dominant position in the market segment. For AMS procurement systems to electronically interact with external vendors' catalog and ordering systems, for example, we need their systems to mutually support our procurement systems. From the vendors' perspective, they would rather support one procurement system than 100. Similarly, our client organizations do not want to support 100 purchase order formats; they would rather support one. In essence, the common desire to simplify application connectivity has created an environment for solution vendors to engage in a winner-take-all competition, which should result in a much less fragmented marketplace than what we have seen previously. A solution provider who can lead the technology evolution and solidify support across the industry can expect the reward of overwhelming dominance in the market segment. While the bottom-up evolution of XML-based standards may seem low-key, laid back or even downright chaotic, this environment breeds winner-take-all competition. In fact, this type of competition has been all around us for years as part of the Internet evolution. Powerful as Microsoft is, for example, it could not resist the Internet revolution-at one time characterized by the competition to dominate HTML, HTTP and browser standards. This time Microsoft has joined other technology vendors in supporting XML. The enabling technology and infrastructure is stable enough for major competitions to take place in the application space. Application vendors and solution providers have to recognize the significance of this competition and reflect it in their strategic plans. In short, broad agreement and the widespread anticipation of XML technology signify a new stage of the Internet evolution. Under this common technology framework, application vendors and solution providers can compete to standardize electronic commerce solutions for the industry segments they serve. It's an Internet-size opportunity. And it's time to vie for the leadership of next-generation business systems. BiographyDr. Winston Chung is the director of the Internet and Web Technologies Laboratory in the AMS Center for Advanced Technologies (AMSCAT). His research is focused on Internet technologies. Winston provides consultation to AMS client projects on Web-based electronic commerce development. He has more than eleven years of experience with network software architecture, design, development, strategy and application. Prior to joining AMS, Winston led several TCP/IP and Internet development projects from PC to mainframe platforms for IBM's Networking Systems Division. In addition, he developed firewall solutions and contributed to the standardization of security protocols as part of the Internet Engineering Task Force (IETF). Winston earned his Ph.D. in Computer Science and Engineering from Auburn University and holds a patent in network protocol design. AcknowledgmentsThe author would like to acknowledge the XML research efforts of Joel Nylund and Almaz Tekle at AMSCAT. He also wishes to thank Shahla Butler, Jerry Grochow, Wick Keating and Mark Raiffa, who provided valuable comments to this article. Dr. Winston Chung is the director of the Internet and Web Technologies Laboratory in the AMS Center for Advanced Technologies (AMSCAT). His research is focused on Internet technologies. |
| Suits | Ponytails | Propheads | Contact WDJ | Discuss | Web Audio | Search |