Wrox Professional XML 2nd Edition May 2001 ISBN 1861005059 pdf

Professional XML 2nd Edition

  Mark Birbeck Jason Diamond Jon Duckett Oli Gauti Gudmundsson

  Pete Kobak Evan Lenz Steven Livingstone Daniel Marcus

  Stephen Mohr Nikola Ozu Y

  Jon Pinnock

FL

  Keith Visco Andrew Watt Kevin Williams

  AM Zoran Zaev

  TE

Professional XML 2nd Edition

  © 2001 Wrox Press All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical articles or reviews. The author and publisher have made every effort in the preparation of this book to ensure the accuracy of the information. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, Wrox Press, nor its dealers or distributors will be held liable for any damages caused or alleged to be caused either directly or indirectly by this book.

  Published by Wrox Press Ltd, Arden House, 1102 Warwick Road, Acocks Green, Trademark Acknowledgements

  Wrox has endeavored to provide trademark information about all the companies and products mentioned in this book by the appropriate use of capitals. However, Wrox cannot guarantee the accuracy of this information.

  Credits

  Authors Technical Reviewers Mark Birbeck Daniel Ayers Jason Diamond Martin Beaulieu Jon Duckett Arnaud Blandin Oli Gauti Gudmundsson Maxime Bombadier Pete Kobak Joseph Bustos Evan Lenz David Carlisle Steven Livingstone Pierre-Antoine Champin Daniel Marcus Robert Chang Stephen Mohr Michael Corning Nikola Ozu Chris Crane Jon Pinnock Steve Danielson Keith Visco Chris Dix Andrew Watt Sébastien Gignoux Kevin Williams Tony Hong Zoran Zaev Paul Houle

  Craig McQueen Technical Architect Thomas B. Passin

  Timothy Briggs Dave Pawson Gary L Peskin

  Technical Editors Phil Powers DeGeorge

  Phil Jackson Eric Rajkovic Simon Mackie Gareth Reakes Chris Mills Matthew Reynolds Andrew Polshaw David Schultz

  Marc H. Simkin Category Managers Darshan Singh

  Dave Galloway Paul Warren Sonia Mulineux Karli Watson

  Project Administrator Production Co-ordinator Beckie Stones Pip Wonson

  Author Agent Indexers Marsha Collins Andrew Criddle

  Bill Johncocks Proof reader

  Agnes Wiggers Diagrams

  Shabnam Hussain About the Authors

Mark Birbeck

  Mark Birbeck is Technical Director of Parliamentary Communications Ltd. where he has been responsible for the design and build of their political portal, ePolitix.com. He is also managing director of XML consultancy x-port.net Ltd., responsible for the publishing system behind spiked-online.com . Although involved in XML for a number of years, his special interests lay in metadata, and in particular the use of RDF. He particularly welcomes Wrox's initiative in trying to move these topics from out of the shadows and into the mainstream.

  Mark would particularly like to thank his long-suffering partner Jan for putting up with the constant smell of midnight oil being burned. He offers the consolation that at least he will already be up when their first child Louis demands attention during the small hours.

  Jon Duckett

  Jon has been using and writing about XML since 1998, when he co-authored and edited Wrox's first XML publication. Having spent the past 3 years working for Wrox in the Birmingham UK offices, Jon is currently working from Sydney, so that he can get a different view out of the window while he is working and supping on a nice cup of tea...

Oli Gauti Gundmundsson

  Oli is working for SALT, acting as one of two Chief System Architects of the SALT Systems, and Development Director in New York. He is currently working on incorporating XML and XSL into SALT’s web authoring and content management systems. He has acted as an instructor in the Computer Science course (Java) at the University of Iceland, and Java is one of his greatest strengths (and pleasures!). As a hobby he is trying to finish his BS degree in Computer Engineering.

  His nationality is Icelandic, but he is currently situated in New York with his girlfriend Edda. He can be reached at [email protected].

Pete Kobak

  Pete Kobak built and programmed his first computer from a kit in 1978, which featured 256 bytes of RAM and a single LED output. After a fling as an electrical engineer for IBM, Pete gradually moved into software development to support mainframe manufacturing. He earned geek programmer status in the late '80s when he helped to improve Burroughs' Fortran compiler by introducing vectorization of DO loops. Justified by his desire to continue to pay his mortgage, Pete left Burroughs in 1991 to put lives in jeopardy by developing medical laboratory software in OS/2. In 1997, Pete somehow convinced The Vanguard Group to hire him to do Solaris web development, even though he could barely spell “Unix”. He has helped to add new features to their web site since then, specializing in secure web communication. Pete's current interest is in web application security, trying to find the right techniques to enforce

  

I'd like to dedicate my humble contribution to my wife Geraldine, and to my children Mary,

John, and Patricia. They have sacrificed my time and attention for me to be able to complete

this project. This chapter is a family effort.

  Evan Lenz

  Evan Lenz currently works as a software engineer for XYZFind Corp. in Seattle, WA. His primary area of expertise is in XSLT, and he enjoys exploring new ways to utilize this technology for various projects. His work at XYZFind includes everything from XSLT and Java development to writing user's manuals, to designing the XML query language used in XYZFind's

  XML database software. Wielding a professional music degree and a philosophy major, he hopes to someday bring his varying interests together into one grand, masterful scheme.

  Thanks to my precious wife, Lisa, and my baby son, Samuel, for putting up with Daddy's long nights. And praise to my Lord and Savior, Jesus Christ, without whom none of this would be possible or meaningful.

  Steven Livingstone

  Steven Livingstone is an IT Architect with IBM Global Services in Winnipeg, Canada. He has contributed to numerous Wrox books and magazine articles, on subjects ranging from XML to E- Commerce. Steven’s current interests include E-Commerce, ebXML, .NET, and Enterprise Application Architectures.

  Steven would like to thank everyone at Wrox, especially for the understanding as he emigrated from Scotland to Canada (and that could be another book itself ;-) Most importantly he wants to thank Loretito for putting up with him whilst writing – gracias mi tesoro.

  Congratulations Celtic on winning the Treble :)

Daniel Marcus

  Dr. Marcus has twenty years of experience in software architecture and design. He is co-founder, President, and Chief Operating Officer at Speechwise Technologies, an applications software company at the intersection of speech, wireless, and Internet technologies. Prior to starting Speechwise, he was Director of E-Business Consulting at Xpedior, leading the strategy, architecture, and deployment of e-business applications for Global 2000 and dot-com clients. Dr.

  Marcus has been a Visiting Scholar at Princeton's Institute for Advanced Study, a research scientist at the Lawrence Livermore National Laboratory, and is the author of over twenty papers in computational science. He is a Sun-Certified Java Technology Architect and holds a Ph.D. in Mechanical Engineering from the University of California, Berkeley.

Stephen Mohr

  Stephen Mohr is a software systems architect with Omicron Consulting, Philadelphia, USA. He has more than ten years' experience working with a variety of platforms and component technologies. His research interests include distributed computing and artificial intelligence.

Nikola Ozu

  Nikola Ozu is an independent systems architect who lives in Wyoming at the end of a few miles of dirt road – out where the virtual community is closer than town, but only flows at 24kb/s, and still does not deliver pizza. His current project involves bringing semantic databases, text searching, and multimedia components together with XML – on the road to Xanadu. Other recent work has included the usual web design consulting, some XML vocabularies, and an XML-based production and full- text indexing system for a publisher of medical reference books and databases.

  In the early 90s, Nik designed and developed a hypertext database called Health Reference Center; followed by advanced versions of InfoTrac. Both of these were bibliographic and full-text databases, delivered as monthly multi-disc CD-ROM subscriptions. Given the large text databases involved, some involvement with SGML was unavoidable. His previous work has ranged from library systems on mainframes to embedded micro systems (telecom equipment, industrial robots, toys, arcade games, and videogame cartridges). In the early 70s, he was thrilled to learn programming using patch boards, punch cards, paper tape, and printouts (and Teletypes, too). When not surfing the 'net, he surfs crowds, the Tetons, and the Pacific; climbs wherever there is rock; and tries to get more than a day's walk from the nearest road now and then. He enjoys these even more when accompanied by his teenage son, who's old enough now to appreciate the joy of mosh pits and sk8ing the Mission District after midnight.

  

To Noah: May we always think of the next (23 - 1) generations instead of just our own 20.

My thanks to the editors and illustrators at Wrox and my friend Deanna Bauder for their

help with this project. Also, thanks and apologies to my family and friends who endured my

disappearances into the WriterZone for days on end.

  Jonathan Pinnock

  Jonathan Pinnock started programming in Pal III assembler on his school's PDP 8/e, with a massive 4K of memory, back in the days before Moore's Law reached the statute books. These days he spends most of his time developing and extending the increasingly successful PlatformOne product set that his company, JPA, markets to the financial services community. He seems to spend the rest of his time writing for Wrox, although he occasionally surfaces to say hello to his long-suffering wife and two children. JPA’s home page is www.jpassoc.co.uk.

Keith Visco

  Keith Visco currently works for Intalio, Inc., the leader in Business Process Management, as a manager and project leader for XML based technologies. Keith is the project leader for the open source data-binding framework, Castor. He has been actively working on open source projects since 1998, including the Mozilla project where he is the original author of Mozilla's XSLT processor (donated by his previous employer, The MITRE Corporation) and is the current XSLT module owner.

  In all aspects of his life, Keith is most inspired after drinking a large Dunkin' Donuts Hazelnut

  

I would like to acknowledge Intalio, Inc. and The Exolob Group for giving me the opportunity

to work on many industry-leading technologies. I would like to thank my team at Intalio, specifically Arnaud Blandin and Sebastien Gignoux, for their hard work as well as their invaluable feedback on this chapter. I would also like to thank my family for their

unconditional support and incessant input into all phases of my life. A special thanks to Cindy

Iturbe, whose encouragement means so much to me and for teaching me that with a little patience and hard work all things are possible, no matter how distant things may seem.

Andrew Watt

  Andrew Watt is an independent consultant who enjoys few things more than exploring the technologies others have yet to sample. Since he wrote his first programs in 6502 Assembler and BBC Basic in the mid 1980s, he has sampled Pascal, Prolog, and C++, among others. More recently he has focused on the power of web-relevant technologies, including Lotus Domino, Java and HTML. His current interest is in the various applications of the Extensible Markup Meta Language, XMML, sometimes imprecisely and misleadingly called XML. The present glimpse he has of the future of SVG, XSL-FO, XSLT, CSS, XLink, XPointer, etc when they actually work properly together is an exciting, if daunting, prospect. He has just begun to dabble with XQuery. Such serial dabbling, so he is told, is called “life-long learning”.

  In his spare time he sometimes ponders the impact of Web technologies on real people. What will be the impact of a Semantic Web? How will those other than the knowledge-privileged fare?

  To the God of Heaven who gives human beings the capacity to see, think and feel. To my father who taught me much about life. My heartfelt thanks go to Gail, who first suggested getting into writing, and now suffers the

consequences on a fairly regular basis, and to Mark and Rachel, who just suffer the consequences.

Kevin Williams

  Kevin’s first experience with computers was at the age of 10 (in 1980) when he took a BASIC class at a local community college on their PDP-9, and by the time he was 12, he stayed up for four days straight hand-assembling 6502 code on his Atari 400. His professional career has been focused on Windows development – first client-server, then onto Internet work. He’s done a little bit of everything, from VB to Powerbuilder to Delphi to C/C++ to MASM to ISAPI, CGI, ASP, HTML, XML, and any other acronym you might care to name; but these days, he’s focusing on

  XML work. Kevin is a Senior System Architect for Equient, an information management company located in Northern Virginia. He may be reached for comment at [email protected] .

Zoran Zaev

  Zoran is a Sr. Web Solutions Architect with Hitachi Innovative Solutions, Corp. in the Washington DC area. He has worked in technology since the time when 1 MHz CPUs and 48Kb was considered a 'significant power', in the now distant 1980s. In mid 1990s, Zoran became involved in web applications development. Since then, he has worked helping large and small

  

I would like to thank my wife, Angela, for her support and encouragement, as well as

sharing some of her solid writing knowledge. And, you can never go wrong thanking your

parents, so 'fala' to my mom, Jelica and dad, Vanco. On the professional side, I would like

to thank Ellen Manetti for her strong project management example, and Pete Johnson,

founder of Virtualogic, Inc., for his vision inspiring influence. Finally, thanks to Beckie and

Marsha from Wrox for their always-timely assistance and to Jan from "Images by Jan".

  Zoran can be reached at [email protected]

  Introduction

  eXtensible Markup Language (XML) has emerged as nothing less than a phenomenon in computing. It is a concept elegant in its simplicity driving dramatic changes in the way Internet applications are written. This book is a revision to the first edition to keep pace with this fast-changing technology as many technologies have been superseded, and new ones have emerged.

  Y What Does This Book Cover?

  FL

  This book explains and demonstrates both the essential techniques for designing and using XML documents, and many of the related technologies that are important today. Almost everything in this book will be based around a specification provided by the World Wide Web Consortium (W3C).

  AM

  These specifications are at various levels of completion and some of the technologies are nascent, but we expect them to become very popular when their specifications are finalized because they are useful or essential. The wider XML community is increasingly jumping in and offering new XML-

TE

  related ideas outside the control of the W3C, although the W3C is still central and important to the development of XML. The focus of this book is on learning how to use XML as an enabling technology in real-world applications. It presents good design techniques, and shows how to interface XML-enabled applications with web applications. Whether your requirements are oriented toward data exchange or presentation, this book will cover all the relevant techniques in the XML community. Most chapters contain a practical example (unless the technology is so new that there were no working implementations at the time of writing). As XML is a platform-neutral technology, the Introduction

Who Is This Book For?

  This book is for the experienced developer, who already has some basic knowledge of XML, to learn how to build effective applications using this exciting but simple technology. Web site developers can learn techniques, using XSLT stylesheets and other technologies, to take their sites to the next level of sophistication. Other developers can learn where and how XML fits into their existing systems and how they can use it to solve their application integration problems.

  XML applications can be distributed and are usually web-oriented. This book focuses on this kind of application and so we would expect the reader to have some awareness of multi-tier architecture - preferably from a web perspective. Although we will retread over XML, in case some of the XML fundamentals have been missed in your experience, we will cover the full specification thoroughly and fairly quickly.

  A variety of programming languages will be used in this book, and we do not expect you to be proficient in them all. The techniques taught in this book can be transferred to other programming languages. As XML is a cross-platform language, Java will be a language used in this book, especially because it has a wealth of tools to manipulate XML. Other languages covered include JavaScript,

  VBScript, VB, C#, and Perl. We expect the reader to be proficient in a programming language, but it does not matter which one.

  How is this Book Structured?

  Although many authors have contributed towards this book, we have tied the chapters together under unifying themes. As you will read below, the book has effectively been split into six sections. A standard example using a toy company has been used in chapters where possible, so you can see how different technologies can explain, describe, or transform the same data in different ways.

  A small number of the chapters, e.g. Chapter 23, rely heavily on a previous chapter, but this will be made clear. Most of the chapters will be relatively self-contained.

  Learning Threads

  XML is evolving into a large, wide-ranging field of related markup technologies. This growth is powering XML applications. With growth comes divergence. Different readers will come to this book with different expectations. XML is different things to different people.

Foundation

  Chapter 1 introduces the XML world in general, discussing the technologies that are relevant today and may be relevant tomorrow, but with very little code. Chapters 2 (Basic XML Syntax) and 3 (Advanced

  XML Syntax) cover the fundamentals of XML 1.0. Chapter 2 gives you the basic syntax of an XML document, while Chapter 3 covers slightly more advanced issues like namespaces. These chapters form the irreducible minimum you need to understand XML and, depending on your experience, you may want to skip these introductory chapters. Chapter 4 teaches you about the Infoset, a standard way of describing XML, which provides an abstract representation for XML data. In Chapter 5, we cover document validation using DTDs. Although, as you learn in the subsequent two chapters, other schema-based validation languages exist that supersede DTDs, they are not quite dead as many more XML parsers validate with DTDs than any other schema language, and DTDs are relatively

Introduction

  In Chapter 8, we explain the XPath specification – a method of referring to specific fragments of XML that is relevant to and used by other XML technologies. These include XSLT, described in Chapter 9. Here we teach you how to transform your XML documents into anything else, based on certain stylesheet declarations. In Chapter 10, we show various linking technologies, such as XLink and

  XPointer and describe the XML Fragment Interchange specification. These ten chapters are enough for you to learn about all of the immediately useful XML technologies – for those who just use XML. You may already have a lot of experience of XML and so some of these chapters will be re-treading over well-walked ground, but everybody should be able to learn something new, especially because XML Schema acquired Proposed Recommendation status, the penultimate stage of the W3C specifications, just two months before this book was printed. Although a wealth of XML techniques lie ahead, you will have a firm foundation upon which to build.

  So the Foundation thread includes: ❑

  Chapter 1 : Introducing XML ❑

  Chapter 2 : Basic XML Syntax ❑

  Chapter 3 : Advanced XML Syntax ❑

  Chapter 4 : The XML Information Set ❑

  Chapter 5 : Validating XML: Schemas ❑

  Chapter 6 : Introducing XML Schema ❑

  Chapter 7 : XML Schema Alternatives ❑

  Chapter 8 : Navigating XML – XPath ❑

  Chapter 9 : Transforming XML ❑

Chapter 10 : Fragments, XLink, and XPointer XML Programming XML is both machine and human readable and, not surprisingly, some standard APIs have been created to manipulate XML data. These APIs are implemented in JavaScript, Java, Visual Basic, C++

  Perl, and many other languages. These provide a standard way of manipulating, and developing for, XML documents. In Chapter 11, we consider the first API, which emerged from the HTML world, the DOM. This has been released as a specification from the W3C, and Level 2 of this specification has recently been released. XML data can be thought of as hierarchical and object-oriented, and the DOM provides methods and properties for retrieving and manipulating XML nodes. Chapter 12 discusses the SAX, a lightweight alternative to the DOM. When manipulating the DOM, the entire document has to be read into memory; with the SAX, however, it only retrieves as much data as is necessary to retrieve or manipulate a specific node.

  Chapter 13 is the last chapter in this section, and it covers Declarative Programming with XML. Most programmers use procedural languages, but XML and the XML specifications don't care about how a particular language or application performs a job, just that it does it according the declarations made. Introduction

  The Programming thread therefore includes: ❑

  Chapter 11 : The Document Object Model ❑ : SAX 2

  Chapter 12 ❑ : Schema Based Programming

  Chapter 13 XML as Data There are four chapters in this section, all targeted specifically at the storage, retrieval, and manipulation of data – as it relates to XML. Chapter 14, Data Modeling, explains how to plan your project 'properly', and so model your XML on your data and build better applications because of it.

  Chapter 15 extends this concept by covering the binding of the data to XML (and vice versa). Querying XML covers a nascent technology known as XML Query. It aims to provide the power of SQL in an XML format. This short chapter teaches you how to use the technology as it stands at the time of writing. The final chapter covered, is a case study, which describes how to relate your databases to your XML data and so integrate your XML and RDBMS in the best way possible. This means that the Data thread contains: ❑

  Chapter 14 : Data Modeling ❑

  Chapter 15 : Data Binding ❑

  Chapter 16 : Querying XML ❑

Chapter 17 : Case Study: XML and Databases Presentation of XML Chapter 18 covers an XML technology called SVG – Scalable Vector Graphics. This XML technology, when coupled with an appropriate viewer (for example, Adobe SVG Viewer), allows quite detailed

  graphics files to be displayed and manipulated. In Chapter 19, we describe VoiceXML, an XML technology to allow voice recognition and processing on the Web. XML data can be converted to VoiceXML and using the appropriate technology, can be spoken and interacted with over a telephone.

  Chapter 20 covers the final technology in this section, XSL-FO. This is an emerging technology that allows the layout of pages to be specified exactly, much in the same way as PDF does now. The main difference is, this is XML too and so can be manipulated using the same XML tools you may be used to. Also, XSL-FO can be converted to PDF if necessary for users without XSL-FO viewers. In the Presentation thread, therefore, we cover:

  ❑ : Presenting XML Graphically

  Chapter 18 ❑ : VoiceXML

  Chapter 19 ❑

  Chapter 20 : XSL Formatting Objects: XSL-FO

  Introduction

XML as Metadata

  In this thread, we discuss how XML can be used to represent metadata – that is, the meaning or semantics of data, rather than the data itself. In Chapter 21, we cover the setting up of an index of XML data. This chapter uses a Java indexing application, but the techniques are applicable to any indexing tool. Chapter 22 is where we really get to the meat of the topic, where we talk about RDF – a language to describe metadata. We cover the elements and syntax of this technology. In Chapter 23, we go over some practical examples of RDF technology, before describing RDDL – a method of bundling resources at the URL of a namespace, so that a RDDL-enabled application can learn what the technology of which the namespace is referring to, actually is and access schema and standard transforms.

  In the Metadata thread, we cover: ❑ : Case Study: Generating a Site Index

  Chapter 21 ❑ : RDF

  Chapter 22 ❑ : RDF Code Samples and RDDL

Chapter 23 XML used for B2B The final section of this book describes what is quite possibly the most important use of XML – B2B and Web Services. In the past, the communication protocols for B2B (e.g. EDI) have been proprietary, and expensive – both in terms of cost, and processor power. Using XML vocabularies, an open and

  programmable model can be used for B2B transactions. In Chapter 24, we describe Simple Object Access Protocol. SOAP was a mostly Microsoft initiative (although the W3C are developing the XML Protocol specification, which should be very similar to SOAP), which allows two applications to specify services using XML. We cover the intricacies of this protocol, so that you can use it to web-enable any service you would care to mention.

  Chapter 25 covers Microsoft's BizTalk Server. This server can control all B2B transactions, using the open BizTalk framework. BizTalk is just one method of using SOAP to conduct business transactions, but it is Microsoft's and is very popular. In Chapter 26, we have a case study discussing E-Business integration using XML. There are a number of business standards for commerce, and this chapter explains how you can integrate all of the standards, without having to write code for every possible B2B transaction between competing standards.

  We end in Chapter 27, with a discussion of the Web Services Description Language, which allows us to formalize other XML vocabularies by defining services that a SOAP, or other client, can connect to. WSDL describes each service and what it does. In addition, in this chapter, we cover UDDI (Universal Description, Discovery, and Integration), which is a way of automating the discovery and transactions with various services. In many cases, it should not be necessary for human interaction to find a service, and using public registration services, UDDI makes this possible. Both of these technologies are nascent but their importance will grow as more and more businesses make use of them.

  In summary, in the B2B thread, we describe in each chapter the following: ❑ : SOAP

  Chapter 24 ❑

  Chapter 25 : B2B with Microsoft BizTalk Server Introduction

What You Need to Use this Book

  The book assumes that you have some knowledge of HTML, some procedural object-oriented programming languages (e.g. Java, VB, C++), and some minimal XML knowledge. For some of the examples in this book, a Java Runtime Environment (http://java.sun.com/j2se/1.3/) will need to be installed on your system, and some other chapters, require applications such as MS SQL Server, MS Index Server, and BizTalk. The complete source for larger portions of code from the book is available for download from: http://www.wrox.com/ . More details are given in the section of this Introduction called, "Support, Errata, and P2P".

  Conventions

  To help you get the most from the text and keep track of what's happening, we've used a number of conventions throughout the book. For instance:

  These boxes hold important, not-to-be forgotten information, which is directly relevant to the surrounding text.

  While this style is used for asides to the current discussion.

  As for styles in the text: When we introduce them, we highlight important words We show keyboard strokes like this: Ctrl-A We show filenames, and code within the text like so: doGet() Text on user interfaces is shown as: File | Save URLs are shown in a similar font, as so: http://www.w3c.org/

  We present code in two different ways. Code that is important, and testable is shown as so: In our code examples, the code foreground style shows new, important, pertinent code

  Code that is an aside, shows examples of what not to do, or has been seen before is shown as so: Code background shows code that's less important in the present context, or has been seen before.

  Introduction

  > java com.ibm.wsdl.Main -in Arithmetic.WSDL >> Transforming WSDL to NASSL .. >> Generating Schema to Java bindings .. >> Generating serializers / deserializers .. Interface 'wsdlns:ArithmeticSoapPort' not found.

Support, Errata, and P2P

  The printing and selling of this book was just the start of our contact with you. If there are any problems, whatsoever with the code or the explanation in this book, we welcome input from you. A mail to [email protected], should elicit a response within two to three days (depending on how busy the support team are).

  In addition to this, we also publish any errata online, so that if you have a problem, you can check on the Wrox web site first to see if we have updated the text at all. First, pay a visit to www.wrox.com, then, click on the Books | By Title(Z-A), or Books | By ISBN link on the left hand side of the page. See below: Navigate to this book (this ISBN is 1861005059, if you choose to navigate this way) and then click on it. Introduction

  All of the code for this book can be downloaded from our site. It is included in a zip file, and all of the code samples in this book can be found within, referenced by chapter number. In addition, at p2p.wrox.com, we have our free "Programmer to Programmer" discussion lists. There are a few relevant to this book, and any questions you post will be answered by either someone at Wrox, or someone else in the developer community. Navigate to http://p2p.wrox.com/xml, and subscribe to a discussion list from there. All lists are moderated and so no fluff or spam should be received in your Inbox.

  Tell Us What You Think We've worked hard to make this book as useful to you as possible, so we'd like to know what you think.

  We're always keen to know what it is you want and need to know. We appreciate feedback on our efforts and take both criticism and praise on board in our future editorial efforts. If you've anything to say, let us know on: [email protected]

  Or via the feedback links on: http://www.wrox.com

  Introduction

  

Introduction

  Introducing XML

  In this chapter, we'll look at the origins of XML, the core technologies and specifications that are related to XML, and an overview of some current, and future applications of XML. The later sections of this introduction should also serve as something of a road map to the rest of the book.

  Y Origins and Goals of XML

  FL

  "XML", as we all know, is an acronym for Extensible Markup Language – but what is a markup language? What is the history of markup languages, what are the goals of XML, and how does it improve upon earlier markup?

  AM Markup Languages

  Ever since the invention of the printing press, writers have made notes on manuscripts to instruct the

TE printers on matters such as typesetting and other production issues. These notes were called "markup"

  A collection of such notes that conform to a defined syntax and grammar can certainly be called a "language". Proofreaders use a hand-written symbolic markup language to communicate corrections to editors and printers. Even the modern use of punctuation is actually a form of markup that remains with the text to advise the reader how to interpret that text.

  These early markup languages use a distinct appearance to differentiate markup from the text to which it refers. For example, proofreaders' marks consist of a combination of cursive handwriting and special symbols to distinguish markup from the typeset text. Punctuation consists of special symbols that cannot be confused with the alphabet and numbers that represent the textual content. These symbols are so

Chapter 1

  The ASCII character set standard was the early basis for widespread data exchange between various hardware and software systems. Whatever the internal representation of characters; conversion to ASCII allowed these disparate systems to communicate with each other. In addition to text, ASCII also defined a set of symbols, the C0 control characters (using the hexadecimal values 00 to 1F), which were intended to be used to markup the structure of data transmissions. Only a few of these symbols found widespread acceptance, and their use was often inconsistent. The most common example is the character(s) used to delimit the end of a line of text in a document. Teletype machines used the physical motion-based character pair CR-LF (carriage-return, line-feed). This was later used by both MS-DOS and MS-Windows; UNX uses a single LF character; and the MacOS uses a single CR character. Due to conflicting and non-standard uses of C0 control characters, document interchange between different systems still often requires a translation step, since even a simple text file cannot be shared without conversion. Various forms of delimiters have been used to define the boundaries of containers for content, special symbol glyphs, presentation style of the text, or other special features of a document. For example, the C and C++ programming languages use the braces {} to delimit units of data or code. A typesetting language, intended for manual human editing, might use strings that are more readable, like ".begin" and ".end".

  Markup is a method of conveying metadata (information about another dataset).

  XML is a relatively new markup language, but it is a subset of, and is based upon a mature markup language called Standard Generalized Markup Language (SGML). The WWW's Hypertext Markup

  Language (HTML)

  is also based upon SGML; indeed, it is an application of SGML. There is a new version of HTML 4 that is called Extensible Hypertext Markup Language (XHTML), which is similarly an application of XML. All of these markup languages are for metadata, but SGML and XML may be further considered meta-languages, since they can be used to create other metadata languages. Just as HTML was expressed in SGML, XHTML and others will use XML.

  SGML-based markup languages all use literal strings of characters, called tags to delimit the major components of the metadata, called elements.

  Tags represent object delimiters and other such markup, as opposed to its content (no matter whether it's simple text or text that is program code). Of course, there has often been conflict between different sets of tags and their interpretation. Without common delimiter vocabularies, or even common internal data formats, it has been very difficult to convert data from one format to another, or otherwise share data between applications and organizations. For example, the following two markup excerpts (Chapter_01_01.html & Chapter_01_01.xml) shows familiar HTML and similar XML elements with their delimiting tags:

  <HTML> <HEAD> <TITLE>Product Catalog (Toysco-only)</TITLE>

Introducing XML

  <H2>Product Descriptions</H2> <HR WIDTH=33% ALIGN=LEFT> <H3>Mega Wonder Widget</H3> <P>The <EM>Mega Wonder Widget</EM> is a popular toy with a 20 oz. capacity. It costs only $12.95 to make, whilst selling for $33.99 (plus $3.95 S&H).<BR> <H3>Giga Wonder Widget</H3> <P>The <EM>Giga Wonder Widget</EM>is even more popular, because of its larger 55 oz. capacity. It has a similar profit margin (costs $19.95, sells for $49.99). ... <HR> <P><I>Updated:</I> 2001-04-01 <I>by Webmaster Will</I> </BODY> </HTML>

  This rather simplistic document uses the few structural tags that exist in HTML, such as <TITLE>, <H1>, <H2>

  , and <H3> for headers, and <P> for paragraphs. This structure is limited to a very basic presentation model of a document as a printed page. Other tags, such as <HR> and <EM>, are purely about the appearance of the data. Indeed, most HTML tags are now used to describe the presentation of data, interactive logic for user input and control, and external multimedia objects. These tags give us no idea what structured data (names, prices, etc.) might appear within the text, or where it might be in that text. On the other hand, XML allows us to create a structural model of the data within the text. Presentation cues can be embedded as with HTML tags, but the best XML practice is to separate the data structure from presentation. An external style sheet can be used with XML to describe the data's presentation model. So, we might convert – and extend – the above HTML example into the following XML data file (Chapter_01_01.xml):

  <?xml version="1.0" ?> <!DOCTYPE ProductCatalog [ <!ELEMENT ProductCatalog (HEAD?, BODY?) > <!ELEMENT HEAD (TITLE, Updated, Author+, Security*) > <!ELEMENT BODY (H1, H2, (H3, Products)+ ) > <!ELEMENT Products (Product+) > <!ELEMENT Product (#PCDATA|Prodname|Capacity|Cost|Price|Shipfee)* > <!ELEMENT H1 (#PCDATA) > <!ELEMENT H2 (#PCDATA) > <!ELEMENT H3 (#PCDATA) > <!ELEMENT TITLE (#PCDATA) > <!ELEMENT Updated (#PCDATA) > <!ELEMENT Author (#PCDATA) > <!ELEMENT Security (#PCDATA) > <!ELEMENT Prodname (#PCDATA) > <!ELEMENT Capacity (#PCDATA) > <!ELEMENT Cost (#PCDATA) > <!ELEMENT Price (#PCDATA) > <!ELEMENT Shipfee (#PCDATA) >

Chapter 1

  ]> <ProductCatalog> <HEAD> <TITLE>Product Catalog</TITLE> <Updated>2001-04-01</Updated> <Author>Webmaster Will</Author> <Security>Toysco-only (TRADE SECRET)</Security> </HEAD> <BODY> <H1>Product Catalog</H1> <H2>Product Descriptions</H2> <Products> <H3>&MWW;</H3> <Product> The <Prodname>&MWW;</Prodname> is a popular toy with a <Capacity unit="oz.">20</Capacity> capacity. It costs only <Cost currency="USD">12.95</Cost> to make, whilst selling for <Price currency="USD">33.99</Price> (plus <Shipfee currency="USD">3.95</Shipfee> S&H).<BR/> </Product> <H3>&GWW;</H3> <Product> The <Prodname>&GWW;</Prodname> is a popular, because of its larger <Capacity unit="oz.">55</Capacity> capacity. It has a similar profit margin (costs <Cost currency="USD">19.95</Cost>, sells for <Price currency="USD">33.99</Price>).<BR/> </Product> ...

  </Products> </BODY> </ProductCatalog>

  The XML document looks very similar to the HTML version, with comparable text content, and some equivalent tags (as XHTML). XML goes far beyond HTML by allowing the use of custom tags (like <Prodname> or <Weight>) that preserve some structured data that is embedded within the text of the description. We can't do this in HTML, since its set of tags is more or less fixed, changing slowly as browser vendors embrace new features and markup. In contrast, anyone can add tags to their own XML data. The use of tags to describe data structure allows easy conversion of XML to an arbitrary DBMS format, or alternative presentations of the XML data such as in tabular form or via a voice synthesizer connected to a telephone.

  We have also assumed that we will use a stylesheet to format the XML data for presentation. Therefore, we are able to omit certain labels from our text (such as the $ sign in prices, and the "oz." after the capacity value). We will then rely upon the formatting process to insert them in the output, as appropriate. In a similar fashion, we have put the document update information in the header (where it can be argued that it logically belongs). When we transform the data for output, this data can be displayed as a footer with various string literals interspersed. In this way, it can appear to be identical to the HTML version. It should be obvious from the examples that HTML and XML are very similar in both overall structure and syntax. Let's look at their common ancestor, before we move on to the goals of XML.

  Introducing XML SGML and Document Markup Languages

  SGML is an acronym for Standard Generalized Markup Language, an older and more much complex markup language than XML. It has been codified as an international standard by the ISO (International

  Organization for Standardization ) as ISO 8879 and WebSGML.

  

The ISO doesn't put very much of its standards information online, but they do maintain a website

at http://www.iso.ch, and offer the paper version of ISO 8879 for sale at

  http://www.iso.ch/cate/d16387.html . General SGML information and links can be found at http://www.w3.org/MarkUp/SGML and http://xml.coverpages.org. WebSGML (ISO 8879:1986

  

TC2. Information technology – Document Description and Processing Languages) is described online

at http://www.sgmlsource.com/8879rev/n0029.htm.

  SGML has been widely used by the U.S. government and its contractors, large manufacturing companies, and publishers of technical information. Publishers often construct paper documents, such as books, reports, and reference manuals from SGML. Often, these SGML documents are then transformed into a presentation format such as PostScript, and sent to the typesetter and printer for output to paper. Technical specifications for manufacturing can also be exchanged via SGML documents. However, SGML's complexities and the high cost of its implementation have meant that most businesses and individuals have not been able to afford to embrace this powerful technology.

  SGML History

  In 1969, a person walked on the Moon for the first time. In the same year, Ed Mosher, Ray Lorie, and Charles F. Goldfarb of IBM Research invented the first modern markup language, Generalized Markup

  

Language (GML). GML was a self-referential language for marking the structure of an arbitrary set of

  data, and was intended to be a meta-language – a language that could be used to describe other languages, their grammars and vocabularies. GML later became SGML. In 1986, SGML was adopted as an international data storage and exchange standard by the ISO. When Tim Berners-Lee developed HTML in the early 1990s, he made a point of maintaining HTML as an application of SGML.

  With the major impact of the World Wide Web (WWW) upon commerce and communications, it could be argued that the quiet invention of GML was a more significant event in the history of technology than the high adventure of that first trip to another celestial body. GML led to SGML, the parent of both HTML and XML. The complexity of SGML and lack of content tagging in HTML led to the need for a new markup language for the WWW and beyond – XML.

  Goals of XML

  In 1996, the principal design organization for technologies related to the WWW, the World Wide Web

  

Consortium (W3C) began the process of designing an extensible markup language that would combine

the flexibility of SGML and the widespread acceptance of HTML. That language is XML.

  The W3C home page is at http://www.w3.org, and its XML pages begin with an overview at

  http://www.w3.org/XML . Most technical documents can be found at http://www.w3.org/TR...

  XML version 1.0 was defined in a February 1998 W3C Recommendation, which, like an Internet