976 Beginning XML, 5th Edition

  

  BEGINNING XML INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii

  ⊲ PART I

INTRODUCING XML

CHAPTER 1 What Is XML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 CHAPTER 2 Well-Formed XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 CHAPTER 3 XML Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 ⊲ PART II VALIDATION CHAPTER 4 Document Type Defi nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 CHAPTER 5 XML Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 CHAPTER 6 RELAX NG and Schematron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 ⊲ PART III PROCESSING CHAPTER 7 Extracting Data from XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 CHAPTER 8 XSLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 ⊲ PART IV DATABASES CHAPTER 9 XQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 CHAPTER 10 XML and Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 ⊲ PART V PROGRAMMING CHAPTER 11 Event-Driven Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 CHAPTER 12 LINQ to XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 ⊲ PART VI COMMUNICATION CHAPTER 13 RSS, Atom, and Content Syndication . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 CHAPTER 14 Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 CHAPTER 15 SOAP and WSDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 CHAPTER 16 AJAX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615

  ⊲ PART VII DISPLAY

  CHAPTER 17 XHTML and HTML 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649 CHAPTER 18 Scalable Vector Graphics (SVG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 ⊲ PART VIII CASE STUDY CHAPTER 19 Case Study: XML in Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727 APPENDIX A Answers to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749 APPENDIX B XPath Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773 APPENDIX C XML Schema Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797 INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  811

  BEGINNING

  XML

  

BEGINNING

  

XML

Joe Fawcett

Liam R.E. Quin

  

Danny Ayers

John Wiley & Sons, Inc.

  Beginning XML Published by John Wiley & Sons, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256 Copyright © 2012 by Joe Fawcett, Liam R.E. Quin, and Danny Ayers Published by John Wiley & Sons, Inc., Indianapolis, Indiana Published simultaneously in Canada

ISBN: 978-1-118-16213-2

  ISBN: 978-1-118-22612-4 (ebk)

  ISBN: 978-1-118-23948-3 (ebk)

  ISBN: 978-1-118-26409-6 (ebk) Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means,

electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108

of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online

  Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifi cally disclaim all warranties, including without limitation warranties of fi tness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional

services. If professional assistance is required, the services of a competent professional person should be sought. Neither

the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is

referred to in this work as a citation and/or a potential source of further information does not mean that the author or the

publisher endorses the information the organization or Web site may provide or recommendations it may make. Further,

readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this

work was written and when it is read.

  For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with

standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media

such as a CD or DVD that is not included in the version you purchased, you may download this material

  Library of Congress Control Number: 2012937910 Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Wrox Programmer to Programmer, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affi liates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this book.

  

I’d like to dedicate this book to my parents, especially

to my mother Sheila who, unfortunately, will never be

able to read this. I love you both.

  Joe Fawcett Dedicated to Yuri Rubinsky, without whom there would be no XML.

  Liam Quin Dedicated to my mother, Mary (because this will amuse her no end).

  Danny Ayers

ABOUT THE AUTHORS JOE FAWCETT

   has been writing software, on and off, for forty years. He was one of the fi rst people to be awarded the accolade of Most Valuable Professional in XML by Microsoft. Joe is head of software development for Kaplan Financial UK in London, which specializes in training people in business and accountancy and has one of the leading accountancy e-learning systems in the UK. This is the third title for Wrox that he has written in addition to the previous editions of this book.

LIAM QUIN

   is in charge of the XML work at the World Wide Web Consortium (W3C). He has been involved with markup languages and text since the early 1980s, and was involved with XML from its inception. He has a background in computer science and digital typography, and also maintains a website dedicated to the love of books and illustrations at . He lives on an old farm near Milford, in rural Ontario, Canada.

DANNY AYERS

  s an independent researcher and developer of Web technologies, primarily those related to linked data. He has been an XML enthusiast since its early days. His background is in electronic music, although this interest has taken a back seat since the inception of the Web. Offl ine, he’s also an amateur woodcarver. Originally from the UK, he now lives in rural Tuscany with two dogs and two cats.

ABOUT THE TECHNICAL EDITOR KAREN TEGTMEYER

  is an independent consultant and software developer with more than 10 years of experience. She has worked in a variety of roles, including design, development, training, and architecture. She also is an Adjunct Computer Science Instructor at Des Moines Area Community College.

  CREDITS EXECUTIVE EDITOR

  PROJECT COORDINATOR, COVER

  VICE PRESIDENT AND EXECUTIVE GROUP PUBLISHER

  Richard Swadley

  VICE PRESIDENT AND EXECUTIVE PUBLISHER

  Neil Edde

  ASSOCIATE PUBLISHER

  Jim Minatel

  Katie Crocker

  PRODUCTION MANAGER

  PROOFREADERS

  James Saturnio, Word One Sara Eddleman-Clute, Word One

  INDEXER

  Johnna VanHoose Dinse

  COVER DESIGNER

  Ryan Sneed

  COVER IMAGE

  Tim Tate

  Amy Knies

  Carol Long

  Kim Cofer

  PROJECT EDITOR

  Victoria Swider

  TECHNICAL EDITOR

  Karen Tegtmeyer

  PRODUCTION EDITOR

  Kathleen Wisor

  COPY EDITOR

  EDITORIAL MANAGER

  BUSINESS MANAGER

  Mary Beth Wakefi eld

  FREELANCER EDITORIAL MANAGER

  Rosemarie Graham

  ASSOCIATE DIRECTOR OF MARKETING

  David Mayhew

  MARKETING MANAGER

  Ashley Zurcher

  © Marcello Bortolino

  ACKNOWLEDGMENTS I’D LIKE TO HEARTILY ACKNOWLEDGE

  the help of the editor Victoria Swider and the acquisitions editor Carol Long, who kept the project going when it looked as if it would never get fi nished. I’d like to thank the authors of the previous edition, especially Jeff Rafter and David Hunter, who let us build on their work when necessary. I’d also like to thank my wife Gillian and my children Persephone and Xavier for putting up with my absences and ill humor over the last year; I’ll make it up to you, I promise.

  —Joe Fawcett THANKS

  are due to my partner and to the pets for tolerating long and erratic hours, and of course to Alexander Chalmers, for creating the Dictionary of Biography in 1810.

  —Liam Quin

MANY THANKS

  to Victoria, Carol, and the team for making everything work. Thanks too to Joe for providing the momentum behind this project and to Liam for keeping it going.

  —Danny Ayers

  CONTENTS

INTRODUCTION XXVII

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xvi

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xvii

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

   xviii

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xix

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xx

  CONTENTS

  

  

  

  

  

  

  

  

   PROGRAMMING

  

  

  

  

  

  

  

   xxi

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xxii

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xxiii

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xxiv

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xxv

  CONTENTS

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

   xxvi

  INTRODUCTION THIS IS THE FIFTH EDITION OF A BOOK

  that has proven popular with professional developers and academic institutions. It strives to impart knowledge on a subject that at fi rst was seen by some as just another fad, but that instead has come to maturity and is now often just taken for granted. Almost six years have passed since the previous edition — a veritable lifetime in IT terms. In review- ing the fourth edition for what should be kept, what should be updated, and what new material was needed, the current authors found that about three-quarters of the material was substantially out of date. XML has far more uses than fi ve years ago, and there is also much more reliance on it under the covers. It is now no longer essential to be able to handcraft esoteric confi guration fi les to get a web service up and running. It has also been found that, in some places, XML is not always the best fi t. These situations and others, along with a complete overhaul of the content, form the basis for this newer version.

  So, what is XML? XML stands for eXtensible Markup Language, which is a language that can be used to describe data in a meaningful way. Virtually anywhere there is a need to store data, especially where it may need to be consumed by more than one application, XML is a good place to start. It has gained a reputation for being a candidate where interoperability is important, either between two applications in different businesses or simply those within a company. Hundreds of standardized XML formats now exist, known as schemas, which have been agreed on by businesses to represent different types of data, from medical records to fi nancial transactions to GPS coordinates representing a journey.

WHO THIS BOOK IS FOR

  This book aims to suit a fairly wide range of readers. Most developers have heard of XML but may have been a bit afraid of it. XML has a habit nowadays of being used behind the scenes, and it’s only when things don’t work as expected or when developers want to do something a little different, that users start to realize that they must open the hood. To those people we say: fear no longer. It should also suit the developer experienced in other fi elds who has never had a formal grounding in the subject. Finally, it can be used as reference when you need to try something out for the fi rst time. Nearly all the technologies in the book have a Try It Out section associated with them that fi rst gets you up and running with a simple example and then explains how to progress from there.

  What you don’t need for this book is any knowledge of markup languages in general. This is all covered in the fi rst few chapters. It is expected that most of the readership will have some knowledge of and experience with web programming, but we’ve tried to spread our examples so that knowledge could include using the Microsoft stack, Java, or one of the other open source frameworks, such as PHP or Python. And just in case you are worried about the Beginning part of the title, that’s a Wrox conceit that applies more to the style of the book than to your level of experience. Many of the concepts covered, especially in later chapters, are from the real world and are far from the Hello World genre.

INTRODUCTION WHAT THIS BOOK COVERS

  This book aims to teach you all you need to know about XML — what it is, how it works, what technologies accompany it, and how you can make it work for you, from simple data transfer to a way to provide multi-channeled content. The book sets out to answer these fundamental questions:

  ➤ What is XML? ➤ How do you use XML? ➤ How does it work? ➤ What can you use it for?

  The basic concepts of XML have remained unchanged since their launch, but the surrounding technologies have changed dramatically. This book gives a basic overview of each technology and how it arose, but the majority of the examples use the latest version available. The examples are also drawn from more than one platform, with Java and .NET sharing most of the stage. XML products have also evolved; at one time there were many free and commercial Extensible Stylesheet Language Transformation (XSLT) processors; for example, XSLT is used to manipulate XML, changing it from one structure to another, and is covered in Chapter 8, but since version 2 appeared the number has reduced considerably as the work needed to develop and maintain the software has risen.

HOW THIS BOOK IS STRUCTURED

  We’ve tried to arrange the subjects covered in this book to lead you along the path of novice to expert in as logical a manner as possible. The sections each cover a different area of expertise. Unless you’re fairly knowledgeable about the basics, we suggest you read the introductory chapters in Part 1, although skimming through may well be enough for the savvier user. The other sections can then be read in order or can be targeted directly if they cover an area that you are particularly interested in. For example, when your boss suddenly tells you that your next release must offer an XQuery add-in, you can head straight to Chapter 9. A brief overview of the book is as follows:

  ➤ You begin by learning exactly what XML is and why people felt it was needed.

➤ We then take you through how to create XML and what rules need to be followed.

➤ Once you’ve mastered that, you move on to what a valid XML document is and how you can be sure that yours is one of them.

  ➤ Then you’ll look at how you can manipulate XML documents to extract data and to trans- form them into other formats. ➤ Next you deal with storing XML in databases — the advantages and disadvantages and how to query them when they’re there. ➤ You then look at other ways to extract data, especially those suitable to dealing with large documents. xxviii

  INTRODUCTION xxix

  Chapter 3: XML Namespaces — Everyone’s favorite, the dreaded topic of namespaces,

  

Chapter 6: RELAX NG and Schematron — Sometimes neither DTDs nor schemas

  an XML document’s format. This chapter examines how they work and discusses the advantages and disadvantages over DTDs.

  Chapter 5: XML Schemas — XML Schemas are the more modern way of describing

  for XML. This chapter shows how they are used to both constrain the document and to supply additional content.

  Part II: Validation This part covers different techniques that help you verify that the XML you’ve created, or received, is in the correct format. Chapter 4: Document Type Defi nitions — DTDs are the original validation mechanism

  is explained in a simple-to-understand fashion. After reading this chapter, you’ll be the expert while everyone else is scratching their heads.

  and isn’t allowed if a document is to be called XML. It also covers the modern naming system that is used to describe the different constituent parts of an XML document.

  ➤ We then cover some uses of XML, how to publish data in an XML format, and how to

  Chapter 1: What Is XML? — Chapter 1 covers the history of XML and why it is needed, as well as the basic rules for creating XML documents.

Chapter 2: Well-Formed XML — This chapter goes into more detail about what is

  and the rules for constructing it. After reading this part you should understand the basic concepts and terminology. If you are already familiar with XML, you can probably just skim these chapters.

  Part I: Introduction This is where most readers should start. The chapters in this part cover the goals of XML

  We’ve tried to organize the book in a logical fashion, such that you are introduced to the basics and then led through the different technologies associated with XML. These technologies are grouped into six sections covering most of topics that you’ll encounter with XML, from validation of the original data to processing, storage, and presentation.

  ➤ We follow up with a couple of chapters on how to use XML for web page and image display.

➤ Finally, there’s a case study that ties a lot of the various XML-based technologies together

into a real-world example.

  create and consume XML-based web services. We explain how AJAX came about and how it works, alongside some alternatives to XML and when you should consider them.

  provide what you need. This chapter discusses two other methods by which you can check if your XML is valid, and also includes examples of mixing more than one validation technique.

  INTRODUCTION

  Part III: Processing This section covers retrieving data from an XML document and also transforming

  one format of XML to another. Included is a thorough grounding in XPath, one of the cornerstones of many XML technologies.

  Chapter 7: Extracting Data from XML — This chapter covers the document object

  model (DOM), one of the earliest ways devised to extract data from XML. It then goes on to describe XPath, one of the cornerstone XML technologies that can be used to pinpoint one or many items of interest.

  Chapter 8: XSLT — XSLT is a way to transform XML from one format to another,

  which is essential if you are receiving documents from external sources and need your own systems to be able to read them. It covers the basics of version 1, the more advanced features of the current version, and shows a little of what’s scheduled in the next release.

  Part IV: Databases For many years there has been a disparity between data held in a database and that stored

  as XML. This part brings the two together and shows how you can have the best of both worlds.

  Chapter 9: XQuery — XQuery is a mechanism designed to query existing documents and

  create new XML documents. It works especially well with XML data that is stored in databases, and this chapter shows how that’s done.

  Chapter 10: XML and Databases — Many database systems now have functionality

  designed especially for XML. This chapter examines three such products and shows how you can both query and update existing data as well as create new XML, should the need arise.

  Part V: Programming This part looks at two programming techniques for handling XML. Chapter 11 covers

  dealing with large documents, and Chapter 12 shows how Microsoft’s latest universal data access strategy, LINQ, can be used with XML.

  Chapter 11: Event-Driven Programming — This chapter looks at two different ways

  of handling XML that are especially suited to processing large fi les. One is based on an open source API and the examples are implemented in Java. The second is a key part of Microsoft’s .NET Framework and shows examples in C#.

  Chapter 12: LINQ to XML — This chapter shows Microsoft’s latest way of handling XML, from creation to querying and transformation. It contains a host of examples that

  use both C# and VB.NET, which, for once, currently has more features than its .NET cousin.

  Part VI: Communication This part has fi ve chapters that deal with using XML as a means of communication. It

  covers presenting data in a way that many different systems can utilize and then shows how web services can make data available to a variety of different clients. It concludes with a discussion on how complex data can be described in a standard way that’s accessible to all.

  xxx

  INTRODUCTION

  

Chapter 13: RSS, Atom, and Content Syndication — This chapter covers the two main

  ways in which content, such as news feeds, is presented in a platform-independent fashion. It also covers how the same XML format can be used to present structured data such as customer listings or sales results.

  

Chapter 14: Web Services — One of the biggest software success stories over the past ten

  years has been web services. This chapter examines how they work and where XML fi ts into the picture, which is essential knowledge, should things start to go wrong.

  

Chapter 15: SOAP and WSDL — This chapter burrows down further into web services

  and describes two major systems used within them: SOAP, which dictates how services are called, and Web Services Description Language (WSDL), which is used to describe what a web service has to offer.

  

Chapter 16: AJAX — The fi nal chapter in this section deals with AJAX and how it can

  help your website provide up-to-the-minute information, yet remain responsive and use less bandwidth. Obviously XML is involved, but the chapter also examines the situations when you’d want to abandon XML and use an alternative technology.

  Part VII: Display This part shows two ways in which XML can help display information in a user-friendly form as well as in a format that can be read by a machine. Chapter 17: XHTML and HTML 5 — This chapter covers how and where to use XHTML and why it is preferred over traditional HTML. It then goes on to show the newer features of HTML 5 and how it has removed some of these obstacles. Chapter 18: Scalable Vector Graphics (SVG) — This chapter shows how images can be

  stored in an XML format and what the advantages are to this method. It then shows how this format can be combined with others, such as HTML, and why you would do this.

  Part VIII: Case Study This part contains a case study that ties in the many uses of XML and shows how they would interact in a real-world example. Chapter 19: Case Study: XML in Publishing — The case study shows how a fi ctional

  publishing house goes from proprietary-based publishing software to an XML-based workfl ow and what benefi ts this brings to the business.

  Appendices The three appendices contain reference material and solutions to the end-of-chapter exercises.

  

Appendix A: Answers to Exercises — This appendix contains solutions and suggestions

for the end-of-chapter exercises that have appeared throughout the book.

Appendix B: XPath Functions — This appendix contains information on the majority

  of XPath functions, their signatures, return values, and examples of how and where you would use them.

  

Appendix C: XML Schema Data Types — This appendix contains information on the

  numerous built-in data types defi ned by XML Schema. It shows how they are related and also how they can be constrained by different facets.

  xxxi

INTRODUCTION WHAT YOU NEED TO USE THIS BOOK

  There’s no need to purchase anything to run the examples in this book; all the examples can be written with and run on freely available software. You’ll need a machine with a standard browser — Internet Explorer, Firefox, Chrome, or Safari should do as long it’s one of the more recent editions. You’ll need a basic text editor, but even Notepad will do if you want to create the examples rather than just download them from the Wrox site. You’ll also need to run a web server for some of the code, either the free version of IIS for Windows or one of the many open source implementations such as Apache for other systems will do. For some of the coding examples you’ll need Visual Studio. You can either use a commercial version or the free one available for download from Microsoft. If you want to use the free version, Visual Studio Express 2010, then hea ach edition of Visual Studio concen- trates on a specifi c area such as C# or web development, so to try all the examples you’ll need to download the C# edition, the VB.NET edition, and the Web edition. You should also install service pack 1 for Visual Studio 2010 which can be found a . Once everything is installed you’ll be able to open the sample solutions or, failing that, one of the sample projects within the solutions by Choosing File ➪ Open ➪ Project/Solution . . . and browsing to either the solution fi le or the specifi c project you want to run. As this book went to press Microsoft was preparing to release a new version, Visual Studio 2011. The examples in this book should all work with this newer version although the screenshots may differ slightly.

  CONVENTIONS

  To help you get the most from the text and keep track of what’s happening, we’ve used a number of conventions throughout the book.

  TRY IT OUT The Try It Out is an exercise you should work through, following the text in the book.

  1. They usually consist of a set of steps.

  2. Each step has a number.

  3. Follow the steps through with your copy of the database.

  How It Works After each Try It Out, the code you’ve typed will be explained in detail.

  WARNING Boxes with a warning icon like this one hold important, not-to-be forgotten information that is directly relevant to the surrounding text.

  xxxii

  INTRODUCTION

  NOTE The pencil icon indicates notes, tips, hints, tricks, and asides to the current discussion. As for styles in the text: ➤ We highlight new terms and important words when we introduce them.

  ➤ We show keyboard strokes like this: Ctrl+A.

➤ We show fi lenames, URLs, and code within the text like so: persistence.properties .

➤ We present code in two different ways: We use a monofont type with no highlighting for most code examples.

  We use bold to emphasize code that’s particularly important in the present context. SOURCE CODE

  As you work through the examples in this book, you may choose either to type in all the code manually, or to use the source code fi les that accompany the book. All the source code used in this book is available When at the site, simply locate the book’s title (use the Search box or one of the title lists) and click the Download Code link on the book’s detail page to obtain all the source code for the book. Code that is included on the website is highlighted by the following icon: download on Available for

  

  Listings include the fi lename in the title. If it is just a code snippet, you’ll fi nd the fi lename in a code note such as this:

  

fi lename

  NOTE Because many books have similar titles, you may fi nd it easiest to search by ISBN; this book’s ISBN is 978-1-118-16213-2. Once you download the code, just decompress it with your favorite compression tool. Alternately, you can go to the main Wrox code download page at to see the code available for this book and all other Wrox books.

  xxxiii

INTRODUCTION ERRATA

  We make every effort to ensure that there are no errors in the text or in the code. However, no one is perfect, and mistakes do occur. If you fi nd an error in one of our books, like a spelling mistake or faulty piece of code, we would be very grateful for your feedback. By sending in errata you may save another reader hours of frustration and at the same time you will be helping us provide even higher quality information. To fi nd the errata page for this book and locate the title using the Search box or one of the title lists. Then, on the book details page, click the Book Errata link. On this page you can view all errata that has been submitted for this book and posted by Wrox editors. A complete book list including links to each book’s errata is also available a misc-pages/booklist.shtml .

  If you don’t spot “your” error on the Book Errata page and complete the form there to send us the error you have found. We’ll check the information and, if appropriate, post a message to the book’s errata page and fi x the problem in subsequent editions of the book.

  For author and peer discussion, join the P2P forums at he forums are a web-based system for you to post messages relating to Wrox books and related technologies and interact with other readers and technology users. The forums offer a subscription feature to e-mail you topics of interest of your choosing when new posts are made to the forums. Wrox authors, editors, other industry experts, and your fellow readers are present on these forums. At you will fi nd a number of different forums that will help you not only as you read this book, but also as you develop your own applications. To join the forums, just follow these steps: 1. Go and click the Register link.

  2. Read the terms of use and click Agree.

  3. Complete the required information to join as well as any optional information you wish to provide and click Submit.

  

4. You will receive an e-mail with information describing how to verify your account and

complete the joining process.

  NOTE You can read messages in the forums without joining P2P but in order to post your own messages, you must join.

  xxxiv

  INTRODUCTION

  Once you join, you can post new messages and respond to messages other users post. You can read messages at any time on the web. If you would like to have new messages from a particular forum e-mailed to you, click the Subscribe to this Forum icon by the forum name in the forum listing. For more information about how to use the Wrox P2P, be sure to read the P2P FAQs for answers to questions about how the forum software works as well as many common questions specifi c to P2P and Wrox books. To read the FAQs, click the FAQ link on any P2P page.

  xxxv

  PART I Introducing XML

  CHAPTER 1: What Is XML?

  

CHAPTER 2: Well-Formed XML

  CHAPTER 3: XML Namespaces

1 What Is XML? WHAT YOU’LL WILL LEARN IN THIS CHAPTER:

  The story before XML ➤

  How XML arrived ➤

  The basic format of an XML document ➤

  Areas where XML is useful ➤

  A brief introduction to the technologies surrounding, and associated ➤ with, XML

  

XML stands for Extensible Markup Language (presumably the original authors thought

  that sounded more exciting than EML) and its development and usage have followed a com- mon path in the software and IT world. It started out more than ten years ago and was originally used by very few; later it caught the public eye and began to pervade the world of data exchange. Subsequently, the tools available to process and manage XML became more sophisticated, to such an extent that many people began to use it without being really aware of its existence. Lately there has been a bit of a backlash in certain quarters over its perceived failings and weak points, which has led to various proposed alternatives and improvements. Nevertheless, XML now has a permanent place in IT systems and it’s hard to imagine any non-trivial application that doesn’t use XML for either its confi guration or data to some degree. For this reason it’s essential that modern software developers have a thorough under- standing of its principles, what it is capable of, and how to use it to their best advantage. This book can give the reader all those things.

  WHAT IS XML?

  4 CHAPTER 1

  ❘

  NOTE Although this chapter presents some short examples of XML, you aren’t expected to understand all that’s going on just yet. The idea is simply to intro- duce the important concepts behind the language so that throughout the book you can see not only how to use XML, but also why it works the way it does.

STEPS LEADING UP TO XML: DATA REPRESENTATION AND MARKUPS

  There are two main uses for XML: One is a way to represent low-level data, for example confi gura- tion fi les. The second is a way to add metadata to documents; for example, you may want to stress a particular sentence in a report by putting it in italics or bold. The fi rst usage for XML is meant as a replacement for the more traditional ways this has been done before, usually by means of lists of name/value pairs as is seen in Windows’ INI or Java’s Property fi les. The second application of XML is similar to how HTML fi les work. The document text is con- tained in an overall container, the <body> element, with individual phrases surrounded by <i> or <b> tags. For both of these scenarios there has been a multiplicity of techniques devised over the years. The problem with these disparate approaches has been more apparent than ever, since the increased use of the Internet and extensive existence of distributed applications, particularly those that rely on components designed and managed by different parties. That problem is one of intercommunication. It’s certainly possible to design a distributed system that has two components, one outputting data using a Windows INI fi le and the other which turns it into a Java Properties format. Unfortunately, it means a lot of development on both sides, which shouldn’t really be necessary and detracts resources from the main objective, developing new functionality that delivers business value.

  XML was conceived as a solution to this kind of problem; it is meant to make passing data between different components much easier and relieve the need to continually worry about different formats of input and output, freeing up developers to concentrate on the more important aspects of coding such as the business logic. XML is also seen as a solution to the question of whether fi les should be easily readable by software or by humans; XML’s aim is to be both. You’ll be examining the distinc- tion between data-oriented and document-centric XML later in the book, but for now let’s look a bit more deeply into what the choices were before XML when there was need to store or communi- cate data in an electronic format.

  This section takes a mid-level look at data representation, without taking too much time to explain low-level details such as memory addresses and the like. For the purposes here you can store data in fi les two ways: as binary or as text.

  Binary Files

  A binary fi le, at its simplest, is just a stream of bits (1s and 0s). It’s up to the application that created the binary fi le to understand what all of the bits mean. That’s why binary fi les can only be read and produced by certain computer programs, which have been specifi cally written to understand them.

  Steps Leading up to XML: Data Representation and Markups

   5 ❘

  For example, when saving a document in Microsoft Word, using a version before 2003, the fi le cre- ated (which has a doc extension) is in a binary format. If you open the fi le in a text editor such as Notepad, you won’t be able to see a picture of the original Word document; the best you’ll be able to see is the occasional line of text surrounded by gibberish rather than the prose, which could be in a number of formats such as bold or italic. The characters in the document other than the actual text are metadata, literally information about information. Mixing data and metadata is both common and straightforward in a binary fi le. Metadata can specify things such as which words should be shown in bold, what text is to be displayed in a table, and so on. To interpret this fi le you the need the help of the application that created it. Without the help of a converter that has in-depth knowl- edge of the underlying binary format, you won’t be able to open a document created in Word with another similar application such as WordPerfect. The main advantage of binary formats is that they are concise and can be expressed in a relatively small space. This means that more fi les can be stored (on a hard drive, for example) but, more importantly nowadays, less bandwidth is used when trans- porting these fi les across networks.

  Text Files The main difference between text and binary fi les is that text fi les are human and machine readable.

  Instead of a proprietary format that needs a specifi c application to decipher it, the data is such that each group of bits represents a character from a known set. This means that many different applica- tions can read text fi les. On a standard Windows machine you have a choice of Notepad, WordPad, and others, including being able to use command-line–based utilities such as Edit. Non-Windows machines have a similar wide range insert of programs available, such as Emacs and Vim.