Syntax Structure of XML Documents
Chapter 6 • XML and Data Representation
215
Comments
A comment begins with the characters -- and ends with --. A comment can span multiple lines in the document and contain any data except the literal string “--.” You can place
comments anywhere in your document outside other markup. Here is an example: --
My comment is imminent. --
Comments are not part of the textual content of an XML document and the parser will ignore them. The parser is not required to pass them along to the application, although it may do so.
Processing Instructions
Processing instructions PIs allow documents to contain instructions for applications that will import the document. Like comments, they are not textually part of the XML document, but this
time around the XML processor is required to pass them to an application.
Processing instructions have the form: ?name pidata?. The name, called the PI target, identifies the PI to the application. For example, you might have ?font start italic?
and ?font end italic?, which indicate the XML processor to start italicizing the text and to end, respectively.
Applications should process only the targets they recognize and ignore all other PIs. Any data that follows the PI target is optional; it is for the application that recognizes the target. The names
used in PIs may be declared as notations in order to formally identify them. Processing instruction names beginning with xml are reserved for XML standardization.
CDATA Sections
In a document, a CDATA section instructs the parser to ignore the reserved markup characters. So, instead of using entities to include reserved characters in the content as in the above example of
lt;non-elementgt; , we can write:
[CDATA[ non-element ]]
Between the start of the section, [CDATA[ and the end of the section, ]], all character data are passed verbatim to the application, without interpretation. Elements, entity references,
comments, and processing instructions are all unrecognized and the characters that comprise them are passed literally to the application. The only string that cannot occur in a CDATA section is
“]]”.
Document Type Declarations DTDs
Document type declarations DTDs are reviewed in Section 6.1.2 below. DTD is used mainly to define constraints on the logical structure of documents, that is, the valid tags and their
arrangementordering.
Ivan Marsic • Rutgers University
216
This is about as much as an average user needs to know about XML. Obviously, it is simple and concise. XML is designed to handle almost any kind of structured data—it constrains neither the
vocabulary set of tags nor the grammar rules of how the tags combine of the markup language that the user intends to create. XML allows you to create your own tag names. Another way to
think of it is that XML only defines punctuation symbols and rules for forming “sentences” and “paragraphs,” but it does not prescribe any vocabulary of words to be used. Inventing the
vocabulary is left to the language designer.
But for any given application, it is probably not meaningful for tags to occur in a completely arbitrary order. From a strictly syntactic point of view, there is nothing wrong with such an XML
document. So, if the document is to have meaning, and certainly if you are writing a stylesheet or application to process it, there must be some constraint on the sequence and nesting of tags,
stating for example, that a chapter that is a sub-element of a book tag, and not the other way around. These constraints can be expressed using an XML schema Section 6.2 below.
XML Document Example
The letter document shown initially in this chapter can be represented in XML as follows:
Listing 6-1: Example XML document of a correspondence letter. 1 ?xml version=1.0 encoding=UTF-8?
2 -- Comment: A personal letter marked up in XML. -- 3 letter language=en-US template=personal
4 sender 5 nameMr. Charles Morsename
6 address type=return 7 street13 Takeoff Lanestreet
8 cityTalkeetnacitystateAKstate 9 postal-code99676postal-code
10 address 11 sender
12 date format=English_USFebruary 29, 1997date 13 recipient
14 nameMrs. Robinsonname 15 address type=delivery
16 street1 Entertainment Waystreet 17 cityLos AngelescitystateCAstate
18 postal-code91011postal-code 19 address
20 recipient 21 salutation style=formalDear Mrs. Robinson, salutation
22 body 23 Heres part of an update ...
24 body 25 closingSincerely,closing
26 signatureCharliesignature 27 letter
Line 1 begins the document with a processing instruction ?xml ... ?. This is the XML declaration
, which, although not required, explicitly identifies the document as an XML document and indicates the version of XML to which it was authored.
Chapter 6 • XML and Data Representation
217
A variation on the above example is to define the components of a postal address lines 6–9 and 14–17 as element attributes:
address type=return street=13 Takeoff Lane city=Talkeetna state=AK postal-code=99676
Notice that this element has no content, i.e., it is an empty element. This produces a more concise markup, particularly suitable for elements with well-defined, simple, and short content.
One quickly notices that XML encourages naming the elements so that the names describe the nature of the named object, as opposed to describing how it should be displayed or printed. In this
way, the information is self-describing, so it can be located, extracted, and manipulated as desired. This kind of power has previously been reserved for organized scalar information
managed by database systems.
You may have also noticed a potential hazard that comes with this freedom—since people may define new XML languages as they please, how can we resolve ambiguities and achieve common
understanding? This is why, although the core XML is very simple, there are many XML-related standards to handle translation and specification of data. The simplest way is to explicitly state
the vocabulary and composition rules of an XML language and enforce those across all the involved parties. Another option, as with natural languages, is to have a translator in between, as
illustrated in Figure 6-1. The former solution employs XML Schemas introduced in Section 6.2 below, and the latter employs transformation languages introduced in Section 6.4 below.
Well-Formedness
A text document is an XML document if it has a proper syntax as per the XML specification. Such document is called a well-formed document. An XML document is well-formed if it
conforms to the XML syntax rules:
• Begins with the XML declaration ?xml ... ? • Has exactly one root element, called the root or document, and no part of it can appear in
the content of any other element • Contains one or more elements delimited by start-tags and end-tags also remember that
XML tags are case sensitive • All elements are closed, that is all start-tags must match end-tags
XML language for letters, variant 1
address type=return“ street=13 Takeoff Lane“
city=Talkeetna state=AK“
zip=99676 address type=return“
street=13 Takeoff Lane“ city=Talkeetna
state=AK“ zip=99676
address type=return street13 Takeoff Lanestreet
cityTalkeetnacity stateAKstate
postal-code99676postal-code
address address type=return
street13 Takeoff Lanestreet cityTalkeetnacity
stateAKstate postal-code99676postal-code
address
XML language for letters, variant 2
Translator
Figure 6-1: Different XML languages can be defined for the same domain andor concepts. In such cases, we need a “translator” to translate between those languages.
Ivan Marsic • Rutgers University
218
• All elements must be properly nested within each other, such as
outerinnerinner contentinnerouter
• All attribute values must be within quotations • XML entities must be used for special characters. Each of the parsed entities that are
referenced directly or indirectly within the document is well-formed. Even if documents are well-formed they can still contain errors, and those errors can have serious
consequences. XML Schemas introduced in Section 6.2 below provide further level of error checking. A well-formed XML document may in addition be valid if it meets constraints
specified by an associated XML Schema.
Document- vs. Data-Centric XML
Generally speaking, there are two broad application areas of XML technologies. The first relates to document-centric applications, and the second to data-centric applications. Because XML can
be used in so many different ways, it is important to understand the difference between these two categories. See more at
http:www.xmleverywhere.comnewsletters20000525.htm Initially, XML’s main application was in semi-structured document representation, such as
technical manuals, legal documents, and product catalogs. The content of these documents is typically meant for human consumption, although it could be processed by any number of
applications before it is presented to humans. The key element of these documents is semi- structured marked-up text. A good example is the correspondence letter in Listing 6-1 above.
By contrast, data-centric XML is used to mark up highly structured information such as the textual representation of relational data from databases, financial transaction information, and
programming language data structures. Data-centric XML is typically generated by machines and is meant for machine consumption. It is XML’s natural ability to nest and repeat markup that
makes it the perfect choice for representing these types of data.
Key characteristics of data-centric XML: • The ratio of markup to content is high. The XML includes many different types of tags.
There is no long-running text. • The XML includes machine-generated information, such as the submission date of a
purchase order using a date-time format of year-month-day. A human authoring an XML document is unlikely to enter a date-time value in this format.
• The tags are organized in a highly structured manner. Order and positioning matter, relative to other tags. For example, TBD
• Markup is used to describe what a piece of information means rather than how it should be presented to a human.
An interesting example of data-centric XML is the XML Metadata Interchange XMI, which is an OMG standard for exchanging metadata information via XML. The most common use of XMI
is as an interchange format for UML models, although it can also be used for serialization of models of other languages metamodels. XMI enables easy interchange of metadata between
UML-based modeling tools and MOF Meta-Object Facility-based metadata repositories in
Chapter 6 • XML and Data Representation
219
distributed heterogeneous environments. For more information see here: http:www.omg.orgtechnologydocumentsformalxmi.htm
.