Syntax Structure of XML Documents

Chapter 6 • XML and Data Representation 215 Comments A comment begins with the characters -- and ends with --. A comment can span multiple lines in the document and contain any data except the literal string “--.” You can place comments anywhere in your document outside other markup. Here is an example: -- My comment is imminent. -- Comments are not part of the textual content of an XML document and the parser will ignore them. The parser is not required to pass them along to the application, although it may do so. Processing Instructions Processing instructions PIs allow documents to contain instructions for applications that will import the document. Like comments, they are not textually part of the XML document, but this time around the XML processor is required to pass them to an application. Processing instructions have the form: ?name pidata?. The name, called the PI target, identifies the PI to the application. For example, you might have ?font start italic? and ?font end italic?, which indicate the XML processor to start italicizing the text and to end, respectively. Applications should process only the targets they recognize and ignore all other PIs. Any data that follows the PI target is optional; it is for the application that recognizes the target. The names used in PIs may be declared as notations in order to formally identify them. Processing instruction names beginning with xml are reserved for XML standardization. CDATA Sections In a document, a CDATA section instructs the parser to ignore the reserved markup characters. So, instead of using entities to include reserved characters in the content as in the above example of lt;non-elementgt; , we can write: [CDATA[ non-element ]] Between the start of the section, [CDATA[ and the end of the section, ]], all character data are passed verbatim to the application, without interpretation. Elements, entity references, comments, and processing instructions are all unrecognized and the characters that comprise them are passed literally to the application. The only string that cannot occur in a CDATA section is “]]”. Document Type Declarations DTDs Document type declarations DTDs are reviewed in Section 6.1.2 below. DTD is used mainly to define constraints on the logical structure of documents, that is, the valid tags and their arrangementordering. Ivan Marsic • Rutgers University 216 This is about as much as an average user needs to know about XML. Obviously, it is simple and concise. XML is designed to handle almost any kind of structured data—it constrains neither the vocabulary set of tags nor the grammar rules of how the tags combine of the markup language that the user intends to create. XML allows you to create your own tag names. Another way to think of it is that XML only defines punctuation symbols and rules for forming “sentences” and “paragraphs,” but it does not prescribe any vocabulary of words to be used. Inventing the vocabulary is left to the language designer. But for any given application, it is probably not meaningful for tags to occur in a completely arbitrary order. From a strictly syntactic point of view, there is nothing wrong with such an XML document. So, if the document is to have meaning, and certainly if you are writing a stylesheet or application to process it, there must be some constraint on the sequence and nesting of tags, stating for example, that a chapter that is a sub-element of a book tag, and not the other way around. These constraints can be expressed using an XML schema Section 6.2 below. XML Document Example The letter document shown initially in this chapter can be represented in XML as follows: Listing 6-1: Example XML document of a correspondence letter. 1 ?xml version=1.0 encoding=UTF-8? 2 -- Comment: A personal letter marked up in XML. -- 3 letter language=en-US template=personal 4 sender 5 nameMr. Charles Morsename 6 address type=return 7 street13 Takeoff Lanestreet 8 cityTalkeetnacitystateAKstate 9 postal-code99676postal-code 10 address 11 sender 12 date format=English_USFebruary 29, 1997date 13 recipient 14 nameMrs. Robinsonname 15 address type=delivery 16 street1 Entertainment Waystreet 17 cityLos AngelescitystateCAstate 18 postal-code91011postal-code 19 address 20 recipient 21 salutation style=formalDear Mrs. Robinson, salutation 22 body 23 Heres part of an update ... 24 body 25 closingSincerely,closing 26 signatureCharliesignature 27 letter Line 1 begins the document with a processing instruction ?xml ... ?. This is the XML declaration , which, although not required, explicitly identifies the document as an XML document and indicates the version of XML to which it was authored. Chapter 6 • XML and Data Representation 217 A variation on the above example is to define the components of a postal address lines 6–9 and 14–17 as element attributes: address type=return street=13 Takeoff Lane city=Talkeetna state=AK postal-code=99676 Notice that this element has no content, i.e., it is an empty element. This produces a more concise markup, particularly suitable for elements with well-defined, simple, and short content. One quickly notices that XML encourages naming the elements so that the names describe the nature of the named object, as opposed to describing how it should be displayed or printed. In this way, the information is self-describing, so it can be located, extracted, and manipulated as desired. This kind of power has previously been reserved for organized scalar information managed by database systems. You may have also noticed a potential hazard that comes with this freedom—since people may define new XML languages as they please, how can we resolve ambiguities and achieve common understanding? This is why, although the core XML is very simple, there are many XML-related standards to handle translation and specification of data. The simplest way is to explicitly state the vocabulary and composition rules of an XML language and enforce those across all the involved parties. Another option, as with natural languages, is to have a translator in between, as illustrated in Figure 6-1. The former solution employs XML Schemas introduced in Section 6.2 below, and the latter employs transformation languages introduced in Section 6.4 below. Well-Formedness A text document is an XML document if it has a proper syntax as per the XML specification. Such document is called a well-formed document. An XML document is well-formed if it conforms to the XML syntax rules: • Begins with the XML declaration ?xml ... ? • Has exactly one root element, called the root or document, and no part of it can appear in the content of any other element • Contains one or more elements delimited by start-tags and end-tags also remember that XML tags are case sensitive • All elements are closed, that is all start-tags must match end-tags XML language for letters, variant 1 address type=return“ street=13 Takeoff Lane“ city=Talkeetna state=AK“ zip=99676 address type=return“ street=13 Takeoff Lane“ city=Talkeetna state=AK“ zip=99676 address type=return street13 Takeoff Lanestreet cityTalkeetnacity stateAKstate postal-code99676postal-code address address type=return street13 Takeoff Lanestreet cityTalkeetnacity stateAKstate postal-code99676postal-code address XML language for letters, variant 2 Translator Figure 6-1: Different XML languages can be defined for the same domain andor concepts. In such cases, we need a “translator” to translate between those languages. Ivan Marsic • Rutgers University 218 • All elements must be properly nested within each other, such as outerinnerinner contentinnerouter • All attribute values must be within quotations • XML entities must be used for special characters. Each of the parsed entities that are referenced directly or indirectly within the document is well-formed. Even if documents are well-formed they can still contain errors, and those errors can have serious consequences. XML Schemas introduced in Section 6.2 below provide further level of error checking. A well-formed XML document may in addition be valid if it meets constraints specified by an associated XML Schema. Document- vs. Data-Centric XML Generally speaking, there are two broad application areas of XML technologies. The first relates to document-centric applications, and the second to data-centric applications. Because XML can be used in so many different ways, it is important to understand the difference between these two categories. See more at http:www.xmleverywhere.comnewsletters20000525.htm Initially, XML’s main application was in semi-structured document representation, such as technical manuals, legal documents, and product catalogs. The content of these documents is typically meant for human consumption, although it could be processed by any number of applications before it is presented to humans. The key element of these documents is semi- structured marked-up text. A good example is the correspondence letter in Listing 6-1 above. By contrast, data-centric XML is used to mark up highly structured information such as the textual representation of relational data from databases, financial transaction information, and programming language data structures. Data-centric XML is typically generated by machines and is meant for machine consumption. It is XML’s natural ability to nest and repeat markup that makes it the perfect choice for representing these types of data. Key characteristics of data-centric XML: • The ratio of markup to content is high. The XML includes many different types of tags. There is no long-running text. • The XML includes machine-generated information, such as the submission date of a purchase order using a date-time format of year-month-day. A human authoring an XML document is unlikely to enter a date-time value in this format. • The tags are organized in a highly structured manner. Order and positioning matter, relative to other tags. For example, TBD • Markup is used to describe what a piece of information means rather than how it should be presented to a human. An interesting example of data-centric XML is the XML Metadata Interchange XMI, which is an OMG standard for exchanging metadata information via XML. The most common use of XMI is as an interchange format for UML models, although it can also be used for serialization of models of other languages metamodels. XMI enables easy interchange of metadata between UML-based modeling tools and MOF Meta-Object Facility-based metadata repositories in Chapter 6 • XML and Data Representation 219 distributed heterogeneous environments. For more information see here: http:www.omg.orgtechnologydocumentsformalxmi.htm .

6.1.2 Document Type Definition DTD

Document Type Definition DTD is a schema language for XML inherited from SGML, used initially, before XML Schema was developed. DTD is one of ways to define the structure of XML documents, i.e., the document’s metadata. There are four kinds of declarations in XML: 1 element type declarations; 2 attribute list declarations; 3 entity declarations; and, 4 notation declarations. Element Type Declarations Element type declarations identify the names of elements and the nature of their content, thus putting a type constraint on the element. Typical element type declarations looks like this: ELEMENT chapter title, paragraph+, figure? ELEMENT title PCDATA Declaration type Element name Element’s content model def. of allowed content The first declaration identifies the element named chapter. Its content model follows the element name. The content model defines what an element may contain. In this case, a chapter must contain paragraphs and title and may contain figures. The commas between element names indicate that they must occur in succession. The plus after paragraph indicates that it may be repeated more than once but must occur at least once. The question mark after figure indicates that it is optional it may be absent. A name with no punctuation, such as title, must occur exactly once. The following table summarizes the meaning of the symbol after an element: Kleene symbol Meaning none The element must occur exactly once ? The element is optional The element can be skipped or included one or more times + The element must be included one or more times Declarations for paragraphs, title, figures and all other elements used in any content model must also be present for an XML processor to check the validity of a document. In addition to element names, the special symbol PCDATA is reserved to indicate character data. The PCDATA stands for parseable character data. Elements that contain only other elements are said to have element content. Elements that contain both other elements and PCDATA are said to have mixed content. For example, the definition for paragraphs might be ELEMENT paragraph PCDATA | quote The vertical bar indicates an “or” relationship, the asterisk indicates that the content is optional may occur zero or more times; therefore, by this definition, paragraphs may contain zero or more characters and quote tags, mixed in any order. All mixed content models must have this form: PCDATA must come first, all of the elements must be separated by vertical bars, and the entire group must be optional. Ivan Marsic • Rutgers University 220 Two other content models are possible: EMPTY indicates that the element has no content and consequently no end-tag, and ANY indicates that any content is allowed. The ANY content model is sometimes useful during document conversion, but should be avoided at almost any cost in a production environment because it disables all content checking in that element. Attribute List Declarations Elements which have one or more attributes are to be specified in the DTD using attribute list type declarations. An example for a figure element could be like so ATTLIST figure caption CDATA REQUIRED scaling CDATA FIXED 100 Names of attributes Data type Keyword or default value Declaration type Name of the associated element Repeat for each attribute of the element The CDATA as before stands for character data and REQUIRED means that the caption attribute of figure has to be present. Other marker could be FIXED with a value which means this attribute acts like a constant. Yet another marker is IMPLIED, which indicates an optional attribute. Some more markers are ID and enumerated data type like ATTLIST person sibling brother | sister REQUIRED Enumerated attributes can take one of a list of values provided in the declaration. Entity Declarations As stated above, entities are used as substitutes for reserved characters, but also to refer to often repeated or varying text and to include the content of external files. An entity is defined by its name and an associated value. An internal entity is the one for which the parsed content replacement text lies inside the document, like so: ENTITY substitute This text is often repeated. Declaration type Entity name Entity value any literal – single or double quotes can be used, but must be properly matched Conversely, the content of the replacement text of an external entity resides in a file separate from the XML document. The content can be accessed using either system identifier, which is a URI Uniform Resource Identifier, see Appendix C address, or a public identifier, which serves as a basis for generating a URI address. Examples are: ENTITY alternate SYSTEM http:any.website.netbooktext.xml ENTITY surrogate PUBLIC -homemrbrowntext Declaration type Entity name SYSTEM or PUBLIC identifier, followed by the external ID URI or other Notation Declarations Notations are used to associate actions with entities. For example, a PDF file format can be associated with the Acrobat application program. Notations identify, by name, the format of these actions. Notation declarations are used to provide an identifying name for the notation. They are used in entity or attribute list declarations and in attribute specifications. This is a complex and controversial feature of DTD and the interested reader should seek details elsewhere. Chapter 6 • XML and Data Representation 221 DTD Example The following fragment of DTD code defines the production rules for constructing book documents. Listing 6-2: Example DTD 1 ?xml version=1.0 encoding=UTF-8? 2 DOCTYPE mydoc [ 3 ENTITY first SYSTEM first.dtd 4 ENTITY second SYSTEM second.dtd 5 ENTITY third SYSTEM third.dtd first;second;third; 6 ] 7 ELEMENT book title,author,chapter+ 8 ATTLIST book isbn CDATA IMPLIED 9 ELEMENT model datamodel,data_options 10 ELEMENT datamodel EMPTY . . . In the above DTD document fragment, Lines 3 – 5 list three extra DTD documents that will be imported into this one at the time the current document is parsed. Line 7 shows an element definition, the element application consists of optional view elements and at least one model element. Line 8 says that application has an optional attribute, name, of the type character data. Limitations of DTDs DTD provided the first schema for XML documents. Their limitations include: • Language inconsistency since DTD uses a non-XML syntax • Failure to support namespace integration • Lack of modular vocabulary design • Rigid content models cannot derive new type definitions based on the old ones • Lack of integration with data-oriented applications • Conversely, XML Schema allows much more expressive and precise specification of the content of XML documents. This flexibility also carries the price of complexity. W3C is making efforts to phase DTDs out. XML Schema is described in Section 6.2 below.

6.1.3 Namespaces

Inventing new languages is an arduous task, so it will be beneficial if we can reuse parts of an existing XML language defined by a schema. Also, there are many occasions when an XML document needs to use markups defined in multiple schemas, which may have been developed independently. As a result, it may happen that some tag names may be non-unique. For example, the word “title” is used to signify the name of a book or work of art, a form of nomenclature indicating a person’s status, the right to ownership of property, etc. People easily