XML Parsers Structure of XML Documents

Chapter 6 • XML and Data Representation 225 Object-Model Oriented Paradigm: DOM DOM Document Object Model Practical Issues Additional features relevant for both event-oriented and object-model oriented parsers include: • Validation against a DTD • Validation against an XML Schema • Namespace awareness, i.e., the ability to determine the namespace URI of an element or attribute These features affect the performance and memory footprint of a parser, so some parsers do not support all the features. You should check the documentation for the particular parser as to the list of supported features.

6.2 XML Schemas

Although there is no universal definition of schema, generally scholars agree that schemas are abstractions or generalizations of our perceptions of the world around us, which is molded by our experience. Functionally, schemas are knowledge structures that serve as heuristics which help us evaluate new information. An integral part of schema is our expectations of people, place, and things. Schemas provide a mechanism for describing the logical structure of information, in the sense of what elements can or should be present and how they can be arranged. Deviant news results in violation of these expectations, resulting in schema incongruence. In XML, schemas are used to make a class of documents adhere to a particular interface and thus allow the XML documents to be created in a uniform way. Stated another way, schemas allow a document to communicate meta-information to the parser about its content, or its grammar. Meta- information includes the allowed sequence and arrangementnesting of tags, attribute values and their types and defaults, the names of external files that may be referenced and whether or not they contain XML, the formats of some external non-XML data that may be referenced, and the entities that may be encountered. Therefore, schema defines the document production rules. XML documents conforming to a particular schema are said to be valid documents. Notice that having a schema associated with a given XML document is optional. If there is a schema for a given document, it must appear before the first element in the document. Here is a simple example to motivate the need for schemas. In Section 6.1.1 above I introduced an XML representation of a correspondence letter and used the tags letter, sender, name , address, street, city, etc., to mark up the elements of a letter. What if somebody used the same vocabulary in a somewhat different manner, such as the following? Listing 6-5: Variation on the XML example document from Listing 6-1. Ivan Marsic • Rutgers University 226 1 ?xml version=1.0 encoding=UTF-8? 2 letter 3 senderMr. Charles Morsesender 4 street13 Takeoff Lanestreet 5 cityTalkeetna, AK 99676city 6 date29.02.1997date 7 recipientMrs. Robinsonrecipient 8 street1 Entertainment Waystreet 9 cityLos Angeles, CA 91011city 10 body 11 Dear Mrs. Robinson, 12 13 Heres part of an update ... 14 15 Sincerely, 16 body 17 signatureCharliesignature 18 letter We can quickly figure that this document is a letter, although it appears to follow different rules of production than the example in Listing 6-1 above. If asked whether Listing 6-5 represents a valid letter, you would likely respond: “It probably does.” However, to support automatic validation of a document by a machine, we must precisely specify and enforce the rules and constraints of composition. Machines are not good at handling ambiguity and this is what schemas are about. The purpose of a schema in markup languages is to: • Allow machine validation of document structure • Establish a contract how an XML document will be structured between multiple parties who are exchanging XML documents There are many other schemas that are used regularly in our daily activities. Another example schema was encountered in Section 2.2.2—the schema for representing the use cases of a system under discussion, Figure 2-1. Chapter 6 • XML and Data Representation 227

6.2.1 XML Schema Basics

XML Schema provides the vocabulary to state the rules of document production. It is an XML language for which the vocabulary is defined using itself. That is, the elements and datatypes that are used to construct schemas, such as schema, element, sequence, string, etc., come from the http:www.w3.org2001XMLSchema namespace, see Figure 6-4. The XML Schema namespace is also called the “schema of schemas,” for it defines the elements and attributes used for defining new schemas. The first step involves defining a new language see Figure 6-4. The following is an example schema for correspondence letters, an example of which is given in Listing 6-1 above. Listing 6-6: XML Schema for correspondence letters see an instance in Listing 6-1. 1 2 2a 2b 2c 3 4 5 6 6a 7 ?xml version=1.0 encoding=UTF-8? xsd:schema xmlns:xsd=http:www.w3.org2001XMLSchema targetNamespace=http:any.website.netletter xmlns=http:any.website.netletter elementFormDefault=qualified xsd:element name=letter xsd:complexType xsd:sequence xsd:element name=sender type=personAddressType minOccurs=1 maxOccurs=1 xsd:element name=date type=xsd:date minOccurs=0 http:www.w3.org2001XMLSchema schema element complexType sequence string boolean http:any.website.netletter letter sender address street name salutation This is the vocabulary that XML Schema provides to define your new vocabulary recipient city ?xml version=1.0 encoding=UTF-8? lt:letter xmlns:lt =http:any.website.netletter xmlns:xsi=http:www.w3.org2001XMLSchema-instance xsi:schemaLocation=http:any.website.netletter http:any.website.netletterletter.xsd lt:language=English_US lt:template=personal lt:sender ... lt:letter An instance document that conforms to the “letter” schema Figure 6-4: Using XML Schema. Step 1: use the Schema vocabulary to define a new XML language Listing 6-6. Step 2: use both to produce valid XML documents Listing 6-7.