XML Schema Basics XML Schemas

Chapter 6 • XML and Data Representation 229 Line 2a: Declares the target namespace as http:any.website.netletter—the elements defined by this schema are to go in the target namespace. Line 2b: The default namespace is set to http:any.website.netletter—same as the target namespace—so the elements of this namespace do not need the namespace qualifierprefix within this schema document. Line 2c: This directive instructs the instance documents which conform to this schema that any elements used by the instance document which were declared in this schema must be namespace qualified. The default value of elementFormDefault if not specified is unqualified . The corresponding directive about qualifying the attributes is attributeFormDefault , which can take the same values. Lines 3–17: Define the root element letter as a compound datatype xsd:complexType comprising several other elements. Some of these elements, such as lt:address anonymous type anonymous type lt:letter lt:signature lt:closing lt:body lt:salutation lt:recipient lt:sender lt:postal-code lt:state lt:city lt:personAddressType lt:sender lt:name lt:address lt:template lt:language lt:street + ? ? lt:date ? lt:address anonymous type anonymous type lt:letter lt:letter lt:signature lt:closing lt:body lt:salutation lt:recipient lt:sender lt:postal-code lt:state lt:city lt:personAddressType lt:sender lt:sender lt:name lt:name lt:address lt:address lt:template lt:language lt:street + ? ? lt:date ? Kleene operators: no indicator Required One and only one ? Optional None or one minOccurs = 0, maxOccurs = 1 ∗ Optional, repeatable None, one, or more minOccurs = 0, maxOccurs = ∞ + Required, repeatable One or more minOccurs = 1, maxOccurs = ∞ Unique element values must be unique choice sequence all element reference element immediately within schema, i.e. global element not immediately within schema, i.e. local element has sub-elements not shown element has sub-elements shown attribute of an element XML Schema symbols group of elements attributeGroup Kleene operators: no indicator Required One and only one ? Optional None or one minOccurs = 0, maxOccurs = 1 ∗ Optional, repeatable None, one, or more minOccurs = 0, maxOccurs = ∞ + Required, repeatable One or more minOccurs = 1, maxOccurs = ∞ Unique element values must be unique choice sequence all element reference element reference element immediately within schema, i.e. global element immediately within schema, i.e. global element not immediately within schema, i.e. local element not immediately within schema, i.e. local element has sub-elements not shown element has sub-elements not shown element has sub-elements shown element has sub-elements shown attribute of an element attribute of an element XML Schema symbols group of elements group of elements attributeGroup attributeGroup Figure 6-5: Document structure defined by correspondence letters schema see Listing 6-6. NOTE: The symbolic notation is inspired by the one used in [McGovern et al., 2003]. Ivan Marsic • Rutgers University 230 Schema document Instance documents conforms-to salutation and body, contain simple, predefined datatype xsd:string. Others, such as sender and recipient, contain compound type personAddressType which is defined below in this schema document lines 18–23. This complex type is also a sequence , which means that all the named elements must appear in the sequence listed. The letter element is defined as an anonymous type since it is defined directly within the element definition, without specifying the attribute “name” of the xsd:complexType start tag line 4. This is called inlined element declaration. Conversely, the compound type personAddressType , defined as an independent entity in line 18 is a named type, so it can be reused by other elements see lines 6 and 8. Line 6a: The multiplicity attributes minOccurs and maxOccurs constrain the number of occurrences of the element. The default value of these attributes equals to 1, so line 6a is redundant and it is omitted for the remaining elements but, see lines 7 and 27a. In general, an element is required to appear in an instance document defined below when the value of minOccurs is 1 or more. Line 7: Element date is of the predefined type xsd:date. Notice that the value of minOccurs is set to 0, which indicates that this element is optional. Lines 14–15: Define two attributes of the element letter, that is, language and template . The language attribute is of the built-in type xsd:language Section 6.2.3 below. Lines 18–23: Define our own personAddressType type as a compound type comprising person’s name and postal address as opposed to a business-address-type. Notice that the postal address element is referred to in line 21 attribute ref and it is defined elsewhere in the same document. The personAddressType type is extended as sender and recipient in lines 6 and 8, respectively. Lines 24–33: Define the postal address element, referred to in line 21. Of course, this could have been defined directly within the personAddressType datatype, as an anonymous sub-element, in which case it would not be reusable. Although the element is not reused in this schema, I anticipate that an external schema may wish to reuse it, see Section 6.2.4 below. Line 27a: The multiplicity attribute maxOccurs is set to “unbounded,” to indicate that the street address is allowed to extend over several lines. Notice that Lines 2a and 2b above accomplish two different tasks. One is to declare the namespace URI that the letter schema will be associated with Line 2a. The other task is to define the prefix for the target namespace that will be used in this document Line 2b. The reader may wonder whether this could have been done in one line. But, in the spirit of the modularity principle, it is always to assign different responsibilities tasks to different entities in this case different lines. The second step is to use the newly defined schema for production of valid instance documents see Figure 6-4. An instance document is an XML document that conforms to a particular schema. To reference the above schema in letter documents, we do as follows: Chapter 6 • XML and Data Representation 231 Listing 6-7: Referencing a schema in an XML instance document compare to Listing 6-1 1 ?xml version=1.0 encoding=UTF-8? 2 -- Comment: A personal letter marked up in XML. -- 3 lt:letter xmlns:lt =http:any.website.netletter 3a xmlns:xsi=http:www.w3.org2001XMLSchema-instance 3b xsi:schemaLocation=http:any.website.netletter 3c http:any.website.netletterletter.xsd 3d lt:language=en-US lt:template=personal 4 lt:sender ... -- similar to Listing 6-1 -- 10 lt:sender ... -- similar to Listing 6-1 -- 25 lt:letter The above listing is said to be valid unlike Listing 6-1 for which we generally only know that it is well-formed . The two documents Listings 6-1 and 6-7 are the same, except for referencing the letter schema as follows: Step 1 line 3: Tell a schema-aware XML processor that all of the elements used in this instance document come from the http:any.website.netletter namespace. All the element and attribute names will be prefaced with the lt: prefix. Notice that we could also use a default namespace declaration and avoid the prefix. Step 2 line 3a: Declare another namespace, the XMLSchema-instance namespace, which contains a number of attributes such as schemaLocation, to be used next that are part of a schema specification. These attributes can be applied to elements in instance documents to provide additional information to a schema-aware XML processor. Again, a usual convention is to use the namespace prefix xsi: for XMLSchema-instance. Step 3 lines 3b–3c: With the xsi:schemaLocation attribute, tell the schema-aware XML processor to establish the binding between the current XML document and its schema. The attribute contains a pair of values. The first value is the namespace identifier whose schema’s location is identified by the second value. In our case the namespace identifier is http:any.website.netletter and the location of the schema document is http:any.website.netletterletter.xsd . In this case, it would suffice to only have letter.xsd as the second value, since the schema document’s URL overlaps with the namespace identifier. Typically, the second value will be a URL, but specialized applications can use other types of values, such as an identifier in a schema repository or a well-known schema name. If the document used more than one namespace, the xsi:schemaLocation attribute would contain multiple pairs of values all within a single pair of quotations. Notice that the schemaLocation attribute is merely a hint. If the parser already knows about the schema types in that namespace, or has some other means of finding them, it does not have to go to the location you gave it. XML Schema defines two aspects of an XML document structure: 1. Content model validity, which tests whether the arrangement and embedding of tags is correct. For example, postal address tag must have nested the street, city, and postal-code tags. A country tag is optional. Ivan Marsic • Rutgers University 232 2. Datatype validity, which is the ability to test whether specific units of information are of the correct type and fall within the specified legal values. For example, a postal code is a five-digit number. Data types are the classes of data values, such as string, integer, or date. Values are instances of types. There are two types of data: 1. Simple types are elements that contain data but not attributes or sub-elements. Examples of simple data values are integer or string, which do not have parts. New simple types are defined by deriving them from existing simple types built-in’s and derived. 2. Compound types are elements that allow sub-elements andor attributes. An example is personAddressType type defined in Listing 6-6. Complex types are defined by listing the elements andor attributes nested within them.

6.2.2 Models for Structured Content

As noted above, schema defines the content model of XML documents—the legal building blocks of an XML document. A content model indicates what a particular element can contain. An element can contain text, other elements, a mixture of text and elements, or nothing at all. Content model defines: • elements that can appear in a document • attributes that can appear in a document • which elements are child elements • the order of child elements • the multiplicity of child elements • whether an element is empty or can include text • data types for elements and attributes • default and fixed values for elements and attributes This section reviews the schema tools for specifying syntactic and structural constraints on document content. The next section reviews datatypes of elements and attributes, and their value constraints. XML Schema Elements XML Schema defines a vocabulary on its own, which is used to define other schemas. Here I provide only a brief overview of XML Schema elements that commonly appear in schema documents. The reader should look for the complete list here: http:www.w3.orgTR2004REC- xmlschema-1-20041028structures.html . The schema element defines the root element of every XML Schema. Syntax of the schema element Description attributes are optional unless stated else schema id=ID …………………………………………… attributeFormDefault=qualified | unqualified Specifies a unique ID for the element. The form for attributes declared in the target namespace of this Chapter 6 • XML and Data Representation 233 elementFormDefault=qualified | unqualified blockDefault=all | list of extension | restriction | substitution finalDefault=all | list of extension | restriction | list | union targetNamespace=anyURI ………………… version=token xmlns=anyURI ……………………………… any attributes include | import | redefine | annotation∗, simpleType | complexType | group | attributeGroup | element | attribute | notation, annotation∗∗ schema schema. The value must be qualified or unqualified. Default is unqualified. unqualified indicates that attributes from the target namespace are not required to be qualified with the namespace prefix. qualified indicates that attributes from the target namespace must be qualified with the namespace prefix. The form for elements declared in the target namespace of this schema. The value must be qualified or unqualified. Default is unqualified. unqualified indicates that elements from the target namespace are not required to be qualified with the namespace prefix. qualified indicates that elements from the target namespace must be qualified with the namespace prefix. A URI reference of the namespace of this schema. Required . A URI reference that specifies one or more namespaces for use in this schema. If no prefix is assigned, the schema components of the namespace can be used with unqualified references. Kleene operators ?, +, and ∗ are defined in Figure 6-5. The element element defines an element. Its parent element can be one of the following: schema , choice, all, sequence, and group. Syntax of the element element Description all attributes are optional element id=ID name=NCName ……………………………… ref=QName …………………………………… type=QName ………………………………… substitutionGroup=QName default=string ………………………………… fixed=string form=qualified|unqualified maxOccurs=nonNegativeInteger|unbounded minOccurs=nonNegativeInteger …………… nillable=true|false Specifies a name for the element. This attribute is required if the parent element is the schema element. Refers to the name of another element. This attribute cannot be used if the parent element is the schema element. Specifies either the name of a built-in data type, or the name of a simpleType or complexType element. This value is automatically assigned to the element when no other value is specified. Can only be used if the element’s content is a simple type or text only. Specifies the maximum number of times this element can occur in the parent element. The value can be any number = 0, or if you want to set no limit on the maximum number, use the value unbounded. Default value is 1. Specifies the minimum number of times this element can occur in the parent element. The value can be any number = 0. Default is 1. Ivan Marsic • Rutgers University 234 abstract=true|false block=all|list of extension|restriction final=all|list of extension|restriction any attributes annotation?,simpleType | complexType?,unique | key | keyref∗ element Kleene operators ?, +, and ∗ are defined in Figure 6-5. The group element is used to define a collection of elements to be used to model compound elements. Its parent element can be one of the following: schema, choice, sequence , complexType, restriction both simpleContent and complexContent , extension both simpleContent and complexContent . Syntax of the group element Description all attributes are optional group id=ID name=NCName ……………………………… ref=QName …………………………………… maxOccurs=nonNegativeInteger | unbounded minOccurs=nonNegativeInteger any attributes annotation?, all | choice | sequence group Specifies a name for the group. This attribute is used only when the schema element is the parent of this group element. Name and ref attributes cannot both be present. Refers to the name of another group. Name and ref attributes cannot both be present. The attributeGroup element is used to group a set of attribute declarations so that they can be incorporated as a group into complex type definitions. Syntax of attributeGroup Description all attributes are optional attributeGroup id=ID name=NCName …………………………… ref=QName ………………………………… any attributes annotation?, attribute | attributeGroup∗, anyAttribute? attributeGroup Specifies the name of the attribute group. Name and ref attributes cannot both be present. Refers to a named attribute group. Name and ref attributes cannot both be present.