Models for Structured Content

Ivan Marsic • Rutgers University 234 abstract=true|false block=all|list of extension|restriction final=all|list of extension|restriction any attributes annotation?,simpleType | complexType?,unique | key | keyref∗ element Kleene operators ?, +, and ∗ are defined in Figure 6-5. The group element is used to define a collection of elements to be used to model compound elements. Its parent element can be one of the following: schema, choice, sequence , complexType, restriction both simpleContent and complexContent , extension both simpleContent and complexContent . Syntax of the group element Description all attributes are optional group id=ID name=NCName ……………………………… ref=QName …………………………………… maxOccurs=nonNegativeInteger | unbounded minOccurs=nonNegativeInteger any attributes annotation?, all | choice | sequence group Specifies a name for the group. This attribute is used only when the schema element is the parent of this group element. Name and ref attributes cannot both be present. Refers to the name of another group. Name and ref attributes cannot both be present. The attributeGroup element is used to group a set of attribute declarations so that they can be incorporated as a group into complex type definitions. Syntax of attributeGroup Description all attributes are optional attributeGroup id=ID name=NCName …………………………… ref=QName ………………………………… any attributes annotation?, attribute | attributeGroup∗, anyAttribute? attributeGroup Specifies the name of the attribute group. Name and ref attributes cannot both be present. Refers to a named attribute group. Name and ref attributes cannot both be present. Chapter 6 • XML and Data Representation 235 The annotation element specifies schema comments that are used to document the schema. This element can contain two elements: the documentation element, meant for human consumption, and the appinfo element, for machine consumption. Simple Elements A simple element is an XML element that can contain only text. It cannot contain any other elements or attributes. However, the “only text” restriction is ambiguous since the text can be of many different types. It can be one of the built-in types that are included in the XML Schema definition, such as boolean, string, date, or it can be a custom type that you can define yourself as will be seen Section 6.2.3 below. You can also add restrictions facets to a data type in order to limit its content, and you can require the data to match a defined pattern. Examples of simple elements are salutation and body elements in Listing 6-6 above. Groups of Elements XML Schema enables collections of elements to be defined and named, so that the elements can be used to build up the content models of complex types. Un-named groups of elements can also be defined, and along with elements in named groups, they can be constrained to appear in the same order sequence as they are declared. Alternatively, they can be constrained so that only one of the elements may appear in an instance. A model group is a constraint in the form of a grammar fragment that applies to lists of element information items, such as plain text or other markup elements. There are three varieties of model group: • Sequence element sequence all the named elements must appear in the order listed; • Conjunction element all all the named elements must appear, although they can occur in any order; • Disjunction element choice one, and only one, of the elements listed must appear.

6.2.3 Datatypes

In XML Schema specification, a datatype is defined by: a Value space, which is a set of distinct values that a given datatype can assume. For example, the value space for the integer type are integer numbers in the range [ −4294967296, 4294967295], i.e., signed 32-bit numbers. b Lexical space, which is a set of allowed lexical representations or literals for the datatype. For example, a float-type number 0.00125 has alternative representation as 1.25E −3. Valid literals for the float type also include abbreviations for positive and negative infinity ±INF and Not a Number NaN. Ivan Marsic • Rutgers University 236 c Facets that characterize properties of the value space, individual values, or lexical items. For example, a datatype is said to have a “numeric” facet if its values are conceptually quantities in some mathematical number system. Numeric datatypes further can have a “bounded” facet, meaning that an upper andor lower value is specified. For example, postal codes in the U.S. are bounded to the range [10000, 99999]. XML Schema has a set of built-in or primitive datatypes that are not defined in terms of other datatypes. We have already seen some of these, such as xsd:string which was used in Listing 6-6. More will be exposed below. Unlike these, derived datatypes are those that are defined in terms of other datatypes either primitive types or derived ones. Simple Types: simpleType These types are atomic in that they can only contain character data and cannot have attributes or element content. Both built-in simple types and their derivations can be used in all element and attribute declarations. Simple-type definitions are used when a new data type needs to be defined, where this new type is a modification of some other existing simpleType-type. Table 6-1 shows a partial list of the Schema-defined types. There are over 40 built-in simple types and the reader should consult the XML Schema specification see http:www.w3.orgTRxmlschema-0 , Section 2.3 for the complete list. Table 6-1: A partial list of primitive datatypes that are built into the XML Schema. Name Examples Comments string My favorite text example byte −128, −1, 0, 1, …, 127 A signed byte value unsignedByte 0, …, 255 Derived from unsignedShort boolean 0, 1, true, false May contain either true or false, 0 or 1 short −5, 328 Signed 16-bit integer int −7, 471 Signed 32-bit integer integer −2, 435 Same as int long −4, 123456 Signed 64-bit integer float 0, −0, −INF, INF, −1E4, 1.401298464324817e −45, 3.402823466385288e +38, NaN Conforming to the IEEE 754 standard for 32- bit single precision floating point number. Note the use of abbreviations for positive and negative infinity ±INF, and Not a Number NaN double 0, −0, −INF, INF, −1E4, 4.9e −324, 1.797e308, NaN Conforming to the IEEE 754 standard for 64- bit double precision floating point numbers duration P1Y2M3DT10H30M12.3S 1 year, 2 months, 3 days, 10 hours, 30 minutes, and 12.3 seconds dateTime 1997-03-31T13:20:00.000- 05:00 March 31st 1997 at 1.20pm Eastern Standard Time which is 5 hours behind Coordinated Universal Time date 1997-03-31 time 13:20:00.000, 13:20:00.000-05:00 Chapter 6 • XML and Data Representation 237 gYear 1997 The “g” prefix signals time periods in the Gregorian calendar. gDay ---31 the 31st day QName lt:sender XML Namespace QName qualified name language en-GB, en-US, fr valid values for xml:lang as defined in XML 1.0 ID this-element An attribute that identifies the element; can be any string that confirms to the rules for assigning the element names. IDREF this-element IDREF attribute type; refers to an element which has the ID attribute with the same value A straightforward use of built-in types is the direct declaration of elements and attributes that conform to them. For example, in Listing 6-6 above I declared the signature element and template attribute of the letter element, both using xsd:string built-in type: xsd:element name=signature type=xsd:string xsd:attribute name=template type=xsd:string New simple types are defined by deriving them from existing simple types built-in’s and derived. In particular, we can derive a new simple type by restricting an existing simple type, in other words, the legal range of values for the new type are a subset of the existing type’s range of values. We use the simpleType element to define and name the new simple type. We use the restriction element to indicate the existing base type, and to identify the facets that constrain the range of values. A complete list of facets is provided below. Facets and Regular Expressions We use the “facets” of datatypes to constrain the range of values. Suppose we wish to create a new type of integer called zipCodeType whose range of values is between 10000 and 99999 inclusive. We base our definition on the built-in simple type integer , whose range of values also includes integers less than 10000 and greater than 99999. To define zipCodeType, we restrict the range of the integer base type by employing two facets called minInclusive and maxInclusive to be introduced below: Listing 6-8: Example of new type definition by facets of the base type. xsd:simpleType name=zipCodeType xsd:restriction base=xsd:integer xsd:minInclusive value=10000 xsd:maxInclusive value=99999 xsd:restriction xsd:simpleType Table 6-2 and Table 6-3 list the facets that are applicable for built-in types. The facets identify various characteristics of the types, such as: • length, minLength, maxLength—the exact, minimum and maximum character length of the value • pattern—a regular expression pattern for the value see more below • enumeration—a list of all possible values an example given in Listing 6-9 below