Translation Transformation Using XSLT Stylesheets

Managing Content Categorizer 9-23 original content is not modified, and both the translated and transformed XML files are discarded after use. This section covers the following topics: ■ Translation on page 9-23 ■ Transformation Using XSLT Stylesheets on page 9-23 ■ SearchML Transformation on page 9-24 ■ Flexiondoc Transformation on page 9-25 ■ Example Files on page 9-27

9.5.1 Translation

The translation step uses the OutsideIn XML Export filters to output the XML in either SearchML or Flexiondoc XML format, depending on the type of content being translated and whether the format is available for the platform being used. This translation process enables Categorizer to support a large number of different source document formats. The transformation step uses eXtensible Stylesheet Language Transformations XSLT to transform the initial XML output into an XML equivalent that can be easily searched and analyzed by Content Categorizer, based on search rules defined by the user. An overview of the transformation process may be useful to anyone interested in the categorization process, and serve as a starting point for users who would like to define their own XSLT stylesheets to accommodate their specific document processing needs. Translation Using OutsideIn XML Export Filters A runtime version of the OutsideIn XML Export product is integrated and installed with Content Server, and it filters content checked in for categorization. The Export filters convert content to XML for transformation using Categorizer’s XSLT stylesheets. The transformation is necessary because the Export XML schemas, Flexiondoc and SearchML, are not in a form easily searched by Content Categorizer rules.

9.5.2 Transformation Using XSLT Stylesheets

Two stylesheets are included with Content Categorizer and applied based on the initial translation format provided by the OutsideIn XML Export filter. The stylesheets are located in the following directory. cs_rootdatacontentcategorizerstylesheets For content items output in SearchML, searchml_to_scc.xsl is applied. For content items output in Flexiondoc, flexiondoc_to_scc.xsl is applied. SearchML and Flexiondoc both reproduce style designations found in the source content, but they do so differently, in ways not detectable by Content Categorizer rules. The appropriate stylesheet can recognize the necessary style information in each format and use that information as the basis for transforming the final output tags into an XML document useful to Content Categorizer. The similarity between SearchML and Flexiondoc depends on the degree to which internal styles or metadata are used in the content. When working with content using named styles, such as Microsoft Word, the resultant output will be similar. When working with content in formats such a PDF or text, results come out with more generic tagging. 9-24 Application Administrators Guide for Content Server

9.5.3 SearchML Transformation