Machine-processed formatting specifications

$Date: 2006/03/03 13:53:06 $(UTC)


Table of Contents

1. Introduction
2. Annotation mechanism
3. Document-level information
4. Page-level information
5. Box-level information
5.1. Box descriptions
6. DocBook augmentation
References

1. Introduction

Formatting specifications are lengthy prose documents with guidelines on the implementation of processes to effect the rendering of information. The UBL HISC output formatting specifications are written in XML using the DocBook [DocBook] vocabulary and published using XSLT [DocBook] stylesheets.

Contained within an HISC formatting specification is a lot of granular information that can be useful input to the implementation when processed in a mechanical fashion. This information is identified through innocuous annotations in DocBook using attributes not utilized by the formatting stylesheets.

The fields described below are specific to the formatting specifications following the United Nations Layout Key, capturing the grid information for the location and dimensions of fielded information presented in a form. The accuracy of these form and box specifications is validated by a visual inspection of a pro forma presentation of the form information automatically synthesized from the annotations. Thus, if the form appears to be presented correctly, then there must be sufficient annotation in the document for an implementation to be created.

The approach described below can be applied to any specification in which useful implementation embedded information can be distilled for machine processing.

2.  Annotation mechanism

The annotation of a DocBook instance is accomplished in such a way that the annotations do not interfere with either validation or formatting. There are a number of common attributes declared for use with DocBook instances, some of which have some semantic use in linking or software documentation, but most of which do not impact on visual formatting. The HISC-specific information is found in uses of these reserved attributes.

Specifically, the role= attribute is exclusively used to identify particular sections of the specification. The nested use of DocBook constructs using different role= attributes gives context to the information items identified with the same role= value.

Most of the values exposed by this method are obtained using the text value of the entire node, for example using the following XSLT:

  <xsl:value-of select=".//entry[@role='row']"/>

would be used to get the value "1" from the descendent <entry> element marked up as:

  <entry role="row"><literal>1</literal></entry>

Some of the values exposed by this method are obtained using only the top-level text child because there are top-level element children being used as heralds, for example using the following XSLT:

  <xsl:value-of select=".//para[@role='docnumber']/text()"/>

would be used to get the value "220" from the descendent <para> element's text children, ignoring the emphasized herald:

  <para role="docnumber"><emphasis>Document number: </emphasis>220</para>

Some of the values exposed by this method are in the id= attribute, chosen to ensure when the DocBook instance is validated the uniqueness necessary in the values is guaranteed.

3.  Document-level information

There are three pieces of information used to describe the information outside of the form boxes:

  • <xsl:value-of select=".//para[@role='doctitle']/text()"/>

    The document type title at the top of the form above the form boxes.

  • <xsl:value-of select=".//para[@role='docnumber']/text()"/>

    The numeric reference to the document type, rendered in the bottom left outside of the form boxes.

  • <xsl:value-of select=".//para[@role='docabbrev']/text()"/>

    An optional alphanumeric reference to the document type, rendered in the bottom left outside of the form boxes, in conjunction with the numeric reference to the document type. This is used only if necessary to disambiguate two different forms that have the same UN layout document number.

4.  Page-level information

There are at most two pages described for a UN Layout Key form. All forms have a "first" page, while only some of the forms have a "continuation" page. Box-level information is grouped within sections marked for each kind of page.

Every formatting specification will have a section with first-page information marked as:

    <section role="first">

Those formatting specifications describing a continuation page will have a section with information marked as:

    <section role="continuation">

5.  Box-level information

A group of box descriptions can be collected in a section with a unique identifier marked as follows, though it is not obligatory to include the identifier unless the group of box descriptions is being reused:

    <section id="{unique-section-identifier}">

When an existing group of box descriptions is being reused (typically on the continuation page), the individual descriptions are not copied in the specification. Rather, they are referenced indirectly by pointing to the section in which the box sections and descriptions are found as follows:

    <xref role="HISC" linkend="{unique-section-identifier}">

5.1.  Box descriptions

A filler box has no XPath or herald information and just takes up room on the page to fill out all rows and columns of the form. Filler boxes must be specified to indicate the use of borders, thus when an irregularly-shaped space must be filled, it must be filled with as many filler boxes as have different settings for the borders. A filler box is described in a single section without a unique identifier and marked as:

    <section role="HISC">

A given box of XPath or herald information in the UN Layout Key is described in a single section with an obligatory unique identifier marked as:

    <section role="HISC" id="{unique-box-identifier}">

From the document focus being the single box section in which the descriptions for a given box are found (as above), the following information about the box can be extracted from the markup:

  • <xsl:value-of select="@id"/>

    The unique identifier for the box description.

  • <xsl:value-of select=".//entry[@role='row']"/>

    The row number (1-origin) of the top left of the box.

  • <xsl:value-of select=".//entry[@role='column']"/>

    The column number (1-origin) of the top left of the box.

  • <xsl:value-of select=".//entry[@role='height']"/>

    The height (1-origin) of the box.

  • <xsl:value-of select=".//entry[@role='width']"/>

    The width (1-origin) of the box.

  • <xsl:value-of select=".//entry[@role='line-before']"/>

    A "true" indication of the before-side (top) of the box being lined.

  • <xsl:value-of select=".//entry[@role='line-after']"/>

    A "true" indication of the after-side (bottom) of the box being lined.

  • <xsl:value-of select=".//entry[@role='line-start']"/>

    A "true" indication of the start-side (left) of the box being lined.

  • <xsl:value-of select=".//entry[@role='line-end]"/>

    A "true" indication of the end-side (right) of the box being lined.

  • <xsl:value-of select=".//para[@role='label']/text()"/>

    The string to use as a herald in the top left of the box.

  • <xsl:for-each select=".//entry[@role='xpath']">

    The absolute XPath address of one of the possible many information items in the input instance that belongs in this box. There is no implication of any presentation of the information items within the box that is drawn.

  • <xsl:if test=".//para[@role='manual']"/>

    This test is true when the form field is for manual entry. This distinguishes form fields that are in use by the end user on paper. These fields are not filled in by the print process. Other form fields that are not in use and not for manual entry might be distinguished visually in the layout (such as with a background).

6.  DocBook augmentation

The concept of processing a DocBook instance with XSLT is also used in the publishing process in order to augment the authored content with repetitive and algorithmic changes.

The fsdb2db.xsl (Formatting Specification DocBook to pure DocBook) stylesheet takes in an HISC formatting specification DocBook instance with annotations and produces an augmented DocBook instance for publishing.

The following augmentations are accomplished:

  • Editorial notes identified by <note role="editorial"> are exposed where authored, collected at the end of the document and exposed again in a summary, and hyperlinked from the summary to their authored location

    This supports the editorial process of annotating a document with a reminder to be addressed later. It is expected that all editorial notes be removed from the final source.

  • The many tables in a specification that are annotated with role="xpath" are augmented with a header row giving the table a header without having to author the header repeatedly.

  • The many table entries in a specification that are annotated with role="xpath" are augmented by injecting zero-width spaces after the occurrence of every oblique ("/") character. This would be very labourious for an author, yet some of the XPath address are so very long they would otherwise overflow the page width without wrap (causing loss of information) thus mandating their use.

References

[DocBook] Norm Walsh DocBook XML, The DocBook 4.4 Document type. OASIS January 27, 2005

[XSLT] James Clark XSL Transformations (XSLT) Version 1.0 W3C Recommendation 16 November 1999