XLIFF Profile for HTML

Preliminary Working Draft, 6 February 2004

This version:
<not applicable yet>
Latest version:
<not applicable yet>
Previous version:
<not applicable yet>
Editors:
Bryan Schnabel <bryan.s.schnabel@exgate.tek.com>
Yves Savourel <ysavourel@translate.com>
Copyright © The Organization for the Advancement of Structured Information Standards [OASIS] 2004. All Rights Reserved.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to OASIS, except as needed for the purpose of developing OASIS specifications, in which case the procedures for copyrights defined in the OASIS Intellectual Property Rights document must be followed, or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be revoked by OASIS or its successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and OASIS DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

 


Abstract

This document describes how HTML (in its different flavors), should be coded when extracted to an XLIFF document.

Status of this Document

This document is a Preliminary Working Draft of the committee. It is an OASIS draft document for review by OASIS members and other interested parties. Comments may be sent to xliff-comment@lists.oasis-open.org.

This document may be updated, replaced, or rendered obsolete by other documents at any time. It may also be discarded without further follow up. It is inappropriate to use this document as reference material other than "work in progress".

Table of Contents

1. Introduction
          1.1. Purpose
2. General Considerations
          2.n. HTML Flavors
          2.n. Server-Side Files
          2.n. Extraction Techniques
          2.n. Order of Extraction
          2.n. Identifiers
          2.n. Text Content
          2.n. Non-Text Content
          2.n. Marked Sections
          2.n. CDATA Sections
          2.n. Multilingual Documents
          2.n. Entity References
          2.n. Numeric Character References
          2.n. Comments
          2.n. Processing Instructions
          2.n. Segmentation
3. General Structure
          3.n.
4. Details by Element and Attribute
          4.n. <img> Element
          4.n. SVG Images
          4.n. HTML Forms
          4.n. XForms Forms
          4.n. Bidirectional Markers
          4.n.

Appendices

A. Contributions
B. References

Introduction

As different tools may provide different filters to extract the content of HTML documents it is important for interoperability that they represent the extracted data in identical manner in the XLIFF document.

Purpose

The aim of this document is to describe a preferred way to represent HTML in an XLIFF document.

MORE needed

General Considerations

This section discusses the general considerations to take in account when extracting HTML data.

HTML Flavors

There are many flavors of HTML that are used. HTML 4.01, XHTML, etc. There are also, probably in even greater quantity, many pages that are considered HTML but are not valid HTML. This document tries to address all these different flavors.

In this document the term "HTML" is used generically, to designate any of the flavors. If the text refers to a specific flavor, it uses the more complete name for that flavor: for example "HTML 4.01", "XHTML", "XHTML 1.0", etc.

Server-Side Files

Many HTML documents are generated dynamically, in some cases using server-side script files which are often made of a mixture of HTML constructs and server-side instructions written in one of the server-side languages such as PHP, JSP, ASP, or many others.

While such source documents are generally outside of the scope of this document, an effort is made to try to address some of the issues you may run into when extracting such source documents.

Do we do that?

Extraction Techniques

There are many ways to process a source HTML documents and create its corresponding XLIFF output.

One interesting approach is to make use XML standards, such as XSLT, XPath, or XSLFO. Of these, XSLT is a particularly good tool for transforming HTML to XLIFF, and XLIFF back to HTML. See the Appendix "Example of XSLT Use to Process HTML" for a concrete example of how go back and forth between HTML and XLIFF.

XSLT works on any well-formed XML documents. So the input HTML document has to be a valid XHTML document or a well-formed HTML document.

If the input file is not at that stage, it can be pre-processed first, using tools such as Perl [Perl], HTML Tidy [HTMLTidy], or other utilities. They all can provide a good way to streamline the pre-processing task. See the Appendix "Pre-Processing HTML Files" for an example of how to use Perl to pre-process an HTML document not well-formed into an well-formed XML document.

Order of Extraction

The flow of the extracted data in the XLIFF document should be in the same order as the flow of data in the original HTML document, regardless of any layout placement. In other words, how the text is stored in the source document, and how it is processed (most of the time displayed) by the user agent are two different things. The extraction order should reflect the order of the data in the source document, and the author is responsible to group logical parts of the text together as much as possible.

Identifiers

The identifier used for matching, leveraging, and other ID-related functions is stored in the resname attribute. That means the ID-related attributes of an HTML document, such name or id, when appropriate, should be stored in the XLIFF attribute resname.

The required attribute id of an XLIFF <trans-unit> element is an identifier allowing extraction tools to merge back the data. Its value is determined by the filter and may or may not correspond to an HTML identifier.

Text Content

The restype value for normal text context is nothing or the name of the element it's coming from?

Non-Text Content

The content of some HTML elements and attributes may be something else than simple text.

Styles

XLIFF provides an attribute css-style that allows you carry directly any CSS style information applied to a specific item through the use of the HTML style attribute. The content of a stylesheet (or the content of the <style> element) may have translatable text, and should be processed accordingly.

Scripts

TODO

Other Types of Content

Such as XML data island, etc...

Marked Sections

The SGML syntax allows the use of marked section such as the example bellow. Their use in HTML is not recommended. In the same way, there is no provision in XLIFF to represent such construct.

<![INCLUDE[
 <!-- this will be included -->
]]>

<![IGNORE[
 <!-- this will be ignored -->
]]>

Is it necessary to talk about marked section?

CDATA Sections

One notation allowed in both HTML and XHTML is the CDATA section. This construct permits the special characters such as '&', '<', '>', etc. to be included in the text without being escaped. This can be useful when they a paragraph contains a lot of such characters.

The use of CDATA is not recommended from the localization viewpoint:

From XML's point of view, CDATA is processed as if it were text. For example:

<p>This is an example <![CDATA[of <sgml> markup that is not <painful> to write with < and such]]>.</p>

is exactly the same as:

<p>This is an example of &lt;sgml&gt; markup that is not &lt;painful&gt; to write with &lt; and such.</p>

It is recommended to not use CDATA in XLIFF content. The second notation is preferable.

Multilingual Documents

There are two kinds of multilingual files:

Multilingual HTML documents belong to the first category: There is a main language and the parts in other languages belongs to the same content flow.

Therefore, an XLIFF filter should extract all the text of an original HTML document, while keeping track of the language switches when they occur.

This cause a problem, we have no standard way to mark this up inside a <source>.

Entity References

HTML uses several types of entity references.

Character Entities

As a general rule, when extracted to XLIFF, character entity references should be resolved to their corresponding Unicode characters. If an entity reference is not converted, it should be treated as an inline code. For example, the following paragraph:

<p>&aacute=a-acute</p>

Should be represented this way:

<source>á=a-acute</source>

Or, at the last resort—This is not the preferred solution—this way:

<source><ph id='1'>&amp;aacute</ph>=a-acute</source>

Numeric Character References

Also known as NCR, the numeric character references are the generic numeric codes representation of characters. When extracted to XLIFF, numeric character references should be resolved to their corresponding Unicode characters, not as inline codes.

For example, the following paragraph:

<p>&#x00e0;=a-grave</p>

Should be represented as a raw character, as follow:

<source>ŕ=a-grave</source>

Obviously, if the XLIFF document is in an encoding that does not support the character, the character must be preserved as an NCR.

Comments

YS, Bryan: I think I confused us originally by mixing three different issues:
a) How to deal with HTML comments when mapping it XLIFF. ==>I think we should preserve them in <ph> (or <x/>).
b) How to deal with comments that are in the <source> or <target> element of XLIFF, regardless how they ended up there. ==> that is a general XLIFF question, not something only for the HTML profile. We have to make a generic choice on how to deal with comments, PI, and other legal constructs that could get into the XLIFF content. I suppose the logical thing to do for the XLIFF "consuming tool" would be to treat them like <ph>.
c) How to deal with what I called "localization directives". ==> this is a different topic I shouldn't have open yet. and like issue b) it's not only for HTML input.
So looking at your input I re-wrote the section, and hopefully it makes more sense now.

As a general rule from the localization viewpoint it is recommended to not have HTML Comments inside a text content (for example within a <p> element) because they create potential problems for translation memory matching. Comments outside text content are not an issue since they do not affect the markup of any translatable segments.

If an XLIFF filter tool finds comments inside a text content, the comment should be preserved, treated as inline code. For example, the following HTML paragraph:

<p>The team members had colorful nicknames, like
<!-- use volume 2 here -->
<i>picabo, big-air</i>, and <i>yard-sale </i>
<!-- back to volume 1 here -->
which the media had difficulty understanding.
</p>

should be mapped to a <trans-unit> where the comments are preserved inside <ph> (or <x/>) elements. The XLIFF output would look something like this:

<trans-unit resname="p" id="d0e3">
<source xml:lang="EN">The team members had colorful nicknames, like
<ph id="d0c5" ctype="x-xml-comment"> use volume 2 here </ph>
<mrk mtype="name" comment="i">picabo, big-air</mrk>, and
<mrk mtype="name" comment="i">yard-sale </mrk>
<ph id="d0c11" ctype="x-xml-comment"> back to volume 1 here </ph>
which the media had difficulty understanding.</source>
</trans-unit>

YS: Bryan, I removed the <target> element in the example, since it doesn't bring anything new, and that make the example more readable.
--- One think I found easier to work with are <ph> elements that have the full original codes (so it would be something like: <ph id="d0c5" ctype="x-xml-comment"><!-- use volume 2 here --></ph>, this allows tools to rewrite the original segment without worrying about looking at what the <ph> is (comments, etc.). But I suppose it's an implementation choice: there are probably advantages doing it the way you show.
--- A last thing: I'm not sure I agree with the use of <mrk>, I think <g> would be more appropriate. But that's another topic.

Processing Instructions

YS: Here, like for comments, I think we have different issues:
a) PI inside an HTML text content that we have to preserve. ==> By using <ph> (or <x/>)
b) What we do when a <source> or <target> content has a PI, regardless how it ended-up were put there.

[Bryan: Hmm, okay. But maybe we should recommend to translation tool vendors that if they’re going to claim to work with XML, they should follow rudimentary XML rules. Thereby recommending that they either process or ignore processing instructions, like other XML tools. Just a thought. I’d also be willing to just stick with the recommendation that processing instructions shouldn’t be in XLIFF files, if this is too difficult for translation tools to do. I’ll go ahead a write alternate code that specifies the <ph element, in case we need it]

YS: I guess the overall question is: If we have a recommendation on how to deal with comments and processing instructions that are in a XLIFF <source> or <target> (and I think we should), why should we bother to convert the ones coming from an HTML file into <ph>?
That a good question... My first answer would be: to make a distinction between the comment/pi extracted from the HTML/XML and the one some tool would have put there. But I guess we need to discuss this more.

As a general rule from the localization viewpoint it is recommended to not have processing instruction inside an HTML element with text content (for example a <p> element) because they create potential problems for translation memory matching. Processing instructions outside text content are not an issue since they do not affect the markup of any translatable segments.

If an XLIFF filter tool finds a processing instruction inside a text content, it should be preserved, treated as inline code. For example, the following HTML paragraph:

<p>The team members had colorful nicknames, like
<?Trans-instruct use volume 2 here ?>
<i>picabo, big-air</i>, and <i>yard-sale </i>
<?Trans-instruct back to volume 1 here ?>
which the media
had difficulty understanding.
</p>

should be mapped to a <trans-unit> where the processing instructions are preserved inside <ph> (or <x/>) elements. The XLIFF output would look something like this:

<trans-unit resname="p" id="d0e3">
<source xml:lang="EN">The team members had colorful nicknames, like
<ph id="pi-d0p5" ctype="x-Trans-instruct">use volume 2 here </ph>
<mrk mtype="name" comment="i">picabo, big-air</mrk>, and <mrk mtype="name" comment="i">yard-sale </mrk>
<ph id="pi-d0p11" ctype="x-Trans-instruct">back to volume 1 here </ph>
which the media
had difficulty understanding.</source>

Segmentation

TODO: address the issue of segmentation by referring to another document, a work in progress, or whatever, but mention something.

General Structure

<file> Element

In an XLIFF document, an extracted HTML document is stored in a <file> element with the datatype set to html.

Whether a filter puts all the HTML files of a given project in a single XLIFF document (using several <file> elements) or utilizes one XLIFF document for each HTML source file is completely up to the tool.

 

Details by Elements and Attributes

The following table list all the elements used in the various flavors of HTML and their properties:

Element Inline? Empty? PCDATA? Mixed? Wrapper? Status Notes

A

inline       wrapper    

ABBR

inline     mixed     not inline in Trados5.5

ACRONYM

inline     mixed      

ADDRESS

      mixed      

APPLET

inline   ??? ??? ??? deprecated not inline in Trados5.5

AREA

  empty          

B

inline     mixed      

BASE

  empty          

BASEFONT

  empty       deprecated  

BDO

inline     mixed      
BGSOUND   ??? ??? ???   not 4.01  

BIG

inline     mixed      
BLOCKQUOTE         wrapped    

BLINK

inline     mixed   not 4.01 not inline in Trados5.5

BODY

        wrapped    

BR

inline??? empty          

BUTTON

inline     mixed      

CAPTION

      mixed      

CENTER

  ??? ??? ??? ??? deprecated  

CITE

inline     mixed      

CODE

inline     mixed      

COL

  empty          

COLGROUP

        wrapper    

DD

      mixed      

DEL

inline???     mixed     not inline in Trados5.5

DFN

inline     mixed      

DIR

      ??? ??? deprecated  

DIV

      mixed      

DL

        wrapper    

DT

      mixed      

EM

inline     mixed      
EMBED inline   ??? ??? ??? not 4.01  

FIELDSET

      mixed      

FONT

inline     mixed   deprecated  

FORM

        wrapper    

FRAME

  empty   ??? ???    

FRAMESET

      ??? ???    

H1

      mixed      

H2

      mixed      

H3

      mixed      

H4

      mixed      

H5

      mixed      

H6

      mixed      

HEAD

        wrapper    

HR

  empty          

HTML

        wrapper    

I

inline     mixed      
IA       ??? ??? not 4.01 from Trados5.5 (no idea what is it)

IFRAME

inline     ??? ???    

IMG

inline empty          

INPUT

inline empty         not inline in Trados5.5

INS

inline     mixed     not inline in Trados5.5

ISINDEX

  empty       deprecated  

KBD

inline     mixed      

LABEL

inline     mixed     not inline in Trados5.5

LEGEND

      mixed      

LI

      mixed      

LINK

  empty          
LISTING       ??? ??? not 4.01  

MAP

inline           not inline in Trados5.5
MARQUEE       ??? ??? not 4.01  

MENU

      ??? ??? deprecated inline in Trados5.5

META

  empty          
NOBR inline???         not 4.01  
NOEMBED         wrapper not 4.01  

NOFRAMES

        wrapper    

NOSCRIPT

        wrapper    

OBJECT

inline???     mixed???      

OL

        wrapper    

OPTGROUP

        wrapper    

OPTION

    PCDATA        

P

      mixed      

PARAM

inline??? empty          
PLAINTEXT       ??? ??? obsolete do not use???

PRE

      mixed      

Q

inline     mixed     not inline in Trados5.5
RB inline     mixed   ruby subset  
RBC inline     mixed   ruby subset  
RP inline     mixed   ruby subset  
RT inline    

mixed

  ruby subset  
RTC inline     mixed   ruby subset  
RUBY inline     mixed   ruby subset  

S

inline     mixed   deprecated  

SAMP

inline     mixed     not inline in Trados5.5

SCRIPT

inline???   PCDATA       not inline in Trados5.5

SELECT

inline       wrapper   not inline in Trados5.5 (make senese to some level: it's in a form as a list of entries...)

SMALL

inline     mixed      

SPAN

inline     mixed      
SPACER ??? ???   ??? ??? not 4.01  

STRIKE

inline     mixed   deprecated  

STRONG

inline     mixed      

STYLE

    PCDATA        

SUB

inline     mixed      

SUP

inline     mixed      

TABLE

        wrapper    

TBODY

        wrapper    

TD

      mixed      

TEXTAREA

inline   PCDATA       not inline in Trados5.5 (content is subflow)

TFOOT

        wrapper    

TH

      mixed      

THEAD

        wrapper    

TITLE

    PCDATA        

TR

      mixed      

TT

inline     mixed      

U

inline     mixed   deprecated  

UL

        wrapper    

VAR

inline     mixed      
WBR inline empty       not 4.01  
XML       ??? ??? not 4.01  
XMP       ??? ??? not 4.01  
<any other> not inline            

 

Inline Elements

Inline elements are HTML elements that should be treated as codes embedded within a run of text, for example <b>, <em>, etc.

The following HTML elements should be treated as inline codes:

HTML DTD 'inline' elements: fontstyle: <tt>, <i>, <b>, <big>, <small> (+ deprecated: <strike>, <s>, <u>, <font>)
phrase: <em>, <strong>, <dfn>, <code>, <samp>, <kbd>, <var>, <cite>, <abbr>, <acronym>.
formctrl: <input>, <select>, <textarea>, <label>, <button>.
special: <a>, <img>, <object>, <br>, <script>, <map>, <q>, <sub>, <sup>, <span>, <bdo>.
and also: <ins>, <del>.

[Not sure how to treat 'specials': <script> is generally not seen as inline by translation tools... Also: <object> can have non-inline element (???)]

[Get a list of the practice in commercial tools. Default inline tags for TagEditor are not the same. How about SDLX, DejaVu, etc.? Need for a standard list of inline tags?]

Ambassador: a, abbr, acronym, b, basefont, bdo, big, blink, cite, code, dfn, em, font, i, img, kbd, nobr, q, s, samp, small, span, strike, strong, sub, sup, tt, u, var, wbr (+br optionally)

Default for Catalyst: h1, h2, h3, h4, h5, h6, code, img, i, b.

Rainbow 4: b, i, span, br, font, a, img, big, small, tt, em, strong, dfn, code, samp, kdb, var, cite, abbr, acronym, q, sub, sup, bdo, u, s, strike, blink, applet, iframe, del, ins, ruby, rt, rc, rp, rbc, rtc.

IMPORTANT:

There is several strategies possible for inline codes:

The first and third use allows to keep an open-close structure that is reflected in the DOM tree. While <bpt>, etc. does not. Traditionaly translation tools have been using the <bpt> model.

This also begs the question: how translation tools should react to an <mrk> they dont know about? treat it like an <bpt>/<ept>?

<meta> Elements

The <meta> element can be used to carry many different information. During localization some need to remain untouched, some need to be translated.

Need a basic list of values for the attribute name that make the content attribute translatable.

Need a basic list of values for the attribute http-equiv that make the content attribute translatable.

Value of name Value of http-equiv Value of content
  keywords To extract
  content-language Not to extract, but the filter should modified it if necessary
  content-type Not to extract, but the filter should modified it if necessary
  <other value> Not to extact [or to extract???]
generator   Not to extract
author   Not to extract???
progid   Not to extract
date   Not to extract???
<other values>   To extract [or not???]

For example:

<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="keywords" content="localization tool, translation tool">
<meta name="test" content="meta data for test">

The HTML fragment above should be represented as follow:

restype correct values TBD

<trans-unit id="1" restype="x-meta-http-keywords">
 <source xml:lang="en">localization tool, translation tool</source>
</trans-unit>
<trans-unit id="1" restype="x-meta-name-test">
 <source xml:lang="en">meta data for test</source>
</trans-unit>

Tables

Caption, row, column, cell, header row, footer row...

TODO

<img> Elements

The <img> element is used to hold a reference to an image. The image itself is not stored in the XLIFF document, but its metadata, and translatable text should be. Here’s an example. The following paragraph contains an image with attributes that describe the image source, and an alternate text. In this case, the source should not be translated but the alternate text might (many browsers display the alternate text on a mouse-over. A case could be made that this is eligible for translation). Consider this example:

<p>My picture,
<img src="mthood.jpg" alt="This is a shot of Mount Hood" />
and there you have it.</p>

A good way to handle this is to start a new <ph> element for the <img> element, put the non-translatable attributes in namespace attributes, and put the alternate text in a <sub> element, like this:

<trans-unit resname="p" id="d0e1">
<source xml:lang="EN">My picture,
<ph id="d0e3" ctype="image" tek:src="mthood.jpg"><sub ctype="x-alt">This is a shot of Mount Hood</sub></ph>
and there you have it.</source>

This is a good approach because the XLIFF schema allows for extensible attributes for the <ph> element:

The conversion can be achieved with the following XSLT template:

<xsl:template match="img">
 <ph xmlns="urn:oasis:names:tc:xliff:document:1.1"
  xmlns:tek="http://www.tektronix.com/TC"
  id="{generate-id()}"
  ctype="image">
  <xsl:for-each select="@*[not(name()='alt')]">
   <xsl:attribute name="{concat('tek:',local-name())}">
    <xsl:value-of select="." />
   </xsl:attribute>
  </xsl:for-each>
  <xsl:for-each select="@alt">
   <sub ctype="x-alt">
    <xsl:value-of select="." />
   </sub>
  </xsl:for-each>
 </ph>
</xsl:template>

Should the name of the image file be extracted as translatable?

SVG Images

Historically, images in HTML have been one of two raster formats, JPEG and GIF. This meant that the task of localizing text within images could prove complex. This was done:

All of those solutions usually meant additional overhead and the need for graphic artists to work on the images in addition to translators.

YS: Bryan, I've modified the text above to reflect my experience. I think not translating was never an option. We just have the PSD source file of the GIF/JPEG or re-create it from scratch, or modify it in a bitmap-type editor. Also: XLIFF can help here too: there are PSD to XLIFF filter.

It is now possible to display Scalable Vector Graphics (.svg) images in browsers via HTML. SVG files are XML documents and have editable (therefore translatable) text. They can be used through an <object> element, or with XHTML, directly embedded in HTML using the XML namespace mechanism.

Such SVG data could be processed with via XML standards such as XSLT. The representation of extracted SVG in XLIFF is outside of the scope of this document, but a short example can show how easily SVG simplify the translation tasks.

For instance, in this SVG image:

The translatable stings are represented like this one:

<text id="XMLID_1_" transform="matrix(1 0 0 1 0 32.0503)">
 <tspan x="0" y="0" fill-rule="evenodd"
clip-rule="evenodd" font-family="'TimesNewRomanPSMT'"
font-size="9.25">TDR on indicator (80E04)</tspan></text>

And they are easily mapped to <trans-units> elements like this:

<trans-unit id="A001-d0e499" resname="tspan">
<source>TDR on indicator (80E04)</source>
<target state="needs-translation">TDR on indicator (80E04)</target>
</trans-unit>

Yves: ideally we would need to have also the font, font size, etc.

After the <target> is localized, it can be transformed back into SVG:

<text id="A001-XMLID_1_" transform="matrix(1 0 0 1 0 32.0503)">
<tspan id="A001-d0e499" x="0" y="0" fill-rule="evenodd"
clip-rule="evenodd" font-family="Arial Unicode MS"
font-size="9.25">表示器(80E04 は) のTDR</tspan></text>

Objects and Param

Need a list of values for the name attribute that make the value attribute translatable

In some cases not all the text in the the value attribute is translatable (e.g. for HTMLHelp) How do we address this?

Subflow Text

Some inline HTML elements present the interesting challenge of having attributes that are translatable. For example, the <img> element has an optional alt attribute that holds the text description of the graphic.

Note: in some case the content of inline elements may be also a subflow. Should we treat it like it?

HTML Forms

An HTML form can be assimilated to a dialog box. However, there is generally no coordinates associated with the controls. When extracted to XLIFF, each form should be mapped to a <group> element with its restype attribute set to dialog (or something else to avoid confusion?).

The <fieldset> element allows the HTML controls to be grouped together, using the content of the <legend> element to store the caption for the given group.

The <input> element can be used for different types of controls depending on its type attribute: text, password, checkbox, radio, submit, reset, file, hidden, image, and button. Translatable text can be found in the alt attribute. The label corresponding to the input control is in the same flow that contains the control.

The <button> element acts almost like the <input> element, but the caption text is the content of the element.

Should the content of <button> be treat like an embedded text?

The <select> element represents a scrolling list, that can be rendered as a listbox, a drop-down menu, etc. It contains <optgroup> and <option> elements. In XLIFF, <select> is represented by a <group>, <optgroup> is also mapped to a <group>, and each <option> element has its corresponding <trans-unit>.

The <textarea> element allows the user to input a block of text. A default text can be set in the content of the element.

The content of <textarea> seem to be pre-formatted (or is it because it's PCDATA only)

The attributes rows and cols specify respectively the number of column and lines allowed. Should they be carried in maxheight and maxlength or other trans-unit attribute?

The element <isindex> is deprecated. XLIFF filter should treat it as inline code. the prompt attribute of <isindex> is translatable.

The <label> element is used to associate a label with a control that does not have an implicit label. The attribute title is translatable. The attribute accesskey should also be localized.

Summary of the Form-related elements and attributes:

HTML Elements XLIFF Mapping
<form> <group restype='dialog'>
<fieldset> <group restype='groupbox'>
<legend> <trans-unit restype='caption'>
<input type="text"> <ph> or <x/>
<input type="password"> <ph> or <x/>
<input type="checkbox"> <ph> or <x/>
<input type="radio"> <ph> or <x/>
<input type="submit"> <ph> or <x/>
<input type="reset"> <ph> or <x/>
<input type="file"> <ph> or <x/>
<input type="hidden"> <ph> or <x/>
<input type="image"> <ph ctype='image'> or <x ctype='image'/>
<input type="button"> <ph> or <x/>
<button> <ph> or <x/>
<select> <group restype='listbox'>
<optgroup> <group restype='heading'>
<option> <trans-unit restype='listitem'>
<textarea cols='20' rows='2'> <trans-unit restype='textbox' xml:space='preserve'
 size-unit='x-cell' maxwidth='20' maxheight='2'>
(See note below table)
<isindex> <ph> or <x/>
<label> <trans-unit restype='label'>
HTML Attributes XLIFF Mapping
prompt <trans-unit> or <sub>
label <trans-unit> or <sub>
title <trans-unit> or <sub>
accesskey <trans-unit> or <sub>
standby <trans-unit> or <sub>
value for <option> Not translatable
value for <input type='text'> <trans-unit> or <sub>
value for <input type='password'> Not translatable
value for <input type='checkbox'> Not translatable
value for <input type='radio'> Not translatable
value for <input type='submit'> <trans-unit> or <sub>
value for <input type='reset'> <trans-unit> or <sub>
value for <input type='file'> Not translatable
value for <input type='hidden'> <trans-unit> or <sub> (In this case, value may or may not translatable, depending on the intended use of <input>)
value for <input type='image'> Not translatable
value for <input type='button'> <trans-unit> or <sub>
size (input) in pixel except when type=text or password then in chars
maxlength (input) in pixel except when type=text or password then in chars

Note: problem with 1.1: having a size-unit value for row and another for col does not make sense because both maxwith and maxheight are in the same element and we would need two size-unit attribute to set the value. 'row' or 'col' needs to be replaced (or complemented) by 'rowcol' or something like that.

Here are some examples of forms and their representation in XLIFF. In the HTML code the extractable content is underlined, and its translatable parts are in black and bold. The translatable attributes are in blue and bold.

Example:

<form method="post">
 <fieldset>
  <legend accesskey='U'>Required Information</legend>
  <label for='gname' accesskey='G'>Given Name: </label>
  <input type='text' name='gname' value='--given name--'><br>
  <label for='fname' accesskey='F'>Family Name: </label>
  <input type='text' name='fname' value='--family name--'>
 </fieldset>
 <fieldset>
  <legend>Optional Information</legend>
  Job Title: <select name="prof" >
   <option selected label="Profession" value="none">--Please select a job title--</option>
   <optgroup label="Engineering">
    <option label="1.1. Engineer" value="it_eng">Engineer</option>
    <option label="1.2. Technician" value="it_tech">Technician</option>
   </optgroup>
   <optgroup label="Translation">
    <option label="2.1. Translator" value="tr_translator">Translator</option>
    <option label="2.2. Editor" value="tr_editor">Editor</option>
    <option label="2.3. Proofer" value="tr_proofer">Proofer</option>
   </optgroup>
   <optgroup label="Management">
    <option label="3.1. Project Manager" value="mg_pm">Project Manager</option>
    <option label="3.2. Translation Coordinator" value="mg_coord">Translation Coordinator</option>
    <option label="3.3. Project Assistant" value="tr_pa">Project Assistant</option>
   </optgroup>
  </select> 
 </fieldset>
 <p><input type="submit" value="Submit" name="B1"> <input type="reset" value="Reset" name="B2"></p>
</form>

Actual rendering of the form in your browser:


 
Required Information
 
 
Optional Information Job Title:  
 


Corresponding XLIFF representation:

<group restype='dialog'>
 TODO
</group>

XForms Forms

With the advent of XHTML the concept of forms has been extended and generalized through the creation of XForms [XForms].

XForms was designed to make the coding of HTML/XHTML forms easier and to be output independent. A well designed XForms is capable of being rendered in VoiceXML as well as HTML, XHTML or as a valid/well formed XML document. In addition validating constraints can be specified for input valuessomething which requires extensive scripting in HTML.

The goal of XForms is to provide the 20% of necessary functionality in order to eliminate 80% of the need for scripting. An XForms processor is needed to render an XForms form into an instance.

XForms offers several controls:

Element HTML Equivalent Description
<input> <input type="text"> For entry of small amounts of text
<textarea> <textarea> For entry of large amounts of text
<secret> <input type="password"> For entry of sensitive information
<output> none For inline display of any instance data
<range> none For smooth "volume control" selection of a value
<upload> <input type="file"> For upload of file or device data
<trigger> <button> For activation of form events
<submit> <input type="submit"> For submission of form data
<select> <select multiple="multiple"> or multiple <input type="checkbox"> For selection of zero, one, or many options
<select1> <select> or multiple <input type="radio"> For selection of just one option among several

The content of the following XForms elements are translatable:

And the following items may be translatable depending on the context:

Example 1 of XForms entries:

<input ref="po/address/street1">
 <label>Street</label>
 <hint>Please enter the number and street name</hint>
</input>
TODO

Corresponding XLIFF extraction for example 1:

TODO

We also need to take in account the way the XForm WG has seen how localization would be done. And present examples illustrating this.

Frames

TODO

Bidirectional Markers

Some languages, such as Arabic and Hebrew may require the use of bidirectional ("bidi") markers to help user agents to display the text correctly. Unicode defines five markers for this:

RLE Right-to-Left Embedding
LRE Left-to-Right Embedding
RLO Right-to-Left Override
LRO Left-to-Right Override
PDF Pop Display Formatting

While bidi functions can be done by using the Unicode special characters. However, when the text is stored in a marked up document, it is strongly recommended to use markup rather than characters.

HTML provides the <bdo> element and the dir attribute for these functions:

dir="rtl" Right-To-Left
dir="ltr" Left-To-Right
<bdo dir="rtl"> Right-to-Left Override
<bdo dir="ltr"> Left-to-Right Override
</bdo> Pop Display Formatting

In order to carry the correct presentation information, XLIFF must provide a way to specify these bidi marks.

How to deal with the dir attribute, in table, paragraph, cell, etc.? ==> possible solution: xhtml:dir in group, trans-unit and target.

For example:

<p>The title says "<span dir="rtl">פעילות הבינאום, W3C</span>" 
in Hebrew.</p>

The HTML fragment above should be represented as follow:

<trans-unit id="1">
 <source xml:lang="en">The title says "<bpt id="1" html:dir='rtl'>&lt;span dir="rtl"></bpt>text...<ept id="1">&lt;/span></ept>" in Hebrew.</source>
</trans-unit>

<br> Element

The <br> element is used to mark a line break. It should be mapped to an <ph> or <x/> element with the ctype attribute set to "lb".

For example:

<p>First line<br>second line</p>

The HTML fragment above should be represented as follow:

<source>First line<ph id="1" ctype="lb">&lt;br></ph>second line.</source>

or, as follow:

<source>First line<x id="1" ctype="lb"/>second line.</source>

Note that <br> is not processed as an isolated tag (<it>, a beginning or end tag without its ending or beginning counterpart tag). The element <br> is defined as empty (<br/> in XHTML).

Languages Switch

An element content can have runs of text in different languages. Those can be marked up so authoring tools can process the text accordingly. For example switching dictionaries when performing a spell-checking.

For example:

<p>She added that "<span lang='fr'>je ne sais quoi</span>" that made her casserole absolutely delicious.</p>

Corresponding XLIFF:

How do we mark change of language within the <source> elements? Is it needed?

<source xml:lang='en'>She added that "<bpt id='1'>&lt;span lang='fr'></bpt>je ne sais quoi<ept id='1'>&lt;/span></ept>" that made her casserole absolutely delicious.</source>

Pre-formatted Content

The <pre> element must be marked with the xml:space="preserve" attribute.

For example, the following HTML fragment:

<pre>First line
and second line</pre>

Should be represented as:

<trans-unit id="1" xml:space="preserve">
 <source xml:lang="en">First line
and second line</source>
</trans-unit>

How about normal paragraphs with a reference to a CSS style that has the preformatted style on?

Script Content

Script content can be found in the <script> element as well as in the following attributes: TODO

 

Style Content

TODO

A. Contributions

The following people have contributed to this document:

B. Example of XSLT Use to Process HTML

XSLT can be used to convert an HTML document into XLIFF and back.

Note: the actual HTML mapping to XLIFF will be modified to reflect this guideline if this document. It's only an temporary example for now.

  1. HTML source document:
    ExampleXSLTUse_1_Source.htm
  2. XSL Transformation template to convert the HTML document into XLIFF:
    ExampleXSLTUse_2_xhtml2xliff.xsl
  3. Result of the transformation: the XLIFF document before translation:
    ExampleXSLTUse_3_BeforeTrans.xlf
  4. XLIFF document after the translation:
    ExampleXSLTUse_4_AfterTrans.xlf
  5. XSL Transformation template to convert the XLIFF document back into HTML:
    ExampleXSLTUse_5_xliff2xhtml.xsl
  6. Final result, the translated HTML document:
    ExampleXSLTUse_6_Translated.htm

C. Pre-Processing HTML Files

Perl is a powerful cross-platform programming language that can be used to convert an HTML document not well-formed. For example by adding quotes to any unquoted attribute values, etc.

  1. Original HTML document (invalid):
    ExamplePerlUse_1_Before.htm
  2. Perl script to fixup some of the issues:
    ExamplePerlUse_2_PerlFixer.htm
  3. Output file after being processed by the Perl script:
    ExamplePerlUse_3_After.htm

D. References

[HTMLTidy]
HTML Tidy, HTML Clean-up Open source Utility
http://www.w3.org/People/Raggett/tidy/
[ISO]
International Organization for Standardization Web site.
[OASIS]
Organization for the Advancement of Structured Information Standards Web site.
[Perl]
The Perl Programming Language
http://www.perl.org/
[RFC 3066]
RFC 3066 Tags for the Identification of Languages. IETF (Internet Engineering Task Force), Jan 2001.
[Unicode]
Unicode Consortium Web site.
[W3CQA-bidi]
Q&A: (X)HTML and bidi formatting codes versus mark-up.
[XForms]
XForms, the Next Generation of Web Forms
http://www.w3.org/MarkUp/Forms/