Re: [xliff-comment] HTML extranction examples

On Tue, May 9, 2017 at 7:35 PM, Simone Chiaretta <simone@piyosailing.com> wrote:

One more thing
With "extraction/merging best practices” do you refer to this? http://docs.oasis-open.org/xliff/xliff-core/v2.1/csprd03/xliff-core-v2.1-csprd03.html#d0e10990

--
Simone Chiaretta
Microsoft MVP ASP.NET - ASPInsider
Blog: http://codeclimber.net.nz
RSS: http://feeds2.feedburner.com/codeclimber
twitter: @simonech

On 9 May 2017 at 19:00:28, Ján Husarčík (jan.husarcik@gmail.com) wrote:

Hello all,

here are my few cents as somebody on the "receiving end" (LSP).

Printed version of XLIFF 2.0 has 135 pages, compared to 71 for v1.2. Inserting examples directly into the specs would further extend the length and might prove difficult to manage.

Putting them in the SVN, along the existing test-suite (as proposed in the original post) would be more maintainable. This way it could contain not just fragments, but the whole (commented) files in different stages of the life-cycle. Also different file-types can be included.

Other than that, Yves already listed the best practices and I'm glad he put the CDATA bit at the first place :). Using CDATA might seem like a simple way how to handle inline codes, however you are losing all the advantages of proper extraction.
You can represent block elements using (nested) groups and units (table/row/cell as group/group/unit), inline codes using <ph/>, <pc> pair, or <sc/>, <ec/>. Please consider the update on extraction/merging best practices in the latest XLIFF2.1 draft.

Do not forget the editing hints, which will help you to prevent technical issues during merging, and context attributes (e.g. disp*, type), which will simplify the life for translator.

However, a lot depend on your particular situation:
- CMS capabilities (content fragmentation, multimedia, metadata available. Is the native format well-formed?)
- CAT tool/LSP capabilities (does it support the features/modules you plan to use?)
- Your technology stack (do you plan just to extract/merge? Do you have terminology management, use for translation candidates or are you using some enricher? Can you make use of ITS metadata?)

- Will you merge with skeleton or reconstruct the target file from what's available in xliff?

- Will the extractor perform also segmentation? (e.g <segment>paragraph</segment> vs. <segment>sentence</segment>)

Regards,
Jan

On Tue, May 9, 2017 at 1:46 PM, Simone Chiaretta <simone@piyosailing.com> wrote:

Thank you very much for the pointers.

Some I did found and used already, but since it requires a lot of hunting around, my suggestion was add them more prominently somewhere in the specs.

It’s true that general principles stay the same in v 1.2 and 2.0, but v2 adds a lot more possibilities, like the originalData, the references, the FS module. One thing is adding more possibilities, the other is explaining how to use them in the best way :)

Simone

--

Simone Chiaretta

Microsoft MVP ASP.NET - ASPInsider

Blog: http://codeclimber.net.nz

RSS: http://feeds2.feedburner.com/codeclimber

twitter: @simonech

On 9 May 2017 at 13:39:08, Yves (yves@opentag.com) wrote:

Hi Simone,

+1 on that. It’s true that there are probably not enough examples in the specification.

Some of them however are using HTML, especially in the section regarding inline codes.

For instance the examples for the sub-flows: http://docs.oasis-open.org/xliff/xliff-core/v2.0/xliff-core-v2.0.html#subflowsdesc

If it can help, a few other examples can be found in the Okapi Framework implementation.

There are two samples in HTML, with the originals and the XLIFF2 outputs:

https://bitbucket.org/okapiframework/xliff-toolkit/src/master/deployment/maven/data/xliff-lib/samples/

The few pointers I can think of, from experience:

-     Do not just extract the HTML content into a CDATA section.

-     Only the in-line codes should be in XLIFF units (as XLIFF codes), that is: <b> not <p>.

-     If possible use sub-flow for text embedded in HTML tags (e.g. alt or title text).

-     If possible don’t use <ph/> for paired code, use <pc>…</pc>.

Also, there is the draft version of the old “XLIFF 1.2 Representation Guide for HTML” that is available. It was done for XLIFF 1.2, but most principles are the same for 2.0. You can find it here: http://docs.oasis-open.org/xliff/v1.2/xliff-profile-html/xliff-profile-html-1.2-cd02.html

I hope that helps.

-yves

From: Simone Chiaretta [mailto:simone@piyosailing.com]
Sent: Tuesday, May 9, 2017 4:58 AM
To: xliff-comment@lists.oasis-open.org
Subject: [xliff-comment] HTML extranction examples

Dear all,

I’m implementing an extractor from a CMS and by reading the specifications it’s not super-clear which is the right way to extract a piece of HTML to XLIFF.

I understand that extraction is a very personal and application specific matter so probably not to be standardised in the specs, but it would be helpful to add somewhere, either as notes or even in the test suite examples of how HTML fragments are to be converted into XLIFF.

Regards,

Simone

--

Simone Chiaretta

Microsoft MVP ASP.NET - ASPInsider

Blog: http://codeclimber.net.nz

RSS: http://feeds2.feedburner.com/codeclimber

twitter: @simonech

xliff-comment message