Re: Notes on localization - analysis of STIX and applicability to CSAF

Subject: Re: Notes on localization - analysis of STIX and applicability to CSAF

FYI, I did some further research on localization, specifically by conferring with our globalization team.

Several insights, and some conclusions.

The previous proposal I outlined suffered from a fatal flaw - if a user of the CSAF JSON format adds customizations that need to be localized, how would a localization processor know which fields to be handled?
Our globalization team typically takes as input, a document to be localized, a specification of what to be localized, runs it through their tooling & processes, and produces the localized form.

Documents themselves are not multi-lingual
Our globalization team, at least, would not benefit from the CSAF team defining a format that captures localized data. To the extent that they need that, they already use XLIFF for intermediate steps - steps which don't correspond to anything talked about so far.

The localization that we do for a product is not the same as that which might occur for CSAF, because a product team controls the creation of translations that go with a release. With CSAF, translations may happen downstream from that which is translated.

This means that having some way of indicating that a document is a translation, rather than an original, somehow needs an indication.

Conclusion / proposals:

Presuming a localization model wherein there's a master version of a document, and translations of that:

The CSAF JSON model should include a property defining the source language.
The model should also define a single property setting the current language

We should investigate some means of formally specifying which properties of the document get translated. Perhaps this is an annotation in the JSON schema? Since downstream users of the spec may define additional fields, perhaps the document metadata has a field that enumerates additional translation targets?

These three conclusions seem radically simpler than trying to support a fully multi-lingual JSON format, don't reinvent the XLIFF model, and will probably work well with existing practices.

Eric.

On Mon, Jun 4, 2018 at 6:21 PM, Eric Johnson <eric@tibco.com> wrote:

In furthering our knowledge of how best to do localization for CSAF, I've reviewed:
https://docs.google.com/document/d/1ShNq4c3e1CkfANmD9O--mdZ5H0O_GLnjN28a_yrEaco/edit#heading=h.cfz5hcantmvx

("Language content" - section 4.2 of STIX version 2.1 - last updated 27 April, 2018)

As noted in section 4.2, the language content must not have any relationships with other "data objects" of the STIX document. In fact, it is represented in such a way that it could be stored completely separately from the rest of a STIX document. The file representation for STIX (including language-content) is apparently defined in TAXI, which I've not reviewed for this updated STIX 2.1 version. So the STIX approach could definitely be similar to what we've discussed in the CSAF TC - that of having an external translation document.

The effort I've been working on is the JSON Scheme representation of a CVRF 1.2 equivalent, whereas STIX defines an object model using JSON which is then separately mapped to a file format. It is definitely possible that some discrepancies could emerge based on my limited analysis. I checked, but did not find any "language-content" examples in the GitHub repository for stix2 schema that might help clarify.

By way of example, supposing this is the STIX content to be localized:

{
"type": "campaign",
"id": "campaign--12a111f0-b824-4baf-a224-83b80237a094",
"lang": "en",
"created": "2017-02-08T21:31:22.007Z",
"modified": "2017-02-08T21:31:22.007Z",
"name": "Bank Attack",
"description": "More information about bank attack"
}

... then this is the "language_content" providing a possible localization:

{
"type": "language-content",
"id": "language-content--b86bd89f-98bb-4fa9-8cb2-9ad421da981d",
"created": "2017-02-08T21:31:22.007Z",
"modified": "2017-02-08T21:31:22.007Z",
"object_ref": "campaign--12a111f0-b824-4baf-a224-83b80237a094",
"object_modified": "2017-02-08T21:31:22.007Z",
"contents":
{
"de": {
"name": "Bank Angriff 1",
"description": "Weitere Informationen über Banküberfall"
},
"fr": {
"name": "Attaque Bank 1",
"description": "Plus d'informations sur la crise bancaire"
}
}
}

Several design choices to note:
STIX items apparently have a unique identifier (such as "campaign--12a..." in the example above)
The modified time of the object is apparently critical for determining the appropriateness of a given translation
There's a lot of metadata here about the source document, and the tracking of the translation itself
Different languages are optionally combined into a single translation.
How does that align with CSAF?
As mentioned above, the design allows for the possibility of storing translations externally from the document itself.
There's a need to be able to identify that the translation corresponds to what was translated.
Multiple languages supported
For context, here's what we might want to translate in CSAF JSON:
/vulnerabilities[]/acknowledgements[]/description
/vulnerabilities[]/id/text
/vulnerabilities[]involvements[]/description
/vulnerabilities[]/notes[]/text
/vulnerabilities[]/remediations[]/description
/vulnerabilities[]/remediations[]/entitlements[]
/vulnerabilities[]/threats[]/description
/vulnerabilities[]/title
/document_notes[]/text
/document_title
/document_references[]/description
(note that the above refer to "document_notes" and "document_title", although our intent is to change this to "document/notes", and "document/title" - written this way to be consistent with the current JSON schema available on GitHub.)

I have a very proposal for a very simple approach that perhaps radically simplifies the problem of localization. The challenge:
Referencing an object requires: adding a unique identifier on an object, leveraging an existing unique characteristic (such as a CVE # on a vulnerability), or using the JSON pointer references. All of these approaches have their failure modes.
As STIX highlights - the time that the source document was snapshotted for translation is provided may differ from when the document was last modified.
Suggested simplified approach:
Capture the *exact* strings that need to be localized, and one to N localized versions of the same.
Don't worry about anything else.
So rather than trying to define a localization for a title (from the sample document) such as "Cisco IOS and IOS XE Software Smart Install Remote Code Execution Vulnerability" by identifying some sort of reference to the "document/title" property, just list the string. That is, a localization document might look like this:

[
{
"original": "Cisco IOS and IOS XE Software Smart Install Remote Code Execution Vulnerability",
"translations": {
"de": "Remotecodeausführung Sicherheitsanfälligkeit durch Cisco IOS- und IOS XE-Software Install"
}
}
]

(Sorry, translation courtesy of Google Translate plus my edits)

Possible issues with this approach:
Applying the localization then means that anything processing the CSAF JSON instance needs to be aware of where translatable might strings appear, and check for the translations.
Failure-mode - two text strings in a CSAF document that share the same original representation, but translate differently.
Only possible to catch post-translation snapshot changes by way of identifying portions of the document not translated.
Benefits of this approach:
Document revisions don't matter - either the string to be translated matches, or it doesn't.
No need for references to the point in the original document where the translation is supposed to apply.
Thoughts?

Eric.

csaf message