OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: i18n (RE: MVP Discussion) - Yet Another Updated Proposal


Hi, 

Please find below yet another updated proposal for i18n.
I think that it is simple, minimal, coherent, consistent, one-way-of-doing-things, 
self-contained, and context-free and that it adds a lot of value
to the standards. 

As you already know there are two major design decisions.

(1) Language code for every text field
(2) Direct reference to text 

Actually they come from the same design principal, that is, 
to avoid dependence on object structure and to make it self-contained
in text itself. Being self-contained without dependence on object structure:

- It survives revisions and other changes made to objects so that it 
  protects investment in quality translations.

- Parser simple to implement (information is always available there,
  you do not have to go through object structures or relations to find
  the language code or resolve references.)

- It can accommodate many use cases and increase standards' utility 
  as its self-containedness allows flexibility (for example, multiple 
  language in a single STIX file).  

As for (2), I could do away with text_id for text fields 
by using the (hash of) text itself, but could not do away with (1).
But it is only 7 bytes extra and its utility is, from my point of
view, huge. 

Other minor design decisions

- {"en": ...} is used instead of {"lang": "en", text_value: ...} both for 
  text fields and translations for efficiency and consistency. 

- I added necessary fields core provides for translations.
  (Thanks, Jeffrey)

- Changed Base64 encoding to Hexadecimal encoding for MD5 hash of the original text

Regards,

Ryu

------------------------------------------------------------
Internationalization - Another Updated Proposal - 20160426
------------------------------------------------------------

- STIX/CybOX should be UTF-8 Encoded.

- Always give the language code as the keyword for every text field.
  Only one text in a single language code is allowed. 

- Always give "text_ref" and the language code as the keyword for every translation.
  Use Hexadecimal-encoded MD5 hash of the original text in UTF-8 for the "text_ref" value to 
  refer the original text.

- One can provide the translation for one of translated texts other than the original text.


-----
- Pattern A - Translation given inside the same original package
-----

{
  "type": "package",
  ...
  "campaigns": [
    {
      "type": "campaign",
      "id": "campaign--a1201df6-c352-4a81-9c7c-5a6f896a4e31",
      "revision": 1,
      "spec_version": "stix-2.0",
      "created_at": "2015-12-03T13:13Z",
      "created_by_ref": "identity--69a17e1b-bb45-4657-9a9d-96db3faccdde",
      "title": {"en": "Dridex Campaign - Botnet 121"}, 
      "descriptions": {"en": "Dridex-based campaign leveraging Botnet 121"}, 
      "intended_effects": [
        {"value": "theft-identity-theft"}
      ],
      "status": "Ongoing"
   }
  ],
  "translations": [
    {"id":"trans-1",
     "type":"translation",
     "created_at":"2016-04-19",
     "created_by_refs":["CTI-Provider-1"],
     "version":1,
     "text_ref: "41cb32a0d74d5d07f5362b3e66f245c9", 
     "ja": "Dridex キャンペーン - ボットネット 121"}, 

    {"id":"trans-2",
     "type":"translation",
     "created_at":"2016-04-19",
     "created_by_refs":["CTI-Provider-1"],
     "version":1,
     "text_ref": "e8465d411f6580e8b67d778f25a78234", 
     "ja": "ボットネット 121 を活用する Dridex を元にしたキャンペーン"}, 

    {"id":"trans-3",
     "type":"translation",
     "created_at":"2016-04-19",
     "created_by_refs":["CTI-Provider-1"],
     "version":1,
     "text_ref": "41cb32a0d74d5d07f5362b3e66f245c9", 
     "de": "Some German Title"}, 

    {"id":"trans-4",
     "type":"translation",
     "created_at":"2016-04-19",
     "created_by_refs":["CTI-Provider-1"],
     "version":1,
     "text_ref": "e8465d411f6580e8b67d778f25a78234", 
     "de": "Some German Description"}
  ]
  ...
}
 
-----
- Pattern B - Translation given by a third-party in some external database
-----

{
  "translations": [
    {"id":"trans-A",
     "type":"translation",
     "created_at":"2016-04-19",
     "created_by_refs":["Translator-1"],
     "version":1,
     "text_ref": "41cb32a0d74d5d07f5362b3e66f245c9", 
     "es": "Some Spanish Title"}, 

    {"id":"trans-B",
     "type":"translation",
     "created_at":"2016-04-19",
     "created_by_refs":["Translator-1"],
     "version":1,
     "text_ref": "e8465d411f6580e8b67d778f25a78234", 
     "es": "Some Spanish Description"}, 

    {"id":"trans-C",
     "type":"translation",
     "created_at":"2016-04-19",
     "created_by_refs":["Translator-1"],
     "version":1,
     "text_ref": "41cb32a0d74d5d07f5362b3e66f245c9", 
     "fr": "Some French Title"}, 

    {"id":"trans-C",
     "type":"translation",
     "created_at":"2016-04-19",
     "created_by_refs":["Translator-1"],
     "version":1,
     "text_ref": "e8465d411f6580e8b67d778f25a78234", 
     "fr": "Some French Description"}
  ]
}

-----
- Pattern C - A Japanese CTI Provider creates CTI with its title in English and description in Japanese
-----

{
  "type": "package",
  ...
  "campaigns": [
    {
      "type": "campaign",
      "id": "campaign--a1201df6-c352-4a81-9c7c-5a6f896a4e31",
     "revision": 1,
      "spec_version": "stix-2.0",
      "created_at": "2015-12-03T13:13Z",
      "created_by_ref": "identity--69a17e1b-bb45-4657-9a9d-96db3faccdde",
      "title": {"en": "Dridex Campaign - Botnet 121"}, 
      "descriptions": {"ja": "ボットネット 121 を活用する Dridex を元にしたキャンペーン"}, 
      "intended_effects": [
        {"value": "theft-identity-theft"}
      ],
      "status": "Ongoing"
   }
  ], 
  ...
}

------------------------------
Notes - Simple, minimal, coherent, consistent, self-contained, context-free, future-proofed
------------------------------

- Only seven additional bytes (without white spaces) for each text field.

- As it is refers to the text itself, it does not break if there is 
  revisions of the objects as long as the text stays the same. 

- As its scope is limited to text-fields and therefore it is self-contained:

  - It is very unlikely this impacts other parts of STIX and other standards. 

  - There will be very little (if not "no") considerations necessary 
    for future standard developments/changes. 

  - It would be easy to implement as the same and context-free codes can 
    handle any text field. 

- There is only one way to express text fields and translations

- Resources spent for translation will not be wasted as long as the text stays same.

  - Even if someone else reuses the same text, its translations are still applicable.
 
------------------------------
Internationalization Use Cases
------------------------------

CN: Chinese
DE: German
EN: English
FR: French
JA: Japanese

------------------------------
(1) Providing text fields in multiple languages simultaneously at the time of creation.
------------------------------

  [ja/en (in case of Japan), en/fr/de (in case of EU countries), etc.]

This is the most likely use case (for me). The original CTI has titles/descriptions in 
multiple languages from the start. When you create a CTI file, you include 
both English and Japanese titles/descriptions for major objects in it
so that non-Japanese speaking people can at least find out what it is at the top level.

Or another use case is the CTI provider in Japan writes a CTI file with its
title in English and description in Japanese. This is because many Japanese
can read short English titles, but many Japanese have difficulties to understand
long and detailed descriptions in English. 

------------------------------
(2) CTI Database Receiving CTI from Multiple CTI Sources in Different Languages
------------------------------

This is a case where you receive CTI from a English CTI source and 
another CTI source in Japanese. 
You put all CTI into MongoDB or some other No-SQL Database and 
would like to do mix and match. I would like the CTI Database still 
can track the language code of textual fields.

------------------------------
(3) EN CTI received by a Japanese entity, which provides EN translation
  (Or vice versa, JA CTI received by a US entity, which provides EN translation
------------------------------

  A Japanese entity receives CTI information pieces in English.
  The entity determines some of them are important/critical
  and worth translating them into Japanese, add descriptions in Japanese
  and redistribute them to other Japanese entities (if redistribution is allowed).
  The CTIM (CTI Management System) of a receiving party displays
  the Japanese description whenever possible, while allowing access to
  the original English descriptions."

  Work Flow:
  1. Company 1 in EN creates an Indicator and TTP and shares them to Company 2 in JP.  
    It is important to note that the flow may be direct or may be through a series of brokers and other entities.  
    1. This Indicator and TTP has a producer of Company 1 and a version of 1
  2. Company 2 builds a translated version of the TTP and Indicator and releases it.
    1. This new Indicator and TTP has a producer of Company 2 and a version of 2.  
    2. It is unrealistic to think that Company 2 can or will share the translated object back to Company 1 and that if Company 1 gets the translated object that they will do anything with it.  Their legal departments will probably prohibit accepting 3rd party translations and then using them in their offerings.

------------------------------
(4) An English CTI report describing attacks against Japanese entities in EN  
------------------------------

  An English report on Cyber Attacks on Japan.
  There are filenames of lure attachments in Japanese (original/real) and their
  translations in English.  Another similar report in English might have an email title along with 
  its translation in English next to it. That report also has a Windows pathname 
  in Chinese (not Japanese) found in a binary along with its translation in English.

  These Japanese texts can be found in descriptions, not just 

  [Ex. Original File Name (JA): "医療費通知", Translated File Name (EN): "Medical expenses notice"]

  Note: This should probably be okay as long as the standards require use of UTF-8 for encoding.

------------------------------
(5) Email subject/body, supposed to be in JP, but includes CN characters (by mistake of the attackers)
------------------------------

  This can happen due to Chinese/Japanese/Korean sharing Unicode characters
  (CJK characters - https://en.wikipedia.org/wiki/CJK_characters.)

  This can be a very important clue as to the attackers.

  Note: This should probably be okay as long as the standards require use of UTF-8 for encoding.

------------------------------
(6) CTI translation service
------------------------------

  A CTI translation service provider keeps translations to target languages of text fields 
  from publicly available and/or commercial/private CTI sources.
  The service is available through some kind of online API.
  Consumers of this translation service will use this service to translate text fields
  in their CTI system through the API provided by the translation service provider.

------------------------------
(7) CTI provider
------------------------------

  A CTI provider (in English) plans to penetrate the Japanese and other APAC markets
  and needs a standard way to add translations of their text fields.
  The CTI provider gives its customer a CTI package with all the translations in it
  or a CTI package with translations to the languages of user's choosing.  

------------------------------------------------------------



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]