OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti-stix message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [cti-stix] Vocab case sensitivity in STIX


By definition, a field value that comes from a vocabulary comes from a… wait for it… I know this is going to be a new idea… a vocabulary. I.e., it is an opaque stream of octets that means whatever the vocabulary says it is. We can define these vocabulary words to have meanings in English, like “Attack”, “Probe”, “Manager”, etc. Or, we can define these vocabulary words to have no meanings in English, like “Value1”, “Foobar”, “Mumble”. The point is that by definition, the vocabulary somewhere defines what these words mean. So, if the thought is we want to let someone say “Anschlag” or “攻击” and have it mean “Attack”, we are on a fool’s errand. The only way to have interoperability is to have one and only one identifier that a machine understands. A machine need not know English, German, or Chinese. A machine need only know that the string 0x41 0x74 0x74 0x61 0x63 0x6B is the identifier for the thing we call an “Attack”. You might call it an “Anschlag” or “攻击”. Feel free to let your UI translate the string 0x41 0x74 0x74 0x61 0x63 0x6B to whatever you need to for your customers to understand the message.

Now if we are talking about free-text fields, like “Description of the indicator,” then it is a free-text field and we do not need to worry about case folding.

Talking about case folding, a number of recent IETF protocols are case sensitive. Guess what, interoperability improved. So, a sensible way to sidestep the issue altogether is to say that STIX is case sensitive. Attack does not equal attack does not equal ATTACK does not equal aTtaCK. Just say ‘attack’ or ‘ATTACK’ (pick one) and we are done. Likewise, if giving guidance for external vocabularies, we just note the STIX is case sensitive and if your external vocabulary is not case sensitive, we would strongly recommend you say what case things should be in.

On Jun 9, 2016, at 8:34 AM, Wunder, John A. <jwunder@mitre.org> wrote:

I don’t think we should mandate that values from extended vocabularies (either other values in open vocabs, or extension values in controlled vocabs) be in English…ignoring the issues actually verifying that (either as a tool trying to produce valid content or as a validation program), it means that people doing STIX in other languages either need to have some ability to translate to English. Or, they can’t use extended vocab values because they can’t produce English text.
 
The values in vocabularies we define should all be in English. They’re pre-defined and tools can localize their interfaces with appropriate translations even in completely non-English ecosystems…they wouldn’t have that same ability for tool or user developed values.
 
Let’s schedule this topic for the call on Tuesday. If we aren’t able to resolve it then, it should probably go to a vote.
 
John
 
From: <cti-stix@lists.oasis-open.org> on behalf of Terry MacDonald <terry.macdonald@cosive.com>
Date: Wednesday, June 8, 2016 at 6:34 PM
To: John-Mark Gurney <jmg@newcontext.com>
Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>, "Wunder, John A." <jwunder@mitre.org>
Subject: Re: [cti-stix] Vocab case sensitivity in STIX
 

One point for vocabs.... I thought we had decided that all controlled vocabularies would be defined in the standard as English, and that it was up to the local implementation to provide translations in other languages.

If this is still the case, does this also apply to open vocabs? If this is the case then I'd go option #3 (fallback #2). Otherwise if we are still going English only then option #1 seems logical.

Cheers
Terry MacDonald 

On 9/06/2016 8:13 AM, "John-Mark Gurney" <jmg@newcontext.com> wrote:

Jason Keirstead wrote this message on Wed, Jun 08, 2016 at 13:44 -0300:
> Case insensitivity can get extremely complicated with non-latin characters.
>
> The definitive example is Turkish -
> http://www.i18nguy.com/unicode/turkish-i18n.html

This is exactly why I support 3...  If we support 2, we need to define
either a limited character set (e.g. latin-1 only) with well defined
rules, or a well defined rules on case sensitivity for ALL unicode
characters, and be willing to break other languages like Turkish...

The header on:
http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

Helps...  Also points out that some case transitions involve going
from one code point to two...

Hmm... I did find W3C's case folding page:
https://www.w3.org/International/wiki/Case_folding

So, anyone who has an opinion on this topic should read it, and then
decide if they want to change their vote...

More info on case mapping from Unicode:
http://unicode.org/faq/casemap_charprop.html

Another fun example from the Unicode page:
"For example, while the default uppercase mapping of "a" is "A" and
the default mapping of "à" is "À", the uppercase conversion of "
e vais à Paris" in some forms of French might be "JE VAIS A PARIS"
Notice how the "à" is uppercased as "A" in this case."

IMO, the spec should be 3, but we provide non-normative text on how
organizations and vendor products should allow such input..  If all
the tools follow the rules, then the issues about comparision is a
non-issue...

--
John-Mark

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]