OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti-stix message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [cti-stix] Supporting translations in STIX


All,

On the working call yesterday the suggestion was made to completely defer on translations until after 2.0, including holding off on even adding the “lang” tag. This will ensure that we don’t go down a path that precludes us from any particular approaches (i.e. if our chosen approach does not have a “lang” tag) and makes sure that we have time to get it right.

That said, this does mean that STIX 2.0 will not have any native support for internationalization.

Does this seem acceptable, in particular to those of you in Europe and Asia/Pacific?

John

On 6/28/16, 6:46 AM, ", Eric Burger on behalf of Eric Burger" <cti-stix@lists.oasis-open.orgewb25@georgetown.edu on behalf of Eric.Burger@georgetown.edu> wrote:

A few thoughts here:

If the reason we want to capture language is to tag indicators, like “look for a message in Russian,” then say so. Tagging a message with a Russian language tag is academically interesting, but not useful. Comparisons are going to be based on the octets seen and inferred, which depends more on charset than language. Again, if you do not happen to have your sample tagged already, (1) it is fairly easy to use machine language detection algorithms to tag it and (2) I am not sold on the utility of having the samples tagged.

If the reason we want to capture language is to tag analyst-entered descriptions, so a human reading it has a chance of either finding someone who knows the language or that a machine translation should be offered, then I would also offer that you (1) already have an idea of what language the *legacy* descriptions are in and (2) you can tag them manually.

Is there a scenario where tagging *all* analyst-entered descriptions (or any text) with a language tag imposes an undue burden on either developers or analysts? Would analyst-entered text not be a 95% default language setup kind of thing? For example, all text is entered in Mandarin for sharing, except for local notes which might be in Hangul?

> On Jun 27, 2016, at 9:54 PM, Allan Thomson <athomson@lookingglasscyber.com> wrote:
> 
> Eric - There is intelligence that has both English and has the original foreign language attached to it, that would be useful to share.
> 
> Threat actors don’t live or necessarily perform their acts in solely English speaking environments.
> 
> Allan
> 
> 
> From: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>, Eric Burger <ewb25@georgetown.edu> on behalf of Eric Burger <Eric.Burger@georgetown.edu>
> Date: Monday, June 27, 2016 at 9:50 AM
> To: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
> Subject: Re: [cti-stix] Supporting translations in STIX
> 
> Snap poll time: is there any legacy data that someone wants to import into STIX that is not in English? I truly do not know, but that would inform my next suggestion.
> 
> Part of me wants to say that if STIX is not part of a protocol and just a representation, the absence of a language tag means the language is whatever the environment of the instance is. If you are in Friesland, the default language tag will be variously ‘fy’ or ‘nl’, depending if you are locally using Frisian or Dutch. If you are in Northern Canada, pick your favorite of ‘cr’, ‘oj’, or ‘en-CA’, depending what your local environment is. Since it is your (or your customer’s) local environment, you will know what the right answer is.
> 
> But: since the point is to exchange STIX documents, then we need to specify, and pick your poison: in the transport layer or by hacking the STIX representation, what the default language is of the sending system, because the receiving system needs to know what the default value is for that document. Note that this mechanism means that any integrity checking on the STIX document as a whole will fail.
> 
> Another part of me wants to say that since there are today an imperceptible documents in something that will end up in STIX, and the vast majority of those documents are already in English, let’s put a line in the sand and say you MUST specify a language tag, with no defaults. This will not be a burden on existing systems, because it should take 15 minutes to see if you have anything extant that is not in English. If everything is already in English, mark it that way. If everything is in Japanese, mark it that way. If 90% is in English and 10% is in Xhosa, great - manually do that 10%. I cannot imagine it is a lot of documents. This will not be a burden on new systems, because they already know they need to mark the entries appropriately. If you think your operators are lazy, grab one of the myriad language detection algorithms out there and run with it.
> 
> On Jun 27, 2016, at 12:11 PM, Wunder, John A. <jwunder@mitre.org<mailto:jwunder@mitre.org>> wrote:
> 
> It sounds America-centric but making the default “en” or even “en-US” seems fine to me. I talked to a team from Europe out at Seoul and they were saying that they wished the default was English. Lots of technical stuff is in English, it’s just a fact…Guido is Dutch and Python keywords are English. Matz is Japanese and Ruby keywords are English. More people speak Mandarin, but more technical content accessible by more subject matter experts that need it is in English.
> 
> I do agree that marking content as “i-default” seems wrong based on the definition.
> 
> Though I will also saying that making it optional and just saying the language is undefined also seems fine.
> 
> Let’s talk about it briefly on the call tomorrow. I don’t think this is a high-impact decision, we just need to pick something.
> 
> John
> 
> From: <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>> on behalf of Jason Keirstead <Jason.Keirstead@ca.ibm.com<mailto:Jason.Keirstead@ca.ibm.com>>
> Date: Monday, June 27, 2016 at 9:43 AM
> To: Dave Cridland <dave.cridland@surevine.com<mailto:dave.cridland@surevine.com>>
> Cc: Allan Thomson <athomson@lookingglasscyber.com<mailto:athomson@lookingglasscyber.com>>, "Wunder, John A." <jwunder@mitre.org<mailto:jwunder@mitre.org>>, "cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>" <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>>, Bret Jordan <bret.jordan@bluecoat.com<mailto:bret.jordan@bluecoat.com>>
> Subject: Re: [cti-stix] Supporting translations in STIX
> 
> i-default Is not for marking content though. STIX is a content specification, it is not a protocol specification. i-default Is to be used inside of a protocol when no language has been negotiated. As STIX is not a protocol, it is inapplicable. A given piece of content is never "i-default", it is always *something*. If the person who authored the content did not specify it, you would have to guess - it is not sufficient to treat it as "i-default", because this has no meaning as "i-default" is defined to not be a language.
> 
> As to the default assumption being "en-US" - stating that the default assumption for an unspecified "lang" attribute is treated as en-US is *not* the same as specifying a "default language", rather it is specifying a fallback assumption. However, I would be just as happy with making "lang" mandatory.
> 
> -
> Jason Keirstead
> STSM, Product Architect, Security Intelligence, IBM Security Systems
> www.ibm.com/security<http://www.ibm.com/security> | www.securityintelligence.com<http://www.securityintelligence.com/>
> 
> Without data, all you are is just another person with an opinion - Unknown
> 
> 
> <image001.gif>Dave Cridland ---06/27/2016 09:58:44 AM---No, you shouldn't really have any content explicitly marked i-default. But neither should you mandat
> 
> From: Dave Cridland <dave.cridland@surevine.com<mailto:dave.cridland@surevine.com>>
> To: Jason Keirstead/CanEast/IBM@IBMCA
> Cc: Allan Thomson <athomson@lookingglasscyber.com<mailto:athomson@lookingglasscyber.com>>, "Wunder, John A." <jwunder@mitre.org<mailto:jwunder@mitre.org>>, "cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>" <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>>, Bret Jordan <bret.jordan@bluecoat.com<mailto:bret.jordan@bluecoat.com>>
> Date: 06/27/2016 09:58 AM
> Subject: Re: [cti-stix] Supporting translations in STIX
> Sent by: <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>>
> ________________________________
> 
> 
> 
> No, you shouldn't really have any content explicitly marked i-default. But neither should you mandate a default language other than i-default. If we do, it ought to be Mandarin Chinese. If your argument is that most people speak at least *some* English, than that is also the argument of i-default...
> RFC 2277 details its use in section 4.5, but loosely, i-default is used when there's no content negotiation. When there is, or when there's a known language, that should be used instead. So what I'm leaning toward is:
> Objects SHOULD have an explicit "lang" attribute providing a language tag describing the language used by the human-readable text within the object. If this is absent, the language tag MUST be treated as "i-default", and the human-readable text SHOULD be understandable to an English speaker.
> The above text means that if you've got content written in US English but the lang attribute is missing, it'll be fine. If you want to add a language tag, then "en-US" is the right one, too. Mandating a particular language tag feels like walking into the same problems that HTTP did by mandating iso-8859-1 as the charset - it introduced a slew of problems that haven't ever been fully resolved.
> On 27 Jun 2016 12:46, "Jason Keirstead" <Jason.Keirstead@ca.ibm.com<mailto:Jason.Keirstead@ca.ibm.com>> wrote:
> I don't think i-default is meant to be used in this way (to mark content).
> 
> i-default is not titled "English for an International Audience", it is titled "Default Language". The reason it exists in the language registry is for implementations to use as a place-holder until another language has been negotiated. Example:
> 
> A server that advertises this extension MUST use the language
>  "i-default" as described in [RFC2277<https://tools.ietf.org/html/rfc2277>] as its default language until
>  another supported language is negotiated by the client.
> 
> defaultLocale is the original language of the Context instance and will be used as the last fallback locale if other locales are registered. If it is undefined, or if registerLocales hasn't been called at all, the Context instance will create a special locale called i-default<http://www.iana.org/assignments/lang-tags/i-default> to be used as the default.
> 
> Since STIX is actually a piece of content, marking it as "i-default" doesn't make a lot of sense. A piece of content is always *something*.
> 
> If a lang attribute is to be added, my vote is to either make it mandatory, or make en-US the default.
> 
> 
> -
> Jason Keirstead
> STSM, Product Architect, Security Intelligence, IBM Security Systems
> www.ibm.com/security<http://www.ibm.com/security> | www.securityintelligence.com<http://www.securityintelligence.com/>
> 
> Without data, all you are is just another person with an opinion - Unknown
> 
> 
> <image001.gif>Dave Cridland ---06/25/2016 04:51:42 AM---That would be a little odd, given i-default is specifically intended for this. It's not deprecated.
> 
> From: Dave Cridland <dave.cridland@surevine.com<mailto:dave.cridland@surevine.com>>
> To: Bret Jordan <bret.jordan@bluecoat.com<mailto:bret.jordan@bluecoat.com>>
> Cc: Allan Thomson <athomson@lookingglasscyber.com<mailto:athomson@lookingglasscyber.com>>, Jason Keirstead/CanEast/IBM@IBMCA, "cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>" <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>>, "Wunder, John A." <jwunder@mitre.org<mailto:jwunder@mitre.org>>
> Date: 06/25/2016 04:51 AM
> Subject: Re: [cti-stix] Supporting translations in STIX
> Sent by: <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>>
> ________________________________
> 
> 
> 
> That would be a little odd, given i-default is specifically intended for this. It's not deprecated.
> On 25 Jun 2016 00:25, "Jordan, Bret" <bret.jordan@bluecoat.com<mailto:bret.jordan@bluecoat.com>> wrote:
> Or drop all of the confusion with grandfathered things and just use "en" or "en-us" as the default.
> 
> Bret
> 
> Sent from my Commodore 64
> 
> On Jun 24, 2016, at 12:51 PM, Dave Cridland <dave.cridland@surevine.com<mailto:dave.cridland@surevine.com>> wrote:
> Jason,
> http://www.iana.org/assignments/lang-tags/i-default
> Grandfathered means it predates the registry, and wasn't added under the formal rules, I believe. I've only created a single IANA registry though, so I'm hardly an expert.
> Dave.
> On 24 Jun 2016 20:33, "Jason Keirstead" <Jason.Keirstead@ca.ibm.com<mailto:Jason.Keirstead@ca.ibm.com>> wrote:
> I am not an expert on this at all, but looking at the registry it says "i-default" is "grandfathered", not sure if that implies "deprecated" or not (?)
> 
> http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
> 
> %%
> Type: grandfathered
> Tag: i-default
> Description: Default Language
> Added: 1998-03-10
> %%
> 
> -
> Jason Keirstead
> STSM, Product Architect, Security Intelligence, IBM Security Systems
> www.ibm.com/security<http://www.ibm.com/security> | www.securityintelligence.com<http://www.securityintelligence.com/>
> 
> Without data, all you are is just another person with an opinion - Unknown
> 
> 
> <graycol.gif>Dave Cridland ---06/24/2016 04:23:58 PM---Allan, As I recall, "i-default" is "English for an International Audience" or some
> 
> From: Dave Cridland <dave.cridland@surevine.com<mailto:dave.cridland@surevine.com>>
> To: Allan Thomson <athomson@lookingglasscyber.com<mailto:athomson@lookingglasscyber.com>>
> Cc: cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>, "Wunder, John A." <jwunder@mitre.org<mailto:jwunder@mitre.org>>
> Date: 06/24/2016 04:23 PM
> Subject: Re: [cti-stix] Supporting translations in STIX
> Sent by: <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>>
> ________________________________
> 
> 
> 
> Allan,
> As I recall, "i-default" is "English for an International Audience" or some such. So it's English of sorts. I'm sitting on the sofa in post-brexit shock, however, and may not have that *quite* right.
> In practise, "i-default" is either the C locale or US English I believe.
> It's given as a special token to avoid a "better" English translation taking precedence.
> Dave.
> On 24 Jun 2016 20:18, "Allan Thomson" <athomson@lookingglasscyber.com<mailto:athomson@lookingglasscyber.com>> wrote:
> I would prefer optional with a default of “English” value.
> 
> If anyone cares about which English version then suggest EU-English. (Had to make that joke based on Brexit news).
> 
> allan
> 
> From: "cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>" <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>> on behalf of "Wunder, John" <jwunder@mitre.org<mailto:jwunder@mitre.org>>
> Date: Friday, June 24, 2016 at 12:06 PM
> To: Dave Cridland <dave.cridland@surevine.com<mailto:dave.cridland@surevine.com>>
> Cc: "cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>" <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>>
> Subject: Re: [cti-stix] Supporting translations in STIX
> 
> Writing normative text would help us a ton, thanks! I think we need two things:
> 
> 
> 1.      The row in the property table:
> 
> Property Name
> 
> Type
> 
> Description
> 
> lang (optional/required)
> 
> string
> 
> ?????
> 
> 
> 
> 2.      A new 6.x section in the STIX Core document (sibling of versioning, object markings, etc.) with any other text we need (if any).
> 
> I would say we either make the field optional with a default of “i-default” or we make it required and force people to say what language they’re providing. We don’t want to tie STIX to TAXII but if there are transport considerations you think we should include at a more generic level we could do that in the 6.x section. I’d reach out to Bret and Mark on the TAXII side to include the Accept-Language stuff directly in those specs.
> 
> Thanks again,
> John
> 
> From: Dave Cridland <dave.cridland@surevine.com<mailto:dave.cridland@surevine.com>>
> Date: Friday, June 24, 2016 at 2:05 PM
> To: "Wunder, John A." <jwunder@mitre.org<mailto:jwunder@mitre.org>>
> Cc: "cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>" <cti-stix@lists.oasis-open.org<mailto:cti-stix@lists.oasis-open.org>>
> Subject: Re: [cti-stix] Supporting translations in STIX
> 
> 
> 2(a) for now. I'm assuming a IANA language tag here, with a default of "i-default" (from memory).
> 
> I think translation objects will work, an alternate design might be to duplicate the entire object (reference and all), or have a relationship to indicate equivalent objects (which allows for both translations and more complex equivalences).
> 
> We'd want TAXII to mention something about Accept-Language for HTTP, and maybe note about other l11n capabilities in other transports (eg, stream language in XMPP), and payload formats (Content-Language in HTTP, email, and stanza language tag in XMPP).
> 
> I can knock out some formal normative text if you like.
> 
> Dave.
> On 24 Jun 2016 16:28, "Wunder, John A." <jwunder@mitre.org<mailto:jwunder@mitre.org><mailto:jwunder@mitre.org<mailto:jwunder@mitre.org>>> wrote:
> All,
> 
> You’re probably aware that we’ve had a bit of work over the past couple months on the best approach to support translations in STIX. As I alluded to in the prioritization e-mail, it’s getting to the point where we need to decide on an approach or we’re at risk of not making the July release date and having to postpone until Winter. As I see it, we have a couple options.
> 
> 
> 1.       We can decide on a general approach and try to prove that it will work for MVP. Ideally, it would be a fairly minimalist approach so that we can be confident in the flows.
> 
> a.       Along those lines, I wrote up some normative text on an approach we discussed on Slack. Translations are very minimal objects (not standard TLOs) and refer to other TLOs to translate their titles and descriptions. It’s here:https://docs.google.com/document/d/1wiG6RoNEFaE2lrblfgjpu3RTAJZOK2q0b5OxXCaCV14/edit#heading=h.aq3spklsm9m6
> 
> b.       If we think that approach is close enough to agree on by MVP we can continue to evolve that.
> 
> c.       If you have a different approach that you think we can agree on, please write up some normative text and submit it to the full list.
> 
> 2.       Alternatively, we can implement something super minimalist now and delay until winter (6 months) to make sure we get this right
> 
> a.       IMO if we add a “lang” property to all TLOs we can provide some immediate capability and build on it in the winter.
> 
> My preference at this point is #2a. Let’s just add a “lang” tag to TLO common properties, put the discussion on hold while we finish MVP, and then resume in August. Then we can spend the fall making sure we get it right. At the same time, we enable an ecosystem where TLOs are in specific languages and so people can innovate and try out different approaches. That said, if people think #1 is close, I’m happy to continue trying to push that forward.
> 
> What do you think?
> 
> John
> 
> 
> 
> <winmail.dat>






[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]