OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti-stix message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [cti-stix] Unicode, strings, and STIX


Maybe say instead: Any length SHOULD be permitted

 

Then maybe in the implementation guide say: suggested storage size is 8KB…

 

From: Mark Davidson [mailto:mdavidson@soltra.com]
Sent: Thursday, June 02, 2016 8:53 AM
To: Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com>
Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
Subject: Re: [cti-stix] Unicode, strings, and STIX

 

I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up being the shortest supported value from major implementations, and it will only be discovered through painful research.

 

-Mark

 

From: <cti-stix@lists.oasis-open.org> on behalf of Mark Davidson <mdavidson@soltra.com>
Date: Thursday, June 2, 2016 at 8:49 AM
To: "Piazza, Rich" <rpiazza@mitre.org>, "Jordan, Bret" <bret.jordan@bluecoat.com>
Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Subject: Re: [cti-stix] Unicode, strings, and STIX

 

There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values

 

Thank you.

-Mark

 

From: <cti-stix@lists.oasis-open.org> on behalf of "Piazza, Rich" <rpiazza@mitre.org>
Date: Wednesday, June 1, 2016 at 2:17 PM
To: "Jordan, Bret" <bret.jordan@bluecoat.com>
Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Subject: RE: [cti-stix] Unicode, strings, and STIX

 

I think the spec would have to say something like – “Any length is permitted”

 

Then, implementers would have to make sure they could support that.

 

In STIX 1.2.1, the description field of all of the objects had this text in the specification documents.  I’m not sure in which direction that will sway you J

 

From: Jordan, Bret [mailto:bret.jordan@bluecoat.com]
Sent: Wednesday, June 01, 2016 1:38 PM
To: Piazza, Rich <rpiazza@mitre.org>
Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
Subject: Re: [cti-stix] Unicode, strings, and STIX

 

If we do not define a max length then everyone will set their own.  And we will have problems.

 

Bret 

Sent from my Commodore 64


On Jun 1, 2016, at 8:08 AM, Piazza, Rich <rpiazza@mitre.org> wrote:

My +1 was for the idea that implementation details like this do not belong in the standard.

 

In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere. 

 

From: cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Jason Keirstead
Sent: Wednesday, June 01, 2016 10:48 AM
To: Piazza, Rich <rpiazza@mitre.org>
Cc: Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
Subject: RE: [cti-stix] Unicode, strings, and STIX

 

RE the encoding language question, I posted some sample language to slack that I think solves the problem:  "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard".

I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler.

Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst.

-
Jason Keirstead
STSM, Product Architect, Security Intelligence, IBM Security Systems
www.ibm.com/security | www.securityintelligence.com

Without data, all you are is just another person with an opinion - Unknown


<image001.gif>"Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Terry Mac

From: "Piazza, Rich" <rpiazza@mitre.org>
To: Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>
Cc: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Date: 06/01/2016 11:39 AM
Subject: RE: [cti-stix] Unicode, strings, and STIX
Sent by: <cti-stix@lists.oasis-open.org>





+1

From: cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Terry MacDonald
Sent:
Wednesday, June 01, 2016 6:09 AM
To:
John-Mark Gurney <jmg@newcontext.com>
Cc:
cti-stix@lists.oasis-open.org
Subject:
Re: [cti-stix] Unicode, strings, and STIX


Hi John-Mark,

My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking....

We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document. JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes.

Doing things this way we will achieve a few key benefits:

· The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader.
· The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future.
· Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure
· Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?).


Cheers

Terry MacDonald | Chief Product Officer

<image002.png>

M: +61-407-203-026
E: terry.macdonald@cosive.com
W: www.cosive.com




On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney <jmg@newcontext.com> wrote:
Hello,

In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type.

You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later.

So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2.

1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc.

2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo (http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme).

If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator.

There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly.

Additional Reading:
UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points.
Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above
Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info

My recommendations:
1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen.

2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above).

John-Mark
New Context



 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]