cti-stix message

Subject: RE: [cti-stix] Unicode, strings, and STIX

From: "Jason Keirstead" <Jason.Keirstead@ca.ibm.com>
To: Terry MacDonald <terry.macdonald@cosive.com>
Date: Thu, 2 Jun 2016 09:17:10 -0300

There is simply no logical way to define a "max length" in a way that protects against "buffer overflow" problems with Unicode... so if buffer overflow is the main motivation for this

- If we say "max_length" of title means 255 *BYTES*, then in some languages that is going to result in a very short title than other languages - and furthermore, you could be truncating it in the middle of a character (grapheme) making it all the more invalid for the person entering it on their screen.

- If we say "max_length" of title means 255 *code points*, then in some languages it will result in shorter titles being allowd than others, and it also could equal an arbitrary number of bytes, as it depends on the encoding and language being encoded. And you still have the problem of truncating in the middle of a character (grapheme)

- If we say "max_length" of title means 255 *graphemes*, then all languages are allowed the same title length, and you have no problems truncating in the middle of a character. However, it means a title could equal an arbitrary number of bytes.

I say throw it out.

-
Jason Keirstead
STSM, Product Architect, Security Intelligence, IBM Security Systems
www.ibm.com/security | www.securityintelligence.com

Without data, all you are is just another person with an opinion - Unknown

Terry MacDonald ---06/01/2016 07:19:19 PM---I think having built in maximum field size is pragmatic. We don't want to design buffer overflow sus

From: Terry MacDonald <terry.macdonald@cosive.com>
To: Rich Piazza <rpiazza@mitre.org>
Cc: John-Mark Gurney <jmg@newcontext.com>, Jason Keirstead/CanEast/IBM@IBMCA, "Jordan, Bret" <bret.jordan@bluecoat.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Date: 06/01/2016 07:19 PM
Subject: RE: [cti-stix] Unicode, strings, and STIX
Sent by: <cti-stix@lists.oasis-open.org>

I think having built in maximum field size is pragmatic. We don't want to design buffer overflow susceptibility into all STIX services just because we couldn't agree where to place text limiting field lengths.

I personally think that maximum field length should be defined in the STIX standards doc for each STIX type (e.g. boolean, number), and that it should be sized in Unicode characters. Then in each serialisation document (e.g. in a JSON serialisation doc) we should convert that Unicode character length into what ever length definition makes sense for that serialisation format e.g. JSON and the use of code points.

I really don't want to be responsible for creating threat intelligence hacks in 2-5 years from now because of a decision we made today.

Cheers
Terry MacDonald
Cosive

On 2/06/2016 04:17, "Piazza, Rich" <rpiazza@mitre.org> wrote:

I think the spec would have to say something like – “

Any length is permitted”

Then, implementers would have to make sure they could support that.

In STIX 1.2.1, the description field of all of the objects had this text in the specification documents. I’m not sure in which direction that will sway you J

From:

Jordan, Bret [mailto:

bret.jordan@bluecoat.com

]

Sent:

Wednesday, June 01, 2016 1:38 PM

To:

Piazza, Rich <

rpiazza@mitre.org

Cc:

Jason Keirstead <

Jason.Keirstead@ca.ibm.com

>; Terry MacDonald <

terry.macdonald@cosive.com

>; John-Mark Gurney <

jmg@newcontext.com

cti-stix@lists.oasis-open.org

Subject:

Re: [cti-stix] Unicode, strings, and STIX

If we do not define a max length then everyone will set their own. And we will have problems.

Bret

Sent from my Commodore 64

On Jun 1, 2016, at 8:08 AM, Piazza, Rich <rpiazza@mitre.org> wrote:

My +1 was for the idea that implementation details like this do not belong in the standard.

In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere.

From:

cti-stix@lists.oasis-open.org

[

mailto:cti-stix@lists.oasis-open.org

]

On Behalf Of

Jason Keirstead

Sent:

Wednesday, June 01, 2016 10:48 AM

To:

Piazza, Rich <

rpiazza@mitre.org

Cc:

Terry MacDonald <

terry.macdonald@cosive.com

>; John-Mark Gurney <

jmg@newcontext.com

cti-stix@lists.oasis-open.org

Subject:

RE: [cti-stix] Unicode, strings, and STIX

RE the encoding language question, I posted some sample language to slack that I think solves the problem: "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard".

I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler.

Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst.

-
Jason Keirstead
STSM, Product Architect, Security Intelligence, IBM Security Systems
www.ibm.com/security | www.securityintelligence.com

Without data, all you are is just another person with an opinion - Unknown

<image001.gif>"Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From: cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Terry Mac

From: "Piazza, Rich" <rpiazza@mitre.org>
To: Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>
Cc: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Date: 06/01/2016 11:39 AM
Subject: RE: [cti-stix] Unicode, strings, and STIX
Sent by: <cti-stix@lists.oasis-open.org>

From:

cti-stix@lists.oasis-open.org

[

mailto:cti-stix@lists.oasis-open.org

]

On Behalf Of

Terry MacDonald

Sent:

Wednesday, June 01, 2016 6:09 AM

To:

John-Mark Gurney <

jmg@newcontext.com

Cc:

cti-stix@lists.oasis-open.org

Subject:

Re: [cti-stix] Unicode, strings, and STIX

Hi John-Mark,

My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking....

We should create a

STIX v2.0 JSON serialization document

that specifies the JSON specific implementations in nomative statements, and this should be separate from the

STIX v2.0 standards document

. JSON examples should absolutely be kept in the

STIX v2.0 standards document

to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document

should only be for illustrative purposes.

Doing things this way we will achieve a few key benefits:

The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader.

The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future.

Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure

Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?).

Cheers

Terry MacDonald

| Chief Product Officer

<image002.png>

+61-407-203-026

terry.macdonald@cosive.com

www.cosive.com

On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney <

jmg@newcontext.com

> wrote:

Hello,

In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type.

You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at

http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

. I will talk about encoding later.

So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2.

1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc.

2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo (

http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme).

If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator.

There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly.

Additional Reading:
UNICODE TEXT SEGMENTATION

http://unicode.org/reports/tr29/

-- has additional examples of grapheme and code points.
Internationalization for Turkish: Dotted and Dotless Letter "I"

http://www.i18nguy.com/unicode/turkish-i18n.html

-- More deals w/ complexities of locales than the above
Forms of Unicode

http://www.icu-project.org/docs/papers/forms_of_unicode/

-- Good description of glyph vs characters vs ligatures and encoding info

My recommendations:
1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen.

2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above).

John-Mark
New Context

References:
- Re: [cti-stix] Unicode, strings, and STIX
  - From: Terry MacDonald <terry.macdonald@cosive.com>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: "Piazza, Rich" <rpiazza@mitre.org>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: "Jason Keirstead" <Jason.Keirstead@ca.ibm.com>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: "Piazza, Rich" <rpiazza@mitre.org>
- Re: [cti-stix] Unicode, strings, and STIX
  - From: "Jordan, Bret" <bret.jordan@bluecoat.com>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: "Piazza, Rich" <rpiazza@mitre.org>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: Terry MacDonald <terry.macdonald@cosive.com>