OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti-stix message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [cti-stix] Unicode, strings, and STIX


HTTP sez:


"

 The HTTP protocol does not place any a priori limit on the length of
   a URI. Servers MUST be able to handle the URI of any resource they
   serve, and SHOULD be able to handle URIs of unbounded length if they
   provide GET-based forms that could generate such URIs. A server
   SHOULD return 414 (Request-URI Too Long) status if a URI is longer
   than the server can handle (see section 10.4.15).

      Note: Servers ought to be cautious about depending on URI lengths
      above 255 bytes, because some older client or proxy
      implementations might not properly support these lengths.
"


The focus of HTTP was not to define a schema (as in "how long is a String type"?), but to promote interoperability via a standard API (as in, the 4 standard verbs: GET/POST/PUT/DELETE). In this snippet, we see how HTTP addresses the situation when the length of something is unknown and potentially too long for one party in the conversation.

Yes, as Mark Davidson points out, the globally-discoverable minimal length of HTTP Headers is really the smallest of any implementation. So, if you want your HTTP thing to interoperate with everybody with the least friction, you have to find and use that minimum length.

But, HTTP has you covered if you don't know (or can't know, really) what that globally-discoverable minimal length is. Simply put: HTTP gives the communicating parties a means to say, "Too big! Sorry!"

Here's the question that most compels me: How can we avoid arguing about schema (lengths of strings and other datatype questions) and allow the communicating parties to tell each other "I can't handle that! Sorry!"

JSA


From: cti-stix@lists.oasis-open.org <cti-stix@lists.oasis-open.org> on behalf of Wunder, John A. <jwunder@mitre.org>
Sent: Thursday, June 2, 2016 10:17:30 AM
To: cti-stix@lists.oasis-open.org
Subject: Re: [cti-stix] Unicode, strings, and STIX
 

This struck me as the type of thing that must have been done before, so  I did a little research on what other similar specifications (data models, not transport protocols) did:

 

-          IODEF: no max lengths specified

-          CIQ: no max lengths specified

-          HL7: specify a minimum length that specifications have to be able to handle, but no maximum length (could not find actual language here due to specs not being free, I asked a colleague,)

-          HDATA: no max lengths specified

-          SMTP: some fields have max length (in characters), some don’t

-          OASIS CAP: no max lengths, they have a MAY requirement for some fields suggesting a max size that would be appropriate

-          EDXL: no max lengths

 

To be honest I went into this thinking that we needed to specify max lengths, but based on this research maybe we shouldn’t? Rich’s approach below seems best to me.

 

Are there any other specs we could learn from? What did I miss?

 

John

 

 

From: <cti-stix@lists.oasis-open.org> on behalf of Jason Keirstead <Jason.Keirstead@ca.ibm.com>
Date: Thursday, June 2, 2016 at 10:10 AM
To: Rich Piazza <rpiazza@mitre.org>
Cc: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Subject: RE: [cti-stix] Unicode, strings, and STIX

 

If the consensus is that we *must* specify some length recommendation, then this is a good way to attempt to do so.


-
Jason Keirstead
STSM, Product Architect, Security Intelligence, IBM Security Systems
www.ibm.com/security | www.securityintelligence.com

Without data, all you are is just another person with an opinion - Unknown


nactive hide details for "Piazza, Rich" ---06/02/2016 10:24:57 AM---Maybe"Piazza, Rich" ---06/02/2016 10:24:57 AM---Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: sugges

From: "Piazza, Rich" <rpiazza@mitre.org>
To: Mark Davidson <mdavidson@soltra.com>, "Jordan, Bret" <bret.jordan@bluecoat.com>
Cc: Jason Keirstead/CanEast/IBM@IBMCA, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Date: 06/02/2016 10:24 AM
Subject: RE: [cti-stix] Unicode, strings, and STIX
Sent by: <cti-stix@lists.oasis-open.org>





Maybe say instead: Any length SHOULD be permitted

Then maybe in the implementation guide say: suggested storage size is 8KB…

From: Mark Davidson [mailto:mdavidson@soltra.com]
Sent:
Thursday, June 02, 2016 8:53 AM
To:
Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com>
Cc:
Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
Subject:
Re: [cti-stix] Unicode, strings, and STIX


I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up being the shortest supported value from major implementations, and it will only be discovered through painful research.

-Mark

From: <cti-stix@lists.oasis-open.org> on behalf of Mark Davidson <mdavidson@soltra.com>
Date:
Thursday, June 2, 2016 at 8:49 AM
To:
"Piazza, Rich" <
rpiazza@mitre.org>, "Jordan, Bret" <bret.jordan@bluecoat.com>
Cc:
Jason Keirstead <
Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Subject:
Re: [cti-stix] Unicode, strings, and STIX


There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values

Thank you.
-Mark

From: <cti-stix@lists.oasis-open.org> on behalf of "Piazza, Rich" <rpiazza@mitre.org>
Date:
Wednesday, June 1, 2016 at 2:17 PM
To:
"Jordan, Bret" <
bret.jordan@bluecoat.com>
Cc:
Jason Keirstead <
Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Subject:
RE: [cti-stix] Unicode, strings, and STIX


I think the spec would have to say something like – “Any length is permitted”

Then, implementers would have to make sure they could support that.

In STIX 1.2.1, the description field of all of the objects had this text in the specification documents. I’m not sure in which direction that will sway you J

From: Jordan, Bret [mailto:bret.jordan@bluecoat.com]
Sent:
Wednesday, June 01, 2016 1:38 PM
To:
Piazza, Rich <
rpiazza@mitre.org>
Cc:
Jason Keirstead <
Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
Subject:
Re: [cti-stix] Unicode, strings, and STIX

If we do not define a max length then everyone will set their own. And we will have problems.

Bret

Sent from my Commodore 64


On Jun 1, 2016, at 8:08 AM, Piazza, Rich <
rpiazza@mitre.org> wrote:

My +1 was for the idea that implementation details like this do not belong in the standard.

In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere.

From: cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Jason Keirstead
Sent:
Wednesday, June 01, 2016 10:48 AM
To:
Piazza, Rich <
rpiazza@mitre.org>
Cc:
Terry MacDonald <
terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org
Subject:
RE: [cti-stix] Unicode, strings, and STIX

RE the encoding language question, I posted some sample language to slack that I think solves the problem: "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard".

I do not think the below proposal solves some of the other key questions JMG poses. The most critical question we have is with regards to all of these "max length" properties in the spec and how they will be validated. These things actually *can not* be validated in an encoding-independent way. I have asked a few times and will ask again - in 2016, is "max length" really anything we need to care about here. DBAs may have a bit of heartburn, but IMO it is not something we should be concerned with in STIX. Modern databases do not pre-allocate storage for columns anymore anyway. I would rather just forget about the idea. It makes things a lot simpler.

Also, the idea that we should say for example "a title should only be 255 code points long" is completely arbitrary IMO and imposing undue limits on the analyst.

-
Jason Keirstead
STSM, Product Architect, Security Intelligence, IBM Security Systems
www.ibm.com/security | www.securityintelligence.com

Without data, all you are is just another person with an opinion - Unknown


<image001.gif>"Piazza, Rich" ---06/01/2016 11:39:45 AM---+1 From:
cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Terry Mac

From:
"Piazza, Rich" <rpiazza@mitre.org>
To:
Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>
Cc:
"cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Date:
06/01/2016 11:39 AM
Subject:
RE: [cti-stix] Unicode, strings, and STIX
Sent by:
<cti-stix@lists.oasis-open.org>






+1

From: cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Terry MacDonald
Sent:
Wednesday, June 01, 2016 6:09 AM
To:
John-Mark Gurney <
jmg@newcontext.com>
Cc:
cti-stix@lists.oasis-open.org
Subject:
Re: [cti-stix] Unicode, strings, and STIX

Hi John-Mark,


My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking....


We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document. JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes.


Doing things this way we will achieve a few key benefits:

· The STIX v2.0 Standards document will be easier to read with plain language, and still have examples to clarify meaning to the reader.
·
The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future.
·
Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure
·
Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?).


Cheers


Terry MacDonald
| Chief Product Officer

<image002.png>

M:
+61-407-203-026
E:
terry.macdonald@cosive.com
W:
www.cosive.com




On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney <
jmg@newcontext.com> wrote:
Hello,

In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type.

You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at
http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later.

So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2.

1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc.

2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo (
http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme).

If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator.

There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly.

Additional Reading:
UNICODE TEXT SEGMENTATION
http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points.
Internationalization for Turkish: Dotted and Dotless Letter "I"
http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above
Forms of Unicode
http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info

My recommendations:
1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen.

2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above).

John-Mark
New Context

 

 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]