[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [cti-stix] Unicode, strings, and STIX
If the consensus is that we *must* specify some length recommendation, then this is a good way to attempt to do so. Maybe say instead: Any length SHOULD be permitted Then maybe in the implementation guide say: suggested storage size is 8KB…
Sent: Thursday, June 02, 2016 8:53 AM To: Piazza, Rich <rpiazza@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX I guess I have also provided evidence that a spec can be widely implemented without specifying max lengths on important fields. The drawback, however, is that the max length will end up being the shortest supported value from major implementations, and it will only be discovered through painful research. -Mark From: <cti-stix@lists.oasis-open.org> on behalf of Mark Davidson <mdavidson@soltra.com> Date: Thursday, June 2, 2016 at 8:49 AM To: "Piazza, Rich" <rpiazza@mitre.org>, "Jordan, Bret" <bret.jordan@bluecoat.com> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Subject: Re: [cti-stix] Unicode, strings, and STIX There needs to be a limit, even if it’s a SHOULD requirement. If we don’t specify it, we’ll get SO posts like this: http://stackoverflow.com/questions/686217/maximum-on-http-header-values Thank you. -Mark From: <cti-stix@lists.oasis-open.org> on behalf of "Piazza, Rich" <rpiazza@mitre.org> Date: Wednesday, June 1, 2016 at 2:17 PM To: "Jordan, Bret" <bret.jordan@bluecoat.com> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>, Terry MacDonald <terry.macdonald@cosive.com>, John-Mark Gurney <jmg@newcontext.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> Subject: RE: [cti-stix] Unicode, strings, and STIX I think the spec would have to say something like – “Any length is permitted” Then, implementers would have to make sure they could support that. In STIX 1.2.1, the description field of all of the objects had this text in the specification documents. I’m not sure in which direction that will sway you J
Sent: Wednesday, June 01, 2016 1:38 PM To: Piazza, Rich <rpiazza@mitre.org> Cc: Jason Keirstead <Jason.Keirstead@ca.ibm.com>; Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX If we do not define a max length then everyone will set their own. And we will have problems. Bret Sent from my Commodore 64 On Jun 1, 2016, at 8:08 AM, Piazza, Rich <rpiazza@mitre.org> wrote:
In addition, I kinda agree that that the length of strings isn’t a “standards” issue, or an implementation issue that we need to comment on anywhere.
Sent: Wednesday, June 01, 2016 10:48 AM To: Piazza, Rich <rpiazza@mitre.org> Cc: Terry MacDonald <terry.macdonald@cosive.com>; John-Mark Gurney <jmg@newcontext.com>; cti-stix@lists.oasis-open.org Subject: RE: [cti-stix] Unicode, strings, and STIX RE the encoding language question, I posted some sample language to slack that I think solves the problem: "Any serialization of STIX MUST encode all String values in an encoding that follows the Unicode standard". +1
Sent: Wednesday, June 01, 2016 6:09 AM To: John-Mark Gurney <jmg@newcontext.com> Cc: cti-stix@lists.oasis-open.org Subject: Re: [cti-stix] Unicode, strings, and STIX Hi John-Mark, My issue with this is that its simple enough language for people reading the STIX standard. Not everyone who reads the STIX standards document will be a programmer, or have a programmers mentality. You have to be a programmer and understand all these terms and subjects before being able to comprehend what's going on within the standard. I firmly believe that we should use common terminology where possible within the standard, to make it as accessible as possible. And that got me thinking.... We should create a STIX v2.0 JSON serialization document that specifies the JSON specific implementations in nomative statements, and this should be separate from the STIX v2.0 standards document. JSON examples should absolutely be kept in the STIX v2.0 standards document to help readers conceptualise the standard, and to see how it would work in practice, but the examples in the standards document should only be for illustrative purposes. Doing things this way we will achieve a few key benefits:
· The STIX v2.0 Standards document will describe the standard itself, and will not have specific JSON implementation details in there, which will make it easier to apply to additional serialisation formats in the future. · Detailed implementation requirements for the JSON MTI serialization will be in a JSON specific document. This will ensure · Using this structure will set ourselves up for the future, enabling creation of additional serializations if we want in the future (binary anyone?). Cheers Terry MacDonald | Chief Product Officer <image002.png> M: +61-407-203-026 E: terry.macdonald@cosive.com W: www.cosive.com On Wed, Jun 1, 2016 at 5:55 AM, John-Mark Gurney <jmg@newcontext.com> wrote: Hello, In attempting to nail down the definition of the type string, there have been a few questions raised about the best definition. I do not believe there is any disagreement that Unicode will be used for the string representation, it is more how to address some of the things about handling the string type. You may have heard various talk about character vs code point vs glyph vs grapheme, and I found a good post answering the distinction between them at http://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme . I will talk about encoding later. So, at the most basic, a string is a sequence of Unicode code points. Some strings may have more code points than others, though they are the equivalent, ø (1) vs o w/ combining long solidus overlay (2), though when normalized (NFC), they will be equal. Sadly, some other code points are ligatures, which are not expanded when normalized (NFC) resulting in the fi ligature not being equal to the letters f followed by i when normalized (NFC). NFKC will make them equal, but will destroy the meaning of other symbols, like 2 superscript becomes a normal 2. 1) Should we add length restrictions to (some?) fields? For example, should the title field be restricted in it's length somehow? Or should people be able to put unlimited length text in the field? Some fields like description, I expect would possibly be unlimited sans some other overriding limit, such as total TLO size, etc. 2) If there are length limits, how should the length limit be defined? Should it be number of graphemes displayed? (Be careful of this, because of things like Zalgo (http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work) make even a short ~25 grapheme string have ~292 code points, or 559 bytes when UTF-8 encoded. Though no language will normally use so many combining code points, it is required to use more than one in some languages. Normalization can help reduce a string's number of code points, but does not always help. Some languages, like Thai, will use more than one combining code point to make a single grapheme (consonant + vowel + tone mark for three code points for a single grapheme). If graphemes are used, it would require a validator to have a detailed table to decide how many graphemes are in the string. Using code points would not require as much work for the validator. There is an additional issue of encoding, but this should be easy. It should use the underlying serialization format's encoding of Unicode. In the case of JSON, the default is UTF-8. In the case of XML, it can be specified by the document itself, and may even be in a non-UTF encoding, but it is assumed that if the document is in a different character set, that the processor will convert to Unicode code points properly. Additional Reading: UNICODE TEXT SEGMENTATION http://unicode.org/reports/tr29/ -- has additional examples of grapheme and code points. Internationalization for Turkish: Dotted and Dotless Letter "I" http://www.i18nguy.com/unicode/turkish-i18n.html -- More deals w/ complexities of locales than the above Forms of Unicode http://www.icu-project.org/docs/papers/forms_of_unicode/ -- Good description of glyph vs characters vs ligatures and encoding info My recommendations: 1) I do believe that limits should be defined for some fields. Things like title should not have the description in them, and leaving it undefined will allow it to happen. 2) My personal view (as a programmer of many years) it to go the simple round and limit it by code points. This is easiest for a programmer to do w/ existing tools. It also gives a more clear storage space limit (see the Zalgo example above). John-Mark New Context |
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]