cti-stix message

Subject: Re: [cti-stix] Unicode, strings, and STIX

From: John-Mark Gurney <jmg@newcontext.com>
To: Jason Keirstead <Jason.Keirstead@ca.ibm.com>
Date: Thu, 2 Jun 2016 12:22:06 -0700

Jason Keirstead wrote this message on Thu, Jun 02, 2016 at 09:17 -0300:
> There is simply no logical way to define a "max length" in a way that
> protects against "buffer overflow" problems with Unicode... so if buffer
> overflow is the main motivation for this

I agree that max lengths is not a buffer overflow issue.. If you're not
using a high level string language, that's your issue...

> - If we say "max_length" of title means 255 *BYTES*, then in some languages
> that is going to result in a very short title than other languages - and
> furthermore, you could be truncating it in the middle of a character
> (grapheme) making it all the more invalid for the person entering it on
> their screen.

And you also run into the issue that 255 *BYTES* means different things in
different encodings for different serialization methods, and so one valid
title might not be valid in another encoding/serialization method...  We
should ensure that a STIX document is valid for all serialization methods...

> - If we say "max_length" of title means 255 *code points*, then in some
> languages it will result in shorter titles being allowed than others, and it
> also could equal an arbitrary number of bytes, as it depends on the
> encoding and language being encoded. And you still have the problem of
> truncating in the middle of a character (grapheme)

I do recognize this is an issue...  (I'm not a language expert in all
the languages out there, so I might be missing some, or getting facts
incorrect)  From my understanding, the "worse" is Korean in using multiple
code points (3) per grapheme, but that all possible combinations can be
normalized down to one code point.  I don't know of a single language
that has an average code point per grapheme (post normalized)
>2..  Even Thai where it's sometimes three is still well below an
average of 2 code points per grapheme (quick test on a random few sentences
gave me 1.12)..  Yes, this doesn't address languages like German that have
extremely long words...

I did try to find averages, but I couldn't find a good resource on it...

Saying that just because some languages require more/less code points,
that they will be allowed shorter/longer titles is incorrect, as each
language has different information densities per grapheme...

> - If we say "max_length" of title means 255 *graphemes*, then all languages
> are allowed the same title length, and you have no problems truncating in
> the middle of a character. However, it means a title could equal an
> arbitrary number of bytes.

Except when some languages like German have really long words, and so their
titles will be (informationally) shorter than other languages, like CJK
languages which have a high information density per grapheme...

P.S. I'm fine w/ supporting a TC's decision to not have lengths.

-- 
John-Mark

Follow-Ups:
- Re: [cti-stix] Unicode, strings, and STIX
  - From: "Jordan, Bret" <bret.jordan@bluecoat.com>

References:
- Re: [cti-stix] Unicode, strings, and STIX
  - From: Terry MacDonald <terry.macdonald@cosive.com>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: "Piazza, Rich" <rpiazza@mitre.org>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: "Jason Keirstead" <Jason.Keirstead@ca.ibm.com>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: "Piazza, Rich" <rpiazza@mitre.org>
- Re: [cti-stix] Unicode, strings, and STIX
  - From: "Jordan, Bret" <bret.jordan@bluecoat.com>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: "Piazza, Rich" <rpiazza@mitre.org>
- RE: [cti-stix] Unicode, strings, and STIX
  - From: Terry MacDonald <terry.macdonald@cosive.com>