Re: [sarif] Change draft for #178 (column interpretation property)

Subject: Re: [sarif] Change draft for #178 (column interpretation property)

Hi Larry,

I think this change draft doesn't quite solve the issue. In particular, I believe this restriction is wrong:

> for results reported in UTF-16-encoded files

It's not UTF-16 encoded files that are (specifically) the problem. The problem is that many languages internally represent strings as UTF-16 regardless of the original encoding of the file. For ease of implementation, static analysis tools written in these languages will often produce column numbers which count surrogate pairs as 2 columns, and all other code points as 1 column - i.e. count columns based on utf-16 code units, regardless of the original encoding.

Given that this can apply even when the file is not utf-16 encoded, I'm not sure specifying columns-per-surrogate pair is the right way to go.

My thought when I submitted the issue (which I didn't fully explain, sorry!) was that we should introduce an enumerated property called something like "columnKind", with options of:

* unicodeCodePoint

* utf16CodeUnit

This is more flexible than the column counting approach, as it allows us the option to add other column types later (utf8CodeUnit, for example).

I also note that this text still occurs in the change draft (3.22.2):

> In text files encoded in UTF-16, a “surrogate pair” [UNICODE10] SHALL be considered as a single character.

It looks like it should just be removed.

Also in the same section, we have:

> A column number represents a count of characters.

With the change, this is no longer true. This should be updated to reference the new property. I also think it should be made clear that the interpretation of the column count is not in any way reliant on the original encoding of the file.

Apologies for not sending this out sooner, I was away over the weekend.

Thanks,

Luke

On Sat, Jun 2, 2018 at 1:23 AM Larry Golding (Comcast) <larrygolding@comcast.net> wrote:

I pushed a change draft for Issue #178: Support a character or column interpretation property.

Documents/ChangeDrafts/Active/sarif-v2.0-issue-178-columnsPerSurrogatePair.docx

I will move its adoption at TC #19 on June 6^th.

Here is the change in its entirety (the property is on the run object):
3.11.19 columnsPerSurrogatePair property
A run object MAY contain a property named columnsPerSurrogatePair whose value is an integer that specifies, for results reported in UTF-16-encoded files, the number of columns a surrogate pair [UNICODE10] is considered to occupy.
If columnsPerSurrogatePair is absent, it defaults to 1.

Thanks,
Larry

sarif message

3.11.19 columnsPerSurrogatePair property