I pushed a change draft for Issue #93, “Problems with regions”:
I will move its adoption at TC #17 on May 16th.
Although the discussion thread in the issue is long, we landed on a simple set of changes:
- Replace offset and length properties with charOffset, charLength, byteOffset, and byteLength.
- Add a statement that character-based properties are independent of the presence or absence of a BOM.
- Change the statement that surrogate pairs consist of two characters. They are one character, but some editors get it wrong.
That first bullet point greatly simplified the descriptions of the various properties and the relationships among them.
I suggest that you read Section 3.22, “Region object” in its entirety as if it were new. Reading with Simple Markup will help a lot.
Jim, there are a few comments in there pointing out where I addressed some of your other concerns.
There are still two open questions about this change:
- Do we have to say anything about the set of acceptable line break characters? Jim is concerned that producers and consumers might differ in their interpretation of the LS character (U+2028). Michael and I are skeptical that there’s anything SARIF can do about that, but the discussion continues. I opened a separate issue for this: Issue #169, “Decide how to handle uncommon line break characters.”
- Section 3.22 contained a paragraph suggesting how a consumer should guess a file’s encoding so that it could count lines and columns correctly. That was very old text, from before we introduced file.encoding and run.defaultFileEncoding. Should we remove that paragraph? Should we specify an encoding-determination procedure that includes both the new properties and the old guidance? I sent an email on that yesterday, which I will now forward to all of you.