RE: [sarif] Re: Determining file encoding

Thanks for the discussion. I incorporated these suggestions into the change draft for Issue #93, “Problems with regions” (which I originally pushed on Friday May 11^th).

I added this text and removed ~~this paragraph~~:

3.22.2 Text regions

The line number of the first line in a text file SHALL be 1. The column number of the first character in each line SHALL be 1. The character offset of first character in the file SHALL be 0.

The values of text properties SHALL NOT depend on the presence or absence of a byte order mark (BOM) at the start of the file.

SARIF defines a column number as a count of characters. If a line in a text file contains tab characters, a SARIF viewer MAY choose to present column numbers that match the visual offset of each character from the beginning of the line. These “visual” column numbers might not match the column numbers contained in the SARIF file.

The range of bytes represented by a text region depends on the file’s character encoding. A SARIF consumer SHALL consider a file to have the encoding specified by file.encoding (§3.17.9) if present, or else by run.defaultFileEncoding (§3.11.17), if present. If neither is present, the consumer MAY use any heuristic or procedure to determine the encoding, including (for example) prompting the user.

NOTE: If a consumer incorrectly determines a file’s encoding, it might not display the file correctly. For example, when it attempts to highlight a region, it might highlight an incorrect range of characters.

Depending on the file's character encoding, each character might be represented by one or more bytes. In particular, in files encoded in UTF-16, a “surrogate pair” [UNICODE10] SHALL be considered as a single character.

Programs such as viewers that process SARIF log files together with the analysis target files to which those log files refer SHOULD attempt to determine the character encoding of the target files. In the absence of internal information such as a Byte Order Mark, viewers MAY use external information (for example, command line arguments, project settings, or other configuration information) to determine the character encoding. If external information is also lacking, viewers SHOULD assume that each character occupies one byte.

...

Larry

From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org> On Behalf Of Larry Golding (Comcast)
Sent: Monday, May 14, 2018 12:57 PM
To: 'James A. Kupsch' <kupsch@cs.wisc.edu>; 'Michael Fanning' <Michael.Fanning@microsoft.com>; 'Luke Cartey' <luke@semmle.com>; sarif@lists.oasis-open.org
Subject: RE: [sarif] Re: Determining file encoding

Merging threads. I wrote in response to Michael:

I agree with what you’re saying. I think there is prior art to guide us here.

The HTTP Content-type header has an optional charset parameter:

Content-Type: text/html; charset=utf-8

… which allows any charset registered with IANA. Similarly, an HTML page can optionally specify its encoding:

But! If those items are absent, neither the HTTP nor HTML spec goes on to give you an algorithm for determining the encoding. You’re on your own.

Likewise, SARIF can optionally tell you about file encodings (file.encoding or run.defaultFileEncoding, which allow any charset registered by IANA). But if they’re absent, you’re on your own.

My concern is that the spec as it stands is inconsistent in what it says about determining file encoding. It says one thing in the description of file.encoding, and another in the description of regions. I suggest we remove the paragraph in the description of regions. The spec tells you how you can use file.encoding and run.defaultFileEncoding, if they’re present – but it’s silent about what to do if they’re absent.

That is consistent with what you wrote:

SARIF should say that the encoding is known only if it is specified in the file. If it is known then that is the encoding. If it is unknown, then users and viewers will need to determine the encoding of file's using heuristics, and that the file may not be displayed correctly. I don't think that we need to say anything more than that (we don't need to talk about the BOM or how to guess).

I agree that snippets and regions don’t need to specify encoding. We’ve discussed that previously and reached that conclusion.

I can add the caveat you want about incorrect highlighting (it would be in the section on regions, not snippets).

I will make these changes in the change draft for #93, “Problems with regions” (since that’s where I came across the paragraph I’m going to remove).

Larry

-----Original Message-----
From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org> On Behalf Of James A. Kupsch
Sent: Monday, May 14, 2018 12:02 PM
To: Michael Fanning <Michael.Fanning@microsoft.com>; Larry Golding (Comcast) <larrygolding@comcast.net>; Luke Cartey <luke@semmle.com>; sarif@lists.oasis-open.org
Subject: [sarif] Re: Determining file encoding

Hi,

There is no way to automatically determine the encoding of the file.

The same sequence of bytes can be a valid bytes sequence in multiple encodings that result in different character sequences (what is rendered to the user). For instance the value C1 is GREEK CAPITAL LETTER ALPHA in ISO-8859-7 while it is LATIN CAPITAL LETTER A WITH ACUTE in ISO-8859-15. For files encoded with these encodings there is no indicator sequence in the file, so it is not that an editor inspects the file and determines which one is correct, but a human being manually configuring their editor and maybe trying different encodings until one of them makes sense. A reasonable strategy today, might be look for a BOM and use an encoding of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE. If the BOM is not present then use UTF-8. This does gets it wrong if the file is encoded as UTF-16, UTF-32, ISO-8859-*, CP*, SJIS, ...

The encoding is an attribute of the file, so I do not think that there needs any encoding associated with a region (the encoding is the same as the file). I think that we should encourage producers to include encoding information, and that if it is not present then note that viewer may not be able to correctly display the file contents to users.

For snippets we should say that if there is not a one-to-one mapping character codes from the source encoding to the unicode then the highlighting of results based on the snippet may not be correct.

Jim

On 05/14/2018 11:34 AM, Michael Fanning wrote:

> I’m uncomfortable having our spec become a place where we provide

> definitive guidance on how text processors (such as code editors)

> determine file encoding. This isn’t the north star of our effort and

> neither have we pulled together relevant industry experience. Leaving

> SARIF out of the equation, afaik, every editor today has only a file’s

> contents available in order to determine its encoding. So I’m not

> clear on the urgency around persisting relevant information to SARIF

> and/or advising editors on how to go about this.

> Larry, as we discussed offline, I’ve gone through several scenarios

> mentally to support the position above. Consider these:

> 1. Let’s say VS supports UTF16. It can parse utf16 surrogate pairs and

> display them, but somehow, VS skipped the part where it reads the

> UTF16 BOM on file open to determine endian-ness/how to parse/display

> (so SARIF needs to provide it). This seems unlikely. 2. Let’s say VS

> doesn’t support UTF16. The SARIF file helpfully provides this encoding

> information, but so what? VS can’t parse/display the SA results, so

> what’s the benefit? 3. Let’s say someone is building a new text file

> viewer. In 100% of non-SARIF cases, that viewer needs to inspect the

> file on file open to detect the encoding. So how is it that this

> information must be present in SARIF files or our scenarios fail?

> A sarif producer’s responsibility is to make sure region column/line

> details are in sync with a text file’s encoding. Any viewer that

> attempts to display sarif results needs to be able to detect and

> handle that file’s encoding if it attempts to display results. Why

> does sarif need to get involved?

> I may be missing something as I am not expert in this area. To be

> clear, I am not averse to leaving placeholders for ‘encoding’ and even

> Jim’s new line sequences data. Maybe it will give someone a leg up

> somewhere. I do object to spending lots of time explicating how to

> handle things in the spec. we need to close on solid, well-tested

> dynamic code flows, graphs, etc. That’s the special value we’re

> adding.

> Michael

> *From:* Larry Golding (Comcast) <larrygolding@comcast.net> *Sent:*

> Saturday, May 12, 2018 2:45 PM *To:* Michael Fanning

> <Michael.Fanning@microsoft.com>; 'James A. Kupsch'

> <kupsch@cs.wisc.edu>; Luke Cartey <luke@semmle.com>;

> sarif@lists.oasis-open.org *Subject:* RE: Determining file encoding

> +SARIF

> *From:* Larry Golding (Comcast) <larrygolding@comcast.net

> <mailto:larrygolding@comcast.net>> *Sent:* Friday, May 11, 2018 5:33

> PM *To:* Michael Fanning <Michael.Fanning@microsoft.com

> <mailto:Michael.Fanning@microsoft.com>>; 'James A. Kupsch'

> <kupsch@cs.wisc.edu <mailto:kupsch@cs.wisc.edu>>; Luke Cartey

> <luke@semmle.com <mailto:luke@semmle.com>> *Subject:* RE:

> Determining file encoding

> I might be able to finesse this point. I could remove the whole part

> of the “text regions” section that presents this (old and busted) way

> of determining encoding. Then I could say something like this:

> A SARIF producer *SHALL* only emit text-related region properties if

> it knows the character encoding of the file, in which case it *SHALL*

> also emit file.encoding (§3.17.9) or run.defaultFileEncoding

> (§3.11.17).

> In the section on fixes I’d say something like:

> If a SARIF consumer does not know the character encoding of a file, it

> *SHALL NOT* apply a fix unless the deletedRegion contains

> binary-related properties.

> Larry

> *From:* Larry Golding (Comcast) <larrygolding@comcast.net

> <mailto:larrygolding@comcast.net>> *Sent:* Friday, May 11, 2018 3:27

> PM *To:* Michael Fanning <Michael.Fanning@microsoft.com

> <mailto:Michael.Fanning@microsoft.com>>; 'James A. Kupsch'

> <kupsch@cs.wisc.edu <mailto:kupsch@cs.wisc.edu>>; Luke Cartey

> <luke@semmle.com <mailto:luke@semmle.com>> *Subject:* Determining file

> encoding *Importance:* High

> *The spec is inconsistent in how it tells a consumer to determine a

> file’s encoding.*

> The sections on file.encoding and run.defaultFileEncoding say:

> 1. First use file.encoding. 2. If it’s missing, use

> run.defaultFileEncoding. 3. If it’s missing, consider the encoding to

> be unknown.

> The section on “Text regions” (which was written *before we introduced

> **file.encoding and **run.defaultFileEncoding*) has a different idea.

> The reason this section cares about encoding is that it wants

> consumers to know how many bytes each character occupies, so they can

> correctly identify (and highlight) a text region:

> 1. Look for a BOM. 2. If it’s absent, use external information

> (command line arguments, project settings, …) 3. If none of that

> helps, assume each character represents one byte.

> (*NOTE*: Step 3 doesn’t actually identify an encoding, but it gives

> the consumer a best guess as to how to identify the region.)

> We need to rationalize these. It might look like this:

> 1. Look for a BOM. (IMO, the file is the final authority.) 2. If

> there’s no BOM, use file.encoding. 3. If it’s missing, use

> run.defaultFileEncoding. 4. It it’s missing, use external information

> (command line arguments, project settings, …) 5.

> Otherwise, sniff the file and make your best guess.

> A couple of things:

> * Step 5 is inconsistent with Luke’s dictum “consider it to be

> unknown”. *Luke*, tell me again please why it is unknown, and what

> would go wrong if I sniffed the file and guessed wrong. * In the case

> of fixes, you absolutely cannot afford to guess. You might even

> refuse to make the fix if the BOM is inconsistent with file.encoding.

> Thoughts?

> Larry

---------------------------------------------------------------------

To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail. Follow this link to all your TCs in OASIS at:

https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

sarif message

3.22.2 Text regions