RE: Determining file encoding

I agree with what you’re saying. I think there is prior art to guide us here.

The HTTP Content-type header has an optional charset parameter:

Content-Type: text/html; charset=utf-8

… which allows any charset registered with IANA. Similarly, an HTML page can optionally specify its encoding:

But! If those items are absent, neither the HTTP nor HTML spec goes on to give you an algorithm for determining the encoding. You’re on your own.

Likewise, SARIF can optionally tell you about file encodings (file.encoding or run.defaultFileEncoding, which allow any charset registered by IANA). But if they’re absent, you’re on your own.

My concern is that the spec as it stands is inconsistent in what it says about determining file encoding. It says one thing in the description of file.encoding, and another in the description of regions. I suggest we remove the paragraph in the description of regions. The spec tells you how you can use file.encoding and run.defaultFileEncoding, if they’re present – but it’s silent about what to do if they’re absent.

Would that be ok?

Larry

From: Michael Fanning <Michael.Fanning@microsoft.com>
Sent: Monday, May 14, 2018 9:35 AM
To: Larry Golding (Comcast) <larrygolding@comcast.net>; 'James A. Kupsch' <kupsch@cs.wisc.edu>; Luke Cartey <luke@semmle.com>; sarif@lists.oasis-open.org
Subject: RE: Determining file encoding

I’m uncomfortable having our spec become a place where we provide definitive guidance on how text processors (such as code editors) determine file encoding. This isn’t the north star of our effort and neither have we pulled together relevant industry experience. Leaving SARIF out of the equation, afaik, every editor today has only a file’s contents available in order to determine its encoding. So I’m not clear on the urgency around persisting relevant information to SARIF and/or advising editors on how to go about this.

Larry, as we discussed offline, I’ve gone through several scenarios mentally to support the position above. Consider these:

Let’s say VS supports UTF16. It can parse utf16 surrogate pairs and display them, but somehow, VS skipped the part where it reads the UTF16 BOM on file open to determine endian-ness/how to parse/display (so SARIF needs to provide it). This seems unlikely.
Let’s say VS doesn’t support UTF16. The SARIF file helpfully provides this encoding information, but so what? VS can’t parse/display the SA results, so what’s the benefit?
Let’s say someone is building a new text file viewer. In 100% of non-SARIF cases, that viewer needs to inspect the file on file open to detect the encoding. So how is it that this information must be present in SARIF files or our scenarios fail?

A sarif producer’s responsibility is to make sure region column/line details are in sync with a text file’s encoding. Any viewer that attempts to display sarif results needs to be able to detect and handle that file’s encoding if it attempts to display results. Why does sarif need to get involved?

I may be missing something as I am not expert in this area. To be clear, I am not averse to leaving placeholders for ‘encoding’ and even Jim’s new line sequences data. Maybe it will give someone a leg up somewhere. I do object to spending lots of time explicating how to handle things in the spec. we need to close on solid, well-tested dynamic code flows, graphs, etc. That’s the special value we’re adding.

Michael

From: Larry Golding (Comcast) <larrygolding@comcast.net>
Sent: Saturday, May 12, 2018 2:45 PM
To: Michael Fanning <Michael.Fanning@microsoft.com>; 'James A. Kupsch' <kupsch@cs.wisc.edu>; Luke Cartey <luke@semmle.com>; sarif@lists.oasis-open.org
Subject: RE: Determining file encoding

+SARIF

From: Larry Golding (Comcast) <larrygolding@comcast.net>
Sent: Friday, May 11, 2018 5:33 PM
To: Michael Fanning <Michael.Fanning@microsoft.com>; 'James A. Kupsch' <kupsch@cs.wisc.edu>; Luke Cartey <luke@semmle.com>
Subject: RE: Determining file encoding

I might be able to finesse this point. I could remove the whole part of the “text regions” section that presents this (old and busted) way of determining encoding. Then I could say something like this:

A SARIF producer SHALL only emit text-related region properties if it knows the character encoding of the file, in which case it SHALL also emit file.encoding (§3.17.9) or run.defaultFileEncoding (§3.11.17).

In the section on fixes I’d say something like:

If a SARIF consumer does not know the character encoding of a file, it SHALL NOT apply a fix unless the deletedRegion contains binary-related properties.

Larry

From: Larry Golding (Comcast) <larrygolding@comcast.net>
Sent: Friday, May 11, 2018 3:27 PM
To: Michael Fanning <Michael.Fanning@microsoft.com>; 'James A. Kupsch' <kupsch@cs.wisc.edu>; Luke Cartey <luke@semmle.com>
Subject: Determining file encoding
Importance: High

The spec is inconsistent in how it tells a consumer to determine a file’s encoding.

The sections on file.encoding and run.defaultFileEncoding say:

First use file.encoding.
If it’s missing, use run.defaultFileEncoding.
If it’s missing, consider the encoding to be unknown.

The section on “Text regions” (which was written before we introduced file.encoding and run.defaultFileEncoding) has a different idea. The reason this section cares about encoding is that it wants consumers to know how many bytes each character occupies, so they can correctly identify (and highlight) a text region:

Look for a BOM.
If it’s absent, use external information (command line arguments, project settings, …)
If none of that helps, assume each character represents one byte.

(NOTE: Step 3 doesn’t actually identify an encoding, but it gives the consumer a best guess as to how to identify the region.)

We need to rationalize these. It might look like this:

Look for a BOM. (IMO, the file is the final authority.)
If there’s no BOM, use file.encoding.
If it’s missing, use run.defaultFileEncoding.
It it’s missing, use external information (command line arguments, project settings, …)
Otherwise, sniff the file and make your best guess.

A couple of things:

Step 5 is inconsistent with Luke’s dictum “consider it to be unknown”. Luke, tell me again please why it is unknown, and what would go wrong if I sniffed the file and guessed wrong.
In the case of fixes, you absolutely cannot afford to guess. You might even refuse to make the fix if the BOM is inconsistent with file.encoding.

Thoughts?

Larry

sarif message