RE: Determining file encoding

+SARIF

From: Larry Golding (Comcast) <larrygolding@comcast.net>
Sent: Friday, May 11, 2018 5:33 PM
To: Michael Fanning <Michael.Fanning@microsoft.com>; 'James A. Kupsch' <kupsch@cs.wisc.edu>; Luke Cartey <luke@semmle.com>
Subject: RE: Determining file encoding

I might be able to finesse this point. I could remove the whole part of the “text regions” section that presents this (old and busted) way of determining encoding. Then I could say something like this:

A SARIF producer SHALL only emit text-related region properties if it knows the character encoding of the file, in which case it SHALL also emit file.encoding (§3.17.9) or run.defaultFileEncoding (§3.11.17).

In the section on fixes I’d say something like:

If a SARIF consumer does not know the character encoding of a file, it SHALL NOT apply a fix unless the deletedRegion contains binary-related properties.

Larry

From: Larry Golding (Comcast) <larrygolding@comcast.net>
Sent: Friday, May 11, 2018 3:27 PM
To: Michael Fanning <Michael.Fanning@microsoft.com>; 'James A. Kupsch' <kupsch@cs.wisc.edu>; Luke Cartey <luke@semmle.com>
Subject: Determining file encoding
Importance: High

The spec is inconsistent in how it tells a consumer to determine a file’s encoding.

The sections on file.encoding and run.defaultFileEncoding say:

First use file.encoding.
If it’s missing, use run.defaultFileEncoding.
If it’s missing, consider the encoding to be unknown.

The section on “Text regions” (which was written before we introduced file.encoding and run.defaultFileEncoding) has a different idea. The reason this section cares about encoding is that it wants consumers to know how many bytes each character occupies, so they can correctly identify (and highlight) a text region:

Look for a BOM.
If it’s absent, use external information (command line arguments, project settings, …)
If none of that helps, assume each character represents one byte.

(NOTE: Step 3 doesn’t actually identify an encoding, but it gives the consumer a best guess as to how to identify the region.)

We need to rationalize these. It might look like this:

Look for a BOM. (IMO, the file is the final authority.)
If there’s no BOM, use file.encoding.
If it’s missing, use run.defaultFileEncoding.
It it’s missing, use external information (command line arguments, project settings, …)
Otherwise, sniff the file and make your best guess.

A couple of things:

Step 5 is inconsistent with Luke’s dictum “consider it to be unknown”. Luke, tell me again please why it is unknown, and what would go wrong if I sniffed the file and guessed wrong.
In the case of fixes, you absolutely cannot afford to guess. You might even refuse to make the fix if the BOM is inconsistent with file.encoding.

Thoughts?

Larry

sarif message