sarif message

Subject: Re: Determining file encoding

From: "James A. Kupsch" <kupsch@cs.wisc.edu>
To: Michael Fanning <Michael.Fanning@microsoft.com>, "Larry Golding (Comcast)" <larrygolding@comcast.net>, Luke Cartey <luke@semmle.com>, "sarif@lists.oasis-open.org" <sarif@lists.oasis-open.org>
Date: Mon, 14 May 2018 14:02:08 -0500

Hi,

There is no way to automatically determine the encoding of the file.
The same sequence of bytes can be a valid bytes sequence in multiple
encodings that result in different character sequences (what is rendered
to the user).  For instance the value C1 is GREEK CAPITAL LETTER ALPHA
in ISO-8859-7 while it is LATIN CAPITAL LETTER A WITH ACUTE in
ISO-8859-15.  For files encoded with these encodings there is no
indicator sequence in the file, so it is not that an editor inspects the
file and determines which one is correct, but a human being manually
configuring their editor and maybe trying different encodings until one
of them makes sense.  A reasonable strategy today, might be look for a

BOM and use an encoding of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE. Ifthe BOM is not present then use UTF-8. This does gets it wrong if thefile is encoded as UTF-16, UTF-32, ISO-8859-*, CP*, SJIS, ...

SARIF should say that the encoding is known only if it is specified inthe file. If it is known then that is the encoding. If it is unknown,then users and viewers will need to determine the encoding of file'susing heuristics, and that the file may not be displayed correctly. Idon't think that we need to say anything more than that (we don't needto talk about the BOM or how to guess).

The encoding is an attribute of the file, so I do not think that thereneeds any encoding associated with a region (the encoding is the same asthe file). I think that we should encourage producers to includeencoding information, and that if it is not present then note thatviewer may not be able to correctly display the file contents to users.

For snippets we should say that if there is not a one-to-one mappingcharacter codes from the source encoding to the unicode then thehighlighting of results based on the snippet may not be correct.


Jim


On 05/14/2018 11:34 AM, Michael Fanning wrote:

I’m uncomfortable having our spec become a place where we providedefinitive guidance on how text processors (such as code editors)determine file encoding. This isn’t the north star of our effort andneither have we pulled together relevant industry experience. Leaving
SARIF out of the equation, afaik, every editor today has only a
file’s contents available in order to determine its encoding. So I’m
not clear on the urgency around persisting relevant information to
SARIF and/or advising editors on how to go about this.
Larry, as we discussed offline, I’ve gone through several scenariosmentally to support the position above. Consider these:
1. Let’s say VS supports UTF16. It can parse utf16 surrogate pairsand display them, but somehow, VS skipped the part where it reads the
UTF16 BOM on file open to determine endian-ness/how to parse/display
(so SARIF needs to provide it). This seems unlikely. 2. Let’s say VS
doesn’t support UTF16. The SARIF file helpfully provides this
encoding information, but so what? VS can’t parse/display the SAresults, so what’s the benefit? 3. Let’s say someone is building anew text file viewer. In 100% of non-SARIF cases, that viewer needsto inspect the file on file open to detect the encoding. So how is
it that this information must be present in SARIF files or our
scenarios fail?
A sarif producer’s responsibility is to make sure region column/linedetails are in sync with a text file’s encoding. Any viewer thatattempts to display sarif results needs to be able to detect andhandle that file’s encoding if it attempts to display results. Whydoes sarif need to get involved?
I may be missing something as I am not expert in this area. To beclear, I am not averse to leaving placeholders for ‘encoding’ andeven Jim’s new line sequences data. Maybe it will give someone a legup somewhere. I do object to spending lots of time explicating how tohandle things in the spec. we need to close on solid, well-testeddynamic code flows, graphs, etc. That’s the special value we’readding.
Michael
*From:* Larry Golding (Comcast) <larrygolding@comcast.net> *Sent:*Saturday, May 12, 2018 2:45 PM *To:* Michael Fanning<Michael.Fanning@microsoft.com>; 'James A. Kupsch'<kupsch@cs.wisc.edu>; Luke Cartey <luke@semmle.com>;sarif@lists.oasis-open.org *Subject:* RE: Determining file encoding
+SARIF
*From:* Larry Golding (Comcast) <larrygolding@comcast.net<mailto:larrygolding@comcast.net>> *Sent:* Friday, May 11, 2018 5:33PM *To:* Michael Fanning <Michael.Fanning@microsoft.com<mailto:Michael.Fanning@microsoft.com>>; 'James A. Kupsch'<kupsch@cs.wisc.edu <mailto:kupsch@cs.wisc.edu>>; Luke Cartey<luke@semmle.com <mailto:luke@semmle.com>> *Subject:* RE:
Determining file encoding
I might be able to finesse this point. I could remove the whole partof the “text regions” section that presents this (old and busted) way
of determining encoding. Then I could say something like this:
A SARIF producer *SHALL* only emit text-related region properties ifit knows the character encoding of the file, in which case it *SHALL*also emit file.encoding (§3.17.9) or run.defaultFileEncoding(§3.11.17).
In the section on fixes I’d say something like:
If a SARIF consumer does not know the character encoding of a file,it *SHALL NOT* apply a fix unless the deletedRegion containsbinary-related properties.
Larry
*From:* Larry Golding (Comcast) <larrygolding@comcast.net<mailto:larrygolding@comcast.net>> *Sent:* Friday, May 11, 2018 3:27PM *To:* Michael Fanning <Michael.Fanning@microsoft.com<mailto:Michael.Fanning@microsoft.com>>; 'James A. Kupsch'<kupsch@cs.wisc.edu <mailto:kupsch@cs.wisc.edu>>; Luke Cartey<luke@semmle.com <mailto:luke@semmle.com>> *Subject:* Determiningfile encoding *Importance:* High
*The spec is inconsistent in how it tells a consumer to determine afile’s encoding.*
The sections on file.encoding and run.defaultFileEncoding say:
1. First use file.encoding. 2. If it’s missing, userun.defaultFileEncoding. 3. If it’s missing, consider the encoding
to be unknown.
The section on “Text regions” (which was written *before weintroduced **file.encoding and **run.defaultFileEncoding*) has adifferent idea. The reason this section cares about encoding is thatit wants consumers to know how many bytes each character occupies, so
they can correctly identify (and highlight) a text region:
1. Look for a BOM. 2. If it’s absent, use external information(command line arguments, project settings, …) 3. If none of thathelps, assume each character represents one byte.
(*NOTE*: Step 3 doesn’t actually identify an encoding, but it givesthe consumer a best guess as to how to identify the region.)
We need to rationalize these. It might look like this:
1. Look for a BOM. (IMO, the file is the final authority.) 2. Ifthere’s no BOM, use file.encoding. 3. If it’s missing, userun.defaultFileEncoding. 4. It it’s missing, use external
information (command line arguments, project settings, …) 5.
Otherwise, sniff the file and make your best guess.

A couple of things:
* Step 5 is inconsistent with Luke’s dictum “consider it to beunknown”. *Luke*, tell me again please why it is unknown, and whatwould go wrong if I sniffed the file and guessed wrong. * In the
case of fixes, you absolutely cannot afford to guess. You might even
refuse to make the fix if the BOM is inconsistent withfile.encoding.
Thoughts?

Larry

Follow-Ups:
- RE: [sarif] Re: Determining file encoding
  - From: "Larry Golding \(Comcast\)" <larrygolding@comcast.net>

References:
- RE: Determining file encoding
  - From: "Larry Golding \(Comcast\)" <larrygolding@comcast.net>
- RE: Determining file encoding
  - From: Michael Fanning <Michael.Fanning@microsoft.com>