I added the following
comment to the issue, explaining why the spec already says as much as it is able to about file encoding:
I agree that the spec says as much as it can about encoding:
The SARIF log file must be encoded in UTF-8 (§3.1).
As a result, embedded file content (
§3.2.2) must be UTF-8 (transcoded from the original file encoding if necessary).
file.encoding (§3.19.9) is optional, and if absent, the original
file encoding is taken to be
I believe it's
that last point that @katrinaoneil's colleague objects to, but it's unavoidable in some cases. For example, Semmle
takes a snapshot of a code base, saves the snapshot in UTF-8, and then analyzes the snapshot. Once the snapshot is taken, Semmle does not remember the original file encoding.
That might seem to imply that the encoding in this case is UTF-8. The problem is that if the SARIF file includes
those fixes might refer to the wrong portion of the original file if that file is in any other encoding. In this scenario, the SARIF log file needs to record the fact that it just doesn't know the original file encoding.
I noted the closure in the Editor’s Report.