[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: Determining file encoding
I’m uncomfortable having our spec become a place where we provide definitive guidance on how text processors (such as code editors) determine file encoding. This isn’t the north star of our effort and neither have we pulled together relevant
industry experience. Leaving SARIF out of the equation, afaik, every editor today has only a file’s contents available in order to determine its encoding. So I’m not clear on the urgency around persisting relevant information to SARIF and/or advising editors
on how to go about this. Larry, as we discussed offline, I’ve gone through several scenarios mentally to support the position above. Consider these:
A sarif producer’s responsibility is to make sure region column/line details are in sync with a text file’s encoding. Any viewer that attempts to display sarif results needs to be able to detect and handle that file’s encoding if it attempts
to display results. Why does sarif need to get involved? I may be missing something as I am not expert in this area. To be clear, I am not averse to leaving placeholders for ‘encoding’ and even Jim’s new line sequences data. Maybe it will give someone a leg up somewhere. I do object to spending
lots of time explicating how to handle things in the spec. we need to close on solid, well-tested dynamic code flows, graphs, etc. That’s the special value we’re adding. Michael From: Larry Golding (Comcast) <larrygolding@comcast.net>
+SARIF From: Larry Golding (Comcast) <larrygolding@comcast.net>
I might be able to finesse this point. I could remove the whole part of the “text regions” section that presents this (old and busted) way of determining encoding. Then I could say something like this: A SARIF producer SHALL only emit text-related region properties if it knows the character encoding of the file, in which case it
SHALL also emit file.encoding (§3.17.9) or
run.defaultFileEncoding (§3.11.17). In the section on fixes I’d say something like: If a SARIF consumer does not know the character encoding of a file, it
SHALL NOT apply a fix unless the deletedRegion contains binary-related properties. Larry From: Larry Golding (Comcast) <larrygolding@comcast.net>
The spec is inconsistent in how it tells a consumer to determine a file’s encoding. The sections on file.encoding and
run.defaultFileEncoding say:
The section on “Text regions” (which was written before we introduced
file.encoding and run.defaultFileEncoding) has a different idea. The reason this section cares about encoding is that it wants consumers to know how many bytes
each character occupies, so they can correctly identify (and highlight) a text region:
(NOTE: Step 3 doesn’t actually identify an encoding, but it gives the consumer a best guess as to how to identify the region.) We need to rationalize these. It might look like this:
A couple of things:
Thoughts? Larry |
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]