[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: Determining file encoding
I agree with what you’re saying. I think there is prior art to guide us here. The HTTP Content-type header has an optional charset parameter: Content-Type: text/html; charset=utf-8 … which allows any charset registered with IANA. Similarly, an HTML page can optionally specify its encoding: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> But! If those items are absent, neither the HTTP nor HTML spec goes on to give you an algorithm for determining the encoding. You’re on your own. Likewise, SARIF can optionally tell you about file encodings (file.encoding or run.defaultFileEncoding, which allow any charset registered by IANA). But if they’re absent, you’re on your own. My concern is that the spec as it stands is inconsistent in what it says about determining file encoding. It says one thing in the description of file.encoding, and another in the description of regions. I suggest we remove the paragraph in the description of regions. The spec tells you how you can use file.encoding and run.defaultFileEncoding, if they’re present – but it’s silent about what to do if they’re absent. Would that be ok? Larry From: Michael Fanning <Michael.Fanning@microsoft.com> I’m uncomfortable having our spec become a place where we provide definitive guidance on how text processors (such as code editors) determine file encoding. This isn’t the north star of our effort and neither have we pulled together relevant industry experience. Leaving SARIF out of the equation, afaik, every editor today has only a file’s contents available in order to determine its encoding. So I’m not clear on the urgency around persisting relevant information to SARIF and/or advising editors on how to go about this. Larry, as we discussed offline, I’ve gone through several scenarios mentally to support the position above. Consider these:
A sarif producer’s responsibility is to make sure region column/line details are in sync with a text file’s encoding. Any viewer that attempts to display sarif results needs to be able to detect and handle that file’s encoding if it attempts to display results. Why does sarif need to get involved? I may be missing something as I am not expert in this area. To be clear, I am not averse to leaving placeholders for ‘encoding’ and even Jim’s new line sequences data. Maybe it will give someone a leg up somewhere. I do object to spending lots of time explicating how to handle things in the spec. we need to close on solid, well-tested dynamic code flows, graphs, etc. That’s the special value we’re adding. Michael From: Larry Golding (Comcast) <larrygolding@comcast.net> +SARIF From: Larry Golding (Comcast) <larrygolding@comcast.net> I might be able to finesse this point. I could remove the whole part of the “text regions” section that presents this (old and busted) way of determining encoding. Then I could say something like this: A SARIF producer SHALL only emit text-related region properties if it knows the character encoding of the file, in which case it SHALL also emit file.encoding (§3.17.9) or run.defaultFileEncoding (§3.11.17). In the section on fixes I’d say something like: If a SARIF consumer does not know the character encoding of a file, it SHALL NOT apply a fix unless the deletedRegion contains binary-related properties. Larry From: Larry Golding (Comcast) <larrygolding@comcast.net> The spec is inconsistent in how it tells a consumer to determine a file’s encoding. The sections on file.encoding and run.defaultFileEncoding say:
The section on “Text regions” (which was written before we introduced file.encoding and run.defaultFileEncoding) has a different idea. The reason this section cares about encoding is that it wants consumers to know how many bytes each character occupies, so they can correctly identify (and highlight) a text region:
(NOTE: Step 3 doesn’t actually identify an encoding, but it gives the consumer a best guess as to how to identify the region.) We need to rationalize these. It might look like this:
A couple of things:
Thoughts? Larry |
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]