Subject: Re: [relax-ng] Encoding declaration, MIME type

>> I certainly agree that we should use the media type when one is
>> provided,  but what about something like a "file:" or "ftp:" URL, where
>> there is no  media type?
> Some people believe that existing OSs should be revised so that they can
> provide  the charset parameter.

That would be nice, but I don't think it's going to happen soon.  The 
closest thing to this is to default to using the same encoding-detection 
approach as the system text editor for local files: i.e. on Windows 2000:

- if there's a UTF-8 BOM, then it's UTF-8
- if there's a UTF-16 BOM, then it's UTF-16
- otherwise, it's the platform default encoding (windows-1252 or shift-jis 
or whatever)

>> Here's a strawman proposal:
>> 1. If you get the RNC as a MIME entity including information about the
>> charset, then use that charset.  Note that text/plain without a charset
>> parameter is equivalent to "text/plain; charset=us-ascii".
> I am happy with this.  By the way, if we stick to the HTTP RFC, the
> default  is ISO-8859-1.  I certainly think that this default is
> ridiculous.

Doesn't the XML media type RFC use the standard default of US-ASCII (which 
I agree is ridiculous)? I guess the only advance of US-ASCII, is it makes 
it a little bit easier to reliably detect a missing charset parameter: the 
presence of any 8-bit byte tells you that something is wrong.

>> 2. Otherwise, the RNC is in UTF-8 or UTF-16.  If it has a UTF-16 BOM,
>> it's  UTF-16.  Otherwise it's UTF-8.
> I can live with this.  By the way, which UTF-8?  With or without the
> Unicode  signature?  Or, both?  (Probably, both/)

With or without.

>> 3. A system may provide a way to allow a user to specify an alternative
>> encoding for local files.
> Again, I can live with this.
>> 4. After converting the sequence of bytes to a sequence of characters,
>> any  initial BOM is discarded.
> Including the Unicode signature for UTF-8?  Probably, yes.

Including. Notepad on Windows 2000 puts a UTF-8 BOM automatically, and I 
don't want that to cause an error for RNC.

> Non-ascii users will probably say that we should provide some in-band
> encoding declarations.  But I'm reluctant to do so.

Me too.  Imagine if every single programming language provided it's own 
different in-band encoding declaration.  The result would be chaotic, and 
we would never see any better solution.

> If we need a specialized media type for our compact syntax, I think that
> application/vnd.oasis-open.rng with the charset parameter is probably
> acceptable.

Wouldn't "vnd.oasis-open.rnc" be preferable?  Or maybe 
"vnd.oasis-open.relax-ng.rnc" (following the OASIS organizational 
structure)?  It seems like there output to be a standard OASIS convention 
for MIME types in the vnd.oasis-open tree.  What's the procedure for 
registration in the "vnd" tree?


