wsrp-markup message

Subject: [wsrp-markup] [wsrp][markup] How browser handle form encoding

From: Michael Freedman <Michael.Freedman@oracle.com>
To: WSRP-Markup <wsrp-markup@lists.oasis-open.org>
Date: Tue, 13 Aug 2002 11:33:45 -0700

After determining that the W3C recommendation I pointed us to last week was more limited in scope then originally thought we ran some experiments on how Netscape/IE handle form charset encoding. Thought you would be interested as we probably need to revisit the decision we made last week regarding UTF encoding of URLs.
-Mike-

Note: extracted from an e-mail sent by David Ward , a colleague of mine ....

You're right. When you actually get down to reading the note Javasoft uses to justify universal use of UTF-8 URL encoding, it's not quite as watertight as you would think.

I think the standards still haven't resolved this matter. There's a note in RFC 2070: HTML Internationalization that says the following:

5.2. Form submission
   The HTML 2.0 form submission mechanism, based on the "application/x-
   www-form-urlencoded" media type, is ill-equipped with regard to
   internationalization. In fact, since URLs are restricted to ASCII
   characters, the mechanism is awkward even for ISO-8859-1 text.
   Section 2.2 of [RFC1738] specifies that octets may be encoded using
   the "%HH" notation, but text submitted from a form is composed of
   characters, not octets. Lacking a specification of a character
   encoding scheme, the "%HH" notation has no well defined meaning.
   The best solution is to use the "multipart/form-data" media type
   described in [RFC1867] with the POST method of form submission. This
   mechanism encapsulates the value part of each name-value pair in a
   body-part of a multipart MIME body that is sent as the HTTP entity;
   each body part can be labeled with an appropriate Content-Type,
   including if necessary a charset parameter that specifies the
   character encoding scheme. The changes to the DTD necessary to
   support this method of form submission have been incorporated in the
   DTD included in this specification.
   A less satisfactory solution is to add a MIME charset parameter to
   the "application/x-www-form-urlencoded" media type specified sent
   along with a POST method form submission, with the understanding that
   the URL encoding of [RFC1738] is applied on top of the specified
   character encoding, as a kind of implicit Content-Transfer-Encoding.
   One problem with both solutions above is that current browsers do not
   generally allow for bookmarks to specify the POST method; this should
   be improved. Conversely, the GET method could be used with the form
   data transmitted in the body instead of in the URL. Nothing in the
   protocol seems to prevent it, but no implementations appear to exist
   at present. How the user agent determines the encoding of the text entered
   by the user is outside the scope of this specification.

Thus, for reliable transmission of multibyte character data, you are recommended to set enctype="multipart/form-data" on your form, so that each field value is encoded in a separate part of a multipart MIME message. This encoding is normally only used
for uploading files through forms.

Still, no-one seems to use this encoding in practice (given that the results are non-bookmarkable), so I decided to investigate what actually happens in IE and Netscape.

I wrote a simple JSP as follows:

<%@page contentType="text/html; charset=UTF-8"%>
<html>
<body>
<form>
<table>
<tr>
<td>Enter some text: </td>
<td><input type="text" name="text"></td>
<tr>
<td colspan="2" align="center"><input type="submit" name="submit" value="Submit"></td>
</tr>
</table>
</form>
</body>
</html>

I found that in both Netscape and IE, when I submitted the text "£££", the following was appended to the URL:

text=%C2%A3%C2%A3%C2%A3

This shows that a UTF-8 URL encoding has been used, because £ is outside the ASCII character set, and thus requires two bytes in UTF-8.

I then changed the charset declaration in my jsp to ISO-8859-1.

Now, in both IE and netscape, when I submitted "£££", the following was appended to the URL

text=%A3%A3%A3

This shows that an ISO-8859-1 encoding has been used, as £ can be encoded as a single byte. This was even when I had the "Always send URLs as UTF-8" checbox set in my Internet Explorer settings. This must only control behaviour in the "invalid link" case
documented by the original w3c extract.

So, in conclusion, both IE and Netscape both use the convension that form data is submitted using the character encoding of the page.

I was interested to see what would happen if I tried to submit characters that weren't encodable in the page character encoding, so I copied the following Japanese characters from http://www.yahoo.co.jp (don't know what they mean) and submitted them:

????

In IE, this resulted in the following - obviously a multibyte encoding is being used, but which one?

text=%96%9C%96%BC%92%C7%89%C1

In Netscape 6.2, I just got the following - the characters seem to be getting truncated to the page encoding.

text=%3F%3F%3F%3F

Now, I thought I'd experiment to see what effect the accept-charset HTML form attribute has in each of the browsers, so I added accept-charset="UTF-8" to my form definition (remember my page encoding is still single byte ISO-8859-1).

Now, in both IE and Netscape I got the following results when I submitted the same characters.

text=%E4%B8%87%E5%90%8D%E8%BF%BD%E5%8A%A0

At last! Consistent results! It appears that a UTF-8 encoding has been used in both cases.

However, what happens now with characters that are encodable in the page encoding? I tried re-submitting "£££".

In Netscape 6.2 I got UTF-8

text=%C2%A3%C2%A3%C2%A3

In IE I got ISO-8859-1

text=%A3%A3%A3

In summary, when no accept-charset attribute is present on an HTML form, both Netscape 6.2 and IE 6 use the page character encoding as the URL encoding.
When an accept-charset attribute is present, Netscape 6.2 uses this to encode all strings, while IE 6 only uses this to encode strings outside the page character set.

So to be compatible with both Netscape and IE, you are going to have to:

get consumers to render the page using a character encoding that is capable of encoding all action data that might be submitted
either force producers to render markup AND url encode using the same encoding, or just force producers to URL encode using the same encoding, and translate the markup later.
consider supporting forms with enctype="multipart/form-data" (this is required for uploading files anyway)