[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: [wsrp-markup] [wsrp][markup] How browser handle form encoding
Note: extracted from an e-mail sent by David Ward , a colleague of mine ....
You're right. When you actually get down to reading the note Javasoft uses to justify universal use of UTF-8 URL encoding, it's not quite as watertight as you would think.
I think the standards still haven't resolved this matter. There's a note in RFC 2070: HTML Internationalization that says the following:
5.2. Form submission
The HTML 2.0 form submission mechanism, based on the "application/x-
www-form-urlencoded" media type, is ill-equipped with
regard to
internationalization. In fact, since URLs are restricted
to ASCII
characters, the mechanism is awkward even for ISO-8859-1
text.
Section 2.2 of [RFC1738] specifies that octets may be
encoded using
the "%HH" notation, but text submitted from a form is
composed of
characters, not octets. Lacking a specification
of a character
encoding scheme, the "%HH" notation has no well defined
meaning.
The best solution is to use the "multipart/form-data"
media type
described in [RFC1867] with the POST method of form submission.
This
mechanism encapsulates the value part of each name-value
pair in a
body-part of a multipart MIME body that is sent as the
HTTP entity;
each body part can be labeled with an appropriate Content-Type,
including if necessary a charset parameter that specifies
the
character encoding scheme. The changes to the DTD
necessary to
support this method of form submission have been incorporated
in the
DTD included in this specification.
A less satisfactory solution is to add a MIME charset
parameter to
the "application/x-www-form-urlencoded" media type specified
sent
along with a POST method form submission, with the understanding
that
the URL encoding of [RFC1738] is applied on top of the
specified
character encoding, as a kind of implicit Content-Transfer-Encoding.
One problem with both solutions above is that current
browsers do not
generally allow for bookmarks to specify the POST method;
this should
be improved. Conversely, the GET method could be
used with the form
data transmitted in the body instead of in the URL.
Nothing in the
protocol seems to prevent it, but no implementations appear
to exist
at present. How the user agent determines the encoding
of the text entered
by the user is outside the scope of this specification.
Thus, for reliable transmission of multibyte character data, you are
recommended to set enctype="multipart/form-data"
on your form, so that each field value is encoded in a separate part of
a multipart MIME message. This encoding is normally only used
for uploading files through forms.
Still, no-one seems to use this encoding in practice (given that the results are non-bookmarkable), so I decided to investigate what actually happens in IE and Netscape.
I wrote a simple JSP as follows:
<%@page contentType="text/html; charset=UTF-8"%>
<html>
<body>
<form>
<table>
<tr>
<td>Enter some text: </td>
<td><input type="text" name="text"></td>
<tr>
<td colspan="2" align="center"><input type="submit" name="submit"
value="Submit"></td>
</tr>
</table>
</form>
</body>
</html>
I found that in both Netscape and IE, when I submitted the text "£££", the following was appended to the URL:
text=%C2%A3%C2%A3%C2%A3
This shows that a UTF-8 URL encoding has been used, because £ is outside the ASCII character set, and thus requires two bytes in UTF-8.
I then changed the charset declaration in my jsp to ISO-8859-1.
Now, in both IE and netscape, when I submitted "£££", the following was appended to the URL
text=%A3%A3%A3
This shows that an ISO-8859-1 encoding has been used, as £ can
be encoded as a single byte. This was even when I had the "Always send
URLs as UTF-8" checbox set in my Internet Explorer settings. This must
only control behaviour in the "invalid link" case
documented by the original w3c extract.
So, in conclusion, both IE and Netscape both use the convension that form data is submitted using the character encoding of the page.
I was interested to see what would happen if I tried to submit characters that weren't encodable in the page character encoding, so I copied the following Japanese characters from http://www.yahoo.co.jp (don't know what they mean) and submitted them:
????
In IE, this resulted in the following - obviously a multibyte encoding is being used, but which one?
text=%96%9C%96%BC%92%C7%89%C1
In Netscape 6.2, I just got the following - the characters seem to be getting truncated to the page encoding.
text=%3F%3F%3F%3F
Now, I thought I'd experiment to see what effect the accept-charset HTML form attribute has in each of the browsers, so I added accept-charset="UTF-8" to my form definition (remember my page encoding is still single byte ISO-8859-1).
Now, in both IE and Netscape I got the following results when I submitted the same characters.
text=%E4%B8%87%E5%90%8D%E8%BF%BD%E5%8A%A0
At last! Consistent results! It appears that a UTF-8 encoding has been used in both cases.
However, what happens now with characters that are encodable in the page encoding? I tried re-submitting "£££".
In Netscape 6.2 I got UTF-8
text=%C2%A3%C2%A3%C2%A3
In IE I got ISO-8859-1
text=%A3%A3%A3
In summary, when no accept-charset attribute is present on an HTML form,
both Netscape 6.2 and IE 6 use the page character encoding as the URL encoding.
When an accept-charset attribute is present, Netscape 6.2 uses this
to encode all strings, while IE 6 only uses this to encode strings outside
the page character set.
So to be compatible with both Netscape and IE, you are going to have to:
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC