OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

mqtt message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: [OASIS Issue Tracker] Commented: (MQTT-44) Specific details for UTF-8 Strings

    [ http://tools.oasis-open.org/issues/browse/MQTT-44?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=34155#action_34155 ] 

Raphael Cohen commented on MQTT-44:

Additional comment from out-of-band discussion of MQTT-24

I'll try to give some context to these.

1) C0 / C1 control codes

Also known as NULL. This in unicode is U+0000. It is the very first of what is known in both US ASCII and Unicode as the 'C0 control characters'. Unfortunately, this character is often reserved for special purposes in some programming languages and technologies:-

- In C, it is used to terminate the end of a string (C strings have no length). All common c stdlib functions (eg strlen()) scan a string looking for this character. Consequently, a naive implementation that used such code could easily be corrupted. Indeed, this is one of the most common attacks. The C form of SQL injection.

- In Java, _some_ of the functions output a bastardised UTF-8 encoding called 'Modified UTF-8'. Usually the data serialisation (incl object serialisation) APIs. This has also been exploited, unsurprisingly.

- In SASL, especially SASL PLAIN (a very common auth technology, used in AMQP, Redis and XMPP) it is used as a delimiter

- In POSIX, it is one of only two characters (the other being /) not allowed in a file name (or folder name).

What I want us to do in MQTT is _be aware_ of these potential issues, and avoid having MQTT implementers naively propagate wrong, bad or simply maliciously crafted data. An implementation may be capable of correctly reading a MQTT string using the length prefix, but it may then pass on such tainted data unawares into an internal library. This is exceedingly common in the C world, where much code makes use of other static libs (eg iconv).

A defensive implementation will reject data with this character in it because it can't be 100% certain of internal or downstream code's ability to cope. Additionally, many clients won't be able to easily store such data - take the POSIX example above - or will need special handling (eg if embedding SASL PLAIN inside the AMQP field for user name).

My proposal: 'An implementation MAY disconnect clients who use a Unicode U+0000' in strings as a defensive measure. Let's give implementers the chance to get this right from day one, and not give it the perception of being an 'insecure protocol'.

1b) Remaining C0 / C1 control codes
Personally, I'd not say anything at all, as I think these are more minor, but we could use a 'SHOULD support C0 / C1 control codes with the exception of U+0000'. Why? Because in some environments C0 / C1 control codes are used for attacks against logging or file names.

2) Invalid characters
These are all characters, code points or byte combinations that are not valid in _transmitted_ UTF-8 data. I strongly suggest that implementations: 'MUST disconnect immediately if any bytes, characters or code points of a UTF-8 string are found to be invalid or impermissible for transmission'. Trying to handle badly encoded data just makes things far worse in my experience.

2a) Surrogate Pairs
Encoded surrogate pairs are NOT allowed in UTF-8*, and we should be explicitly say this, because there is much confusion. In particular, Modified UTF-8, CESU-8 and UCS-2   DO allow them. See 4 below.

* They occur when a UTF-16 surrogate pair (ie 2 UTF-16 characters) are naively encoded to UTF-8. In such situations the input data is WRONG. In my experience, the usual culprit is java originated data, where a programmer naively assumes one UTF-16 char is one UTF-8 character - some of the internal Sun/Oracle code does likewise. Even worse is when one gets half a surrogate pair encoded...

2b) In-transmissible Characters
U+FFFE, U+FFFF and others:-

'Not a character' / 'process internal usage'. Wikipedia (Certain noncharacter code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six noncharacters: U+FDD0..U+FDEF and any code point ending in the value FFFE or FFFF (i.e. U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined.)

See   http://www.unicode.org/faq/private_use.html#noncharacters

2c) Out-of-range characters
These are simply bad encodings, eg ones 6 bytes long or with invalid bits.

2d) PUA - Private Use Area and PUC - Private Use Character / Code
Should be fully allowed, so all we need to say is implementations 'SHOULD accept Private Use Area and Private Use Characters, eg PU1).

3) BOMs
The UTF-8 encoded strings in MQTT do not have a three character BOM prefix.

This commonly occurs when MS Notepad has been used to craft some data - it usually sticks a UTF-8 'BOM' on the front, even though the unicode consortium makes it clear that's not permitted.

Explicitly state that UTF-8 is NOT (even though the reference to the RFC should be binding, it really needs hammering home).

4a) Modified UTF-8
I think we should make an explicit statement that Modified UTF-8 is incorrect. I'm not sure there's a normative reference. http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html is the best I can find.

I also think a 'call to action' to java and TCL programmers is in order here, eg 'Please take care if using Java or TCL that you are correctly encoding to UTF-8 and not Modified UTF-8', especially if using java.io.DataOutput).

4b) CESU-8

Likewise, we should state that the UTF-8 encoding is NOT CESU-8. (Wikipedia: "CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange."). Reference: http://www.unicode.org/reports/tr26/

4c) UCS-2

Likewise, not UCS-2.

> Specific details for UTF-8 Strings
> ----------------------------------
>                 Key: MQTT-44
>                 URL: http://tools.oasis-open.org/issues/browse/MQTT-44
>             Project: OASIS Message Queuing Telemetry Transport (MQTT) TC
>          Issue Type: Improvement
>          Components: core
>            Reporter: Rahul Gupta
> This issues is based on comments in MQTT-24, and is opened  a Core issue to discuss in MQTT TC Call, I had a discussion with my co-editor Andy and he suggested to open a core issue for TC discussion.
> from MQTT-24
> -------------------
> > We should also make a simple statement that UTF-8 encodings MUST NOT have a three character initial BOM.
> > A clarification that the encoding MUST NOT be Java's Modified UTF-8, and can contain ASCII NULL
> > At the same time, it's probably worth nothing too that certain unicode combinations are invalid in UTF- 8 - the use of surrogate pairs from UTF-16 re-encoded and certain non-transmissable characters (eg U+FFFE from memory) - these normally delimit the last 2 characters in a multi-lingual plain. These restrictions are only a minor burden fro java implementations using the naive methods in string / character. These restrictions serve to stop propagation of bad data through a network of nodes.
> > Implementations MAY decide to not support the use of ASCII NUL and C0 / C1 control codes / MAY decide to place additional restrictions on supported characters

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://tools.oasis-open.org/issues/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]