mqtt message

Subject: [OASIS Issue Tracker] Updated: (MQTT-44) Specific details for UTF-8 Strings
From: OASIS Issues Tracker <workgroup_mailer@lists.oasis-open.org>
To: mqtt@lists.oasis-open.org
Date: Thu, 14 Nov 2013 15:25:50 +0000 (UTC)
     [ http://tools.oasis-open.org/issues/browse/MQTT-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Niblett updated MQTT-44:
------------------------------

    Proposal: 
Add the following to section 2.1.2

The encoded data must be well-formed UTF-8 as defined by the Unicode spec [Unicode], and restated in RFC 3629 [RFC 3629]. In particular the encoded data MUST NOT include encodings of codepoints between U+D800 and U+DFFF.  If a receiver (server or client) receives a control packet containing ill-formed UTF-8 it MUST close the network connection.  

The data MUST NOT include an encoding of the null character U+0000.   If a receiver (server or client) receives a control packet containing U+0000 it MUST close the network connection.  
The data SHOULD NOT include encodings of the Unicode codepoints listed below. If a receiver (server or client) receives a control packet containing any of them it MAY close the network connection.

U+0001..U+001F control characters
U+007F..U+009F control characters
Codepoints defined in the Unicode spec to be noncharacters (for example U+0FFFF)

The UTF-8 encoded sequence 0xEF 0xBB 0xBF is always to be interpreted to mean U+FEFF  ("ZERO WIDTH NO-BREAK SPACE") wherever it appears in a string and must never be skipped over or stripped off by a packet receiver.

Add the following to section 4.7.3 
When it performs subscription matching the server does not perform any normalization of Topic Names/Filters, or any modification or substitution of unrecognised characters. Each non-wildcarded level in the Topic Filter has to match the corresponding level in the Topic Name character for character for the match to succeed.

Non-normative comment. The UTF-8 encoding rules mean that the comparison of Topic Filter and Topic Name could be performed either by comparing the encoded UTF-8 bytes, or by comparing decoded Unicode characters

  was:
Add the following to section 2.1.2

The encoded data must be well-formed UTF-8 as defined by the Unicode spec, and restated in RFC 3629. In particular the encoded data MUST NOT include encodings of characters between U+D800 and U+DFFF.  If a receiver (server or client) receives a control packet containing ill-formed UTF-8 it MUST close the network connection.  

The data MUST NOT include an encoding of the null character U+0000.   If a receiver (server or client) receives a control packet containing U+0000 it MUST close the network connection.  
The data SHOULD NOT include encodings of the Unicode codepoints listed below. If a receiver (server or client) receives a control packet containing any of them it MAY close the network connection.

U+0001..U+001F control characters
U+007F..U+009F control characters
Codepoints defined in the Unicode spec to be noncharacters (for example U+0FFFF)

The UTF-8 encoded sequence 0xEF 0xBB 0xBF is always to be interpreted to mean U+FEFF  ("ZERO WIDTH NO-BREAK SPACE") wherever it appears in a string and must never be skipped over or stripped off by a packet receiver.

Add the following to section 4.7.3 
When it performs subscription matching the server does not perform any normalization of Topic Names/Filters, or any modification or substitution of unrecognised characters. Each non-wildcarded level in the Topic Filter has to match the corresponding level in the Topic Name character for character for the match to succeed.


I have made a couple of minor edits to the proposal and have added an addiional non-normative comment at the end..

Non-normative comment. The UTF-8 encoding rules mean that the comparison of Topic Filter and Topic Name could be performed either by comparing the encoded UTF-8 bytes, or by comparing decoded Unicode characters

> Specific details for UTF-8 Strings
> ----------------------------------
>
>                 Key: MQTT-44
>                 URL: http://tools.oasis-open.org/issues/browse/MQTT-44
>             Project: OASIS Message Queuing Telemetry Transport (MQTT) TC
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 3.1.1
>            Reporter: Rahul Gupta
>            Assignee: Peter Niblett
>
> This issues is based on comments in MQTT-24, and is opened  a Core issue to discuss in MQTT TC Call, I had a discussion with my co-editor Andy and he suggested to open a core issue for TC discussion.
> from MQTT-24
> -------------------
> > We should also make a simple statement that UTF-8 encodings MUST NOT have a three character initial BOM.
> > A clarification that the encoding MUST NOT be Java's Modified UTF-8, and can contain ASCII NULL
> > At the same time, it's probably worth nothing too that certain unicode combinations are invalid in UTF- 8 - the use of surrogate pairs from UTF-16 re-encoded and certain non-transmissable characters (eg U+FFFE from memory) - these normally delimit the last 2 characters in a multi-lingual plain. These restrictions are only a minor burden fro java implementations using the naive methods in string / character. These restrictions serve to stop propagation of bad data through a network of nodes.
> > Implementations MAY decide to not support the use of ASCII NUL and C0 / C1 control codes / MAY decide to place additional restrictions on supported characters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://tools.oasis-open.org/issues/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira