[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: [OASIS Issue Tracker] Updated: (MQTT-44) Specific details for UTF-8 Strings
[ http://tools.oasis-open.org/issues/browse/MQTT-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Niblett updated MQTT-44: ------------------------------ Proposal: Add the following to section 2.1.2 The encoded data must be well-formed UTF-8 as defined by the Unicode spec [Unicode], and restated in RFC 3629 [RFC 3629]. In particular the encoded data MUST NOT include encodings of codepoints between U+D800 and U+DFFF. If a receiver (server or client) receives a control packet containing ill-formed UTF-8 it MUST close the network connection. The data MUST NOT include an encoding of the null character U+0000. If a receiver (server or client) receives a control packet containing U+0000 it MUST close the network connection. The data SHOULD NOT include encodings of the Unicode codepoints listed below. If a receiver (server or client) receives a control packet containing any of them it MAY close the network connection. U+0001..U+001F control characters U+007F..U+009F control characters Codepoints defined in the Unicode spec to be noncharacters (for example U+0FFFF) The UTF-8 encoded sequence 0xEF 0xBB 0xBF is always to be interpreted to mean U+FEFF ("ZERO WIDTH NO-BREAK SPACE") wherever it appears in a string and must never be skipped over or stripped off by a packet receiver. Add the following to section 4.7.3 When it performs subscription matching the server does not perform any normalization of Topic Names/Filters, or any modification or substitution of unrecognised characters. Each non-wildcarded level in the Topic Filter has to match the corresponding level in the Topic Name character for character for the match to succeed. Non-normative comment. The UTF-8 encoding rules mean that the comparison of Topic Filter and Topic Name could be performed either by comparing the encoded UTF-8 bytes, or by comparing decoded Unicode characters was: Add the following to section 2.1.2 The encoded data must be well-formed UTF-8 as defined by the Unicode spec, and restated in RFC 3629. In particular the encoded data MUST NOT include encodings of characters between U+D800 and U+DFFF. If a receiver (server or client) receives a control packet containing ill-formed UTF-8 it MUST close the network connection. The data MUST NOT include an encoding of the null character U+0000. If a receiver (server or client) receives a control packet containing U+0000 it MUST close the network connection. The data SHOULD NOT include encodings of the Unicode codepoints listed below. If a receiver (server or client) receives a control packet containing any of them it MAY close the network connection. U+0001..U+001F control characters U+007F..U+009F control characters Codepoints defined in the Unicode spec to be noncharacters (for example U+0FFFF) The UTF-8 encoded sequence 0xEF 0xBB 0xBF is always to be interpreted to mean U+FEFF ("ZERO WIDTH NO-BREAK SPACE") wherever it appears in a string and must never be skipped over or stripped off by a packet receiver. Add the following to section 4.7.3 When it performs subscription matching the server does not perform any normalization of Topic Names/Filters, or any modification or substitution of unrecognised characters. Each non-wildcarded level in the Topic Filter has to match the corresponding level in the Topic Name character for character for the match to succeed. I have made a couple of minor edits to the proposal and have added an addiional non-normative comment at the end.. Non-normative comment. The UTF-8 encoding rules mean that the comparison of Topic Filter and Topic Name could be performed either by comparing the encoded UTF-8 bytes, or by comparing decoded Unicode characters > Specific details for UTF-8 Strings > ---------------------------------- > > Key: MQTT-44 > URL: http://tools.oasis-open.org/issues/browse/MQTT-44 > Project: OASIS Message Queuing Telemetry Transport (MQTT) TC > Issue Type: Improvement > Components: core > Affects Versions: 3.1.1 > Reporter: Rahul Gupta > Assignee: Peter Niblett > > This issues is based on comments in MQTT-24, and is opened a Core issue to discuss in MQTT TC Call, I had a discussion with my co-editor Andy and he suggested to open a core issue for TC discussion. > from MQTT-24 > ------------------- > > We should also make a simple statement that UTF-8 encodings MUST NOT have a three character initial BOM. > > A clarification that the encoding MUST NOT be Java's Modified UTF-8, and can contain ASCII NULL > > At the same time, it's probably worth nothing too that certain unicode combinations are invalid in UTF- 8 - the use of surrogate pairs from UTF-16 re-encoded and certain non-transmissable characters (eg U+FFFE from memory) - these normally delimit the last 2 characters in a multi-lingual plain. These restrictions are only a minor burden fro java implementations using the naive methods in string / character. These restrictions serve to stop propagation of bad data through a network of nodes. > > Implementations MAY decide to not support the use of ASCII NUL and C0 / C1 control codes / MAY decide to place additional restrictions on supported characters -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://tools.oasis-open.org/issues/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]