[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Proposal for Spreadsheets: New sort option "natural sort" (updated)
Hi, below is the revised proposal for the "natural-sort" attribute of the <table:sort> element: The attribute "table:embedded-number-behavior" specifies how string values that contain digits are sorted. If the attribute's value is "integer" or "float", string-prefixed numbers will be sorted in a "natural", number-aware way, i.e. A1, A2, A3, ... , A19, A20, instead of the normal, alpha-numeric behavior, i.e. A1, A10, A11, A12, ... ,A19, A2, A20, A3, A4, ... , A8, A9. <define name="table-sort-attlist" combine="interleave"> <optional> <attribute name="table:embedded-number-behavior" a:defaultValue="alpha-numeric"> <choice> <value>alpha-numeric</value> <value>integer</value> <value>double</value> </choice> </attribute> </optional> </define> The following illustrates how two strings shall be compared if the attribute value is "integer" or "float". Step 1. First of all, the two strings are compared by using the normal string comparison algorithm to ensure that they are not equal. If they are, the function will return immediately with equality. Step 2. Next, each of the two strings is divided into three parts: 1.Prefix substring 2.Numeric substring 3.Suffix substring The prefix substring is determined by locating the first occurrence of a digit character; the substring from the very first character through the character preceding the first digit is considered the prefix. Now, if the first digit happens to be the first character of the whole string, the prefix substring becomes empty. If there is no digit in either one of the compared strings, the natural sort process will end and the normal string comparison will be performed instead. The digit determined herein is locale-aware, and therefore is not limited to ASCII digits. If the attribute value is "float", a decimal separator is considered a digit so that real numbers are supported if the appropriate conditions are met (see "Note" below). Step 3. After the prefix substring is extracted from both of the compared strings, a normal string comparison is performed on the extracted prefixes. If they differ, the result is returned and the process will end. If they are equal, it will proceed to the next step of numeric string comparison. Step 4. In this step, the numeric substring is determined by locating the first occurrence of a non-digit character after the first digit character; the substring from the first digit character through the character preceding the first non-digit is considered the numeric substring. This substring is then converted into a double-precision variable. This step is performed on both of the compared strings, and the converted values are compared by simple numeric comparison. If these values differ, then the result will be returned and the process will end. If they are equal to one another, then the process will proceed to the next step. Step 5. After the numeric comparison returns equality, the suffix substring, which is simply the rest of the string that occurs after the last digit of the numeric substring, will be extracted. This suffix substring will then replace the original string, and the whole process will repeat (i.e. back to Step 1). This sorting process is illustrated in the picture below. Note that the term "normal string comparison" mentioned in the algorithm description refers to a locale-specific string comparison; therefore the term does not refer to a simple ASCII string comparison. This locale setting is either explicitly given by the table:language and table:country attributes, or the default locale when the language option is not explicitly specified. Note: Treatment of decimal separators: If the attribute value is "integer", then a decimal separator is is not considered as a digit. If the attribute value is "float", the treatment of a decimal separator is context-dependent, that is, when a decimal separator occurs adjacent to one or two digit characters, it is considered a digit character as long as it's the only occurrence in that given numeric substring. In other words, a second occurrence of a decimal separator in any numeric substring is treated as a non-digit character; therefore the character immediately preceding the separator becomes the last character of the numeric substring, while the separator itself becomes the first character of the suffix substring. Best regards Michael -- Michael Brauer, Technical Architect Software Engineering StarOffice/OpenOffice.org Sun Microsystems GmbH Nagelsweg 55 D-20097 Hamburg, Germany michael.brauer@sun.com http://sun.com/staroffice +49 40 23646 500 http://blogs.sun.com/GullFOSS
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]