office message

Subject: Re: [office] Proposal for Spreadsheets: New sort option "naturalsort"

From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: robert_weir@us.ibm.com
Date: Fri, 19 Jan 2007 17:33:20 -0500 (EST)

Robert Weir asked:
>How far do we take it?
>For example do we allow multiple levels, as in:
>A1.1, A1.2, A1.10, ... , A19.1, A20.3, etc.

All definitions that I know of handle an arbitrary number of levels.

I like the idea of it, it's VERY user-friendly, but there should be TWO natural sorts: case-sensitive and case-insensitive.

The trick is to make sure we carefully define "natural sort".  There seems to be several definitions floating around.  Of the ones around, Martin Pool's implementation seems promising:
* It seems to be the one most referred to, and the one most copied (e.g., by PHP)
* He provides an implementation in C with a zlib-style (open source BSD-style license), so many implementors could probably start with that code easily (easing implementation)
* It handles decimal numbers (with leading zeros) cleanly; not all other definitions do
* He defines BOTH case-sensitive and case-insensitive natural sorts.  That makes sense.

Pool's works like this.  To compare left, right, repeatedly do the following:
* Skip all leading whitespace of left and right
* If either begin with zero, compare the digits
   as left-aligned numbers  (consuming all consecutive digits of left and right)
   and return if not equal.
   This is basically the same as char-by-char comparisons of the digits starting from the left.
* If both begin with non-zero digits, compare the
   digits as right-aligned numbers (consuming all consecutive digits of left and right)
   and return if not equal.
   This means the longest number of digits wins, else digits are compared char-by-char
* Otherwise, compare the current character of left and right lexicographically
   (consuming one character of each) - this may be case-sensitive or not.
   Return if not equal.
* If we haven't hit a difference, go to next unconsumed chars of left and right,
   and repeat until all characters consumed.
   If all characters consumed in both left and right, return equal.

This works with the hidden assumption that an end-of-string is marked with a 0-byte (true for C).

Unfortunately, there are lots of variations.  How do you handle whitespace? Non-alpha characters? Leading zeros?  They seem to vary :-(.

Further info:
* Martin Pool has an implementation and links to other info:
http://sourcefrog.net/projects/natsort/
I particularly like this one because he carefully deals with leading zeros,
so that decimal numbers sort nicely. It looks like most other implementations are
based on this implementation/definition.

* PHP has one built-in, strnatcmp function, based on Martin Pool's:
http://us3.php.net/manual/en/function.strnatcmp.php

* Perl's CPAN has two implementations:
 http://search.cpan.org/~salva/Sort-Key/lib/Sort/Key/Natural.pm
 http://search.cpan.org/~sburke/Sort-Naturally-1.02/lib/Sort/Naturally.pm

* OOo implementation:
http://codesnippets.services.openoffice.org/Office/Office.SimpleNaturalSortAlgorithm.snip
http://kohei.us/ooo/nsort/index.html

* Stuart Cheshire's Mac Finder version: http://www.naturalordersort.org/
This appears to be the "grandaddy" of the idea, but it's not clear that this
handles leading 0's in a way that makes decimal numbers sort nicely.
Pool re-implemented the idea as a function usable by others.

--- David A. Wheeler

References:
- Proposal for Spreadsheets: New sort option "natural sort"
  - From: Michael Brauer - Sun Germany - ham02 - Hamburg <Michael.Brauer@Sun.COM>
- Re: [office] Proposal for Spreadsheets: New sort option "natural sort"
  - From: robert_weir@us.ibm.com