office-formula message

Subject: Re: [office-formula] Calculation setting for regular expressionlanguage?
From: Eike Rathke <erack@sun.com>
To: office-formula@lists.oasis-open.org
Date: Wed, 18 Oct 2006 16:36:48 +0200
Hi David,

Following-up on an older topic:

On Saturday, 2006-08-05 23:08:02 -0400, David A. Wheeler wrote:

> I suggest that we create a new calculation setting
> to control which regular expression language to use -
> at least for the database criteria, and probably for SEARCH
> as well.  That will allow documents from various locations
> to move into OpenFormula, AND gives more flexibility.
> Anyone care to comment, discuss?

I think we should define one regex language instead, at least POSIX
EREs, maybe PCREs, with the addition of Unicode handling, everything
else doesn't make much sense from an i18n point of view. This
effectively boils down to a language like used by the ICU as described
in http://icu.sourceforge.net/userguide/regexp.html

Allowing multiple regex languages may seem to give more flexibility, but
IMHO just adds to confusion. Most spreadsheet applications would only
implement one regex language anyway, thus exchanging documents using
a different language between applications would be very limited.

Additionally to the table:use-regular-expressions caclulation setting,
ODF should be enhanced to include another setting, table:use-wildcards
or similar, to allow calculcations using the MS-Excel wildcards,
asterisk '*', question mark '?' and the tilde '~' escape character.

The two settings table:use-regular-expressions and table:use-wildcards
would be mutually exclusive.

> OOo's is much more capable; below is its language per its help file.
> In fact, OOo's looks a whole lot like the POSIX standard's
> RE language. If it is, we probably ought to call it "POSIX"
> (as claimed above).  But I'm not SURE it is; I'd love to
> hear confirm/deny of it.  Should we call it POSIX? OOo?

The current implementation of OOo is mostly POSIX, though not strictly,
as you have noted:

> A quick comparison of OOo to the standard suggests that OOo
> _is_ the POSIX Extended RE set, except:
> * "." in POSIX matches any char; in OOo it
>   "Represents any single character except for a
>   line break or paragraph break."
> * "\>", "\<" in OOo Matches end/beginning of word.  Not in the spec,
>   this is an extension.
> * "\xXXXX" in OOo "Represents a special character based on its
>   four-digit hexadecimal code (XXXX)."  Not in the spec.
> We could just document the extensions.

I would like to call these "temporary flaws" instead.. they were
invented without having any standard in mind (well, \< \> probably being
derived from sed's syntax), just to be compatible with some ancient
regex engine used by former legacy versions of StarDivision's
StarOffice. I would refrain from nailing these down in an ODF standard.

> The different meaning
> of "." is more bothersome; if that's really important, maybe
> it shouldn't be called POSIX, but something else.

This is mainly to be seen in the context of the Writer textprocessor
application, where a paragraph is actually not delimited by a newline or
any other character, so using a '.' will not find it.

> What should we do about this detail? Are there other
> differences I haven't noticed?

Not to my knowledge.

However, it is most likely that future versions of OOo will switch to
the ICU regular expressions. ICU regex Unicode properties follow those
defined in the Unicode Regular Expressions, so if we wanted to include
a reference we maybe should point to UTS #18,
http://www.unicode.org/unicode/reports/tr18/
Note that the latter does not define a concrete syntax and uses Perl
notation for its examples. Also the ICU syntax is based on Perl, as is
the syntax of the Java package java.util.regex, both could be valid
pointers as well.

> I'd love to be able to reference another standard directly.

AFAIK there is no standard that includes Unicode _and_ defines a proper
syntax.

  Eike

-- 
Automatic string conversions considered dangerous. They are the GOTO statements
of spreadsheets.  --Robert Weir on the OpenDocument formula subcommitee's list.