office-formula message

Subject: Calculation setting for regular expression language?
From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Sun, 06 Aug 2006 09:07:15 -0400 (EDT)
(I sent this message earlier to the office-formula mailing list, but it
doesn't seem to have gone out.  I suggest that you keep a copy of all
messages you send to the list, in case you need to resend.)

I suggest that we create a new calculation setting
to control which regular expression language to use -
at least for the database criteria, and probably for SEARCH
as well. That will allow documents from various locations
to move into OpenFormula, AND gives more flexibility.
Anyone care to comment, discuss?

An alternative would be to have different function names,
depending on which RE language you used. For SEARCH, that's
probably the right thing to do - it's simple to do, and
very flexible. But all the database functions
use a pattern language, and I'm skeptical that we
really want a number of different database functions, which
only differ based on the RE language they use. Particularly
since over time there may be several in use! And is it really
likely that someone would use multiple RE pattern languages
in the database queries, in the same spreadsheet?
I think they'd find that very confusing.

So what would this setting look like? Here's one approach.

Basically, this would add an optional setting to the OpenDocument
rule "table-calculation-setting-attlist", so that there would be
an optional "table:regex-language" attribute. Here's the RELAX NG:
<define name="table-calculation-setting-attlist" combine="interleave">
 <optional>
  <attribute name="table:regex-language" a:defaultValue="POSIX">
   <ref name="string"/>
  </attribute>
 </optional>
</define>

I think it needs to be a string; there are many RE languages
out there in the world, and this lets us add new ones when necessary.
At the very least, we need to handle Excel's and OOo's.

Excel's language is pretty limited:
? - Matches any one character (like ".")
* - Matches one or more characters (like ".+")
~ - Escapes the next character (like "\")
Source: "Excel 2003 Formulas", Walkenbach, page 126;
Page 252 shows how this impacts the database language.
Confirmed with "Excel 2000 in a Nutshell" page 515.
Call this "Excel".

OOo's is much more capable; below is its language per its help file.
In fact, OOo's looks a whole lot like the POSIX standard's
RE language. If it is, we probably ought to call it "POSIX"
(as claimed above). But I'm not SURE it is; I'd love to
hear confirm/deny of it. Should we call it POSIX? OOo?

A quick comparison of OOo to the standard suggests that OOo
_is_ the POSIX Extended RE set, except:
* "." in POSIX matches any char; in OOo it
"Represents any single character except for a
line break or paragraph break."
* "\>", "\<" in OOo Matches end/beginning of word. Not in the spec,
this is an extension.
* "\xXXXX" in OOo "Represents a special character based on its
four-digit hexadecimal code (XXXX)." Not in the spec.
We could just document the extensions. The different meaning
of "." is more bothersome; if that's really important, maybe
it shouldn't be called POSIX, but something else.
What should we do about this detail? Are there other
differences I haven't noticed?

I'd love to be able to reference another standard directly.
Where possible, a standard should build on other standards.
I _love_ it that OpenDocument uses MathML, for example; that's
the way things SHOULD be. And it would mean that an implementor
could just reuse regexec(), for example, instead of building
their old system for matching.

More thoughts welcome. I lost this message and had to
recreate it (grumble), so please excuse me if I skipped over
a thought or three.

--- David A. Wheeler


==================================================================

I tried to compare the OOo language with the standards;
the Open Group, IEEE, and ISO have done alignments on
the POSIX standards.

Some useful information is here:
http://www.unix.org/version3/online.html
http://www.unix.org/version3/iso_std.html

Here's what the Open Group says about ISO/IEC 9945-2003, in its
forward to the Single Unix Specification:
The core of the Single UNIX Specification, Version 3 is also ISO/IEC
9945:2003. ISO/IEC 9945:2003 is the latest edition of ISO/IEC 9945
and incorporates the former ISO/IEC 9945-1:1996 (POSIX.1), and ISO/IEC
9945-2:1993 (POSIX.2) plus subsequent IEEE amendments, combined with
the core volumes of the Single UNIX Specification, Version 2. It is
technically identical to The Open Group, Base Specifications, Issue 6
and IEEE Std 1003.1, 2003 edition; they are one and the same documents,
the front cover having all the designations. It was published on August
18 2003.

I looked further at the Single Unix Spec, since it's
online (free registration). "The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition", and looked at the regular
expression stuff. They define Basic REs and Extended Regular Expressions,
and it's the extended ones we care about (they support +, |, and so on,
just as OOo does).

A quick comparison of OOo to the standard suggests that OOo
_is_ the POSIX Extended RE set, except for the following:

* "." in OOo "Represents any single character except for a
line break or paragraph break. For example, the
search term "sh.rt" returns both "shirt" and "short"."

In the standard, "." matches any char except NUL.

Calc really does this too; this fails (Error):
=SEARCH("sh.rt";"osh" & CHAR(10) & "rt")
this succeeds:
=SEARCH("sh.rt";"osh" & CHAR(10) & "rt")

* "\>", "\<" in OOo Matches end/beginning of word. Not in the spec,
this is an extension.

* "\xXXXX" in OOo "Represents a special character based on its
four-digit hexadecimal code (XXXX)." Not in the spec.

Other than that, they seem to be identical.
There may be some subtle feature in the standard not in OOo,
but they SEEM to be the same otherwise, e.g., {lowcount,highcount},
and so on. I didn't as detailed a comparison backward from
the standard to OOo's documentation, and could certainly have missed
something.




==================================================================

Here's what OOo says its RE language is; I quote its help file:

Character Result/Use

Any character
Represents any single character unless otherwise specified.

.
Represents any single character except for a line break or paragraph break. For example, the search term "sh.rt" returns both "shirt" and "short".

^
Only finds the search term if the term is at the beginning of a paragraph. Special objects such as empty fields or character-anchored frames, at the beginning of a paragraph are ignored. Example: "^Peter".

$
Only finds the search term if the term appears at the end of a paragraph. Special objects such as empty fields or character-anchored frames at the end of a paragraph are ignored. Example: "Peter$".

*
Finds zero or more of the characters in front of the "*". For example, "Ab*c" finds "Ac", "Abc", "Abbc", "Abbbc", and so on.

+
Finds one or more of the characters in front of the "+". For example, "AX.+4" finds "AXx4", but not "AX4".
The longest possible string that matches this search pattern in a paragraph is always found. If the paragraph contains the string "AX 4 AX4", the entire passage is highlighted.

?
Finds zero or one of the characters in front of the "?". For example, "Texts?" finds "Text" and "Texts" and "x(ab|c)?y" finds "xy", "xaby", or "xcy".

\
Search interprets the special character that follows the "\" as a normal character and not as a regular expression (except for the combinations \n, \t, \>, and \<). For example, "tree\." finds "tree.", not "treed" or "trees".

\n
Represents a line break that was inserted with the Shift+Enter key combination. To change a line break into a paragraph break, enter \n in the Search for and Replace with boxes, and then perform a search and replace.

\t
Represents a tab. You can also use this expression in the Replace with box.

\>
Only finds the search term if it appears at the end of a word. For example, "book\>" finds "checkbook", but not "bookmark".

\<
Only finds the search term if it appears at the beginning of a word. For example, "\<book" finds "bookmark", but not "checkbook".

^$
Finds an empty paragraph.

^.
Finds the first character of a paragraph.

&
Adds the string that was found by the search criteria in the Search for box to the term in the Replace with box when you make a replacement.
For example, if you enter "window" in the Search for box and "&frame" in the Replace with box, the word "window" is replaced with "windowframe".
You can also enter an "&" in the Replace with box to modify the Attributes or the Format of the string found by the search criteria.

[abc123]
Represents one of the characters that are between the brackets.

[a-e]
Represents any of the characters that are between a and e.

[a-eh-x]
Represents any of the characters that are between a-e and h-x.

[^a-s]
Represents any character that is not between a and s.

\xXXXX
Represents a special character based on its four-digit hexadecimal code (XXXX).
The code for the special character depends on the font used. You can view the codes by choosing Insert - Special Character.

|
Finds the terms that occur before or after the "|". For example, "this|that" finds "this" and "that".

{2}
Defines the number of times that the character in front of the opening bracket occurs. For example, "tre{2}" finds "tree".

{1,2}
Defines the number of times that the character in front of the opening bracket can occur. For example, "tre{1,2}" finds both "tree" and "treated".

{1,}
Defines the minimum number of times that the character in front of the opening bracket can occur. For example, "tre{2,}" finds "tree", "treee", and "treeeee".

( )
Defines the characters inside the parentheses as a reference. You can then refer to the first reference in the current expression with "\1", to the second reference with "\2", and so on.
For example, if your text contains the number 13487889 and you search using the regular expression (8)7\1\1, "8788" is found.
You can also use () to group terms, for example, "a(bc)?d" finds "ad" or "abcd".

[:digit:]
Represents a decimal digit.

[:space:]
Represents a white space character such as space.

[:print:]
Represents a printable character.

[:cntrl:]
Represents a nonprinting character.

[:alnum:]
Represents an alphanumeric character ([:alpha:] and [:digit:]).

[:alpha:]
Represents an alphabetic character.

[:lower:]
Represents a lowercase character if Match case is selected in Options.

[:upper:]
Represents an uppercase character if Match case is selected in Options.




--- David A. Wheeler
Follow-Ups:
- Re: Calculation setting for regular expression language?
  - From: "Tomas Mecir" <mecirt@gmail.com>