office-formula message

Subject: Regular expression/wildcard language calculation setting?
From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Sat, 05 Aug 2006 13:10:15 -0400
OpenOffice.org uses a POSIX-based regular expression language
for pattern-matching.  Excel uses its own nonstandard language that
is not as capable, but there ARE documents that depend on its language.
I'd LIKE our results to be able to handle documents no matter
where they originally came from.

What do you think about adding a calculation setting
that lets you set the pattern-matching language to use?
Currently the OpenDocument specification includes
this calculation setting:
<define name="table-calculation-setting-attlist" combine="interleave">
  <optional>
    <attribute name="table:use-regular-expressions"
               a:defaultValue="true">
        <ref name="boolean"/>
    </attribute>
  </optional>
</define>


Perhaps we could add something like this:
  <define name="table-calculation-setting-attlist" combine="interleave">
    <optional>
      <attribute name="table:use-regular-expressions"
                 a:defaultValue="POSIX">
         <ref name="string"/>
      </attribute>
    </optional>
</define>


Possible defined values (for now) would be:
* "POSIX".  I intend for this to match OOo's notation,
   which I believe is identical to POSIX 1003.2's
   section 2.8 (Regular Expression Notation)'s "Extended
   regular expressions", which I believe are also the
   same as ISO/IEC 9945-2:1993.
   I don't have the ISO document, but here's the
   Single Unix Spec (version 2) at the Open Group:
      http://opengroup.org/onlinepubs/007908799/xbd/re.html
   If someone could confirm that they're all the same, that'd
   be great; a standard should reference other standards
   where reasonable.  If the term "POSIX" should be replaced by
   something else, like "ISO9945", let me know that too.
* "Excel".  This has only the following, which is a mishmash
   of globbing notation and RE notation:
   ? - Match any single character ("." in standard notation)
   * - Matches zero or more characters (".*" in standard notation)
   # - Matches any single digit ("[:digits:]" in standard notation)
   [...] - Matches any character in the list.
       (Amazing Standard notation!)
   [!...] - Matches any character NOT in the list.
         ([^....] in standard notation)
   ~ - Escapes the following character ("\" in standard notation).

   Yes, it really is this limited.
   My Sources: Walkenbach "Excel 2003 formulas" page 680, except it
   doesn't mention ~.  Simon's "Excel 2000 in a Nutshell" page 515
   mentions ~, ?, *, though not the others.
   It'd be worth checking to make sure this is correct for Excel.
   Many docs only list a subset of their nonstandard RE syntax.

I propose making this a string, not a boolean.
That lets us handle versions of languages, other
languages by other apps, and so on.
In the future I expect Perl's features to drift
to other RE users, for example.

We could make support for this calculation setting optional;
after all, many docs don't care.  But some docs do, so let's
see if we can handle them all.

--- David A. Wheeler

====================================================================

Here's OOo's language, quoting from their help page:

Character Result/Use
==========================
Any character
Represents any single character unless otherwise specified.

.
Represents any single character except for a line break or paragraph 
break. For example, the search term "sh.rt" returns both "shirt" and 
"short".

^
Only finds the search term if the term is at the beginning of a 
paragraph. Special objects such as empty fields or character-anchored 
frames, at the beginning of a paragraph are ignored. Example: "^Peter".

$
Only finds the search term if the term appears at the end of a 
paragraph. Special objects such as empty fields or character-anchored 
frames at the end of a paragraph are ignored. Example: "Peter$".

*
Finds zero or more of the characters in front of the "*". For example, 
"Ab*c" finds "Ac", "Abc", "Abbc", "Abbbc", and so on.

+
Finds one or more of the characters in front of the "+". For example, 
"AX.+4" finds "AXx4", but not "AX4".
The longest possible string that matches this search pattern in a 
paragraph is always found. If the paragraph contains the string "AX 4 
AX4", the entire passage is highlighted.

?
Finds zero or one of the characters in front of the "?". For example, 
"Texts?" finds "Text" and "Texts" and "x(ab|c)?y" finds "xy", "xaby", or 
"xcy".

\
Search interprets the special character that follows the "\" as a normal 
character and not as a regular expression (except for the combinations 
\n, \t, \>, and \<). For example, "tree\." finds "tree.", not "treed" or 
"trees".

\n
Represents a line break that was inserted with the Shift+Enter key 
combination. To change a line break into a paragraph break, enter \n in 
the Search for and Replace with boxes, and then perform a search and 
replace.

\t
Represents a tab. You can also use this expression in the Replace with box.

\>
Only finds the search term if it appears at the end of a word. For 
example, "book\>" finds "checkbook", but not "bookmark".

\<
Only finds the search term if it appears at the beginning of a word. For 
example, "\<book" finds "bookmark", but not "checkbook".

^$
Finds an empty paragraph.

^.
Finds the first character of a paragraph.

&
Adds the string that was found by the search criteria in the Search for 
box to the term in the Replace with box when you make a replacement.
For example, if you enter "window" in the Search for box and "&frame" in 
the Replace with box, the word "window" is replaced with "windowframe".
You can also enter an "&" in the Replace with box to modify the 
Attributes or the Format of the string found by the search criteria.

[abc123]
Represents one of the characters that are between the brackets.

[a-e]
Represents any of the characters that are between a and e.

[a-eh-x]
Represents any of the characters that are between a-e and h-x.

[^a-s]
Represents any character that is not between a and s.

\xXXXX
Represents a special character based on its four-digit hexadecimal code 
(XXXX).
The code for the special character depends on the font used. You can 
view the codes by choosing Insert - Special Character.

|
Finds the terms that occur before or after the "|". For example, 
"this|that" finds "this" and "that".

{2}
Defines the number of times that the character in front of the opening 
bracket occurs. For example, "tre{2}" finds "tree".

{1,2}
Defines the number of times that the character in front of the opening 
bracket can occur. For example, "tre{1,2}" finds both "tree" and "treated".

{1,}
Defines the minimum number of times that the character in front of the 
opening bracket can occur. For example, "tre{2,}" finds "tree", "treee", 
and "treeeee".

( )
Defines the characters inside the parentheses as a reference. You can 
then refer to the first reference in the current expression with "\1", 
to the second reference with "\2", and so on.
For example, if your text contains the number 13487889 and you search 
using the regular expression (8)7\1\1, "8788" is found.
You can also use () to group terms, for example, "a(bc)?d" finds "ad" or 
"abcd".

[:digit:]
Represents a decimal digit.

[:space:]
Represents a white space character such as space.

[:print:]
Represents a printable character.

[:cntrl:]
Represents a nonprinting character.

[:alnum:]
Represents an alphanumeric character ([:alpha:] and [:digit:]).

[:alpha:]
Represents an alphabetic character.

[:lower:]
Represents a lowercase character if Match case is selected in Options.

[:upper:]
Represents an uppercase character if Match case is selected in Options.