office-formula message

Subject: Re: Calculation setting for regular expression language?
From: "Tomas Mecir" <mecirt@gmail.com>
To: dwheeler@dwheeler.com
Date: Sun, 6 Aug 2006 15:39:05 +0200
Sounds like the usual regular expression syntax, with the exception of
the things mentioned, and the & character to match text from Replace
dialog.
I guess that one does not apply for the formula syntax, though ?

The problem with regular expressions is that there are, in fact, three
of them - the "basic" syntax, as used by GNU's grep, the "extended"
one, as used by GNU's egrep (or grep -e) - that's what you were
describing here, and then, there are Perl regular expressions, which
add even more convenience features (\d is a digit for example (same as
[0-9])).

In addition, the replace-like functions may want to use
back-references, which get commonly denoted as \0, \1, \2, ...,
although Perl, for example, uses $0, $1, $2, basically treating them
as variables.

Well, it's a complicated matter, although it mostly goes beyond the
scope of spreadsheets - it gets used in all sorts of different
situations.

As far as KSpread goes, to my understanding, we just use the simple
Excel-like language in database functions. Not 100% sure though. Here
the problem also is that the simple patterns, as used in Excel, for
example, are not compatible with regular expressions - if you give
such a pattern to the regular expression engine, the result will be
very much different from what you want.

/ Tomas

On 8/6/06, David A. Wheeler <dwheeler@dwheeler.com> wrote:
> (I sent this message earlier to the office-formula mailing list, but it
> doesn't seem to have gone out.  I suggest that you keep a copy of all
> messages you send to the list, in case you need to resend.)
>
> I suggest that we create a new calculation setting
> to control which regular expression language to use -
> at least for the database criteria, and probably for SEARCH
> as well. That will allow documents from various locations
> to move into OpenFormula, AND gives more flexibility.
> Anyone care to comment, discuss?
>
> An alternative would be to have different function names,
> depending on which RE language you used. For SEARCH, that's
> probably the right thing to do - it's simple to do, and
> very flexible. But all the database functions
> use a pattern language, and I'm skeptical that we
> really want a number of different database functions, which
> only differ based on the RE language they use. Particularly
> since over time there may be several in use! And is it really
> likely that someone would use multiple RE pattern languages
> in the database queries, in the same spreadsheet?
> I think they'd find that very confusing.
>
> So what would this setting look like? Here's one approach.
>
> Basically, this would add an optional setting to the OpenDocument
> rule "table-calculation-setting-attlist", so that there would be
> an optional "table:regex-language" attribute. Here's the RELAX NG:
> <define name="table-calculation-setting-attlist" combine="interleave">
>  <optional>
>   <attribute name="table:regex-language" a:defaultValue="POSIX">
>    <ref name="string"/>
>   </attribute>
>  </optional>
> </define>
>
> I think it needs to be a string; there are many RE languages
> out there in the world, and this lets us add new ones when necessary.
> At the very least, we need to handle Excel's and OOo's.
>
> Excel's language is pretty limited:
> ? - Matches any one character (like ".")
> * - Matches one or more characters (like ".+")
> ~ - Escapes the next character (like "\")
> Source: "Excel 2003 Formulas", Walkenbach, page 126;
> Page 252 shows how this impacts the database language.
> Confirmed with "Excel 2000 in a Nutshell" page 515.
> Call this "Excel".
>
> OOo's is much more capable; below is its language per its help file.
> In fact, OOo's looks a whole lot like the POSIX standard's
> RE language. If it is, we probably ought to call it "POSIX"
> (as claimed above). But I'm not SURE it is; I'd love to
> hear confirm/deny of it. Should we call it POSIX? OOo?
>
> A quick comparison of OOo to the standard suggests that OOo
> _is_ the POSIX Extended RE set, except:
> * "." in POSIX matches any char; in OOo it
> "Represents any single character except for a
> line break or paragraph break."
> * "\>", "\<" in OOo Matches end/beginning of word. Not in the spec,
> this is an extension.
> * "\xXXXX" in OOo "Represents a special character based on its
> four-digit hexadecimal code (XXXX)." Not in the spec.
> We could just document the extensions. The different meaning
> of "." is more bothersome; if that's really important, maybe
> it shouldn't be called POSIX, but something else.
> What should we do about this detail? Are there other
> differences I haven't noticed?
>
> I'd love to be able to reference another standard directly.
> Where possible, a standard should build on other standards.
> I _love_ it that OpenDocument uses MathML, for example; that's
> the way things SHOULD be. And it would mean that an implementor
> could just reuse regexec(), for example, instead of building
> their old system for matching.
>
> More thoughts welcome. I lost this message and had to
> recreate it (grumble), so please excuse me if I skipped over
> a thought or three.
>
> --- David A. Wheeler
>
>
> ==================================================================
>
> I tried to compare the OOo language with the standards;
> the Open Group, IEEE, and ISO have done alignments on
> the POSIX standards.
>
> Some useful information is here:
> http://www.unix.org/version3/online.html
> http://www.unix.org/version3/iso_std.html
>
> Here's what the Open Group says about ISO/IEC 9945-2003, in its
> forward to the Single Unix Specification:
> The core of the Single UNIX Specification, Version 3 is also ISO/IEC
> 9945:2003. ISO/IEC 9945:2003 is the latest edition of ISO/IEC 9945
> and incorporates the former ISO/IEC 9945-1:1996 (POSIX.1), and ISO/IEC
> 9945-2:1993 (POSIX.2) plus subsequent IEEE amendments, combined with
> the core volumes of the Single UNIX Specification, Version 2. It is
> technically identical to The Open Group, Base Specifications, Issue 6
> and IEEE Std 1003.1, 2003 edition; they are one and the same documents,
> the front cover having all the designations. It was published on August
> 18 2003.
>
> I looked further at the Single Unix Spec, since it's
> online (free registration). "The Open Group Base Specifications Issue 6
> IEEE Std 1003.1, 2004 Edition", and looked at the regular
> expression stuff. They define Basic REs and Extended Regular Expressions,
> and it's the extended ones we care about (they support +, |, and so on,
> just as OOo does).
>
> A quick comparison of OOo to the standard suggests that OOo
> _is_ the POSIX Extended RE set, except for the following:
>
> * "." in OOo "Represents any single character except for a
> line break or paragraph break. For example, the
> search term "sh.rt" returns both "shirt" and "short"."
>
> In the standard, "." matches any char except NUL.
>
> Calc really does this too; this fails (Error):
> =SEARCH("sh.rt";"osh" & CHAR(10) & "rt")
> this succeeds:
> =SEARCH("sh.rt";"osh" & CHAR(10) & "rt")
>
> * "\>", "\<" in OOo Matches end/beginning of word. Not in the spec,
> this is an extension.
>
> * "\xXXXX" in OOo "Represents a special character based on its
> four-digit hexadecimal code (XXXX)." Not in the spec.
>
> Other than that, they seem to be identical.
> There may be some subtle feature in the standard not in OOo,
> but they SEEM to be the same otherwise, e.g., {lowcount,highcount},
> and so on. I didn't as detailed a comparison backward from
> the standard to OOo's documentation, and could certainly have missed
> something.
>
>
>
>
> ==================================================================
>
> Here's what OOo says its RE language is; I quote its help file:
>
> Character Result/Use
>
> Any character
> Represents any single character unless otherwise specified.
>
> .
> Represents any single character except for a line break or paragraph break. For example, the search term "sh.rt" returns both "shirt" and "short".
>
> ^
> Only finds the search term if the term is at the beginning of a paragraph. Special objects such as empty fields or character-anchored frames, at the beginning of a paragraph are ignored. Example: "^Peter".
>
> $
> Only finds the search term if the term appears at the end of a paragraph. Special objects such as empty fields or character-anchored frames at the end of a paragraph are ignored. Example: "Peter$".
>
> *
> Finds zero or more of the characters in front of the "*". For example, "Ab*c" finds "Ac", "Abc", "Abbc", "Abbbc", and so on.
>
> +
> Finds one or more of the characters in front of the "+". For example, "AX.+4" finds "AXx4", but not "AX4".
> The longest possible string that matches this search pattern in a paragraph is always found. If the paragraph contains the string "AX 4 AX4", the entire passage is highlighted.
>
> ?
> Finds zero or one of the characters in front of the "?". For example, "Texts?" finds "Text" and "Texts" and "x(ab|c)?y" finds "xy", "xaby", or "xcy".
>
> \
> Search interprets the special character that follows the "\" as a normal character and not as a regular expression (except for the combinations \n, \t, \>, and \<). For example, "tree\." finds "tree.", not "treed" or "trees".
>
> \n
> Represents a line break that was inserted with the Shift+Enter key combination. To change a line break into a paragraph break, enter \n in the Search for and Replace with boxes, and then perform a search and replace.
>
> \t
> Represents a tab. You can also use this expression in the Replace with box.
>
> \>
> Only finds the search term if it appears at the end of a word. For example, "book\>" finds "checkbook", but not "bookmark".
>
> \<
> Only finds the search term if it appears at the beginning of a word. For example, "\<book" finds "bookmark", but not "checkbook".
>
> ^$
> Finds an empty paragraph.
>
> ^.
> Finds the first character of a paragraph.
>
> &
> Adds the string that was found by the search criteria in the Search for box to the term in the Replace with box when you make a replacement.
> For example, if you enter "window" in the Search for box and "&frame" in the Replace with box, the word "window" is replaced with "windowframe".
> You can also enter an "&" in the Replace with box to modify the Attributes or the Format of the string found by the search criteria.
>
> [abc123]
> Represents one of the characters that are between the brackets.
>
> [a-e]
> Represents any of the characters that are between a and e.
>
> [a-eh-x]
> Represents any of the characters that are between a-e and h-x.
>
> [^a-s]
> Represents any character that is not between a and s.
>
> \xXXXX
> Represents a special character based on its four-digit hexadecimal code (XXXX).
> The code for the special character depends on the font used. You can view the codes by choosing Insert - Special Character.
>
> |
> Finds the terms that occur before or after the "|". For example, "this|that" finds "this" and "that".
>
> {2}
> Defines the number of times that the character in front of the opening bracket occurs. For example, "tre{2}" finds "tree".
>
> {1,2}
> Defines the number of times that the character in front of the opening bracket can occur. For example, "tre{1,2}" finds both "tree" and "treated".
>
> {1,}
> Defines the minimum number of times that the character in front of the opening bracket can occur. For example, "tre{2,}" finds "tree", "treee", and "treeeee".
>
> ( )
> Defines the characters inside the parentheses as a reference. You can then refer to the first reference in the current expression with "\1", to the second reference with "\2", and so on.
> For example, if your text contains the number 13487889 and you search using the regular expression (8)7\1\1, "8788" is found.
> You can also use () to group terms, for example, "a(bc)?d" finds "ad" or "abcd".
>
> [:digit:]
> Represents a decimal digit.
>
> [:space:]
> Represents a white space character such as space.
>
> [:print:]
> Represents a printable character.
>
> [:cntrl:]
> Represents a nonprinting character.
>
> [:alnum:]
> Represents an alphanumeric character ([:alpha:] and [:digit:]).
>
> [:alpha:]
> Represents an alphabetic character.
>
> [:lower:]
> Represents a lowercase character if Match case is selected in Options.
>
> [:upper:]
> Represents an uppercase character if Match case is selected in Options.
>
>
>
>
> --- David A. Wheeler
>
>
References:
- Calculation setting for regular expression language?
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>