[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: Calculation setting for regular expression language?
Sounds like the usual regular expression syntax, with the exception of the things mentioned, and the & character to match text from Replace dialog. I guess that one does not apply for the formula syntax, though ? The problem with regular expressions is that there are, in fact, three of them - the "basic" syntax, as used by GNU's grep, the "extended" one, as used by GNU's egrep (or grep -e) - that's what you were describing here, and then, there are Perl regular expressions, which add even more convenience features (\d is a digit for example (same as [0-9])). In addition, the replace-like functions may want to use back-references, which get commonly denoted as \0, \1, \2, ..., although Perl, for example, uses $0, $1, $2, basically treating them as variables. Well, it's a complicated matter, although it mostly goes beyond the scope of spreadsheets - it gets used in all sorts of different situations. As far as KSpread goes, to my understanding, we just use the simple Excel-like language in database functions. Not 100% sure though. Here the problem also is that the simple patterns, as used in Excel, for example, are not compatible with regular expressions - if you give such a pattern to the regular expression engine, the result will be very much different from what you want. / Tomas On 8/6/06, David A. Wheeler <dwheeler@dwheeler.com> wrote: > (I sent this message earlier to the office-formula mailing list, but it > doesn't seem to have gone out. I suggest that you keep a copy of all > messages you send to the list, in case you need to resend.) > > I suggest that we create a new calculation setting > to control which regular expression language to use - > at least for the database criteria, and probably for SEARCH > as well. That will allow documents from various locations > to move into OpenFormula, AND gives more flexibility. > Anyone care to comment, discuss? > > An alternative would be to have different function names, > depending on which RE language you used. For SEARCH, that's > probably the right thing to do - it's simple to do, and > very flexible. But all the database functions > use a pattern language, and I'm skeptical that we > really want a number of different database functions, which > only differ based on the RE language they use. Particularly > since over time there may be several in use! And is it really > likely that someone would use multiple RE pattern languages > in the database queries, in the same spreadsheet? > I think they'd find that very confusing. > > So what would this setting look like? Here's one approach. > > Basically, this would add an optional setting to the OpenDocument > rule "table-calculation-setting-attlist", so that there would be > an optional "table:regex-language" attribute. Here's the RELAX NG: > <define name="table-calculation-setting-attlist" combine="interleave"> > <optional> > <attribute name="table:regex-language" a:defaultValue="POSIX"> > <ref name="string"/> > </attribute> > </optional> > </define> > > I think it needs to be a string; there are many RE languages > out there in the world, and this lets us add new ones when necessary. > At the very least, we need to handle Excel's and OOo's. > > Excel's language is pretty limited: > ? - Matches any one character (like ".") > * - Matches one or more characters (like ".+") > ~ - Escapes the next character (like "\") > Source: "Excel 2003 Formulas", Walkenbach, page 126; > Page 252 shows how this impacts the database language. > Confirmed with "Excel 2000 in a Nutshell" page 515. > Call this "Excel". > > OOo's is much more capable; below is its language per its help file. > In fact, OOo's looks a whole lot like the POSIX standard's > RE language. If it is, we probably ought to call it "POSIX" > (as claimed above). But I'm not SURE it is; I'd love to > hear confirm/deny of it. Should we call it POSIX? OOo? > > A quick comparison of OOo to the standard suggests that OOo > _is_ the POSIX Extended RE set, except: > * "." in POSIX matches any char; in OOo it > "Represents any single character except for a > line break or paragraph break." > * "\>", "\<" in OOo Matches end/beginning of word. Not in the spec, > this is an extension. > * "\xXXXX" in OOo "Represents a special character based on its > four-digit hexadecimal code (XXXX)." Not in the spec. > We could just document the extensions. The different meaning > of "." is more bothersome; if that's really important, maybe > it shouldn't be called POSIX, but something else. > What should we do about this detail? Are there other > differences I haven't noticed? > > I'd love to be able to reference another standard directly. > Where possible, a standard should build on other standards. > I _love_ it that OpenDocument uses MathML, for example; that's > the way things SHOULD be. And it would mean that an implementor > could just reuse regexec(), for example, instead of building > their old system for matching. > > More thoughts welcome. I lost this message and had to > recreate it (grumble), so please excuse me if I skipped over > a thought or three. > > --- David A. Wheeler > > > ================================================================== > > I tried to compare the OOo language with the standards; > the Open Group, IEEE, and ISO have done alignments on > the POSIX standards. > > Some useful information is here: > http://www.unix.org/version3/online.html > http://www.unix.org/version3/iso_std.html > > Here's what the Open Group says about ISO/IEC 9945-2003, in its > forward to the Single Unix Specification: > The core of the Single UNIX Specification, Version 3 is also ISO/IEC > 9945:2003. ISO/IEC 9945:2003 is the latest edition of ISO/IEC 9945 > and incorporates the former ISO/IEC 9945-1:1996 (POSIX.1), and ISO/IEC > 9945-2:1993 (POSIX.2) plus subsequent IEEE amendments, combined with > the core volumes of the Single UNIX Specification, Version 2. It is > technically identical to The Open Group, Base Specifications, Issue 6 > and IEEE Std 1003.1, 2003 edition; they are one and the same documents, > the front cover having all the designations. It was published on August > 18 2003. > > I looked further at the Single Unix Spec, since it's > online (free registration). "The Open Group Base Specifications Issue 6 > IEEE Std 1003.1, 2004 Edition", and looked at the regular > expression stuff. They define Basic REs and Extended Regular Expressions, > and it's the extended ones we care about (they support +, |, and so on, > just as OOo does). > > A quick comparison of OOo to the standard suggests that OOo > _is_ the POSIX Extended RE set, except for the following: > > * "." in OOo "Represents any single character except for a > line break or paragraph break. For example, the > search term "sh.rt" returns both "shirt" and "short"." > > In the standard, "." matches any char except NUL. > > Calc really does this too; this fails (Error): > =SEARCH("sh.rt";"osh" & CHAR(10) & "rt") > this succeeds: > =SEARCH("sh.rt";"osh" & CHAR(10) & "rt") > > * "\>", "\<" in OOo Matches end/beginning of word. Not in the spec, > this is an extension. > > * "\xXXXX" in OOo "Represents a special character based on its > four-digit hexadecimal code (XXXX)." Not in the spec. > > Other than that, they seem to be identical. > There may be some subtle feature in the standard not in OOo, > but they SEEM to be the same otherwise, e.g., {lowcount,highcount}, > and so on. I didn't as detailed a comparison backward from > the standard to OOo's documentation, and could certainly have missed > something. > > > > > ================================================================== > > Here's what OOo says its RE language is; I quote its help file: > > Character Result/Use > > Any character > Represents any single character unless otherwise specified. > > . > Represents any single character except for a line break or paragraph break. For example, the search term "sh.rt" returns both "shirt" and "short". > > ^ > Only finds the search term if the term is at the beginning of a paragraph. Special objects such as empty fields or character-anchored frames, at the beginning of a paragraph are ignored. Example: "^Peter". > > $ > Only finds the search term if the term appears at the end of a paragraph. Special objects such as empty fields or character-anchored frames at the end of a paragraph are ignored. Example: "Peter$". > > * > Finds zero or more of the characters in front of the "*". For example, "Ab*c" finds "Ac", "Abc", "Abbc", "Abbbc", and so on. > > + > Finds one or more of the characters in front of the "+". For example, "AX.+4" finds "AXx4", but not "AX4". > The longest possible string that matches this search pattern in a paragraph is always found. If the paragraph contains the string "AX 4 AX4", the entire passage is highlighted. > > ? > Finds zero or one of the characters in front of the "?". For example, "Texts?" finds "Text" and "Texts" and "x(ab|c)?y" finds "xy", "xaby", or "xcy". > > \ > Search interprets the special character that follows the "\" as a normal character and not as a regular expression (except for the combinations \n, \t, \>, and \<). For example, "tree\." finds "tree.", not "treed" or "trees". > > \n > Represents a line break that was inserted with the Shift+Enter key combination. To change a line break into a paragraph break, enter \n in the Search for and Replace with boxes, and then perform a search and replace. > > \t > Represents a tab. You can also use this expression in the Replace with box. > > \> > Only finds the search term if it appears at the end of a word. For example, "book\>" finds "checkbook", but not "bookmark". > > \< > Only finds the search term if it appears at the beginning of a word. For example, "\<book" finds "bookmark", but not "checkbook". > > ^$ > Finds an empty paragraph. > > ^. > Finds the first character of a paragraph. > > & > Adds the string that was found by the search criteria in the Search for box to the term in the Replace with box when you make a replacement. > For example, if you enter "window" in the Search for box and "&frame" in the Replace with box, the word "window" is replaced with "windowframe". > You can also enter an "&" in the Replace with box to modify the Attributes or the Format of the string found by the search criteria. > > [abc123] > Represents one of the characters that are between the brackets. > > [a-e] > Represents any of the characters that are between a and e. > > [a-eh-x] > Represents any of the characters that are between a-e and h-x. > > [^a-s] > Represents any character that is not between a and s. > > \xXXXX > Represents a special character based on its four-digit hexadecimal code (XXXX). > The code for the special character depends on the font used. You can view the codes by choosing Insert - Special Character. > > | > Finds the terms that occur before or after the "|". For example, "this|that" finds "this" and "that". > > {2} > Defines the number of times that the character in front of the opening bracket occurs. For example, "tre{2}" finds "tree". > > {1,2} > Defines the number of times that the character in front of the opening bracket can occur. For example, "tre{1,2}" finds both "tree" and "treated". > > {1,} > Defines the minimum number of times that the character in front of the opening bracket can occur. For example, "tre{2,}" finds "tree", "treee", and "treeeee". > > ( ) > Defines the characters inside the parentheses as a reference. You can then refer to the first reference in the current expression with "\1", to the second reference with "\2", and so on. > For example, if your text contains the number 13487889 and you search using the regular expression (8)7\1\1, "8788" is found. > You can also use () to group terms, for example, "a(bc)?d" finds "ad" or "abcd". > > [:digit:] > Represents a decimal digit. > > [:space:] > Represents a white space character such as space. > > [:print:] > Represents a printable character. > > [:cntrl:] > Represents a nonprinting character. > > [:alnum:] > Represents an alphanumeric character ([:alpha:] and [:digit:]). > > [:alpha:] > Represents an alphabetic character. > > [:lower:] > Represents a lowercase character if Match case is selected in Options. > > [:upper:] > Represents an uppercase character if Match case is selected in Options. > > > > > --- David A. Wheeler > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]