OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

search-ws message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [search-ws-comment] Comments on http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/csprd01/part5-cql/searchRetrieve-v1.0-csprd01-part5-cql.html


Thank you, Martin, for your comments on the CQL spec.

> 1. In par. ?3.4 Search Term? it is explained that the backslash (\) is
> used to escape quote (?) as well as itself. The list of forbidden
> characters for simple-strings is specified as: left or right angle
> bracket, left or right parenthesis, equal, backslash, quote, whitespace.
> But in par. ?4 CQL Query
> Syntax: ABNF? it is explained that simple-strings shouldn?t contain
> ?()/<=> or whitespace. These paragraphs are a bit confusing.

Let me start with a basic explanation of and reasoning behind the rules for
treating special characters.

CQL initially singles out four structural characters for special treatment
-- left and right parentheses, equal, and space -- because they are used for
tokenization. 

For example,
 -  in the expression:  
        title=cat
 the equal sign separates title and cat into separate tokens.

- in the expression
       cat AND dog
the (two) spaces result (three) tokens. 

- in the expression
       (cat AND dog)  OR (rat AND frog)  
the parentheses are used to order the tokens. (Without them the expression
would be evaluated differently, in effect it would be evaluated as 
"((cat AND dog) OR rat) AND frog" and the tokens would be ordered
differently.)

Therefore, none of these four characters may occur in a search term without
indication that they are functioning as normal search term characters and
not in the roles illustrated above. 

The rule adopted is that if any of those four characters need to appear in a
search term, then the entire search term is to be enclosed in quotes.
Conversely if a search term is enclosed in quotes then any of the four
characters within that search term do not take on their special tokenizing
function.  

Hence quote is added as a "special character".

But then that raises the question, what if you need a quote within the
search term?  You have to indicate that it is part of the search term and is
not there to signal the end of the search term. So the rule adopted is that
a backslash is used to "escape" the quote. 

Hence backslash is added as a "special character".

Thus six special characters: equal, space, open paren, close paren, quote,
and backslash.   

Left and right angle brackets are added to the list because CQL should be
XML friendly.

Forward slash is also a special character, but more on that below.

So, backslash is used to escape quote, which raises the question, what if
you need a backslash within the search term?  The solution to that is,
backslash is used to escape a backslash. (Thus two backslashes in a search
term result in a single backslash. Four would result in two, etc.)

It is important to note that among these nine "special characters", Only
quote and backslash itself are backslash escaped (additional special
character may be defined in context sets, and these may be backslashed
escaped, more on that below)  the others are "escaped" either by virtue of
being within a parenthesis-enclosed term, or in the case of forward slash
.....




> a) What about the forward slash (/), is it  a special character or not?


Forward slash is a special character but it does not require special
treatment (i.e. escaping), because it's role can be unambiguously determined
by the context.

To illustrate, consider the search:

	title =/relevant cat 

where the relation, = , is qualified by the relation qualifier "relevant".
How do we know the forward slash isn't part of the search term? The answer
is, the slash is used only within a relation, to signal that a relation
qualifier follows, and  the syntax is desiged such that the CQL parser will
never be confused about whether it is encountering a term or a relation,
because...
 
a CQL search is composed of search clauses linked by booleans. A search
clause is always either:

(a) index relation term

or

(b) term

in case (b) the index and relation assume system defaults. 

So the first token will always be either an index or term and upon
encountering the first token the parser doesn't know which, but if the
second token is other than a boolean, it knows that that first token is an
index and that the second is a relation (and the third had better be a
search term). Thus it is logically impossible for "=/relevant" in the above
example to be a search term. 




> b) Should backslash-escaping be done in unquoted strings as well?

Not for any of the characters mentioned above.  However, context sets may
define additional characters with special meaning that may be backslash
escaped.

So in fact where it says in 3.4 ...
"A search term  ... MUST be enclosed in double quotes if it contains any of
the following characters: left or right angle bracket, left or right
parenthesis, equal, backslash, quote, or whitespace..."

 .... I think backslash was included in that list in error. (I will need to
consult with the TC on this.)  While it's true that if backslash is used to
escape a quote then the entire search term must be quoted (not because of
the backslash but because of the quote), backslash may be used to escape,
for example, a masking character, and the search term would not need to be
quoted.




> c) Is it possible to escape other characters except double-quotes and
> backslashes, like ?\a??  .....

Yes, but only if that character is defined by the operational context set as
a special character. 


 .....I think there?s nothing wrong with ?\a? to be
> interpreted as ?a?. But quote par. ?B.3.3 Matching?: ?:Backslash (\) is
> used to escape '*', '?', quote (") and '^' , as well as itself.
> Backslash not followed immediately by one of these characters is an
> error.?. 

Yes, but that's from Annex B, "CQL Context Set", and applies only if that is
the operational context set.  

I can see where that would cause confusion.   We will add a note of
clarification in the next draft. 

To elaborate a bit about context sets .... . The CQL spec introduces the
concept of a context set, which allows different communities to formally
define their indexes, relations, and qualifiers. In fact the CQL document
includes the specifications of four context sets in annexes B, C, D, and E:
the CQL context set, the Sort context set, the Dublin Core context set, and
the Bib context set. The first, the CQL context set, is the most general and
is the default.  Point is, a context set may designate additional characters
to have special meaning; the CQL set does just that, it desinates asterisk,
caret, and question mark, and so these three characters need to be escaped
if they are intended literally in a search term.


 

> 1. Is CQL Unicode compliant? ....

I'll have to admit, we haven't thought deeply enough about it for me to be
able to answer that question reliably; I'll consult with the Committee and
we'll follow up.

.... For example what exactly is the definition
> of whitespace in CQL?

My answer, perhaps overly simplistic, is "one or more consecutive whitespace
characters".  If you are asking what is considered a whitespace character,
we have not attempted to answer that. Again, I'll need to consult with the
Committee. 

Thanks again for raising these questions.

--Ray









[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]