[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [search-ws-comment] Comments on http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/csprd01/part5-cql/searchRetrieve-v1.0-csprd01-part5-cql.html
Thank you, Martin, for your comments on the CQL spec. > 1. In par. ?3.4 Search Term? it is explained that the backslash (\) is > used to escape quote (?) as well as itself. The list of forbidden > characters for simple-strings is specified as: left or right angle > bracket, left or right parenthesis, equal, backslash, quote, whitespace. > But in par. ?4 CQL Query > Syntax: ABNF? it is explained that simple-strings shouldn?t contain > ?()/<=> or whitespace. These paragraphs are a bit confusing. Let me start with a basic explanation of and reasoning behind the rules for treating special characters. CQL initially singles out four structural characters for special treatment -- left and right parentheses, equal, and space -- because they are used for tokenization. For example, - in the expression: title=cat the equal sign separates title and cat into separate tokens. - in the expression cat AND dog the (two) spaces result (three) tokens. - in the expression (cat AND dog) OR (rat AND frog) the parentheses are used to order the tokens. (Without them the expression would be evaluated differently, in effect it would be evaluated as "((cat AND dog) OR rat) AND frog" and the tokens would be ordered differently.) Therefore, none of these four characters may occur in a search term without indication that they are functioning as normal search term characters and not in the roles illustrated above. The rule adopted is that if any of those four characters need to appear in a search term, then the entire search term is to be enclosed in quotes. Conversely if a search term is enclosed in quotes then any of the four characters within that search term do not take on their special tokenizing function. Hence quote is added as a "special character". But then that raises the question, what if you need a quote within the search term? You have to indicate that it is part of the search term and is not there to signal the end of the search term. So the rule adopted is that a backslash is used to "escape" the quote. Hence backslash is added as a "special character". Thus six special characters: equal, space, open paren, close paren, quote, and backslash. Left and right angle brackets are added to the list because CQL should be XML friendly. Forward slash is also a special character, but more on that below. So, backslash is used to escape quote, which raises the question, what if you need a backslash within the search term? The solution to that is, backslash is used to escape a backslash. (Thus two backslashes in a search term result in a single backslash. Four would result in two, etc.) It is important to note that among these nine "special characters", Only quote and backslash itself are backslash escaped (additional special character may be defined in context sets, and these may be backslashed escaped, more on that below) the others are "escaped" either by virtue of being within a parenthesis-enclosed term, or in the case of forward slash ..... > a) What about the forward slash (/), is it a special character or not? Forward slash is a special character but it does not require special treatment (i.e. escaping), because it's role can be unambiguously determined by the context. To illustrate, consider the search: title =/relevant cat where the relation, = , is qualified by the relation qualifier "relevant". How do we know the forward slash isn't part of the search term? The answer is, the slash is used only within a relation, to signal that a relation qualifier follows, and the syntax is desiged such that the CQL parser will never be confused about whether it is encountering a term or a relation, because... a CQL search is composed of search clauses linked by booleans. A search clause is always either: (a) index relation term or (b) term in case (b) the index and relation assume system defaults. So the first token will always be either an index or term and upon encountering the first token the parser doesn't know which, but if the second token is other than a boolean, it knows that that first token is an index and that the second is a relation (and the third had better be a search term). Thus it is logically impossible for "=/relevant" in the above example to be a search term. > b) Should backslash-escaping be done in unquoted strings as well? Not for any of the characters mentioned above. However, context sets may define additional characters with special meaning that may be backslash escaped. So in fact where it says in 3.4 ... "A search term ... MUST be enclosed in double quotes if it contains any of the following characters: left or right angle bracket, left or right parenthesis, equal, backslash, quote, or whitespace..." .... I think backslash was included in that list in error. (I will need to consult with the TC on this.) While it's true that if backslash is used to escape a quote then the entire search term must be quoted (not because of the backslash but because of the quote), backslash may be used to escape, for example, a masking character, and the search term would not need to be quoted. > c) Is it possible to escape other characters except double-quotes and > backslashes, like ?\a?? ..... Yes, but only if that character is defined by the operational context set as a special character. .....I think there?s nothing wrong with ?\a? to be > interpreted as ?a?. But quote par. ?B.3.3 Matching?: ?:Backslash (\) is > used to escape '*', '?', quote (") and '^' , as well as itself. > Backslash not followed immediately by one of these characters is an > error.?. Yes, but that's from Annex B, "CQL Context Set", and applies only if that is the operational context set. I can see where that would cause confusion. We will add a note of clarification in the next draft. To elaborate a bit about context sets .... . The CQL spec introduces the concept of a context set, which allows different communities to formally define their indexes, relations, and qualifiers. In fact the CQL document includes the specifications of four context sets in annexes B, C, D, and E: the CQL context set, the Sort context set, the Dublin Core context set, and the Bib context set. The first, the CQL context set, is the most general and is the default. Point is, a context set may designate additional characters to have special meaning; the CQL set does just that, it desinates asterisk, caret, and question mark, and so these three characters need to be escaped if they are intended literally in a search term. > 1. Is CQL Unicode compliant? .... I'll have to admit, we haven't thought deeply enough about it for me to be able to answer that question reliably; I'll consult with the Committee and we'll follow up. .... For example what exactly is the definition > of whitespace in CQL? My answer, perhaps overly simplistic, is "one or more consecutive whitespace characters". If you are asking what is considered a whitespace character, we have not attempted to answer that. Again, I'll need to consult with the Committee. Thanks again for raising these questions. --Ray
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]