OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

relax-ng message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Subject: [relax-ng] Proposal 2 for RNG regular expressions


This is a proposal for extending RNG to support regular expressions
within content.  Essentially, a new type of pattern is introduced,
parallel to the value and data patterns: the regex pattern.
Various existing and novel elements can appear within a regex pattern
in order to specify exactly which characters (of attribute value or
character content) are allowed.

There are two main design principles.  The first is that everything
within a regex pattern (after reduction of ref and externalRef elements)
can be compiled into a Perl 5 regular expression, since this type of
r.e. is supported by many different libraries.  The second is that the
facilities provided are "in the spirit of RNG": that is, they provide
what is clearly necessary, but do not have much syntactic sugar, and
can be implemented using variants of the implementation for ordinary
patterns, if a Perl-compatible library is not being used.
A non-principle was to be equivalent to XSD regexes.

Here is the (compact) syntax for the additions.  Anything not
defined here is defined in the compact syntax for RNG.
This grammar could be improved if the compact RNG grammar were
factored better so that there were rules for ref, externalRef, empty, value.

# Add regex as a new pattern type
pattern &= regex

# All regexes are packaged within a regex element
regex =	element regex { common & re+ }

re =	element group|choice|optional|zeroOrMore|oneOrMore
		{ common & re+ }		# Perl (?:...), |, ?, *, +
	| element ref { nameNCName, common }	# Local regex rule
	| element empty { common }		# No op
	| element value { commonAttributes, xsd:string }
		# Literal string: all non-alphabetics get escaped for Perl
	| element externalRef { href, common }	# Global regex rule
	| element begin|end { common, attribute type { "word"|"line"|"string" } }
		# \b, \b, ^, $, ^A, ^Z
	| cset					# Character sets

cset = element cset { common & ((cs+ & exceptcs?)	# cs expression
		| attribute name { xsd:token}		# named character class
		| (attribute type { "chars"|"ranges" }, xsd:string) }
			# chars: enumerate members of set
			# ranges: pairs of characters define inclusive ranges

cs = 	element choice|concur {common & cs+} 		# union, intersection
	| element ref { nameNCName, common }
	| element empty { common }
	| element externalRef { href, common }
	| cset

csexcept = element except { common & cs+ }		# cset difference

The following paths are forbidden.  Essentially these restrictions force
patterns imported by ref or externalRef to conform to the re and cs rules.

regex//data
regex//list
regex//attribute
regex//ref
regex//interleave

cset//optional
cset//oneOrMore
cset//zeroOrMore
cset//group
cset//begin
cset//end

The licit values for cset/@name have yet to be defined.  Obvious candidates
are any, anyButNewline, lower, upper, alpha, digit, num, alphanum, punct,
graph, space, control.  Also ICU character predicates, possibly Unicode
block names.

-- 
John Cowan <jcowan@reutershealth.com>     http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,    http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Powered by eList eXpress LLC