cam message

Subject: Re: [ciq] FW: Address Schema Questions
From: David RR Webber <david@drrw.info>
To: "TC, CAM OASIS," <cam@lists.oasis-open.org>
Date: Mon, 15 Nov 2004 07:49:44 -0500
Team,

Thought I'd share this exchange from the CIQ list - since John here 
makes reference
to some interesting rule processing needs.

The jCAM engine does not yet include these aspects of the specification 
but we have
included facilities to be able to tackle some of these more esoteric 
handling needs.

Next month I'm planning to start work on some CAM templates for CIQ examples
that Ram has already asked me for.

DW

Ram Kumar wrote:

> 
>From: John D. Putman [mailto:jdputman@scanningtech.fedex.com] 
>Sent: Saturday, November 13, 2004 5:47 AM
>To: Ram Kumar; John D. Putman
>Subject: RE: Address Schema Questions
>
>Briefly -
>
>o  I appreciate the reply and agree that CIQ address schemas afford a
>wide degree of latitude for address formats and data; my main concern
>was with regard to practical considerations involved in appropriate
>population of the more finely grained schema, which may indeed be
>largely outside of CIQ purview.
>
>o  See details for the distinction between and import of "raw" address
>parsing and address matching to address reference data.
>
>o  I understand that your committee's efforts are "open" / without
>restriction or royalties; but some synergy with address reference data
>and matching facilities providers (who DO impose fees / restrictions)
>could be critical to the successful application of the schemas (again
>see address parsing verses address matching results covered under
>Details).
>
>o  I would be pleased to participate in your committee's efforts, as I
>believe there could be mutual benefits in so doing - especially with
>regard to international address standards (see Details for some general
>points on which we might work and for an immediate and specific instance
>/ question - aspects of which could affect a particular schema field's
>use and definition).
>
>Details - 
>
>Thanks for the reply!  Overall, I understand and can see in your schema
>definitions that a great deal of necessary latitude is provided.  So,
>depending on the data available (very general or very finely grained),
>the schemas can handle that.  The concern / questions were directed to
>practical matters of how the finely grained address element breakup
>could be populated and so adequately leveraged.  Yes, that is probably
>largely outside the scope of the schema definitions; but again, if it is
>not possible, then the schema may never be used.  I would like to see
>that made possible.
>
>PARSING VERSES MATCHING TO REFERENCE DATA
>
>Address parsing generally depends on some heuristic rules relative to a
>particular address format and address parts representation.  That
>parsing is also sometimes dependent on language recognition (see Canada
>where address parts depend on whether the address is in English or
>French).  Often the parsing is facilitated by static (but sometimes
>updated) lists of terms so that address parts can be recognized.  For
>instance, this can be lists of whole words, abbreviations, or symbols
>that are GENERALLY used for or in association with a particular part of
>an address - say directionals or street types for instance.  
>
>However, such "raw" parsing can and does make errors.  Consider the
>address "10 S Rayburn Ave".  A parsing engine will generally ASSUME that
>the "S" is a pre-directional for "South", which in most cases IS a
>correct assumption.
>HOWEVER, "S" may be the initial for "Sam" in the street name "S Rayburn"
>(Sam Rayburn was Texas legislator and longtime speaker of the United
>States House of Representatives, for whom streets, reservoirs, etc. have
>been named).  It is matching to address reference data, which contains
>"S Rayburn" as the street name rather than "S" as a pre-directional and
>"Rayburn" as the street name, that will allow one to determine this
>difference.  Without that match-made correction / determination (ALONG
>with delivery by the matching engine of the address parts identified!),
>one could take a parsing engine's non-match-informed split up and
>incorrectly place the "S" in a pre-directional field and only "Rayburn"
>in the street name field.  
>
>Note that this is a case where whole string matching against address
>reference data can "save the day".  There may NOT be a street named just
>"Rayburn".  Of course, putting the "S" back together with the "Rayburn"
>in a human usable address block would make this distinction moot FOR
>human usability (though NOT for how the address was populated into some
>schema).
>BUT I HAVE seen examples where that would NOT be the case!
>
>On another note, "real world" addresses sometimes contain extraneous
>information.  This can be included by users by mistake or for other
>purposes.  For instance, a user may include "directions" as part of an
>address - even though those directions might NOT be needed for address
>uniqueness determination and deliverability.  As a consequence, the user
>supplied address data might look something like -
>	10 Shady Lane
>	1 km south of Farm Rd 10 and Hwy 3
>	Anytown, Anycountry PostalCode
>
>But suppose there is ALSO a "1 Farm Rd" and/or a "10 Hwy 3"!  It is
>SUCCESSFUL matching to the address reference data that allows one to get
>a unique and correct address - in this instance, say a match to "10
>Shady Lane" (determined by address format precedence order, lack of
>intervening extraneous data - the "km south of" and the "and", and/or
>the postal code - "10 Shady Lane" being associated with the postal code
>provided and the other possible addresses NOT).  A "raw" parser can
>become quite confused with such representations and return any or all or
>some mixture of address "parts" for the supplied address.  While
>precedence order considerations might "help"
>the parser ONLY populate address fields with "10 Shady Lane", I HAVE
>SEEN addresses where the directions come "first"; AND that might even be
>the case here if the parser analyzes the address from "bottom up"!
>
>Perhaps the CIQ schema might attempt to take a "neutral" stance on such
>address reference data determined "extraneous" information.  After all,
>in the absence of reference data and matching, one might want to or have
>to preserve that (no way to know what to "drop"!).  If so, have you
>considered how that might be attempted / allowed within the schema
>definitions (allow multiple address elements of the same type? a
>"directions" or "extra information" field?)?  Note that a similar
>situation can occur for company or organization internal "mail stop"
>information, which is often included with the more standard address
>elements and can just be a string of characters with no identifying
>"type" information (NO "Mail Stop" to go along with "XYZ-44/8").  But
>oddly enough, for a certain type of postal address in the USA, ONLY that
>mail stop string and a postal code may be provided (nothing more is
>required even for matching!).
>
>RECOGNITION OF OASIS AS AN OPEN STANDARD
>
>I did not mean to imply that OASIS was not an open standards effort, or
>royalty charges might be levied in association with it.  What I meant to
>convey was that the organizations who control address reference data and
>matching facilities could be critical to proper schema population.
>Hopefully, the reason for this is clearer based on the points I made
>above with regard to improvements / corrections  afforded by matching to
>address reference data OVER "raw" address parsing.
>
>CIQ PARTICIPATION / MUTUAL BENEFITS
>
>As for me being an "expert", I will not lay claim to that title - rather
>I will just say that I am brutally experienced.  I would be pleased to
>apply that experience to the CIQ effort.  I believe I could justify that
>to my management (as they do largely determine the amount of time I can
>apply to such efforts) on the basis of mutual benefits.  From my side,
>some sharing of international address research would be beneficial; I am
>more highly experienced in USA address data.  When I have completed an
>analysis of international address data examples I can find, perhaps we
>could review that with regard to what your organization has discovered.
>
>But immediately, a particular case for some international addresses
>might be considered.  I have noticed that several countries use what is
>elsewhere used as a temperature or geographical degrees symbol /
>superscript ("o"
>above the character center line either before or after a "number", with
>or without an intervening space!).  I have noticed that symbol being
>used apparently in association with floors or apartments.  I am ASSUMING
>that that symbol actually means "number" in the context of that
>addressing, as in "apartment number" or "floor number".  Can you confirm
>that assumption?  
>
>As for the effect of this particular case on schema definition, one
>might consider whether -
>	+  The CONJOINED number and superscript representation should be
>placed in some unit / subpremise "number" field.
>	+  Some metadata rule or indication somehow "linked" to an
>address format or country should "clue" a schema user into the fact that
>that superscript should be added (and where / how) to a "raw" number, IF
>only a "raw" number is stored in the field.
>		NOTE : For the USA, a number without a unit or
>"subpremise"
>type (APT, STE, etc.), AND where that number cannot be matched to
>reference data (there is no Unit-type associated with the number
>provided in the address reference data), is given the "#" sign (meaning
>"number") for unit type.  A "default" record in the reference data
>specifically allows non-matched unit data for high-rise / multi-unit
>edifices.  This is done because it is recognized that not all
>subpremises will be known and included in reference data - as a
>practical matter, subpremises can "come and  go"
>faster than reference data can be updated.
>	+  Whether the field should be termed a "number" at all might be
>questioned, since some "numbers" for floors or apartments can be alpha
>characters or a mixture of those and numbers ("Apartment 3B") - see
>"SubPremiseNumber" in the xAL schema terminology.  I tend to try to say
>Unit type (APT / apartment, STE / suite, etc.) and Unit identifier
>(rather than "number"), as this helps prevent any misunderstanding or
>WRONG typed data assumption.
>
>Thank you,
>David Putman
>
>-----Original Message-----
>From: Ram Kumar [mailto:RKumar@msi.com.au]
>Sent: Thursday, November 11, 2004 8:30 PM
>To: John D. Putman
>Cc: Ram Kumar
>Subject: RE: Address Schema Questions
>
>
>Dear John,
>
>Thank you for your detailed email. I have attempted to answer your
>questions to the best of my knowledge.
>
>  
>
>>Having worked on address correction since 1993, I periodically review 
>>the "state of the art".  I recently came across some of your 
>>committee's work in the area of defining XML address structures.  I 
>>myself have made several "stabs"
>>at various address holding structures in several different mediums 
>>(IMS, RDBS, transactional, XML, etc.).  I have not only analyzed and 
>>worked with "end-users" on those BUT ALSO with address correction 
>>vendors / facilities AND address using applications.  I am also in the
>>    
>>
>
>  
>
>>process of ATTEMPTING to analyze international addresses (the "parts" 
>>of which they are composed, what those parts might mean, which are 
>>most significant / their hierarchy, etc.).
>>
>>While I have yet to study IN DETAIL all of your proposed schemas, I 
>>have several operational and implementation questions that you or your
>>    
>>
>
>  
>
>>committee may have pondered -
>>
>>1.  Even if a completely comprehensive and consistent set of schemas 
>>can be defined (leaving aside the messy "real world"
>>of addresses - especially internationally AND ongoing changes in 
>>that!), how would one populate those schema?  Or is that something to 
>>HOPE that the producers of address reference databases will do (and so
>>    
>>
>
>  
>
>>be able to deliver such sub-divisions back for use)?
>>    
>>
>
>XML Schemas defined by CIQ TC are to provide a consistent way of
>defining the metadata for addresses. There are tools in the market that
>can populate the address data into XML format that is validated against
>the schemas. This is the case with any XML usage. Formatting the data
>into XML format and retrieving the data from XML format is the work of
>the end users of the schemas.
>
>  
>
>>2.  Wouldn't schema population require one or all of the following -
>>	a.  Reference data that is fully segmented into all relevant and
>>    
>>
>
>  
>
>>possible parts / subparts of an address relative to what is defined in
>>    
>>
>
>  
>
>>the schema.
>>    
>>
>
>I am not sure what you mean by reference data. Any schema requires the
>necessary data to be represented into the XML format that is validated
>against the schema. Whether the data is to be fully segmented or not
>depends on the rules of the schema. For example, in CIQ, an address data
>can be either represented as say, address line 1, address line 2.......
>or fully segmented into say, country, region, state, postcode, street
>name, street number, etc.
>It is the choice of the end users of how the schema should be used.
>
>  
>
>>	b.  Parsing and matching facilities that can adequately
>>    
>>
>recognize 
>  
>
>>addresses not submitted in a fully schema normalized format / parts.
>>    
>>
>
>Parsing is definitely required to break the address into atomic
>components.
>I do not know why you require matching to transform the address data
>into XML format that is validated against the schema.
>
>  
>
>>	c.  Delivery of fully parsed address data back as either a
>>    
>>
>pre-match 
>  
>
>>provisional structure (though matching does often "fix" "raw" parsing 
>>errors!), or as a structure developed in (successful) address matching
>>    
>>
>
>  
>
>>to reference data (unfortunately neither of these is something many 
>>vendors do OR they do NOT expose it - especially not for international
>>    
>>
>
>  
>
>>addresses).
>>    
>>
>
>I am not sure whether I understood you question here. Once address
>elements are defined in XXM structure, re-construction of address
>structures into the required format is not difficult. 
>
>  
>
>>	Note that is highly unlikely that users and applications, with
>>    
>>
>their 
>  
>
>>vast stores of address information, will very soon or EVER take either
>>    
>>
>
>  
>
>>the trouble or time to restructure their addresses into fully parsed 
>>and finely defined parts - EVEN IF postal authorities do so, someday 
>>and globally, for their reference databases.  This means that having 
>>the above facilities available AND applied will probably be necessary 
>>to consistently and adequately populate very finely grained address 
>>element schemas!
>>    
>>
>
>Totally agree. But the objective of CIQ is to provide that option too. I
>have 15 years of experience working in data quality and in particular,
>name and address. Many organisations break address structures into
>individual components to enable efficient matching. This is very much
>applicable in Postal address certification process and classical example
>of this is
>
>the Australia Post's Address Matching Approval System Program. So, the
>objective of CIQ TC is to be "application independent and global". By
>this way, we cover any type of application (breaking address data into
>atomic components OR keeping it at an abstract level).Please not that
>there is no compulsion in CIQ to break address data into atomic
>elements.
>
>  
>
>>3.  Even once segmented into all possible and appropriate "parts", 
>>won't one have to further define address template schemas (maybe one 
>>for each country and perhaps several for some countries that have 
>>different languages or address formats - take India for one!)?  That 
>>is, won't that be necessary to be able to properly reconstruct the 
>>address data into a human readable, postally valid, and usable 
>>"address block"?
>>    
>>
>
>Agreed. Parsing address data is the most difficult bit. Once it is
>parsed, it means you understand the address structure. Then, templates
>are needed to reconstruct the address elements. This is outside the
>scope of CIQ TC. 
>
>  
>
>>	Note that I have discussed and worked with some users on address
>>    
>>
>
>  
>
>>entry by constituent parts; but that has never been accepted.  Not all
>>    
>>
>
>  
>
>>users can adequately understand (or be taught) how to do so, OR they 
>>refuse (for good operational and efficiency reasons) to take the time 
>>to ATTEMPT to enter addresses in that way (even if they DO understand 
>>how to).
>>This is somewhat similar to diagramming complex language structures 
>>into their constituent parts - not everyone can do it and FEW want to.
>>    
>>
>
>Agreed. Again, please note that breaking address into parts is entirely
>OPTIONAL in CIQ. When you look at the CIQ schemas carefully, you will
>note that it provides options to either break address into parts or not
>to break them. In V3.0 of CIQ that we are working on at the moment, we
>provide two versions of address schemas, one that is not broken into
>atomic elements and the other that is broken into atomic elements.
>
>  
>
>>4.  Won't one actually HAVE TO ATTEMPT to "parse" (and
>>understand!) various address-related language elements (abbreviations,
>>    
>>
>
>  
>
>>phrases, even single symbols in, for instance, ideogrammatic languages
>>    
>>
>
>  
>
>>like Kanji) so as to break them up into their relevant address schema 
>>parts?
>>	Note that as part of an address, certain "common" 
>>language elements can take on highly specialized or even different 
>>meanings / usages from their "common" ones!
>>    
>>
>
>Agreed. Good parsing engines do a very good job in parsing complex
>addresses.
>
>  
>
>>5.  Are you aware that SOMETIMES SOME matching algorithms work better 
>>on the "string as a whole" than they do on constituent parts?  There 
>>is a fine line in address correction that often requires partial or 
>>whole string matching to "get the best match" OR prevent false 
>>positives.
>>This is often due to "special words" like "South" or "Circle" 
>>being used for BOTH sub-portions of street names (directionals, street
>>    
>>
>
>  
>
>>types) AND as primary street names themselves.
>>    
>>
>
>Agreed. 
>
>  
>
>>	Note that, for very "well behaved" address structures, some such
>>    
>>
>
>  
>
>>address element segmentation already occurs in vendor reference DBs.  
>>See some of those for non-Puerto Rican USA addresses for instance.  
>>However, where address structures are more variable or less "well 
>>behaved", that is often not attempted or only attempted partially.  
>>Puerto Rican addresses are, in fact, a mild example of that and 
>>continue to give address correction vendors "fits".
>>		Be that as it may, the point is that the more finely
>>    
>>
>grained address 
>  
>
>>element storage is, the more processing cost there will be for putting
>>    
>>
>
>  
>
>>addresses back together into partial or complete strings for matching 
>>(when that needs to occur) - or, for that matter, into "human usable" 
>>form.  Address correction vendors are VERY concerned about 
>>performance, as their products are often required to process millions 
>>OR hundreds of millions of addresses in a relatively short time (a few
>>    
>>
>
>  
>
>>days if not less than a day).
>>Even on a per-transaction basis, practically negligible
>>(sub-second) response time is required by applications and users.
>>Consequently, both address correction vendors and their users are 
>>highly resistant to performance "hits".  Similarly, since postal 
>>authorities generally count on those vendors to help them get better 
>>addresses into their operations, the postal authorities share that 
>>performance concern.
>>    
>>
>
>Agreed. The CIQ providing the option of selecting breaking address into
>atomic elements or keeping them at an abstract level, users have the
>choice of what to do. The beauty with CIQ is that it covers both the
>extremes while other standards do not. 
>
>  
>
>>6.  On a practical note, is there any coordination with postal 
>>authorities (USPS, Canada Poste, UPU, etc.) AND address correction 
>>vendors so that
>>    
>>
>
>We are a data quality vendor and we have sophisticated name and address
>parsing engines for North America, Canada, Australia and other
>countries. Moreover, the latest V3.0 specs. that we are working on has
>taken all the UPU addresses and tested them against the schema.
>Standards is more a political issue than a technical issue. UPU, USPS,
>UK Post, UN/CEFACT, etc are aware of our work and we are open for
>liaisons with them and they know it. Whether they want to work with us
>is the decision they have to make as we have approached all these bodies
>to work with us. As I said before, Standards creating is more a
>political agenda than technical. The technical bit is always easy. 
>
>  
>
>>	a.  Some similar or compatible definition and storage of
>>    
>>
>reference 
>  
>
>>data is likely?
>>	b.  Matching rules, algorithms and logic compatible with very
>>    
>>
>finely 
>  
>
>>grained postal elements are adequately available - especially for 
>>international addresses?
>>	c.  Delivery of such finely grained address elements will be
>>    
>>
>provided 
>  
>
>>by postal authorities out of their reference DBs and/or address 
>>correction vendors in address parsing and matching using those 
>>reference DBs?
>>    
>>
>
>  
>
>>	Note that BOTH address correction vendors AND postal authorities
>>    
>>
>(or 
>  
>
>>other address data providers) often view such detailed data and 
>>facilities as their "crown jewels" - for which they either want one to
>>    
>>
>
>  
>
>>pay dearly and accept stringent licensing restrictions, OR have 
>>decided to restrict from general availability at all!
>>    
>>
>
>This is where we differ. CIQ standards do not have any IPRs, Roytalties,
>or license fees to use. For example, to use the address samples from
>UPU, we were asked to pay a fee! Anyone can contribute to CIQ and watch
>everything happening to CIQ without being its member. But this is not
>the case with other bodies like UPU. CIQ is more oriented towards users
>who use name and address for various purposes such as customer
>identification, customer views, registration, profiling, etc and not
>just for postal services. This is in direct contrast to UPU or USPS who
>come from a postal services point of view.  
>
>  
>
>>If you or your committee have approached or considered these issues, I
>>    
>>
>
>  
>
>>would very much like to know.  That would make the adoption of some 
>>very segmented / "normalized" address schema so much more likely and 
>>potentially beneficial.
>>    
>>
>
>  
>
>>Don't get me
>>wrong, I would love to have the option of fully and  deterministically
>>    
>>
>
>  
>
>>processing addresses into their constituent  parts when justified (and 
>>storing and transporting them that  way, if not presenting them to end 
>>users in that manner).
>>    
>>
>
>We need experts like you to contribute to the CIQ effort. It is hard to
>find experts in name and address data which to me, is a niche area.
>
>  
>
>>I
>>agree that "getting something out there" to help move this  along 
>>would be good; I am just concerned that one could incur  some 
>>significant costs, lose some potential benefits, OR  never have a very 
>>good set of schemas used if the above  issues have not been approached 
>>and don't have some likely answers.
>>    
>>
>
>Regards,
>
>Ram
>Chair, CIQ TC
>  
>
>>Thank you,
>>David Putman
>>
>>    
>>
>
>To unsubscribe from this mailing list (and be removed from the roster of the OASIS TC), go to http://www.oasis-open.org/apps/org/workgroup/ciq/members/leave_workgroup.php.
>
>  
>