ciq message

Subject: RE: Address Schema Questions
From: "Ram Kumar" <RKumar@msi.com.au>
To: "John D. Putman" <jdputman@scanningtech.fedex.com>
Date: Thu, 18 Nov 2004 08:54:54 +1100
Hi John,

Since you have become an individual member, please
subscribe to the CIQ list (details in the CIQ page).
Once you have done so, you will automatically start
receiving all the email discussions we have in our TC.

Regards,

Ram

Ram Kumar
General Manager
Software R&D and Architecture
MSI BUSINESS SYSTEMS
Suite 204A, 244 Beecroft Road
Epping, NSW 2121, Australia
Direct: +61-2-9815 0226
Mobile: +61-412 758 025
Fax: +61-2-98150200
URL: www.msi.com.au 

 

> -----Original Message-----
> From: John D. Putman [mailto:jdputman@scanningtech.fedex.com] 
> Sent: Thursday, November 18, 2004 5:32 AM
> To: Ram Kumar; John D. Putman
> Subject: RE: Address Schema Questions
> 
> I have taken an individual membership.  Getting anything 
> through corporate procedures would have taken too long 
> otherwise.  I will have some discussion with our XML working 
> group; so they might opt for a corporate membership ...
> sometime.
> 
> For now, I am still working on the international address 
> examples analysis.
> Hopefully, I will have some very detailed breakdown of 
> possible individual elements by country by the end of this 
> month (will include what I know about USA and Canada 
> already).  I may also produce some stats relative to how many 
> countries appear to use those, where elements appear in 
> address lines, etc.
> 
> Then, I will see how those might relate to existing OASIS / 
> CIQ XML elements.  I am purposely holding those at a distance 
> to try and keep the current analysis unaffected / "blind" to 
> those - sometimes one "sees more"
> if one looks at things in isolation / without some other 
> organization structure supposed or affecting that effort.
> 
> Thank you,
> David Putman
> 
> -----Original Message-----
> From: Ram Kumar [mailto:RKumar@msi.com.au]
> Sent: Monday, November 15, 2004 12:03 AM
> To: John D. Putman
> Subject: RE: Address Schema Questions
> 
> 
> John,
> 
> > 
> > o  I would be pleased to participate in your committee's 
> efforts, as I 
> > believe there could be mutual benefits in so doing - 
> especially with 
> > regard to international address standards (see Details for some 
> > general points on which we might work and for an immediate and 
> > specific instance / question - aspects of which could affect a 
> > particular schema field's use and definition).
> 
> THIS IS GREAT.
> 
> > 
> > Details -
> > 
> > Thanks for the reply!  Overall, I understand and can see in your 
> > schema definitions that a great deal of necessary latitude is 
> > provided.  So, depending on the data available (very 
> general or very 
> > finely grained), the schemas can handle that.  The concern 
> / questions 
> > were directed to practical matters of how the finely 
> grained address 
> > element breakup could be populated and so adequately 
> leveraged.  Yes, 
> > that is probably largely outside the scope of the schema 
> definitions; 
> > but again, if it is not possible, then the schema may never 
> be used.  
> > I would like to see that made possible.
> 
> The objective is to provide all holders to store atomic 
> elements of addresses. We cannot gurantee that everything is 
> covered in the schema.
> This eveolves over time as we come across any issues users 
> have with the schema.
> 
> > 
> > PARSING VERSES MATCHING TO REFERENCE DATA
> > 
> > Address parsing generally depends on some heuristic rules 
> relative to 
> > a particular address format and address parts representation.  That 
> > parsing is also sometimes dependent on language recognition (see 
> > Canada where address parts depend on whether the address is 
> in English 
> > or French).  Often the parsing is facilitated by static 
> (but sometimes 
> > updated) lists of terms so that address parts can be 
> recognized.  For 
> > instance, this can be lists of whole words, abbreviations, 
> or symbols 
> > that are GENERALLY used for or in association with a 
> particular part 
> > of an address - say directionals or street types for instance.
> > 
> > However, such "raw" parsing can and does make errors.  
> > Consider the address "10 S Rayburn Ave".  A parsing engine will 
> > generally ASSUME that the "S" is a pre-directional for 
> "South", which 
> > in most cases IS a correct assumption.
> > HOWEVER, "S" may be the initial for "Sam" in the street name "S 
> > Rayburn"
> > (Sam Rayburn was Texas legislator and longtime speaker of 
> the United 
> > States House of Representatives, for whom streets, reservoirs, etc. 
> > have been named).  It is matching to address reference data, which 
> > contains "S Rayburn" as the street name rather than "S" as a 
> > pre-directional and "Rayburn" as the street name, that will 
> allow one 
> > to determine this difference.  Without that match-made correction / 
> > determination (ALONG with delivery by the matching engine of the 
> > address parts identified!), one could take a parsing engine's 
> > non-match-informed split up and incorrectly place the "S" in a 
> > pre-directional field and only "Rayburn" in the street name field.
> 
> AGREED. AGAIN, THIS IS WHERE THE USER OF THE SCHEMA SHOULD 
> MAKE SURE HOW IT IS REPRESENTED. CIQ CAN ONLY DOCUMENT THE 
> ISSUES USERS MIGHT COME ACROSS AND HOW TO OVERCOME THEM.
> 
> > 
> > Note that this is a case where whole string matching 
> against address 
> > reference data can "save the day".  There may NOT be a street named 
> > just "Rayburn".  Of course, putting the "S"
> > back together with the "Rayburn" in a human usable address 
> block would 
> > make this distinction moot FOR human usability (though NOT 
> for how the 
> > address was populated into some schema).
> > BUT I HAVE seen examples where that would NOT be the case!
> 
> > 
> > On another note, "real world" addresses sometimes contain 
> extraneous 
> > information.  This can be included by users by mistake or for other 
> > purposes.  For instance, a user may include "directions" as 
> part of an 
> > address - even though those directions might NOT be needed 
> for address 
> > uniqueness determination and deliverability.  As a consequence, the 
> > user supplied address data might look something like -
> > 	10 Shady Lane
> > 	1 km south of Farm Rd 10 and Hwy 3
> > 	Anytown, Anycountry PostalCode
> > 
> > But suppose there is ALSO a "1 Farm Rd" and/or a "10 Hwy 3"!  
> > It is SUCCESSFUL matching to the address reference data that allows 
> > one to get a unique and correct address - in this instance, say a 
> > match to "10 Shady Lane" (determined by address format precedence 
> > order, lack of intervening extraneous data - the "km south 
> of" and the 
> > "and", and/or the postal code - "10 Shady Lane" being 
> associated with 
> > the postal code provided and the other possible addresses NOT).
> > A "raw" parser can become quite confused with such 
> representations and 
> > return any or all or some mixture of address "parts" for 
> the supplied 
> > address.  While precedence order considerations might "help"
> > the parser ONLY populate address fields with "10 Shady 
> Lane", I HAVE 
> > SEEN addresses where the directions come "first"; AND that 
> might even 
> > be the case here if the parser analyzes the address from 
> "bottom up"!
> > 
> > Perhaps the CIQ schema might attempt to take a "neutral" 
> > stance on such address reference data determined "extraneous" 
> > information.  After all, in the absence of reference data and 
> > matching, one might want to or have to preserve that (no 
> way to know 
> > what to "drop"!).  If so, have you considered how that might be 
> > attempted / allowed within the schema definitions (allow multiple 
> > address elements of the same type? a "directions" or "extra 
> > information" field?)?  Note that a similar situation can occur for 
> > company or organization internal "mail stop" information, which is 
> > often included with the more standard address elements and 
> can just be 
> > a string of characters with no identifying "type"
> > information (NO "Mail Stop" to go along with "XYZ-44/8").  
> > But oddly enough, for a certain type of postal address in the USA, 
> > ONLY that mail stop string and a postal code may be 
> provided (nothing 
> > more is required even for matching!).
> > 
> 
> WE HAVE FIELDS FOR UNPARSED DATA THAT CAN BE USED TO 
> REPRESENT THESE TYPES OF ADDRESS LINES. THE FIELD PROVIDES A 
> "TYPE" ATTRIBUTE THAT IS USED TO DEFINE WHAT TYPE OF UNPARSED 
> DATA IT IS.
> 
> > RECOGNITION OF OASIS AS AN OPEN STANDARD
> > 
> > I did not mean to imply that OASIS was not an open 
> standards effort, 
> > or royalty charges might be levied in association with it.  What I 
> > meant to convey was that the organizations who control address 
> > reference data and matching facilities could be critical to proper 
> > schema population.
> > Hopefully, the reason for this is clearer based on the 
> points I made 
> > above with regard to improvements / corrections afforded by 
> matching 
> > to address reference data OVER "raw"
> > address parsing.
> > 
> > CIQ PARTICIPATION / MUTUAL BENEFITS
> > 
> > As for me being an "expert", I will not lay claim to that title - 
> > rather I will just say that I am brutally experienced.  I would be 
> > pleased to apply that experience to the CIQ effort.  I 
> believe I could 
> > justify that to my management (as they do largely determine 
> the amount 
> > of time I can apply to such efforts) on the basis of mutual 
> benefits.
> > From my side, some sharing of international address 
> research would be 
> > beneficial; I am more highly experienced in USA address 
> data.  When I 
> > have completed an analysis of international address data examples I 
> > can find, perhaps we could review that with regard to what your 
> > organization has discovered.
> 
> THIS IS GOOD. YES, YOU ARE WELCOME TO JOIN THE TC. CHECK WITH 
> YOUR ORGANISATION WHETHER FEDEX IS A MEMBER OF OASIS. THERE 
> ARE TWO OPTIONS TO BECOME A MEMBER. 1. FEDEX BECOMING A 
> MEMBER OF OASIS OR, YOU BECOMING AN INDIVIDUAL MEMBERS. GO 
> THROUGH THE OASIS SITE TO SEE HOW TO JOIN OASIS.
> 
> > 
> > But immediately, a particular case for some international addresses 
> > might be considered.  I have noticed that several countries 
> use what 
> > is elsewhere used as a temperature or geographical degrees symbol / 
> > superscript ("o"
> > above the character center line either before or after a "number", 
> > with or without an intervening space!).  I have noticed that symbol 
> > being used apparently in association with floors or 
> apartments.  I am 
> > ASSUMING that that symbol actually means "number" in the context of 
> > that addressing, as in "apartment number" or "floor 
> number".  Can you 
> > confirm that assumption?
> 
> YES. EXAMPLE:
> 
> SBN - Quadra 13 - Bloco B - 8º andar 		 
> BRASILIA-DF 					 
> 70002-900
> BRAZIL
> 
> HERE, 8 is the number and andar is floor. 
> xAL can store this "o" also.
>  
> > 
> > As for the effect of this particular case on schema definition, one 
> > might consider whether -
> > 	+  The CONJOINED number and superscript representation 
> should be 
> > placed in some unit / subpremise "number" field.
> 
> yes, we do this.
> 
> > 	+  Some metadata rule or indication somehow "linked" to 
> an address 
> > format or country should "clue" a schema user into the fact 
> that that 
> > superscript should be added (and where /
> > how) to a "raw" number, IF only a "raw" number is stored in 
> the field.
> 
> This is upto the application that uses the CIQ. CIQ will 
> provide the fields to store the value ("o") and the location 
> of the value against the number (eg. befor or after).
> 
> > 		NOTE : For the USA, a number without a unit or 
> "subpremise"
> > type (APT, STE, etc.), AND where that number cannot be matched to 
> > reference data (there is no Unit-type associated with the number 
> > provided in the address reference data), is given the "#" sign 
> > (meaning "number") for unit type.  A "default" record in 
> the reference 
> > data specifically allows non-matched unit data for high-rise / 
> > multi-unit edifices.
> > This is done because it is recognized that not all 
> subpremises will be 
> > known and included in reference data - as a practical matter, 
> > subpremises can "come and  go"
> > faster than reference data can be updated.
> 
> 
> > 	+  Whether the field should be termed a "number" at all 
> might be 
> > questioned, since some "numbers" for floors or apartments 
> can be alpha 
> > characters or a mixture of those and numbers ("Apartment 3B") - see 
> > "SubPremiseNumber" in the xAL schema terminology.  I tend to try to 
> > say Unit type (APT / apartment, STE / suite, etc.) and Unit 
> identifier 
> > (rather than "number"), as this helps prevent any 
> misunderstanding or 
> > WRONG typed data assumption.
> 
> SubPremise is very valid for most of the countries and we 
> keep it that way as "Unit" is very american way. Regarding 
> "number", we call it "Identifier" in the draft version of 3.0 
> of xAL that we are now working on.
> 
> Looking forward to hearing from you regarding joining CIQ TC.
> 
> Regards,
> 
> Ram
> > 
> > Thank you,
> > David Putman
> > 
> > -----Original Message-----
> > From: Ram Kumar [mailto:RKumar@msi.com.au]
> > Sent: Thursday, November 11, 2004 8:30 PM
> > To: John D. Putman
> > Cc: Ram Kumar
> > Subject: RE: Address Schema Questions
> > 
> > 
> > Dear John,
> > 
> > Thank you for your detailed email. I have attempted to answer your 
> > questions to the best of my knowledge.
> > 
> > > 
> > > Having worked on address correction since 1993, I
> > periodically review
> > > the "state of the art".  I recently came across some of your 
> > > committee's work in the area of defining XML address 
> structures.  I 
> > > myself have made several "stabs"
> > > at various address holding structures in several 
> different mediums 
> > > (IMS, RDBS, transactional, XML, etc.).  I have not only
> > analyzed and
> > > worked with "end-users" on those BUT ALSO with address correction 
> > > vendors / facilities AND address using applications.  I am
> > also in the
> > > process of ATTEMPTING to analyze international addresses
> > (the "parts" 
> > > of which they are composed, what those parts might mean, 
> which are 
> > > most significant / their hierarchy, etc.).
> > > 
> > > While I have yet to study IN DETAIL all of your proposed 
> schemas, I 
> > > have several operational and implementation questions that
> > you or your
> > > committee may have pondered -
> > > 
> > > 1.  Even if a completely comprehensive and consistent set
> > of schemas
> > > can be defined (leaving aside the messy "real world"
> > > of addresses - especially internationally AND ongoing changes in 
> > > that!), how would one populate those schema?  Or is that
> > something to
> > > HOPE that the producers of address reference databases will
> > do (and so
> > > be able to deliver such sub-divisions back for use)?
> > 
> > XML Schemas defined by CIQ TC are to provide a consistent way of 
> > defining the metadata for addresses. There are tools in the market 
> > that can populate the address data into XML format that is 
> validated 
> > against the schemas. This is the case with any XML usage. 
> Formatting 
> > the data into XML format and retrieving the data from XML format is 
> > the work of the end users of the schemas.
> > 
> > > 
> > > 2.  Wouldn't schema population require one or all of the 
> following -
> > > 	a.  Reference data that is fully segmented into all
> > relevant and
> > > possible parts / subparts of an address relative to what is
> > defined in
> > > the schema.
> > 
> > I am not sure what you mean by reference data. Any schema 
> requires the 
> > necessary data to be represented into the XML format that 
> is validated 
> > against the schema. Whether the data is to be fully 
> segmented or not 
> > depends on the rules of the schema. For example, in CIQ, an address 
> > data can be either represented as say, address line 1, address line 
> > 2....... or fully segmented into say, country, region, state, 
> > postcode, street name, street number, etc.
> > It is the choice of the end users of how the schema should be used.
> > 
> > > 	b.  Parsing and matching facilities that can adequately
> > recognize
> > > addresses not submitted in a fully schema normalized 
> format / parts.
> > 
> > Parsing is definitely required to break the address into atomic 
> > components.
> > I do not know why you require matching to transform the 
> address data 
> > into XML format that is validated against the schema.
> > 
> > > 	c.  Delivery of fully parsed address data back as
> > either a pre-match
> > > provisional structure (though matching does often "fix" 
> > "raw" parsing
> > > errors!), or as a structure developed in (successful)
> > address matching
> > > to reference data (unfortunately neither of these is 
> something many 
> > > vendors do OR they do NOT expose it - especially not for
> > international
> > > addresses).
> > 
> > I am not sure whether I understood you question here. Once address 
> > elements are defined in XXM structure, re-construction of address 
> > structures into the required format is not difficult.
> > 
> > > 	Note that is highly unlikely that users and
> > applications, with their
> > > vast stores of address information, will very soon or EVER
> > take either
> > > the trouble or time to restructure their addresses into
> > fully parsed
> > > and finely defined parts - EVEN IF postal authorities do
> > so, someday
> > > and globally, for their reference databases.  This means
> > that having
> > > the above facilities available AND applied will probably be
> > necessary
> > > to consistently and adequately populate very finely 
> grained address 
> > > element schemas!
> > 
> > Totally agree. But the objective of CIQ is to provide that 
> option too. 
> > I have 15 years of experience working in data quality and in 
> > particular, name and address. Many organisations break address 
> > structures into individual components to enable efficient matching. 
> > This is very much applicable in Postal address 
> certification process 
> > and classical example of this is
> > 
> > the Australia Post's Address Matching Approval System 
> Program. So, the 
> > objective of CIQ TC is to be "application independent and 
> global". By 
> > this way, we cover any type of application (breaking 
> address data into 
> > atomic components OR keeping it at an abstract 
> level).Please not that 
> > there is no compulsion in CIQ to break address data into atomic 
> > elements.
> > 
> > > 
> > > 3.  Even once segmented into all possible and appropriate 
> "parts", 
> > > won't one have to further define address template schemas
> > (maybe one
> > > for each country and perhaps several for some countries that have 
> > > different languages or address formats - take India for
> > one!)?  That
> > > is, won't that be necessary to be able to properly 
> reconstruct the 
> > > address data into a human readable, postally valid, and usable 
> > > "address block"?
> > 
> > Agreed. Parsing address data is the most difficult bit. Once it is 
> > parsed, it means you understand the address structure.
> > Then, templates are needed to reconstruct the address 
> elements. This 
> > is outside the scope of CIQ TC.
> > 
> > > 	Note that I have discussed and worked with some users
> > on address
> > > entry by constituent parts; but that has never been
> > accepted.  Not all
> > > users can adequately understand (or be taught) how to do
> > so, OR they
> > > refuse (for good operational and efficiency reasons) to
> > take the time
> > > to ATTEMPT to enter addresses in that way (even if they DO
> > understand
> > > how to).
> > > This is somewhat similar to diagramming complex language 
> structures 
> > > into their constituent parts - not everyone can do it and
> > FEW want to.
> > 
> > Agreed. Again, please note that breaking address into parts is 
> > entirely OPTIONAL in CIQ. When you look at the CIQ schemas 
> carefully, 
> > you will note that it provides options to either break address into 
> > parts or not to break them. In V3.0 of CIQ that we are 
> working on at 
> > the moment, we provide two versions of address schemas, one that is 
> > not broken into atomic elements and the other that is broken into 
> > atomic elements.
> > 
> > > 
> > > 4.  Won't one actually HAVE TO ATTEMPT to "parse" (and
> > > understand!) various address-related language elements
> > (abbreviations,
> > > phrases, even single symbols in, for instance,
> > ideogrammatic languages
> > > like Kanji) so as to break them up into their relevant
> > address schema
> > > parts?
> > > 	Note that as part of an address, certain "common" 
> > > language elements can take on highly specialized or even 
> different 
> > > meanings / usages from their "common" ones!
> > 
> > Agreed. Good parsing engines do a very good job in parsing complex 
> > addresses.
> > 
> > > 
> > > 5.  Are you aware that SOMETIMES SOME matching algorithms
> > work better
> > > on the "string as a whole" than they do on constituent
> > parts?  There
> > > is a fine line in address correction that often requires 
> partial or 
> > > whole string matching to "get the best match" OR prevent false 
> > > positives.
> > > This is often due to "special words" like "South" or "Circle" 
> > > being used for BOTH sub-portions of street names
> > (directionals, street
> > > types) AND as primary street names themselves.
> > 
> > Agreed. 
> > 
> > > 	Note that, for very "well behaved" address structures,
> > some such
> > > address element segmentation already occurs in vendor
> > reference DBs.  
> > > See some of those for non-Puerto Rican USA addresses for 
> instance.  
> > > However, where address structures are more variable or less "well 
> > > behaved", that is often not attempted or only attempted partially.
> > > Puerto Rican addresses are, in fact, a mild example of that and 
> > > continue to give address correction vendors "fits".
> > > 		Be that as it may, the point is that the more
> > finely grained address
> > > element storage is, the more processing cost there will be
> > for putting
> > > addresses back together into partial or complete strings
> > for matching
> > > (when that needs to occur) - or, for that matter, into
> > "human usable" 
> > > form.  Address correction vendors are VERY concerned about 
> > > performance, as their products are often required to
> > process millions
> > > OR hundreds of millions of addresses in a relatively short
> > time (a few
> > > days if not less than a day).
> > > Even on a per-transaction basis, practically negligible
> > > (sub-second) response time is required by applications and users.
> > > Consequently, both address correction vendors and their users are 
> > > highly resistant to performance "hits".  Similarly, since postal 
> > > authorities generally count on those vendors to help them
> > get better
> > > addresses into their operations, the postal authorities 
> share that 
> > > performance concern.
> > 
> > Agreed. The CIQ providing the option of selecting breaking address 
> > into atomic elements or keeping them at an abstract level, 
> users have 
> > the choice of what to do. The beauty with CIQ is that it 
> covers both 
> > the extremes while other standards do not.
> > 
> > > 
> > > 6.  On a practical note, is there any coordination with postal 
> > > authorities (USPS, Canada Poste, UPU, etc.) AND address 
> correction 
> > > vendors so that
> > 
> > We are a data quality vendor and we have sophisticated name and 
> > address parsing engines for North America, Canada, 
> Australia and other 
> > countries. Moreover, the latest V3.0 specs. that we are 
> working on has 
> > taken all the UPU addresses and tested them against the schema. 
> > Standards is more a political issue than a technical issue. 
> UPU, USPS, 
> > UK Post, UN/CEFACT, etc are aware of our work and we are open for 
> > liaisons with them and they know it. Whether they want to 
> work with us 
> > is the decision they have to make as we have approached all these 
> > bodies to work with us. As I said before, Standards 
> creating is more a 
> > political agenda than technical. The technical bit is always easy.
> > 
> > > 	a.  Some similar or compatible definition and storage
> > of reference
> > > data is likely?
> > > 	b.  Matching rules, algorithms and logic compatible
> > with very finely
> > > grained postal elements are adequately available - especially for 
> > > international addresses?
> > > 	c.  Delivery of such finely grained address elements
> > will be provided
> > > by postal authorities out of their reference DBs and/or address 
> > > correction vendors in address parsing and matching using those 
> > > reference DBs?
> > 
> > > 	Note that BOTH address correction vendors AND postal
> > authorities (or
> > > other address data providers) often view such detailed data and 
> > > facilities as their "crown jewels" - for which they either
> > want one to
> > > pay dearly and accept stringent licensing restrictions, OR have 
> > > decided to restrict from general availability at all!
> > 
> > This is where we differ. CIQ standards do not have any IPRs, 
> > Roytalties, or license fees to use. For example, to use the address 
> > samples from UPU, we were asked to pay a fee! Anyone can 
> contribute to 
> > CIQ and watch everything happening to CIQ without being its member. 
> > But this is not the case with other bodies like UPU. CIQ is more 
> > oriented towards users who use name and address for various 
> purposes 
> > such as customer identification, customer views, registration, 
> > profiling, etc and not just for postal services. This is in direct 
> > contrast to UPU or USPS who come from a postal services 
> point of view.
> > 
> > > 
> > > If you or your committee have approached or considered 
> > these issues, I 
> > > would very much like to know.  That would make the 
> adoption of some 
> > > very segmented / "normalized" address schema so much more 
> > likely and 
> > > potentially beneficial.
> > 
> > >Don't get me
> > > wrong, I would love to have the option of fully and  
> > deterministically 
> > >processing addresses into their constituent  parts when 
> > justified (and 
> > >storing and transporting them that  way, if not presenting 
> > them to end 
> > >users in that manner).
> > 
> > We need experts like you to contribute to the CIQ effort. It 
> > is hard to find experts in name and address data which to me, 
> > is a niche area.
> > 
> > >I
> > > agree that "getting something out there" to help move this  along 
> > >would be good; I am just concerned that one could incur  some 
> > >significant costs, lose some potential benefits, OR  never 
> > have a very 
> > >good set of schemas used if the above  issues have not been 
> > approached 
> > >and don't have some likely answers.
> > 
> > Regards,
> > 
> > Ram
> > Chair, CIQ TC
> > > 
> > > Thank you,
> > > David Putman
> > > 
> > 
>