ciq message

Subject: FW: Address Schema Questions
From: "Ram Kumar" <RKumar@msi.com.au>
To: <ciq@lists.oasis-open.org>
Date: Mon, 15 Nov 2004 17:03:48 +1100
From: Ram Kumar 
Sent: Friday, November 12, 2004 1:30 PM
To: 'John D. Putman'
Cc: Ram Kumar
Subject: RE: Address Schema Questions

Dear John,

Thank you for your detailed email. I have attempted to answer your
questions to the best of my knowledge.

> 
> Having worked on address correction since 1993, I periodically review 
> the "state of the art".  I recently came across some of your 
> committee's work in the area of defining XML address structures.  I 
> myself have made several "stabs"
> at various address holding structures in several different mediums 
> (IMS, RDBS, transactional, XML, etc.).  I have not only analyzed and 
> worked with "end-users" on those BUT ALSO with address correction 
> vendors / facilities AND address using applications.  I am also in the

> process of ATTEMPTING to analyze international addresses (the "parts" 
> of which they are composed, what those parts might mean, which are 
> most significant / their hierarchy, etc.).
> 
> While I have yet to study IN DETAIL all of your proposed schemas, I 
> have several operational and implementation questions that you or your

> committee may have pondered -
> 
> 1.  Even if a completely comprehensive and consistent set of schemas 
> can be defined (leaving aside the messy "real world"
> of addresses - especially internationally AND ongoing changes in 
> that!), how would one populate those schema?  Or is that something to 
> HOPE that the producers of address reference databases will do (and so

> be able to deliver such sub-divisions back for use)?

XML Schemas defined by CIQ TC are to provide a consistent way of
defining the metadata for addresses. There are tools in the market that
can populate the address data into XML format that is validated against
the schemas. This is the case with any XML usage. Formatting the data
into XML format and retrieving the data from XML format is the work of
the end users of the schemas.

> 
> 2.  Wouldn't schema population require one or all of the following -
> 	a.  Reference data that is fully segmented into all relevant and

> possible parts / subparts of an address relative to what is defined in

> the schema.

I am not sure what you mean by reference data. Any schema requires the
necessary data to be represented into the XML format that is validated
against the schema. Whether the data is to be fully segmented or not
depends on the rules of the schema. For example, in CIQ, an address data
can be either represented as say, address line 1, address line 2.......
or fully segmented into say, country, region, state, postcode, street
name, street number, etc.
It is the choice of the end users of how the schema should be used.

> 	b.  Parsing and matching facilities that can adequately
recognize 
> addresses not submitted in a fully schema normalized format / parts.

Parsing is definitely required to break the address into atomic
components.
I do not know why you require matching to transform the address data
into XML format that is validated against the schema.

> 	c.  Delivery of fully parsed address data back as either a
pre-match 
> provisional structure (though matching does often "fix" "raw" parsing 
> errors!), or as a structure developed in (successful) address matching

> to reference data (unfortunately neither of these is something many 
> vendors do OR they do NOT expose it - especially not for international

> addresses).

I am not sure whether I understood you question here. Once address
elements are defined in XXM structure, re-construction of address
structures into the required format is not difficult. 

> 	Note that is highly unlikely that users and applications, with
their 
> vast stores of address information, will very soon or EVER take either

> the trouble or time to restructure their addresses into fully parsed 
> and finely defined parts - EVEN IF postal authorities do so, someday 
> and globally, for their reference databases.  This means that having 
> the above facilities available AND applied will probably be necessary 
> to consistently and adequately populate very finely grained address 
> element schemas!

Totally agree. But the objective of CIQ is to provide that option too. I
have 15 years of experience working in data quality and in particular,
name and address. Many organisations break address structures into
individual components to enable efficient matching. This is very much
applicable in Postal address certification process and classical example
of this is the Australia Post's Address Matching Approval System
Program. So, the objective of CIQ TC is to be "application independent
and global". By this way, we cover any type of application (breaking
address data into atomic components OR keeping it at an abstract
level).Please not that there is no compulsion in CIQ to break address
data into atomic elements.

> 
> 3.  Even once segmented into all possible and appropriate "parts", 
> won't one have to further define address template schemas (maybe one 
> for each country and perhaps several for some countries that have 
> different languages or address formats - take India for one!)?  That 
> is, won't that be necessary to be able to properly reconstruct the 
> address data into a human readable, postally valid, and usable 
> "address block"?

Agreed. Parsing address data is the most difficult bit. Once it is
parsed, it means you understand the address structure. Then, templates
are needed to reconstruct the address elements. This is outside the
scope of CIQ TC. 

> 	Note that I have discussed and worked with some users on address

> entry by constituent parts; but that has never been accepted.  Not all

> users can adequately understand (or be taught) how to do so, OR they 
> refuse (for good operational and efficiency reasons) to take the time 
> to ATTEMPT to enter addresses in that way (even if they DO understand 
> how to).
> This is somewhat similar to diagramming complex language structures 
> into their constituent parts - not everyone can do it and FEW want to.

Agreed. Again, please note that breaking address into parts is entirely
OPTIONAL in CIQ. When you look at the CIQ schemas carefully, you will
note that it provides options to either break address into parts or not
to break them. In V3.0 of CIQ that we are working on at the moment, we
provide two versions of address schemas, one that is not broken into
atomic elements and the other that is broken into atomic elements.

> 
> 4.  Won't one actually HAVE TO ATTEMPT to "parse" (and
> understand!) various address-related language elements (abbreviations,

> phrases, even single symbols in, for instance, ideogrammatic languages

> like Kanji) so as to break them up into their relevant address schema 
> parts?
> 	Note that as part of an address, certain "common" 
> language elements can take on highly specialized or even different 
> meanings / usages from their "common" ones!

Agreed. Good parsing engines do a very good job in parsing complex
addresses.

> 
> 5.  Are you aware that SOMETIMES SOME matching algorithms work better 
> on the "string as a whole" than they do on constituent parts?  There 
> is a fine line in address correction that often requires partial or 
> whole string matching to "get the best match" OR prevent false 
> positives.
> This is often due to "special words" like "South" or "Circle" 
> being used for BOTH sub-portions of street names (directionals, street

> types) AND as primary street names themselves.

Agreed. 

> 	Note that, for very "well behaved" address structures, some such

> address element segmentation already occurs in vendor reference DBs.  
> See some of those for non-Puerto Rican USA addresses for instance.  
> However, where address structures are more variable or less "well 
> behaved", that is often not attempted or only attempted partially.  
> Puerto Rican addresses are, in fact, a mild example of that and 
> continue to give address correction vendors "fits".
> 		Be that as it may, the point is that the more finely
grained address 
> element storage is, the more processing cost there will be for putting

> addresses back together into partial or complete strings for matching 
> (when that needs to occur) - or, for that matter, into "human usable" 
> form.  Address correction vendors are VERY concerned about 
> performance, as their products are often required to process millions 
> OR hundreds of millions of addresses in a relatively short time (a few

> days if not less than a day).
> Even on a per-transaction basis, practically negligible
> (sub-second) response time is required by applications and users.
> Consequently, both address correction vendors and their users are 
> highly resistant to performance "hits".  Similarly, since postal 
> authorities generally count on those vendors to help them get better 
> addresses into their operations, the postal authorities share that 
> performance concern.

Agreed. The CIQ providing the option of selecting breaking address into
atomic elements or keeping them at an abstract level, users have the
choice of what to do. The beauty with CIQ is that it covers both the
extremes while other standards do not. 

> 
> 6.  On a practical note, is there any coordination with postal 
> authorities (USPS, Canada Poste, UPU, etc.) AND address correction 
> vendors so that

We are a data quality vendor and we have sophisticated name and address
parsing engines for North America, Canada, Australia and other
countries. Moreover, the latest V3.0 specs. that we are working on has
taken all the UPU addresses and tested them against the schema.
Standards is more a political issue than a technical issue. UPU, USPS,
UK Post, UN/CEFACT, etc are aware of our work and we are open for
liaisons with them and they know it. Whether they want to work with us
is the decision they have to make as we have approached all these bodies
to work with us. As I said before, Standards creating is more a
political agenda than technical. The technical bit is always easy. 

> 	a.  Some similar or compatible definition and storage of
reference 
> data is likely?
> 	b.  Matching rules, algorithms and logic compatible with very
finely 
> grained postal elements are adequately available - especially for 
> international addresses?
> 	c.  Delivery of such finely grained address elements will be
provided 
> by postal authorities out of their reference DBs and/or address 
> correction vendors in address parsing and matching using those 
> reference DBs?

> 	Note that BOTH address correction vendors AND postal authorities
(or 
> other address data providers) often view such detailed data and 
> facilities as their "crown jewels" - for which they either want one to

> pay dearly and accept stringent licensing restrictions, OR have 
> decided to restrict from general availability at all!

This is where we differ. CIQ standards do not have any IPRs, Roytalties,
or license fees to use. For example, to use the address samples from
UPU, we were asked to pay a fee! Anyone can contribute to CIQ and watch
everything happening to CIQ without being its member. But this is not
the case with other bodies like UPU. CIQ is more oriented towards users
who use name and address for various purposes such as customer
identification, customer views, registration, profiling, etc and not
just for postal services. This is in direct contrast to UPU or USPS who
come from a postal services point of view.  

> 
> If you or your committee have approached or considered these issues, I

> would very much like to know.  That would make the adoption of some 
> very segmented / "normalized" address schema so much more likely and 
> potentially beneficial.

>Don't get me
> wrong, I would love to have the option of fully and  deterministically

>processing addresses into their constituent  parts when justified (and 
>storing and transporting them that  way, if not presenting them to end 
>users in that manner).

We need experts like you to contribute to the CIQ effort. It is hard to
find experts in name and address data which to me, is a niche area.

>I
> agree that "getting something out there" to help move this  along 
>would be good; I am just concerned that one could incur  some 
>significant costs, lose some potential benefits, OR  never have a very 
>good set of schemas used if the above  issues have not been approached 
>and don't have some likely answers.

Regards,

Ram
Chair, CIQ TC
> 
> Thank you,
> David Putman
>