ciq message

Subject: FW: Address Schema Questions
From: "Ram Kumar" <RKumar@msi.com.au>
To: <ciq@lists.oasis-open.org>
Date: Mon, 15 Nov 2004 17:04:04 +1100
 
From: John D. Putman [mailto:jdputman@scanningtech.fedex.com] 
Sent: Saturday, November 13, 2004 5:47 AM
To: Ram Kumar; John D. Putman
Subject: RE: Address Schema Questions

Briefly -

o  I appreciate the reply and agree that CIQ address schemas afford a
wide degree of latitude for address formats and data; my main concern
was with regard to practical considerations involved in appropriate
population of the more finely grained schema, which may indeed be
largely outside of CIQ purview.

o  See details for the distinction between and import of "raw" address
parsing and address matching to address reference data.

o  I understand that your committee's efforts are "open" / without
restriction or royalties; but some synergy with address reference data
and matching facilities providers (who DO impose fees / restrictions)
could be critical to the successful application of the schemas (again
see address parsing verses address matching results covered under
Details).

o  I would be pleased to participate in your committee's efforts, as I
believe there could be mutual benefits in so doing - especially with
regard to international address standards (see Details for some general
points on which we might work and for an immediate and specific instance
/ question - aspects of which could affect a particular schema field's
use and definition).

Details - 

Thanks for the reply!  Overall, I understand and can see in your schema
definitions that a great deal of necessary latitude is provided.  So,
depending on the data available (very general or very finely grained),
the schemas can handle that.  The concern / questions were directed to
practical matters of how the finely grained address element breakup
could be populated and so adequately leveraged.  Yes, that is probably
largely outside the scope of the schema definitions; but again, if it is
not possible, then the schema may never be used.  I would like to see
that made possible.

PARSING VERSES MATCHING TO REFERENCE DATA

Address parsing generally depends on some heuristic rules relative to a
particular address format and address parts representation.  That
parsing is also sometimes dependent on language recognition (see Canada
where address parts depend on whether the address is in English or
French).  Often the parsing is facilitated by static (but sometimes
updated) lists of terms so that address parts can be recognized.  For
instance, this can be lists of whole words, abbreviations, or symbols
that are GENERALLY used for or in association with a particular part of
an address - say directionals or street types for instance.  

However, such "raw" parsing can and does make errors.  Consider the
address "10 S Rayburn Ave".  A parsing engine will generally ASSUME that
the "S" is a pre-directional for "South", which in most cases IS a
correct assumption.
HOWEVER, "S" may be the initial for "Sam" in the street name "S Rayburn"
(Sam Rayburn was Texas legislator and longtime speaker of the United
States House of Representatives, for whom streets, reservoirs, etc. have
been named).  It is matching to address reference data, which contains
"S Rayburn" as the street name rather than "S" as a pre-directional and
"Rayburn" as the street name, that will allow one to determine this
difference.  Without that match-made correction / determination (ALONG
with delivery by the matching engine of the address parts identified!),
one could take a parsing engine's non-match-informed split up and
incorrectly place the "S" in a pre-directional field and only "Rayburn"
in the street name field.  

Note that this is a case where whole string matching against address
reference data can "save the day".  There may NOT be a street named just
"Rayburn".  Of course, putting the "S" back together with the "Rayburn"
in a human usable address block would make this distinction moot FOR
human usability (though NOT for how the address was populated into some
schema).
BUT I HAVE seen examples where that would NOT be the case!

On another note, "real world" addresses sometimes contain extraneous
information.  This can be included by users by mistake or for other
purposes.  For instance, a user may include "directions" as part of an
address - even though those directions might NOT be needed for address
uniqueness determination and deliverability.  As a consequence, the user
supplied address data might look something like -
	10 Shady Lane
	1 km south of Farm Rd 10 and Hwy 3
	Anytown, Anycountry PostalCode

But suppose there is ALSO a "1 Farm Rd" and/or a "10 Hwy 3"!  It is
SUCCESSFUL matching to the address reference data that allows one to get
a unique and correct address - in this instance, say a match to "10
Shady Lane" (determined by address format precedence order, lack of
intervening extraneous data - the "km south of" and the "and", and/or
the postal code - "10 Shady Lane" being associated with the postal code
provided and the other possible addresses NOT).  A "raw" parser can
become quite confused with such representations and return any or all or
some mixture of address "parts" for the supplied address.  While
precedence order considerations might "help"
the parser ONLY populate address fields with "10 Shady Lane", I HAVE
SEEN addresses where the directions come "first"; AND that might even be
the case here if the parser analyzes the address from "bottom up"!

Perhaps the CIQ schema might attempt to take a "neutral" stance on such
address reference data determined "extraneous" information.  After all,
in the absence of reference data and matching, one might want to or have
to preserve that (no way to know what to "drop"!).  If so, have you
considered how that might be attempted / allowed within the schema
definitions (allow multiple address elements of the same type? a
"directions" or "extra information" field?)?  Note that a similar
situation can occur for company or organization internal "mail stop"
information, which is often included with the more standard address
elements and can just be a string of characters with no identifying
"type" information (NO "Mail Stop" to go along with "XYZ-44/8").  But
oddly enough, for a certain type of postal address in the USA, ONLY that
mail stop string and a postal code may be provided (nothing more is
required even for matching!).

RECOGNITION OF OASIS AS AN OPEN STANDARD

I did not mean to imply that OASIS was not an open standards effort, or
royalty charges might be levied in association with it.  What I meant to
convey was that the organizations who control address reference data and
matching facilities could be critical to proper schema population.
Hopefully, the reason for this is clearer based on the points I made
above with regard to improvements / corrections  afforded by matching to
address reference data OVER "raw" address parsing.

CIQ PARTICIPATION / MUTUAL BENEFITS

As for me being an "expert", I will not lay claim to that title - rather
I will just say that I am brutally experienced.  I would be pleased to
apply that experience to the CIQ effort.  I believe I could justify that
to my management (as they do largely determine the amount of time I can
apply to such efforts) on the basis of mutual benefits.  From my side,
some sharing of international address research would be beneficial; I am
more highly experienced in USA address data.  When I have completed an
analysis of international address data examples I can find, perhaps we
could review that with regard to what your organization has discovered.

But immediately, a particular case for some international addresses
might be considered.  I have noticed that several countries use what is
elsewhere used as a temperature or geographical degrees symbol /
superscript ("o"
above the character center line either before or after a "number", with
or without an intervening space!).  I have noticed that symbol being
used apparently in association with floors or apartments.  I am ASSUMING
that that symbol actually means "number" in the context of that
addressing, as in "apartment number" or "floor number".  Can you confirm
that assumption?  

As for the effect of this particular case on schema definition, one
might consider whether -
	+  The CONJOINED number and superscript representation should be
placed in some unit / subpremise "number" field.
	+  Some metadata rule or indication somehow "linked" to an
address format or country should "clue" a schema user into the fact that
that superscript should be added (and where / how) to a "raw" number, IF
only a "raw" number is stored in the field.
		NOTE : For the USA, a number without a unit or
"subpremise"
type (APT, STE, etc.), AND where that number cannot be matched to
reference data (there is no Unit-type associated with the number
provided in the address reference data), is given the "#" sign (meaning
"number") for unit type.  A "default" record in the reference data
specifically allows non-matched unit data for high-rise / multi-unit
edifices.  This is done because it is recognized that not all
subpremises will be known and included in reference data - as a
practical matter, subpremises can "come and  go"
faster than reference data can be updated.
	+  Whether the field should be termed a "number" at all might be
questioned, since some "numbers" for floors or apartments can be alpha
characters or a mixture of those and numbers ("Apartment 3B") - see
"SubPremiseNumber" in the xAL schema terminology.  I tend to try to say
Unit type (APT / apartment, STE / suite, etc.) and Unit identifier
(rather than "number"), as this helps prevent any misunderstanding or
WRONG typed data assumption.

Thank you,
David Putman

-----Original Message-----
From: Ram Kumar [mailto:RKumar@msi.com.au]
Sent: Thursday, November 11, 2004 8:30 PM
To: John D. Putman
Cc: Ram Kumar
Subject: RE: Address Schema Questions


Dear John,

Thank you for your detailed email. I have attempted to answer your
questions to the best of my knowledge.

> 
> Having worked on address correction since 1993, I periodically review 
> the "state of the art".  I recently came across some of your 
> committee's work in the area of defining XML address structures.  I 
> myself have made several "stabs"
> at various address holding structures in several different mediums 
> (IMS, RDBS, transactional, XML, etc.).  I have not only analyzed and 
> worked with "end-users" on those BUT ALSO with address correction 
> vendors / facilities AND address using applications.  I am also in the

> process of ATTEMPTING to analyze international addresses (the "parts" 
> of which they are composed, what those parts might mean, which are 
> most significant / their hierarchy, etc.).
> 
> While I have yet to study IN DETAIL all of your proposed schemas, I 
> have several operational and implementation questions that you or your

> committee may have pondered -
> 
> 1.  Even if a completely comprehensive and consistent set of schemas 
> can be defined (leaving aside the messy "real world"
> of addresses - especially internationally AND ongoing changes in 
> that!), how would one populate those schema?  Or is that something to 
> HOPE that the producers of address reference databases will do (and so

> be able to deliver such sub-divisions back for use)?

XML Schemas defined by CIQ TC are to provide a consistent way of
defining the metadata for addresses. There are tools in the market that
can populate the address data into XML format that is validated against
the schemas. This is the case with any XML usage. Formatting the data
into XML format and retrieving the data from XML format is the work of
the end users of the schemas.

> 
> 2.  Wouldn't schema population require one or all of the following -
> 	a.  Reference data that is fully segmented into all relevant and

> possible parts / subparts of an address relative to what is defined in

> the schema.

I am not sure what you mean by reference data. Any schema requires the
necessary data to be represented into the XML format that is validated
against the schema. Whether the data is to be fully segmented or not
depends on the rules of the schema. For example, in CIQ, an address data
can be either represented as say, address line 1, address line 2.......
or fully segmented into say, country, region, state, postcode, street
name, street number, etc.
It is the choice of the end users of how the schema should be used.

> 	b.  Parsing and matching facilities that can adequately
recognize 
> addresses not submitted in a fully schema normalized format / parts.

Parsing is definitely required to break the address into atomic
components.
I do not know why you require matching to transform the address data
into XML format that is validated against the schema.

> 	c.  Delivery of fully parsed address data back as either a
pre-match 
> provisional structure (though matching does often "fix" "raw" parsing 
> errors!), or as a structure developed in (successful) address matching

> to reference data (unfortunately neither of these is something many 
> vendors do OR they do NOT expose it - especially not for international

> addresses).

I am not sure whether I understood you question here. Once address
elements are defined in XXM structure, re-construction of address
structures into the required format is not difficult. 

> 	Note that is highly unlikely that users and applications, with
their 
> vast stores of address information, will very soon or EVER take either

> the trouble or time to restructure their addresses into fully parsed 
> and finely defined parts - EVEN IF postal authorities do so, someday 
> and globally, for their reference databases.  This means that having 
> the above facilities available AND applied will probably be necessary 
> to consistently and adequately populate very finely grained address 
> element schemas!

Totally agree. But the objective of CIQ is to provide that option too. I
have 15 years of experience working in data quality and in particular,
name and address. Many organisations break address structures into
individual components to enable efficient matching. This is very much
applicable in Postal address certification process and classical example
of this is

the Australia Post's Address Matching Approval System Program. So, the
objective of CIQ TC is to be "application independent and global". By
this way, we cover any type of application (breaking address data into
atomic components OR keeping it at an abstract level).Please not that
there is no compulsion in CIQ to break address data into atomic
elements.

> 
> 3.  Even once segmented into all possible and appropriate "parts", 
> won't one have to further define address template schemas (maybe one 
> for each country and perhaps several for some countries that have 
> different languages or address formats - take India for one!)?  That 
> is, won't that be necessary to be able to properly reconstruct the 
> address data into a human readable, postally valid, and usable 
> "address block"?

Agreed. Parsing address data is the most difficult bit. Once it is
parsed, it means you understand the address structure. Then, templates
are needed to reconstruct the address elements. This is outside the
scope of CIQ TC. 

> 	Note that I have discussed and worked with some users on address

> entry by constituent parts; but that has never been accepted.  Not all

> users can adequately understand (or be taught) how to do so, OR they 
> refuse (for good operational and efficiency reasons) to take the time 
> to ATTEMPT to enter addresses in that way (even if they DO understand 
> how to).
> This is somewhat similar to diagramming complex language structures 
> into their constituent parts - not everyone can do it and FEW want to.

Agreed. Again, please note that breaking address into parts is entirely
OPTIONAL in CIQ. When you look at the CIQ schemas carefully, you will
note that it provides options to either break address into parts or not
to break them. In V3.0 of CIQ that we are working on at the moment, we
provide two versions of address schemas, one that is not broken into
atomic elements and the other that is broken into atomic elements.

> 
> 4.  Won't one actually HAVE TO ATTEMPT to "parse" (and
> understand!) various address-related language elements (abbreviations,

> phrases, even single symbols in, for instance, ideogrammatic languages

> like Kanji) so as to break them up into their relevant address schema 
> parts?
> 	Note that as part of an address, certain "common" 
> language elements can take on highly specialized or even different 
> meanings / usages from their "common" ones!

Agreed. Good parsing engines do a very good job in parsing complex
addresses.

> 
> 5.  Are you aware that SOMETIMES SOME matching algorithms work better 
> on the "string as a whole" than they do on constituent parts?  There 
> is a fine line in address correction that often requires partial or 
> whole string matching to "get the best match" OR prevent false 
> positives.
> This is often due to "special words" like "South" or "Circle" 
> being used for BOTH sub-portions of street names (directionals, street

> types) AND as primary street names themselves.

Agreed. 

> 	Note that, for very "well behaved" address structures, some such

> address element segmentation already occurs in vendor reference DBs.  
> See some of those for non-Puerto Rican USA addresses for instance.  
> However, where address structures are more variable or less "well 
> behaved", that is often not attempted or only attempted partially.  
> Puerto Rican addresses are, in fact, a mild example of that and 
> continue to give address correction vendors "fits".
> 		Be that as it may, the point is that the more finely
grained address 
> element storage is, the more processing cost there will be for putting

> addresses back together into partial or complete strings for matching 
> (when that needs to occur) - or, for that matter, into "human usable" 
> form.  Address correction vendors are VERY concerned about 
> performance, as their products are often required to process millions 
> OR hundreds of millions of addresses in a relatively short time (a few

> days if not less than a day).
> Even on a per-transaction basis, practically negligible
> (sub-second) response time is required by applications and users.
> Consequently, both address correction vendors and their users are 
> highly resistant to performance "hits".  Similarly, since postal 
> authorities generally count on those vendors to help them get better 
> addresses into their operations, the postal authorities share that 
> performance concern.

Agreed. The CIQ providing the option of selecting breaking address into
atomic elements or keeping them at an abstract level, users have the
choice of what to do. The beauty with CIQ is that it covers both the
extremes while other standards do not. 

> 
> 6.  On a practical note, is there any coordination with postal 
> authorities (USPS, Canada Poste, UPU, etc.) AND address correction 
> vendors so that

We are a data quality vendor and we have sophisticated name and address
parsing engines for North America, Canada, Australia and other
countries. Moreover, the latest V3.0 specs. that we are working on has
taken all the UPU addresses and tested them against the schema.
Standards is more a political issue than a technical issue. UPU, USPS,
UK Post, UN/CEFACT, etc are aware of our work and we are open for
liaisons with them and they know it. Whether they want to work with us
is the decision they have to make as we have approached all these bodies
to work with us. As I said before, Standards creating is more a
political agenda than technical. The technical bit is always easy. 

> 	a.  Some similar or compatible definition and storage of
reference 
> data is likely?
> 	b.  Matching rules, algorithms and logic compatible with very
finely 
> grained postal elements are adequately available - especially for 
> international addresses?
> 	c.  Delivery of such finely grained address elements will be
provided 
> by postal authorities out of their reference DBs and/or address 
> correction vendors in address parsing and matching using those 
> reference DBs?

> 	Note that BOTH address correction vendors AND postal authorities
(or 
> other address data providers) often view such detailed data and 
> facilities as their "crown jewels" - for which they either want one to

> pay dearly and accept stringent licensing restrictions, OR have 
> decided to restrict from general availability at all!

This is where we differ. CIQ standards do not have any IPRs, Roytalties,
or license fees to use. For example, to use the address samples from
UPU, we were asked to pay a fee! Anyone can contribute to CIQ and watch
everything happening to CIQ without being its member. But this is not
the case with other bodies like UPU. CIQ is more oriented towards users
who use name and address for various purposes such as customer
identification, customer views, registration, profiling, etc and not
just for postal services. This is in direct contrast to UPU or USPS who
come from a postal services point of view.  

> 
> If you or your committee have approached or considered these issues, I

> would very much like to know.  That would make the adoption of some 
> very segmented / "normalized" address schema so much more likely and 
> potentially beneficial.

>Don't get me
> wrong, I would love to have the option of fully and  deterministically

>processing addresses into their constituent  parts when justified (and 
>storing and transporting them that  way, if not presenting them to end 
>users in that manner).

We need experts like you to contribute to the CIQ effort. It is hard to
find experts in name and address data which to me, is a niche area.

>I
> agree that "getting something out there" to help move this  along 
>would be good; I am just concerned that one could incur  some 
>significant costs, lose some potential benefits, OR  never have a very 
>good set of schemas used if the above  issues have not been approached 
>and don't have some likely answers.

Regards,

Ram
Chair, CIQ TC
> 
> Thank you,
> David Putman
>