ciq message

Subject: FW: Address Schema Questions
From: "Ram Kumar" <RKumar@msi.com.au>
To: <ciq@lists.oasis-open.org>
Date: Mon, 15 Nov 2004 17:03:17 +1100
CIQ TC,

FYI.


Regards,

Ram

Ram Kumar
General Manager
Software R&D and Architecture
MSI BUSINESS SYSTEMS
Suite 204A, 244 Beecroft Road
Epping, NSW 2121, Australia
Direct: +61-2-9815 0226
Mobile: +61-412 758 025
Fax: +61-2-98150200
URL: www.msi.com.au 


-----Original Message-----
From: John D. Putman [mailto:jdputman@scanningtech.fedex.com] 
Sent: Friday, November 12, 2004 9:56 AM
To: Ram Kumar
Cc: John D. Putman
Subject: Address Schema Questions

Having worked on address correction since 1993, I periodically review
the "state of the art".  I recently came across some of your committee's
work in the area of defining XML address structures.  I myself have made
several "stabs" at various address holding structures in several
different mediums (IMS, RDBS, transactional, XML, etc.).  I have not
only analyzed and worked with "end-users" on those BUT ALSO with address
correction vendors / facilities AND address using applications.  I am
also in the process of ATTEMPTING to analyze international addresses
(the "parts" of which they are composed, what those parts might mean,
which are most significant / their hierarchy, etc.).

While I have yet to study IN DETAIL all of your proposed schemas, I have
several operational and implementation questions that you or your
committee may have pondered -

1.  Even if a completely comprehensive and consistent set of schemas can
be defined (leaving aside the messy "real world" of addresses -
especially internationally AND ongoing changes in that!), how would one
populate those schema?  Or is that something to HOPE that the producers
of address reference databases will do (and so be able to deliver such
sub-divisions back for use)?

2.  Wouldn't schema population require one or all of the following -
	a.  Reference data that is fully segmented into all relevant and
possible parts / subparts of an address relative to what is defined in
the schema.
	b.  Parsing and matching facilities that can adequately
recognize addresses not submitted in a fully schema normalized format /
parts.
	c.  Delivery of fully parsed address data back as either a
pre-match provisional structure (though matching does often "fix" "raw"
parsing errors!), or as a structure developed in (successful) address
matching to reference data (unfortunately neither of these is something
many vendors do OR they do NOT expose it - especially not for
international addresses).
	Note that is highly unlikely that users and applications, with
their vast stores of address information, will very soon or EVER take
either the trouble or time to restructure their addresses into fully
parsed and finely defined parts - EVEN IF postal authorities do so,
someday and globally, for their reference databases.  This means that
having the above facilities available AND applied will probably be
necessary to consistently and adequately populate very finely grained
address element schemas!

3.  Even once segmented into all possible and appropriate "parts", won't
one have to further define address template schemas (maybe one for each
country and perhaps several for some countries that have different
languages or address formats - take India for one!)?  That is, won't
that be necessary to be able to properly reconstruct the address data
into a human readable, postally valid, and usable "address block"?
	Note that I have discussed and worked with some users on address
entry by constituent parts; but that has never been accepted.  Not all
users can adequately understand (or be taught) how to do so, OR they
refuse (for good operational and efficiency reasons) to take the time to
ATTEMPT to enter addresses in that way (even if they DO understand how
to).  This is somewhat similar to diagramming complex language
structures into their constituent parts - not everyone can do it and FEW
want to.

4.  Won't one actually HAVE TO ATTEMPT to "parse" (and understand!)
various address-related language elements (abbreviations, phrases, even
single symbols in, for instance, ideogrammatic languages like Kanji) so
as to break them up into their relevant address schema parts?
	Note that as part of an address, certain "common" language
elements can take on highly specialized or even different meanings /
usages from their "common" ones!

5.  Are you aware that SOMETIMES SOME matching algorithms work better on
the "string as a whole" than they do on constituent parts?  There is a
fine line in address correction that often requires partial or whole
string matching to "get the best match" OR prevent false positives.
This is often due to "special words" like "South" or "Circle" being used
for BOTH sub-portions of street names (directionals, street types) AND
as primary street names themselves.
	Note that, for very "well behaved" address structures, some such
address element segmentation already occurs in vendor reference DBs.
See some of those for non-Puerto Rican USA addresses for instance.
However, where address structures are more variable or less "well
behaved", that is often not attempted or only attempted partially.
Puerto Rican addresses are, in fact, a mild example of that and continue
to give address correction vendors "fits".  
		Be that as it may, the point is that the more finely
grained address element storage is, the more processing cost there will
be for putting addresses back together into partial or complete strings
for matching (when that needs to occur) - or, for that matter, into
"human usable" form.  Address correction vendors are VERY concerned
about performance, as their products are often required to process
millions OR hundreds of millions of addresses in a relatively short time
(a few days if not less than a day).  Even on a per-transaction basis,
practically negligible (sub-second) response time is required by
applications and users.
Consequently, both address correction vendors and their users are highly
resistant to performance "hits".  Similarly, since postal authorities
generally count on those vendors to help them get better addresses into
their operations, the postal authorities share that performance concern.

6.  On a practical note, is there any coordination with postal
authorities (USPS, Canada Poste, UPU, etc.) AND address correction
vendors so that 
	a.  Some similar or compatible definition and storage of
reference data is likely?
	b.  Matching rules, algorithms and logic compatible with very
finely grained postal elements are adequately available - especially for
international addresses?
	c.  Delivery of such finely grained address elements will be
provided by postal authorities out of their reference DBs and/or address
correction vendors in address parsing and matching using those reference
DBs?
	Note that BOTH address correction vendors AND postal authorities
(or other address data providers) often view such detailed data and
facilities as their "crown jewels" - for which they either want one to
pay dearly and accept stringent licensing restrictions, OR have decided
to restrict from general availability at all!

If you or your committee have approached or considered these issues, I
would very much like to know.  That would make the adoption of some very
segmented / "normalized" address schema so much more likely and
potentially beneficial.  Don't get me wrong, I would love to have the
option of fully and deterministically processing addresses into their
constituent parts when justified (and storing and transporting them that
way, if not presenting them to end users in that manner).  I agree that
"getting something out there" to help move this along would be good; I
am just concerned that one could incur some significant costs, lose some
potential benefits, OR never have a very good set of schemas used if the
above issues have not been approached and don't have some likely
answers.

Thank you,
David Putman