[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: FW: Address Schema Questions
CIQ TC, FYI. Regards, Ram Ram Kumar General Manager Software R&D and Architecture MSI BUSINESS SYSTEMS Suite 204A, 244 Beecroft Road Epping, NSW 2121, Australia Direct: +61-2-9815 0226 Mobile: +61-412 758 025 Fax: +61-2-98150200 URL: www.msi.com.au -----Original Message----- From: John D. Putman [mailto:jdputman@scanningtech.fedex.com] Sent: Friday, November 12, 2004 9:56 AM To: Ram Kumar Cc: John D. Putman Subject: Address Schema Questions Having worked on address correction since 1993, I periodically review the "state of the art". I recently came across some of your committee's work in the area of defining XML address structures. I myself have made several "stabs" at various address holding structures in several different mediums (IMS, RDBS, transactional, XML, etc.). I have not only analyzed and worked with "end-users" on those BUT ALSO with address correction vendors / facilities AND address using applications. I am also in the process of ATTEMPTING to analyze international addresses (the "parts" of which they are composed, what those parts might mean, which are most significant / their hierarchy, etc.). While I have yet to study IN DETAIL all of your proposed schemas, I have several operational and implementation questions that you or your committee may have pondered - 1. Even if a completely comprehensive and consistent set of schemas can be defined (leaving aside the messy "real world" of addresses - especially internationally AND ongoing changes in that!), how would one populate those schema? Or is that something to HOPE that the producers of address reference databases will do (and so be able to deliver such sub-divisions back for use)? 2. Wouldn't schema population require one or all of the following - a. Reference data that is fully segmented into all relevant and possible parts / subparts of an address relative to what is defined in the schema. b. Parsing and matching facilities that can adequately recognize addresses not submitted in a fully schema normalized format / parts. c. Delivery of fully parsed address data back as either a pre-match provisional structure (though matching does often "fix" "raw" parsing errors!), or as a structure developed in (successful) address matching to reference data (unfortunately neither of these is something many vendors do OR they do NOT expose it - especially not for international addresses). Note that is highly unlikely that users and applications, with their vast stores of address information, will very soon or EVER take either the trouble or time to restructure their addresses into fully parsed and finely defined parts - EVEN IF postal authorities do so, someday and globally, for their reference databases. This means that having the above facilities available AND applied will probably be necessary to consistently and adequately populate very finely grained address element schemas! 3. Even once segmented into all possible and appropriate "parts", won't one have to further define address template schemas (maybe one for each country and perhaps several for some countries that have different languages or address formats - take India for one!)? That is, won't that be necessary to be able to properly reconstruct the address data into a human readable, postally valid, and usable "address block"? Note that I have discussed and worked with some users on address entry by constituent parts; but that has never been accepted. Not all users can adequately understand (or be taught) how to do so, OR they refuse (for good operational and efficiency reasons) to take the time to ATTEMPT to enter addresses in that way (even if they DO understand how to). This is somewhat similar to diagramming complex language structures into their constituent parts - not everyone can do it and FEW want to. 4. Won't one actually HAVE TO ATTEMPT to "parse" (and understand!) various address-related language elements (abbreviations, phrases, even single symbols in, for instance, ideogrammatic languages like Kanji) so as to break them up into their relevant address schema parts? Note that as part of an address, certain "common" language elements can take on highly specialized or even different meanings / usages from their "common" ones! 5. Are you aware that SOMETIMES SOME matching algorithms work better on the "string as a whole" than they do on constituent parts? There is a fine line in address correction that often requires partial or whole string matching to "get the best match" OR prevent false positives. This is often due to "special words" like "South" or "Circle" being used for BOTH sub-portions of street names (directionals, street types) AND as primary street names themselves. Note that, for very "well behaved" address structures, some such address element segmentation already occurs in vendor reference DBs. See some of those for non-Puerto Rican USA addresses for instance. However, where address structures are more variable or less "well behaved", that is often not attempted or only attempted partially. Puerto Rican addresses are, in fact, a mild example of that and continue to give address correction vendors "fits". Be that as it may, the point is that the more finely grained address element storage is, the more processing cost there will be for putting addresses back together into partial or complete strings for matching (when that needs to occur) - or, for that matter, into "human usable" form. Address correction vendors are VERY concerned about performance, as their products are often required to process millions OR hundreds of millions of addresses in a relatively short time (a few days if not less than a day). Even on a per-transaction basis, practically negligible (sub-second) response time is required by applications and users. Consequently, both address correction vendors and their users are highly resistant to performance "hits". Similarly, since postal authorities generally count on those vendors to help them get better addresses into their operations, the postal authorities share that performance concern. 6. On a practical note, is there any coordination with postal authorities (USPS, Canada Poste, UPU, etc.) AND address correction vendors so that a. Some similar or compatible definition and storage of reference data is likely? b. Matching rules, algorithms and logic compatible with very finely grained postal elements are adequately available - especially for international addresses? c. Delivery of such finely grained address elements will be provided by postal authorities out of their reference DBs and/or address correction vendors in address parsing and matching using those reference DBs? Note that BOTH address correction vendors AND postal authorities (or other address data providers) often view such detailed data and facilities as their "crown jewels" - for which they either want one to pay dearly and accept stringent licensing restrictions, OR have decided to restrict from general availability at all! If you or your committee have approached or considered these issues, I would very much like to know. That would make the adoption of some very segmented / "normalized" address schema so much more likely and potentially beneficial. Don't get me wrong, I would love to have the option of fully and deterministically processing addresses into their constituent parts when justified (and storing and transporting them that way, if not presenting them to end users in that manner). I agree that "getting something out there" to help move this along would be good; I am just concerned that one could incur some significant costs, lose some potential benefits, OR never have a very good set of schemas used if the above issues have not been approached and don't have some likely answers. Thank you, David Putman
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]