Accents, characters, and unusual punctuation in CAP

Certain data was being being processed in our internal system Java as UTF-8 for languages that need at least UTF-16 to handle. This caused characters with accents common in Spanish or French to cause processing exceptions. Since Java uses Unicode internally, the fix to allow accented characters is not hard. You just need to set a value in a couple of place in the code.

But... It bring up a bigger question. The language tag in the info block can be used to validate/determine how to read the data in Unicode in CAP messages written in languages than use non-Roman characters or unusual accents on Roman characters. This would make translation on the receiving end much simpler and more consistent. But, how about mixed information? The simple example is Spanish or French place names in English where the accenting is not recognized. A certain laxness in processing can handle that for the most part. The more challenging case is something typical in Japan, for example, where the mixed use of character sets in written communication is quite common. Japanese writing in Roman letters, but using some Japanese characters is one example. Another example is text in Japanese characters except that a non-Japanese place name is written in its native character set instead of, or as well as, its katakana (Japanese characters used for foreign words) representation. I suspect that is might be the case in other languages as well.

Question, should we validate info block content by language? Should we even process text content by language? Or, is it just a translation problem on either end to be left to user systems? (It may not be trivial.)

emergency-msg message