Re: [legaldocml-comment] [COMMENT] Discrepancy in akn-media-v1.0-csprd01

Subject: Re: [legaldocml-comment] [COMMENT] Discrepancy in akn-media-v1.0-csprd01.pdf Document

Hi Fabio,

On Wed, Jun 24, 2015 at 5:03 AM, Fabio Vitali <fvitali@gmail.com> wrote:

Thank you for pointing this out for us. This is clearly a copy/paste done wrong. The sentence should be:

" There is no single initial octet sequence that is always present in Akoma Ntoso documents. "

Grand. Many eyes makes all bugs shallow.

I would think that trying to deduce an XML media type from magic bytes is in general unreliable and overly complex. The rfc 2376 (XML Media Types) [1] has this to say about magic numbers:

> Magic number(s): none
>
> Although no byte sequences can be counted on to always be present,
> XML entities in ASCII-compatible charsets (including UTF-8) often
> begin with hexadecimal 3C 3F 78 6D 6C ("<?xml"), and those in
> UTF-16 often begin with hexadecimal FE FF 00 3C 00 3F 00 78 00 6D
> or FF FE 3C 00 3F 00 78 00 6D 00 (the Byte Order Mark (BOM)
> followed by "<?xml"). For more information, see Annex F of [REC-
> XML].

Slightly off point but the purpose here is that MagicByte detection is only one of a number of detection methods we utilize within Tika. I understand that AKN files have no specific magic byte fingerprint so that is OK... I can move on :)

Thanks Fabio, I'll update the AKN GoogleGroup and this public list once the AKN parser for Tika is finished.

Excellent work @kohsah for driving on development of akomantoso-lib Java project. Kudos.

Lewis

legaldocml-comment message