Re: [legaldocml-comment] [COMMENT] Discrepancy in akn-media-v1.0-csprd01

Subject: Re: [legaldocml-comment] [COMMENT] Discrepancy in akn-media-v1.0-csprd01.pdf Document

On Wed, Jun 24, 2015 at 11:47 AM, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com> wrote:

Hi Fabio,

On Wed, Jun 24, 2015 at 5:03 AM, Fabio Vitali <fvitali@gmail.com> wrote:

Thank you for pointing this out for us. This is clearly a copy/paste done wrong.

A good friend today made a copy-and-paste error, prompting me to re-tell the story we all should remember. I embellish the narrative from memory; it goes back to Ted Nelson, sometime "Father of Hypertext". Approximately:

Now, for example, they named this computer thing a "clipboard". Why "clipboard"? After all, when you have a real clipboard, you can see what you have placed on it, With a computer clipboard, it's invisible, so you have to remember what's on your clipboard. With a real clipboard, if you add something, that addition does not delete the papers already on your clipboard. With the computer clipboard, if you add something, it deletes what was there. With a real clipboard, if you move something off, it doesn't reappear back on the clipboard. With a computer clipboard, if you take what's on it and place it somewhere else, the thing is still on your clipboard. In all other respects, however, the computer clipboard is just like a real clipboard. Unfortunately: there are no other respects."

The sentence should be:

" There is no single initial octet sequence that is always present in Akoma Ntoso documents. "

Grand. Many eyes makes all bugs shallow.

I would think that trying to deduce an XML media type from magic bytes is in general unreliable and overly complex. The rfc 2376 (XML Media Types) [1] has this to say about magic numbers:

> Magic number(s): none
>
> Although no byte sequences can be counted on to always be present,
> XML entities in ASCII-compatible charsets (including UTF-8) often
> begin with hexadecimal 3C 3F 78 6D 6C ("<?xml"), and those in
> UTF-16 often begin with hexadecimal FE FF 00 3C 00 3F 00 78 00 6D
> or FF FE 3C 00 3F 00 78 00 6D 00 (the Byte Order Mark (BOM)
> followed by "<?xml"). For more information, see Annex F of [REC-
> XML].

Slightly off point but the purpose here is that MagicByte detection is only one of a number of detection methods we utilize within Tika. I understand that AKN files have no specific magic byte fingerprint so that is OK... I can move on :)

Thanks Fabio, I'll update the AKN GoogleGroup and this public list once the AKN parser for Tika is finished.
Excellent work @kohsah for driving on development of akomantoso-lib Java project. Kudos.
Lewis

Robin Cover
OASIS, Director of Information Services
Editor, Cover Pages and XML Daily Newslink
Email: robin@oasis-open.org
Staff bio: http://www.oasis-open.org/people/staff/robin-cover
Cover Pages: http://xml.coverpages.org/
Newsletter: http://xml.coverpages.org/newsletterArchive.html
Tel: +1 972-296-1783

legaldocml-comment message