OFFICE-2093: Using URL fragment identifiers for ODF media types

Dear TC,

Allow me to come back to a previous issue we talked about some weeks ago.

Michael had updated the JIRA issue OFFICE-2093 with the extensions of LibreOffice to access via an URL not only just the document itself but to be able to jump directly into various inner parts of an ODF document.

Still, I would like to start a discussion here on the list what further useful extension are thinkable to access content within a document via an URL fragment identifier.

The basics:

URLs are defined as part of URI (URI is the superset of URL and URN) in RFC3986. Most important to us now is the suffix behind the '#' of a URI called fragment identifier, which is defined without detailed semantics. Instead, each document owner has to specify the meaning of the fragment identifier of the MIME type they are controlling. All have on thing in common: They are defining the type of fragments within a document, which should be identified via the fragment identifier.

The basic design of RFC3986 is:

first, the document is being retrieved via the URI to the calling application,
but the fragment identifier is being resolved afterwards by the calling application itself (e.g. LibreOffice).

What are the constraints?

We can refer to any part we want of the document, of course, it might make sense to refer to parts often desired to access/jump by a user.

The biggest constraint is the syntax, the characters that are allowed within the fragment.

The fragment starts after the '#' and ends with the end of the URI.

RFC3986 defines the following:

fragment = *( pchar | "/" | "?" )

pchar = unreserved | pct-encoded | sub-delims | ":" | "@"

unreserved = ALPHA | DIGIT | "-" | "." | "_" | "~"

pct-encoded = "%" HEXDIG HEXDIG

sub-delims = "!" | "$" | "&" | "'" | "(" | ")" | "*" | "+" | "," | ";" | "="

Which is similar to any arbitrary combination of characters written as

ALPHA or DIGIT or "-" or "." or "_" or "~" or

"!" or "$" or "&" or "'" or "(" or ")" or "*" or "+" or "," or ";" or "=",

any other character is being encoded as hexadecimal characters.

For instance, for a space character, the hexadecimal byte value is 20, so it would percentage (pct) encoded %20. If there are multiple bytes, for instance when the URL is written in UTF-8 every byte is encoded. The German "Ä" in UTF-8 is encoded as %C3%84 or when using ISO-8859-1 encoded as %C4.

The last anomaly to be mentioned but not important for us is that the space character within the PATH and QUERY part of an URL is not encoded %20, but encoded as +.

Usability & Examples

In HTML the fragment identifier is referring to an HTML element having the equal valued ID attribute.

Now knowing the paradigm convention over configuration, I would suggest that even if there is no explicit ID, that for instance, at least every heading is accessible by a fragment identifier (usually space would have to be exchanged with %20, but if applications could be tolerant it might be nice).

In addition, I would love to have the simplest usability for end users to access the document parts.

For presentations, for instance, we might want to access for example the fifth slide of a presentation "example.odp" by "example.odp#5" aside of the currently implemented "example.odp#page5"

For spreadsheet, we might want to access any cell ranges, cells, rows, columns of certain sheets with fragment identifiers. I leave it for the spreadsheet folks to come up with the syntax. ;)

Finally the main question

Can you come up with examples for fragment identifiers to ODF content, that would ease your life?

Thanks in advance,

Svante

office message