Binary oBIX Ideas

obix-xml message

There are two primary use cases that seem ideal for a compressed oBIX object tree encoding:

6LoWPAN: here our main constraints are bandwidth and limited devices. Ideal 6LowPAN messages have payloads of ~80bytes or less to avoid fragmentation at the 15.4 layer. The nodes processing the oBIX messages have little buffer space, so likely will be producing and consuming fairly small oBIX messages all processed in memory. In this use case a gateway is translating b/w XML and binary oBIX.

Cellular: another use case that has come up for me a couple times is cellular networking where bandwidth is constrained or expensive. Cellular devices don't typically have the same computing/memory constraints and may be passing fairly large oBIX messages, but still want compression to save bandwidth. In these cases we want to efficiently work off an I/O stream (which has implications for string dictionary lookups).

1. The binary encoding should have 100% fidelity with the oBIX model which means any oBIX specified element or attribute should have a corresponding binary encoding. It should be possible to round trip any oBIX XML document with no loss of information (caveats below).

2. The binary encoding will *not* be an compression of the XML. The oBIX model is a separate abstraction from its XML represenation. In the XML representation it is possible to add additional meta-data such as custom namespaced elements and attributes. The binary encoding will not preserve this information.

3. The value space will not be truly lossless. In the case of XML, the value space of integers, reals, and times is encoded using a string with potentially infinite precision. Binary oBIX will be restricted to the precision found in common programming language primitives. Int will have precision up to a 64-bit signed integer, Real will have precision up to 64-bit IEEE float, time mostly likely down to milliseconds or perhaps nanoseconds.

There are various compression techniques. The primary technique we will use is simply encoding the object and facet structure of oBIX into byte codes versus XML text elements and attributes. Because the oBIX model has a simple closed syntax, this is where we will get the bulk of our compression:

- Most facets are monomorphic - they have exactly one value type, for example unit is always a URI

With-in each of the 10 value types, there might be multiple encodings. For example I might have 4 different ways to encode an int based on its magnitude:

These variations allow the optimal encoding based on a given integer. Likewise there are multiple encodings for reals, and we especially want multiple optimizations for string encodings to use some sort of lookup dictionary. The huge advantage a custom oBIX encoding has over other techniques is that we know how to interpret text data into boolean, ints, and floats.

Given these aspects of the core oBIX model, it is really just a bit packing puzzle (to be solved).

One thing I'd like to throw out and get some feedback on is how to handle strings. One of the simplest compression techniques is to pre-assign strings to a numeric code (in fact this is the sort of the basis of most general purpose compression algorithms). But strings in oBIX (such contracts, names, display names, etc) are open ended.

The key question is this: must string dictionaries be inlined into each message or do we allow out-of-band references?

The simplicity of inline is that given a binary message, I can faithfully turn it back into XML with no other information. The expense is that every string used within the message must be transported in the message (potentially expensive).

The alternative is to just extend oBIX's REST nature and allow an out-of-band string dictionary via any arbitrary URI. This would maintain very compact binary encodings, at the expense of an extra HTTP deference to fully turn the message back into XML.

I was originally opposed to any out-of-band dictionaries, but then I started thinking about more from a REST perspective, and I am thinking it is OK. In fact that is basically what we are already doing with unit, and ranges, and contracts by making them arbitrary URIs.