office message

Subject: Encryption and data leakage

From: robert_weir@us.ibm.com
To: office@lists.oasis-open.org
Date: Tue, 11 May 2010 13:42:49 -0400

The approach we inherited from ODF 1.1 encrypts each file in the ZIP 
independently.  Although the contents of the files are not viewable due to 
the encryption, there are bits of information that  potential "leak", such 
as:

1) The file size
2) The file date
3) The file name
4) The file mime type
5) The hash of the first 1024 bytes of the file

For example, even in an encrypted document I could see a file name called 
"big-secret-takeover-june-3.jpg" and know some information that the person 
who wrote the encrypted document might be rather surprised to see in the 
open.

Although not required by ODF, an implementation, if it is clever, can 
avoid some of these leakages.  For example, the timestamp of the file can 
be turned into the time of encryption rather than the original time stamp. 
 And the file name can be randomized rather than indicate the original 
file name.  This might be fine for ODF, since these time stamps and file 
names are not necessary to be preserved.  So long as as we preserve 
referential integrity of the package, the names of images are not 
significant.

However we still should be concerned here.  First, the reason we split 
Part 3 into its own part was the believe that it could be useful for 
purposes other than just ODF 1.2.  Many of us hoped that it would other 
uses.  But I don't think we can assume that all uses can ignore the 
original file names and time stamps.  These might be significant for some 
uses. 

Second, even within ODF, especially if we allow package extensions,  we 
might see items added to packages where the names of files (which may 
ultimately end user-defined) cannot safely be renamed to random names. For 
example, there may be referential integrity constraints that a generic ODF 
processor is not aware of.  Maybe there is RDF that points to a contained 
image or other package resource.  In any case, the approach is very 
fragile.

Finally, even without extensions, and with the use of randomized names, we 
still leak information, based on knowing the size and hash of the first 
1024 bytes of the file.  For example, if I have a copy of "
big-secret-takeover-june-3.jpg" I can easily check to see what encrypted 
documents also contain that same image.  I can similarly probe for any 
other resource where I know in advance its size and or contents. 

There are three ways of getting around this problem.  (Or at least two 
that come to mind).  One is to keep a "shadow directory" for the ZIP, that 
contains the original names, time stamps, and sizes of the files.  Encrypt 
this  "shadow directory" when the document is encrypted.  For example 
encrypted file, prepend it with some random bytes (not sure what is 
optimal) in order to prevent data leakage of original size and hash of 
first 1024 bytes.

Another approach is to encode the original full path of the file, appended 
with its timestamp, using the original derived key, base64 encode that, 
and then write that out as the full path for the ZIP entry. That way you 
do not need another file in the ZIP. 

The other way is to move to a whole-package encryption method, rather than 
trying to do this file-by-file. 

-Rob

Follow-Ups:
- Re: [office] Encryption and data leakage
  - From: Malte Timmermann <Malte.Timmermann@Sun.COM>
- RE: [office] Encryption and data leakage
  - From: David LeBlanc <dleblanc@exchange.microsoft.com>