OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

office-collab message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: IRC log - ODF Collab SC call - 2017-07-12


Call Summary:
We talked about string comparison, some of their algorithms and change detection approaches on documents.

On the 26th of July will be our next meeting:
https://www.timeanddate.com/worldclock/meetingdetails.html?year=2017&month=07&day=26&hour=14&min=30&sec=0&p1=179&p2=37&p3=136&p4=234&iv=1800

The teleconference login data for next call will be found in the OASIS calendar event:
https://www.oasis-open.org/committees/event.php?event_id=39526

[16:34] Svante Schubert: Hello
[16:37] Svante Schubert: Regarding change-tracking I read more about String comparison
[16:37] Svante Schubert: Seems it is favorable to use the Burrows-Wheeler-Transformation over Suffix array
http://www.mi.fu-berlin.de/wiki/pub/ABI/RnaSeqP4/00_Protokoll_main.pdf
The comparison in chapter: 4.2 BWT and suffix arrays
explained both earlier - best suffix array explaination
[16:37] Svante Schubert: At least, what they are using when they are comparing DNA
[16:37] Svante Schubert: But nothing on runtime behavior written there.
[16:38] Svante Schubert: Regarding precise String finding (which might be related to finding differences) there is a funny movement to the GPU
[16:39] Svante Schubert: For instance this approach on the GPU: http://on-demand.gputechconf.com/gtc/2017/posters/images/1920x1607/GTC_2017_Algorithms_AL_02_P7202_WEB.png [16:39] Svante Schubert: from http://www.gputechconf.com/resources/poster-gallery/2017/algorithms
[16:40] Svante Schubert: In the paper above, they were able to explain SuffixArray in one pager (a little more on next page), that it becomes apparent for what the data structure was meant for.
[16:40] Svante Schubert: Also the example of Burrows-Wheeler is very good explained!
[16:41] Svante Schubert: Nevertheless the SuffixArray only makes sense, when the Longest Common Prefix is used upon it to find the similar substrings.
[16:42] Svante Schubert: There is even a race how fast this StringArray datastructure can be created. There is a paper on taxonomy about it:
[16:42] Svante Schubert: Especially the graphs within: http://www.cas.mcmaster.ca/~bill/best/algorithms/07Taxonomy.pdf
[16:42] Svante Schubert: But already 10 years old and even during the release new algorithms were found.
[16:43] Svante Schubert: String comparison became an interesting field because of DNA analysis science!
[16:45] Patrick: tries were invented in 1959 and named tries by Edward Fredkin two years later
[17:05] Svante Schubert: Concatenating all strings from all visible text in the document and run SuffixArray (SA) and LCP upon will provide the matches
[17:07] Svante Schubert: Basically there are graph and text comparison algorithms out there, the best results can be archieved by combining the best out of both worlds
[17:08] Patrick: insertion marks the end point of the substring where insertion starts, then the insertion with its end point
[17:09] Svante Schubert: In former times it was explicitly mark, while we are going to reference the start/end by counting
[17:09] Patrick: have a suffix for the start and suffix for the end, insertion goes between them
[17:10] Svante Schubert: yes
[17:10] Patrick: your location syntax + suffixes then its unique to the document
[17:12] Svante Schubert: Having a list of changes and references by counting allows overlapping work, while marks within the document as in former times make overlapping changes very complicated
[17:14] Svante Schubert: Patrick: As asynch operations are being applied, following changes have to be adopted
[17:15] Svante Schubert: Async changes from different people, are like multiple people working on source code with a revision system (as GIT). They are working on different branches
[17:15] Svante Schubert: The merge of branches is done one at a time
[17:16] Svante Schubert: First comes first, and the other branches have to apply previous applied branches to their change list first. Similar as the changes of others are being moved before (or through) their change list.
[17:23] Patrick: if entire text is treated as one string, including markup, then following changes in stack must increment/decrement by the length of the change. And, only valid changes can be entered, such as <text:p> must have </text:p> as part of the text string
[17:26] Svante Schubert: in the paper I have been reading, they made elements to graphs above the text nodes, only styles had been added within (as in Wiki documents)
[17:28] Patrick: In change list, I'm suggesting the entire document is a string, changes (including markup) are offsets into that string. Or into changes if you want to support that.
[17:28] Svante Schubert: In my example a branch is equal to a list of changes. Where every commit in the GIT world is equal to a user change (or ODF change operationg) in our world
[17:29] Svante Schubert: Oh, you are talking about the document model being in a single string.
[17:31] Svante Schubert: Well, graphs can be serialized to String, like StringArray and lexical ordered String Graphs are equivalent usable
[17:32] Svante Schubert: I only would suggest not to build the string based directly upon the XML markup (as it is too verbose, too noisy on changes as it is not normalized), but on what I would call user objects
[17:33] Patrick: So test nodes = user objects ?
[17:33] Patrick: text nodes
[17:34] Svante Schubert: here every character is in theory a user node, but in fact we were comparing text and concating text node only for sub-string comparison
[17:36] Patrick: are attribute values user nodes for tracking?
[17:40] Svante Schubert: depends :)
[17:40] Svante Schubert: automatic styles are resolved from current ODF applications to track changes on text, such as bold, italic, etc.
[17:40] Svante Schubert: but not in general
[17:48] Patrick: <span>blah<span>bleh</span>blah</span> could become <span>blah</span><span>bleh</span><span>blah</span>
[17:49] Svante Schubert: Normalizing the nested span to unnested is simplifying the work for office applications (at least the ODFDOM application I worked on, where XML was represented as DOM)
[17:49] Svante Schubert: There are different ways to have the ODF semantic model serialized. Interesting is to find criteria what are better application models.
[17:51] Svante Schubert: Obviously it is not good to have the same user feature being offered in different ways, as lists and numbered paragraphs, as it only provides more development time without user benefit
[17:52] Svante Schubert: Second, if there is a user change (e.g. by the GUI) the model needs to be updated. This update should be quick and consistent. My nightmare as ODF application working directly on the XML DOM was three list items.
[17:54] Svante Schubert: For instance, all three list items have list level 10 and now the middle should be moved to list level 3. The list have to be split and there is a quite amount of node changes involved.
[17:54] Svante Schubert: Instead the middle list might have simply the list level as integer, which is being changed from 10 to 3.
[17:55] Svante Schubert: Another important rule, documents are being loaded fastest, when the serialized model is close to the internal model.
[17:55] Svante Schubert: Anyway, we are over the time. Talk to you next time. We continue on the list..
[17:55] Svante Schubert: byebye



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]