OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [docbook-apps] Dynamic web serving of large Docbook



Michael, thanks for your extensive replies. I have been looking into this 
relatively extensively, and it sure is tricky. Docbook is a very attractive 
format to have beneath, and being able to swiftly use it in large web 
projects would make it even more powerfull. I think it applies to many, so a 
clean, thorough solution which is pushed upstream(into a CMS or stylesheets) 
would gain many people.

It should be noted I have no possibilities for financing or proprietary 
solutions due to several reasons, one is that it's for an open source 
project. Also, sorry about the late reply :|

On Wednesday 13 October 2004 13:29, Michael Smith wrote:
> Frans,
>
> Reading through your message a little more...
>
> [...]
>
> > The perfect solution, AFAICT, would be a dynamic, cached, generation.
> > When a certain section is requested, only that part is transformed, and
> > cached for future deliveries. It sounds nice, and sounds like it would be
> > fast.
> >
> > I looked at Cocoon(cocoon.apache.org) for helping me with this, and it
> > does many things well; it caches XSLT sheets, the source files, and even
> > CIncludes(same as XIncludes basically).
> >
> > However, AFAICT, Docbook makes it not easy:
> >
> > * If one section is to be transformed, the sheets must parse /all/
> > sources, in order to resolve references and so forth. There's no way to
> > workaround this, right?
>
> It seems like your main requirement as far as HTML output is to be
> able to preserve stable cross-references among your rendered
> pages. And you would like to be able to dynamically regenerate
> just a certain HTML page without regenerating every HTML page that
> it needs to cross-reference.
>
> And, if I understand you right, your requirement for PDF output is
> to be able to generate a PDF file with the same content as each
> HTML chunk, without regenerating the whole set/book it belongs to.
> (At least that's what I take your mention "chunked PDF" in your
> original message to mean.)

Yes, correct interpretation.

>
> (But -- this is just an indicental question -- in the case of the
> PDF chunks, you're not able to preserve cross-references between
> individual PDF files, right? There's no easy way to do that. Not
> that I know of at least.)

Nope, the PDF would simply contain the content of the viewed page without any 
webspecifics such as navigation; used for printing. Example(upper right 
corner):
http://xml.apache.org/

>
> If the above is all an accurate description of your requirements,
> then I think a partial solution is
>
>   - set up the relationship between your source files and HTML
>     output such that the DocBook XML source for your parts are
>     stored as separate physical files that corresponded one-to-one
>     with the HTML files in your chunked output
>
>   - use olinks for cross-references (instead of using xref or link)
>
>       http://www.sagehill.net/docbookxsl/Olinking.html
>
> If you were to do those two things, then maybe:
>
>  1. You could do an initial "transform everything" step of your
>     set/book file, with the individual XML files brought together
>     using XInclude or entities; that would generate your TOC &
>     index and one big PDF file for the whole set/book
>
>  2. You would then need to to generate a target data file for each
>     of your individual XML files, using a unique filename value for
>     the targets.filename parameter for each one, and then
>     regenerate the HTML page for each individual XML file, and
>     also the corresponding PDF output file.
>
>  3. After doing that initial setup once, then each time an
>     individual part is requested (HTML page or individual PDF
>     file), you could regenerate just that from its corresponding
>     XML source file.
>
>     The cross-references in your HTML output will then be
>     preserved (as long as the relationship between files hasn't
>     changed and you use the target.database.document and
>     current.docid parameters when calling your XSLT engine).
>
> I _think_ that all would work. But Bob Stayton would know best.
> (He's the one who developed the olink implementation in the
> DocBook XSL stylesheets.)
>
> A limitation of it all is that, if a writer adds a new section to
> a document, you're still going to need to re-generate the whole
> set/book to get that new section to show up in the master TOC.
> Same thing if a writer adds an index marker, in order to get that
> marker to show up in the index.
>
> But one way to deal with that is, you could just do step 3 above
> on-demand, and have steps 1 and 2 re-run, via a cron job or
> equivalent, at some regular interval -- once a day or once an hour
> or at whatever the minimum interval is that you figure would be
> appropriate given how often writers are likely to add new sections
> or index markers.
>
> And during that interval, of course there would be some
> possibility of an end user not being aware of a certain newly
> added section because the TOC hasn't been regenerated yet, and
> similarly, not finding anything about that section in the index
> because it hasn't been regenerated yet.
>
> > * Cocoon specific: It cannot cache "a part" of a transformation, which
> > means the point above isn't workarounded. Right? This would otherwise
> > mean the transformation of all non-changed sources would be cached.
>
> Caching is something that you could do with or without Cocoon, and
> something that's entirely separate from transformation phase. You
> wouldn't necessarily need Cocoon or anything Cocoon-like if you
> used the solution above (and if it would actually work as I
> think). And using Cocoon just to handle caching would probably be
> overkill. I think there are probably some lighter-weight ways to
> handle caching.
>
> Anyway, I think the solution I described would be some work to set
> up -- but you could hire some outside expertise to help you do
> that (Bob Stayton comes to mind for some reason...).


I looked at the solution of using an olink database, but perhaps I discarded 
it too quickly. Perhaps I'm setting the threshold to high(I am..), but I find 
it hackish; it isn't transparent, and it most of all disturbs creation of 
content: one can't use standard Docbook, and authors have to bother with 
technical problems. It's messy.

One thing which can be remembered is that splitting the source document 
mustn't be done propotionally to what pieces that are rendered; it only have 
to be kept in such small pieces that performance is acceptable(it's a small 
detail, but it can from an editing perspective be practical with a document 
larger than what is to be viewed), /assuming/ the CMS( or whatever content 
generation mechanism is used) can map the generated output to a certain part 
in the source file(like XInclude). 

To recapitulate, the problem is the initial transformation of the requested 
content -- that the XSLs must traverse "all" the sources -- and that 
performance hit is the same regardless of whether it's PDF, HTML, and if the 
requested content is small. Once it's generated all is cool, since it's 
cached for later deliveries. That's the key problem -- everything depends on 
it.

Here's possible solutions:


1. The olink way you described. It works, but it's complex, restraining, and 
intrusive on content creation.

2. True static content(croned). Not intrusive on content creation, but it's 
perhaps too simple(too dumb), and it actually can become a performance issue 
too; generating PDFs for each section -- that's a lot of mega bytes to write 
to disk each time the cron job runs.

3. To actually go for the long transformation which we try to avoid; that all 
the sources are transformed for each requested section. First of all, this 
long transformation happens for the first request -- the first user -- and 
then it's cached. How long does it take then? Cocoon caches includes, and the 
files, so when the cache becomes invalidated one source file is reloaded(the 
one which have changed) while all others and the Docbook XSLs(they're huge) 
are kept in memory(DOM, I presume) -- perhaps that's enough for reducing that 
first transformation to reasonable speeds. I'm only speculating, no doubt 
that it's the transformation that takes the longest time(perhaps someone 
knows if I'm unrealistic, but otherwise real testing gives the definite 
answer). If this worked, it would be the best solution.

These approaches can also be combined; the html output could be static(cron), 
while PDFs are dynamic. In this way the performance trouble of 2) are 
gone(writing tons of PDF files), and perhaps the delay is ok for PDF. From my 
shallow reading about Forrest, I have understood it's good at combining 
serving dynamic and generating static, perhaps it can be a way to pull it all 
together under one technical framework.


***

Another trouble, or at least something which requires action, with flexible 
website integration is navigation. As I see it, Docbook is tricky on that 
front -- the XSLs are quite focused on static content generation, the chunked 
output for example. Since dynamic generation basically takes a node and 
transforms with docbook.xsl, navigation must be hand written, for example if 
one wants the TOC as a sidebar, and that it changes depending on what is 
viewed(flexible integration). I bet this is relatively easy to do, 
considering how the XSLs are written, and this could be good to have in a 
generic way somewhere(Forrest, Docbook XSLs, perhaps..). 



Yes, speculations. When I write something, have actual numbers, proof of 
concept, or know what I actually talk about, I will definitely share it on 
this list. 

Hm.. That's as far as I see.


Cheers,

		Frans



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]