docbook message

Subject: Re: [docbook] DB4 to DB5 conversion stylesheet

From: Alexey Neyman <stilor@att.net>
To: docbook@lists.oasis-open.org
Date: Mon, 22 Apr 2013 11:05:08 -0700

Hi Bob,

Thanks. I've also made some other modifications in db4-upgrade.xsl, attached, so the full list of changes is:

- <productname>: do not discard all @class specifiers (but see below)

- <remark>: do not discard markup

- <title>, <subtitle>, <titleabbrev>: do not reposition titles inside <info>. Instead, in <title> template, verify if parent element XXX contains another title inside XXX/XXXinfo - if it does, complain and skip the title outside of <XXXinfo>

- <productname>, <orderedlist>, <literallayout>: Suppress attributes with default values added by DTD (<literallayout class="normal">, etc.)

- Add an ability to set a custom @version via stylesheet parameter

- Slightly change the criteria used for determining where @version is set - set it on the topmost element without a namespace, or topmost element with DocBook namespace (we have a few documents that include DocBook inside some other elements using non-DocBook namespaces)

Also attached is a Python script I wrote for cloaking the entities and other content. Inspired by the Perl script referred to from the DB4 to DB5 upgrade HOWTO, but different:

- Perl version is too greedy when looking for internal subset (in fact, it just assumes that internal subset is always present). It interprets anything from the "DOCTYPE" string to the next "]>" to be internal subset. In absence of internal subset, though, "]>" would be the next CDATA closing.

- My version also preserves the content outside of the root element (which is lost with Perl version, as XSLT step drops it)

- My version also duplicates the <?xml ...?> directive and the DTD when cloaking, which may be needed by XSLT step (DTD may define namespace prefixes; <?xml ...?> may define document encoding and XML version.

Feel free to use/publish this script.

Regards,

Alexey.

On Wednesday, April 17, 2013 02:06:01 PM Bob Stayton wrote:

Hi Alexey,

Thanks for pointing out these problems. We discussed them in the DocBook Technical Committee meeting today, and concluded that you are correct. The conversion stylesheet will get fixed. If you find any other issues, please let us know.

Bob Stayton
Sagehill Enterprises
bobs@sagehill.net

From: Alexey Neyman

Sent: Thursday, April 11, 2013 11:16 PM

To: docbook@lists.oasis-open.org

Subject: [docbook] DB4 to DB5 conversion stylesheet

Hi,

I am exploring possibility of conversion of our documents from DB4 to DB5 using the db4-upgrade.xsl stylesheet provided at docbook.org. I have a few questions regarding the stuff that the stylesheet drops from DB4 elements.

1. Is there a particular reason why that stylesheet drops @class attribute from <productname/>?

<xsl:template match="productname[@class]" priority="200">

<xsl:call-template name="emit-message">

<xsl:with-param name="message">

<xsl:text>Dropping class attribute from productname</xsl:text>

</xsl:with-param>

</xsl:call-template>

<xsl:copy>

<xsl:call-template name="copy.attributes">

<xsl:with-param name="suppress" select="'class'"/>

</xsl:call-template>

<xsl:apply-templates/>

</xsl:copy>

</xsl:template>

As far as I can see from TDG5.1, it is perfectly legal in DocBook 5.0: http://www.docbook.org/tdg51/en/html/productname.html

At first I thought this removal is because of some element models changing before the final release of DocBook 5.0 specification (the stylesheet claims conformance to 5.0CR5), but I don't see any productname/remark mentions in the specification change history.

2. Similar question: why all mark-up is removed from inside of the <remark/> element?

<xsl:template match="remark" priority="200">

<xsl:copy-of select="@*"/>

<xsl:value-of select="."/>

</remark>

</xsl:template>

Again, TDG5.1 says most mark-up is allowed in <remark/>: http://www.docbook.org/tdg51/en/html/remark.html

3. Is there a reason why all <title>xxx</title> elements are converted to <info><title>xxx</title></info>? As far as I can see, both are legal in DB5. Is <title> going to be deprecated as a direct child of <sectX/> elements?

We don't use anything but titles on lower level elements (such as sections), so <info> just adds some clutter to the documents.

Regards,

Alexey.

Attachment: db4-upgrade.xsl
Description: application/xslt

#!/usr/bin/python3

import sys
import re
import argparse

def do_cloak(ns, old, force):

	def pi(s):
		return '<?' + ns.pi_wrap + ' ' + s + '?>'

	def xml_escape(s):
		escapes = { '<' : 'LT', '>' : 'GT', '&' : 'AMP' }
		return re.sub(r'[<>&]', lambda x: pi('char ' + escapes[x.group()]), s)

	def make_escapes(dtd, s):
		if dtd != '':
			s = s.replace(dtd, xml_escape(dtd))
		s = re.sub(r'\<\?xml(.*?)\?\>', lambda m: pi('xml-pi !' + m.group(1)), s, flags=re.DOTALL)
		s = re.sub(r'\<\!\[CDATA\[(.*?)\]\]\>', lambda m: pi('cdata-start') + xml_escape(m.group(1)) + pi('cdata-end'), s, flags=re.DOTALL)
		# See NameStartChar/NameChar in XML spec; sorry for non-English entity names - add them if you use them.
		s = re.sub(r'\&([:_A-Za-z][-.:_A-Za-z0-9]*);', lambda m: pi('entity ' + m.group(1)), s)
		return s

	def find_dtd(s):
		if ns.no_dtd:
			return ""
		if "<!DOCTYPE" not in s:
			return ""
		# TBD: will fail if internal subset contains closing bracket followed by closing angle bracket (r'\]\s*\>'), e.g. inside an entity
		m = re.search(r'\<\!DOCTYPE\s+\S+\s+(SYSTEM\s+([\'"]).*?\2|PUBLIC\s+([\'"]).*?\3\s+([\'"]).*?\4)?\s*(\[.*?\])?\s*\>', s, flags=re.DOTALL)
		if m:
			return m.group()
		sys.stderr.write("DOCTYPE seen but not understood\n")


	def find_xmldecl(s):
		if ns.no_xmldecl:
			return ""
		m = re.search(r'\<\?xml\s+.*?\?\>', s)
		if m is None:
			return ""
		return m.group(0)

	def tmp_wrap(s):
		return ("<%s:wrapper xmlns:%s='%s'>" % (ns.wrap_pfx, ns.wrap_pfx, ns.wrap_uri)) + s + ("</%s:wrapper>" % ns.wrap_pfx)

	if not(force) and pi('cloak-marker') in old:
		return None
	xmldecl = find_xmldecl(old)
	dtd = find_dtd(old)
	if dtd is None:
		return None
	return xmldecl + dtd + tmp_wrap(make_escapes(dtd, old) + pi('cloak-marker'))

def do_uncloak(ns, old):
	escapes = { 'LT' : '<', 'GT' : '>', 'AMP' : '&' }
	pis = {
			'xml-pi' : lambda s: "<?xml%s?>" % s[1:],
			'char': lambda s: escapes[s],
			'entity': lambda s: "&%s;" % s,
			'cdata-start': lambda s: '<![CDATA[',
			'cdata-end': lambda s: ']]>',
			'cloak-marker': lambda s: ''
	}

	# Find the temporary wrapper
	start = old.find("<%s:wrapper" % ns.wrap_pfx)
	if start == -1:
		return None
	start = old.find(">", start)
	if start == -1:
		return None
	start += 1
	end = old.rfind("</%s:wrapper>" % ns.wrap_pfx)
	if end == -1 or end <= start:
		return None
	new = re.sub(r'\<\?' + ns.pi_wrap + r'\s+(?P<pi>\S+?)(\s+(?P<arg>.*?))?\?\>', lambda m: pis[m.group('pi')](m.group('arg')), old[start:end])
	return new

if __name__ == '__main__':
	parser = argparse.ArgumentParser(description='Cloak/uncloak entities and other elements before XSLT processing')
	parser.add_argument('--cloak', action='store_true', help='Cloak document (default: determine automatically)')
	parser.add_argument('--uncloak', action='store_true', help='Uncloak document (default: determine automatically)')
	parser.add_argument('--pi-wrap', metavar='PI', default='xxx-escape', help='Name of PI used to escape elements')
	parser.add_argument('--wrap-pfx', metavar='PFX', default='tmp', help='Namespace prefix for top-level wrapper element')
	parser.add_argument('--wrap-uri', metavar='URI', default='http://localhost/temporary-wrapper-namespace', help='Namespace URI for top-level wrapper element')
	parser.add_argument('--no-dtd', action='store_true', help='Do not create a copy of DTD declaration')
	parser.add_argument('--no-xmldecl', action='store_true', help='Do not create a copy of XML declaration')
	parser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin, help='Input (stdin if omitted)')
	parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout, help='Output (stdout if omitted)')
	ns = parser.parse_args()
	old = ns.infile.read()
	if ns.uncloak:
		new = do_uncloak(ns, old)
	else:
		new = do_cloak(ns, old, ns.cloak)
		if new is None:
			new = do_uncloak(ns, old)
	if new is None:
		sys.stderr.write("script cannot handle input\n")
		sys.exit(1)
	ns.outfile.write(new)

References:
- DB4 to DB5 conversion stylesheet
  - From: Alexey Neyman <stilor@att.net>
- Re: [docbook] DB4 to DB5 conversion stylesheet
  - From: "Bob Stayton" <bobs@sagehill.net>