OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [docbook-apps] Auto-generating an index


On Fri, Feb 4, 2011 at 09:12, Ron Catterall <ron@catterall.net> wrote:
> Tom
> I generally have a whole lot of files in a hierarchy under a root directory
> (Xinclude) and do a global edit on the root directory (I use BBedit, but
> other editors can do this).  Of course Perl can do the same, I'd be
> interested in your Perl script as and when.

Thanks for your interest, Ron and Paul.  I'll post it when I have a
rudimentary system working.

First of all, I have all my db xml files in one directory, but your
arrangement should work, too.  The actual db source files for my doc
have a ".db.xml' extension so I can have miscellaneous xml files
around and ease my Makefile editing.

Here's what I've done so far:

I made a file containing lines from all my db files:

$ for f in `ls *.db.xml` ; do cat $f >>  t.txt done

Then I sorted the list and removed dup lines:

$ sort t.txt | uniq > index-words.txt

I edit "index-words.txt" to select words and phrases to list per line.
 I put two special lines in the file for parsing:

1.  a header line just to check for the proper file beginning:

# index-words-and-phrases

2.  a line to indicate end of parsing so the file can be used as a
work in progress:

# end-index-parse

I hard-wired file "index-words.txt" as my initial input file and treat
each line as a separate word or phrase to put in my dictionary (a Perl
hash),

Each line in the file gets white space normalization in the program
before it is put in the hash.  I'm still working on this part but I
think I'm going to change my current simple hash on phrases to key on
the number of words per line and so have a hash of numbers of words
and then each owns a hash of phrases with that number of words.

Then I loop over each member (f) of @ARGV and on it use this pseudo code:

  if f is --debug or --force, set that flag and continue
  if f is not a readable file continue
  if f is not a file with the correct extension continue
  die if there are problems opening the  file for reading

  make a new file name for output (ofil) by appending a hard-wired
suffix to f (".index_markup")
  if ofil exists continue unless force flag is true

  open ofil and loop over every line and do:
     from most words to least in the phrase hash look for matches (not
case-significant)
        bound those found with index tags and eliminate that text from
further searching
     output the [possibly modifed] line to the output file
  close the output file
  if no lines have been modified
    unlink the output file
  else
    push file name onto an array

at end of program output names of files modified

Then, for each file modified, use a visual diff prog to see results
and modify original file as needed.

I see this as an iterative process that I can work on when I'm looking
for some boring work for relaxation.

My current script execution without args looks like this:

<programlisting>
Usage: ./make_index_markup.pl [--debug] [--force] <DocBook xml file(s)>

Example:

  $ ./make_index_markup.pl *.db.xml

Uses file 'index-words.txt' as input to generate index markup.
Output files are named same as input with '.index_markup' suffix.
Existing file are not over written unless the '--force'
  option is used.
Options must be first in the arg list before files to take
  effect for the duration of the run.
</programlisting>

Obviously there is room for prepping the index input file to eliminate
tags, etc. An xsl file could help, but my system relies on source line
orientation.  Also, much room for using tmp files or an array for
output and only write the array or tmp file if any line was modified.

I have thought about using xmllint to canonicalize the xml source
first to help the situation.

The actual algorithm used for the line matches is certainly not
settled yet, but regexes will be used heavily (referring to my trusty
copy of "Perl Cookbook").  I also need to check that a phrase found is
not already indexed (since the process is iterative, the source file
may already have the index tags).

Comments and criticisms welcome.

Best regards,

-Tom


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]