[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [docbook-apps] Auto-generating an index
On Fri, Feb 4, 2011 at 09:12, Ron Catterall <ron@catterall.net> wrote: > Tom > I generally have a whole lot of files in a hierarchy under a root directory > (Xinclude) and do a global edit on the root directory (I use BBedit, but > other editors can do this). Of course Perl can do the same, I'd be > interested in your Perl script as and when. Thanks for your interest, Ron and Paul. I'll post it when I have a rudimentary system working. First of all, I have all my db xml files in one directory, but your arrangement should work, too. The actual db source files for my doc have a ".db.xml' extension so I can have miscellaneous xml files around and ease my Makefile editing. Here's what I've done so far: I made a file containing lines from all my db files: $ for f in `ls *.db.xml` ; do cat $f >> t.txt done Then I sorted the list and removed dup lines: $ sort t.txt | uniq > index-words.txt I edit "index-words.txt" to select words and phrases to list per line. I put two special lines in the file for parsing: 1. a header line just to check for the proper file beginning: # index-words-and-phrases 2. a line to indicate end of parsing so the file can be used as a work in progress: # end-index-parse I hard-wired file "index-words.txt" as my initial input file and treat each line as a separate word or phrase to put in my dictionary (a Perl hash), Each line in the file gets white space normalization in the program before it is put in the hash. I'm still working on this part but I think I'm going to change my current simple hash on phrases to key on the number of words per line and so have a hash of numbers of words and then each owns a hash of phrases with that number of words. Then I loop over each member (f) of @ARGV and on it use this pseudo code: if f is --debug or --force, set that flag and continue if f is not a readable file continue if f is not a file with the correct extension continue die if there are problems opening the file for reading make a new file name for output (ofil) by appending a hard-wired suffix to f (".index_markup") if ofil exists continue unless force flag is true open ofil and loop over every line and do: from most words to least in the phrase hash look for matches (not case-significant) bound those found with index tags and eliminate that text from further searching output the [possibly modifed] line to the output file close the output file if no lines have been modified unlink the output file else push file name onto an array at end of program output names of files modified Then, for each file modified, use a visual diff prog to see results and modify original file as needed. I see this as an iterative process that I can work on when I'm looking for some boring work for relaxation. My current script execution without args looks like this: <programlisting> Usage: ./make_index_markup.pl [--debug] [--force] <DocBook xml file(s)> Example: $ ./make_index_markup.pl *.db.xml Uses file 'index-words.txt' as input to generate index markup. Output files are named same as input with '.index_markup' suffix. Existing file are not over written unless the '--force' option is used. Options must be first in the arg list before files to take effect for the duration of the run. </programlisting> Obviously there is room for prepping the index input file to eliminate tags, etc. An xsl file could help, but my system relies on source line orientation. Also, much room for using tmp files or an array for output and only write the array or tmp file if any line was modified. I have thought about using xmllint to canonicalize the xml source first to help the situation. The actual algorithm used for the line matches is certainly not settled yet, but regexes will be used heavily (referring to my trusty copy of "Perl Cookbook"). I also need to check that a phrase found is not already indexed (since the process is iterative, the source file may already have the index tags). Comments and criticisms welcome. Best regards, -Tom
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]