[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [docbook-apps] WebHelp, English stemmer, problems with specific words
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Oops. Snapshots are available, but the dns change hasn't happened yet. I forgot that I'd added the following to my hosts file a while back: 50.56.245.89 snapshots.docbook.org snapshots Feel free to go directly to http://50.56.245.89 in the mean time. David On 01/12/2012 09:53 AM, David Cramer wrote: > Yes, Kasun, Peter, and I talked about it then, but are just now > finding time to fix it. > > Btw., Mike Smith has the snapshot builds moved over to the new > server, so you can again download a snapshot to test the latest > functionality [1] or check out what the latest output looks like > [2]. > > David > > [1] http://snapshots.docbook.org/ [2] > http://snapshots.docbook.org/xsl/webhelp/docs/content/ch01.html > > On 01/11/2012 09:31 AM, Bill Burns wrote: >> Thanks, David. I reported this same issue to Kasun about three >> months ago. > >> Bill Burns Verbum Communications, Inc. +1.208.336.6081 >> bburns@verbumcomm.com http://www.verbumcomm.com > > >> -----Original Message----- From: David Cramer >> [mailto:david@thingbag.net] Sent: Tuesday, January 10, 2012 9:54 >> PM To: Bort, Paul Cc: docbook-apps@lists.oasis-open.org Subject: >> Re: [docbook-apps] WebHelp, English stemmer, problems with >> specific words > >> Hi Paul, Funny you should mention that. I've also been working >> on the client side stemmer recently to address the same issue >> you mention and some others. The problem was with all words >> ending with vowel+y (relay, array, key, say, day) being stemmed >> to -i (relai, arrai,kei, sai, dai) by the client side stemmer but >> not by the build-time indexer. I'm mostly done, but I think it >> still overstems words like arsenal. > >> http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/stemmers/en_stemmer.js?r1=9067&r2=9178 > >> Basically, nothing from the section "Exceptional forms in >> general" was implemented and step 1c was incorrectly implemented: >> http://snowball.tartarus.org/algorithms/english/stemmer.html > >> Regarding nucleus etc., I've also committed a fix from a >> colleague that should always check the index for the full >> unstemmed word to catch those Latinate terms that are handled >> correctly by the indexer but not the client side stemmer: > >> http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/nwSearchFnt.js?r1=9105&r2=9172 > >> He's also working on always searching the index for things that >> look like filenames (e.g. build.xml, which it currently >> tokenizes to 'build' and 'xml'). > >> Here's a demo of the current state of things: > >> http://www.thingbag.net/docbook/docs/content/ch05s01.html > >> You can grab the en_stemmer.js and use it now. The >> nwSearchFnt.js file also has changes related to adding search >> weighting to the results, so you'd need to take changes from it >> more carefully. > >> We should have a release of the xsls out before too long though. > >> Thanks, David > >> On 01/10/2012 07:33 PM, Bort, Paul wrote: >>> Hi, > >>> I found the conversation about problems with the stemmer used >>> with English at >>> http://lists.oasis-open.org/archives/docbook-apps/201103/msg00040.html > >>> >>> > >> very informative in tracking down the problem I'm having with >> the >>> stemmer, which is similar. In my case, the word that isn't >>> being stemmed correctly is "relay".(It comes out as "relai".) >>> This does break searches: searching for "relay" in a document >>> that should have six matches returns an error "Your search >>> returned no results for relai". > >>> The solution that I've implemented locally, and offer below >>> for your consideration, is a list of words to be stemmed >>> manually. I've tried to follow your coding style but I'm not a >>> serious JavaScript hacker so I may have stepped on some toes >>> inadvertently. > >>> Regards, Paul Bort Systems Engineer TMW Systems, Inc. >>> pbort@tmwsystems.com > >>> ---------------------------------- > >>> --- en_stemmer.js +++ en_stemmer.js @@ -54,6 +54,14 @@ meq1 = >>> "^(" + C + ")?" + V + C + "(" + V + ")?$", // [C]VC[V] is m=1 >>> mgr1 = "^(" + C + ")?" + V + C + V + C, // [C]VCVC... is >>> m>1 s_v = "^(" + C + ")?" + v; // vowel in >>> stem + + var exceptionWords = { + >>> "relay":"relay", + "relaying":"relay", + >>> "relays":"relay", + "nucleus":"nucleus", + >>> "zeus":"zeus" + }; > >>> return function (w) { var stem, @@ -67,6 +75,8 @@ > >>> if (w.length < 3) { return w; } > >>> + if (w in exceptionWords) { return exceptionWords{w}; >>> } + firstch = w.substr(0,1); if (firstch == "y") { w = >>> firstch.toUpperCase() + w.substr(1); > > > >> --------------------------------------------------------------------- > >> > > To unsubscribe, e-mail: > docbook-apps-unsubscribe@lists.oasis-open.org >> For additional commands, e-mail: >> docbook-apps-help@lists.oasis-open.org > > > >> --------------------------------------------------------------------- > >> > > To unsubscribe, e-mail: > docbook-apps-unsubscribe@lists.oasis-open.org >> For additional commands, e-mail: >> docbook-apps-help@lists.oasis-open.org > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPDx4vAAoJEMHeSXG7afUhqJIH/0dM7XiwCjovObvS0pfjKNC0 obmZsGbV3+03bKXAVbuDDfTtjysdf18sp+AxXDsA7pg2cS4VVNjuimnnTTG3PrKh rCFIgpoQ+/Z5Cr3R/M8fVmxTkve9ytPn14BWYYlaip84Qt1HUdKPxHuIJXRlbJzl O42OHoJPXXta5DKWNaqnqo4puwgoagMqVq3ICkiBZdagTJIXPWVWJGJK5RFrc0sq 3btvOVzSgSshC/U7mlq2nxCsNxuFIvwulqXnvHTcQ9PhCYwj8Inc2fFucUiimW+7 YC/P7EzKnOY2AQplvGVwgxlW4DLSNqvkJaATZqJXx1gIFuw8S7U39LNch2Qb2ns= =rywA -----END PGP SIGNATURE-----
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]