OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: RE: [docbook-apps] WebHelp, English stemmer, problems with specific words

Thanks, David. I reported this same issue to Kasun about three months ago.

Bill Burns
Verbum Communications, Inc.

-----Original Message-----
From: David Cramer [mailto:david@thingbag.net] 
Sent: Tuesday, January 10, 2012 9:54 PM
To: Bort, Paul
Cc: docbook-apps@lists.oasis-open.org
Subject: Re: [docbook-apps] WebHelp, English stemmer, problems with specific words

Hash: SHA1

Hi Paul,
Funny you should mention that. I've also been working on the client side stemmer recently to address the same issue you mention and some others. The problem was with all words ending with vowel+y (relay, array, key, say, day) being stemmed to -i (relai, arrai,kei, sai, dai) by the client side stemmer but not by the build-time indexer. I'm mostly done, but I think it still overstems words like arsenal.


Basically, nothing from the section "Exceptional forms in general" was implemented and step 1c was incorrectly implemented:

Regarding nucleus etc., I've also committed a fix from a colleague that should always check the index for the full unstemmed word to catch those Latinate terms that are handled correctly by the indexer but not the client side stemmer:


He's also working on always searching the index for things that look like filenames (e.g. build.xml, which it currently tokenizes to 'build' and 'xml').

Here's a demo of the current state of things:


You can grab the en_stemmer.js and use it now. The nwSearchFnt.js file also has changes related to adding search weighting to the results, so you'd need to take changes from it more carefully.

We should have a release of the xsls out before too long though.


On 01/10/2012 07:33 PM, Bort, Paul wrote:
> Hi,
> I found the conversation about problems with the stemmer used with 
> English at 
> http://lists.oasis-open.org/archives/docbook-apps/201103/msg00040.html
very informative in tracking down the problem I'm having with the
> stemmer, which is similar. In my case, the word that isn't being 
> stemmed correctly is "relay".(It comes out as "relai".) This does 
> break searches: searching for "relay" in a document that should have 
> six matches returns an error "Your search returned no results for 
> relai".
> The solution that I've implemented locally, and offer below for your 
> consideration, is a list of words to be stemmed manually. I've tried 
> to follow your coding style but I'm not a serious JavaScript hacker so 
> I may have stepped on some toes inadvertently.
> Regards, Paul Bort Systems Engineer TMW Systems, Inc. 
> pbort@tmwsystems.com
> ----------------------------------
> --- en_stemmer.js +++ en_stemmer.js @@ -54,6 +54,14 @@ meq1 = "^("
> + C + ")?" + V + C + "(" + V + ")?$",  // [C]VC[V] is m=1 mgr1 =
> "^(" + C + ")?" + V + C + V + C,       // [C]VCVC... is m>1 s_v =
> "^(" + C + ")?" + v;                   // vowel in stem + +    var
> exceptionWords = { +            "relay":"relay", +
> "relaying":"relay", +            "relays":"relay", +
> "nucleus":"nucleus", +            "zeus":"zeus" +        };
> return function (w) { var     stem, @@ -67,6 +75,8 @@
> if (w.length < 3) { return w; }
> +        if (w in exceptionWords) { return exceptionWords{w}; } +
>  firstch = w.substr(0,1); if (firstch == "y") { w =
> firstch.toUpperCase() + w.substr(1);

Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/


To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]