OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [docbook-apps] WebHelp, English stemmer, problems with specific words

Hash: SHA1

Hi Paul,
Funny you should mention that. I've also been working on the client
side stemmer recently to address the same issue you mention and some
others. The problem was with all words ending with vowel+y (relay,
array, key, say, day) being stemmed to -i (relai, arrai,kei, sai, dai)
by the client side stemmer but not by the build-time indexer. I'm
mostly done, but I think it still overstems words like arsenal.


Basically, nothing from the section "Exceptional forms in general" was
implemented and step 1c was incorrectly implemented:

Regarding nucleus etc., I've also committed a fix from a colleague
that should always check the index for the full unstemmed word to
catch those Latinate terms that are handled correctly by the indexer
but not the client side stemmer:


He's also working on always searching the index for things that look
like filenames (e.g. build.xml, which it currently tokenizes to
'build' and 'xml').

Here's a demo of the current state of things:


You can grab the en_stemmer.js and use it now. The nwSearchFnt.js file
also has changes related to adding search weighting to the results, so
you'd need to take changes from it more carefully.

We should have a release of the xsls out before too long though.


On 01/10/2012 07:33 PM, Bort, Paul wrote:
> Hi,
> I found the conversation about problems with the stemmer used with 
> English at 
> http://lists.oasis-open.org/archives/docbook-apps/201103/msg00040.html
very informative in tracking down the problem I'm having with the
> stemmer, which is similar. In my case, the word that isn't being
> stemmed correctly is "relay".(It comes out as "relai".) This does
> break searches: searching for "relay" in a document that should
> have six matches returns an error "Your search returned no results
> for relai".
> The solution that I've implemented locally, and offer below for
> your consideration, is a list of words to be stemmed manually. I've
> tried to follow your coding style but I'm not a serious JavaScript
> hacker so I may have stepped on some toes inadvertently.
> Regards, Paul Bort Systems Engineer TMW Systems, Inc. 
> pbort@tmwsystems.com
> ----------------------------------
> --- en_stemmer.js +++ en_stemmer.js @@ -54,6 +54,14 @@ meq1 = "^("
> + C + ")?" + V + C + "(" + V + ")?$",  // [C]VC[V] is m=1 mgr1 =
> "^(" + C + ")?" + V + C + V + C,       // [C]VCVC... is m>1 s_v =
> "^(" + C + ")?" + v;                   // vowel in stem + +    var
> exceptionWords = { +            "relay":"relay", +
> "relaying":"relay", +            "relays":"relay", +
> "nucleus":"nucleus", +            "zeus":"zeus" +        };
> return function (w) { var     stem, @@ -67,6 +75,8 @@
> if (w.length < 3) { return w; }
> +        if (w in exceptionWords) { return exceptionWords{w}; } +
>  firstch = w.substr(0,1); if (firstch == "y") { w =
> firstch.toUpperCase() + w.substr(1);

Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]