OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [docbook-apps] WebHelp, English stemmer, problems with specific words


Thanks, David. I reported this same issue to Kasun about three months ago.

Bill Burns
Verbum Communications, Inc.
+1.208.336.6081
bburns@verbumcomm.com
http://www.verbumcomm.com


-----Original Message-----
From: David Cramer [mailto:david@thingbag.net] 
Sent: Tuesday, January 10, 2012 9:54 PM
To: Bort, Paul
Cc: docbook-apps@lists.oasis-open.org
Subject: Re: [docbook-apps] WebHelp, English stemmer, problems with specific words

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Paul,
Funny you should mention that. I've also been working on the client side stemmer recently to address the same issue you mention and some others. The problem was with all words ending with vowel+y (relay, array, key, say, day) being stemmed to -i (relai, arrai,kei, sai, dai) by the client side stemmer but not by the build-time indexer. I'm mostly done, but I think it still overstems words like arsenal.

http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/stemmers/en_stemmer.js?r1=9067&r2=9178

Basically, nothing from the section "Exceptional forms in general" was implemented and step 1c was incorrectly implemented:
http://snowball.tartarus.org/algorithms/english/stemmer.html

Regarding nucleus etc., I've also committed a fix from a colleague that should always check the index for the full unstemmed word to catch those Latinate terms that are handled correctly by the indexer but not the client side stemmer:

http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/nwSearchFnt.js?r1=9105&r2=9172

He's also working on always searching the index for things that look like filenames (e.g. build.xml, which it currently tokenizes to 'build' and 'xml').

Here's a demo of the current state of things:

http://www.thingbag.net/docbook/docs/content/ch05s01.html

You can grab the en_stemmer.js and use it now. The nwSearchFnt.js file also has changes related to adding search weighting to the results, so you'd need to take changes from it more carefully.

We should have a release of the xsls out before too long though.

Thanks,
David

On 01/10/2012 07:33 PM, Bort, Paul wrote:
> Hi,
> 
> I found the conversation about problems with the stemmer used with 
> English at 
> http://lists.oasis-open.org/archives/docbook-apps/201103/msg00040.html
>
> 
very informative in tracking down the problem I'm having with the
> stemmer, which is similar. In my case, the word that isn't being 
> stemmed correctly is "relay".(It comes out as "relai".) This does 
> break searches: searching for "relay" in a document that should have 
> six matches returns an error "Your search returned no results for 
> relai".
> 
> The solution that I've implemented locally, and offer below for your 
> consideration, is a list of words to be stemmed manually. I've tried 
> to follow your coding style but I'm not a serious JavaScript hacker so 
> I may have stepped on some toes inadvertently.
> 
> Regards, Paul Bort Systems Engineer TMW Systems, Inc. 
> pbort@tmwsystems.com
> 
> ----------------------------------
> 
> --- en_stemmer.js +++ en_stemmer.js @@ -54,6 +54,14 @@ meq1 = "^("
> + C + ")?" + V + C + "(" + V + ")?$",  // [C]VC[V] is m=1 mgr1 =
> "^(" + C + ")?" + V + C + V + C,       // [C]VCVC... is m>1 s_v =
> "^(" + C + ")?" + v;                   // vowel in stem + +    var
> exceptionWords = { +            "relay":"relay", +
> "relaying":"relay", +            "relays":"relay", +
> "nucleus":"nucleus", +            "zeus":"zeus" +        };
> 
> return function (w) { var     stem, @@ -67,6 +75,8 @@
> 
> if (w.length < 3) { return w; }
> 
> +        if (w in exceptionWords) { return exceptionWords{w}; } +
>  firstch = w.substr(0,1); if (firstch == "y") { w =
> firstch.toUpperCase() + w.substr(1);
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPDRXxAAoJEMHeSXG7afUh9iQH/2wcuq+ovkT5gjhhJq58ZFXm
hy9jcNruCQMRO9Nw8iozUKZjvqcaG4rHfZpmO6pyT574FQ5n4IBJRam24AcJZVrj
gY2LMeckMwkQzIuuH9xvKAXUCp13bxdL66R1ZrsPowQ/vGpxMqUZmPg8bAsJu9DL
4vxFR5vt7S2T5xLAh2kWMHz+uKC33QNl7kuh9bpVZDi/EmZIG91gvNGsFGDGqMVY
bniHVYDqYxJwYYzTHcD+lmylIwfyeqjFzrO+FDzH5/TJ/lCxyhd365je+FdMia1g
0QK0H5j90sSHBtkIPro5HVyv+sw2RTs7eB9GCROLUJKDX310efNcOLTPk3uWmuc=
=Zvxp
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]