OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [docbook-apps] WebHelp, English stemmer, problems with specific words


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yes, Kasun, Peter, and I talked about it then, but are just now
finding time to fix it.

Btw., Mike Smith has the snapshot builds moved over to the new server,
 so you can again download a snapshot to test the latest functionality
[1] or check out what the latest output looks like [2].

David

[1] http://snapshots.docbook.org/
[2] http://snapshots.docbook.org/xsl/webhelp/docs/content/ch01.html

On 01/11/2012 09:31 AM, Bill Burns wrote:
> Thanks, David. I reported this same issue to Kasun about three
> months ago.
> 
> Bill Burns Verbum Communications, Inc. +1.208.336.6081 
> bburns@verbumcomm.com http://www.verbumcomm.com
> 
> 
> -----Original Message----- From: David Cramer
> [mailto:david@thingbag.net] Sent: Tuesday, January 10, 2012 9:54
> PM To: Bort, Paul Cc: docbook-apps@lists.oasis-open.org Subject:
> Re: [docbook-apps] WebHelp, English stemmer, problems with specific
> words
> 
> Hi Paul, Funny you should mention that. I've also been working on
> the client side stemmer recently to address the same issue you
> mention and some others. The problem was with all words ending with
> vowel+y (relay, array, key, say, day) being stemmed to -i (relai,
> arrai,kei, sai, dai) by the client side stemmer but not by the
> build-time indexer. I'm mostly done, but I think it still overstems
> words like arsenal.
> 
> http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/stemmers/en_stemmer.js?r1=9067&r2=9178
>
>  Basically, nothing from the section "Exceptional forms in general"
> was implemented and step 1c was incorrectly implemented: 
> http://snowball.tartarus.org/algorithms/english/stemmer.html
> 
> Regarding nucleus etc., I've also committed a fix from a colleague
> that should always check the index for the full unstemmed word to
> catch those Latinate terms that are handled correctly by the
> indexer but not the client side stemmer:
> 
> http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/nwSearchFnt.js?r1=9105&r2=9172
>
>  He's also working on always searching the index for things that
> look like filenames (e.g. build.xml, which it currently tokenizes
> to 'build' and 'xml').
> 
> Here's a demo of the current state of things:
> 
> http://www.thingbag.net/docbook/docs/content/ch05s01.html
> 
> You can grab the en_stemmer.js and use it now. The nwSearchFnt.js
> file also has changes related to adding search weighting to the
> results, so you'd need to take changes from it more carefully.
> 
> We should have a release of the xsls out before too long though.
> 
> Thanks, David
> 
> On 01/10/2012 07:33 PM, Bort, Paul wrote:
>> Hi,
> 
>> I found the conversation about problems with the stemmer used
>> with English at 
>> http://lists.oasis-open.org/archives/docbook-apps/201103/msg00040.html
>
>> 
> 
> very informative in tracking down the problem I'm having with the
>> stemmer, which is similar. In my case, the word that isn't being
>>  stemmed correctly is "relay".(It comes out as "relai".) This
>> does break searches: searching for "relay" in a document that
>> should have six matches returns an error "Your search returned no
>> results for relai".
> 
>> The solution that I've implemented locally, and offer below for
>> your consideration, is a list of words to be stemmed manually.
>> I've tried to follow your coding style but I'm not a serious
>> JavaScript hacker so I may have stepped on some toes
>> inadvertently.
> 
>> Regards, Paul Bort Systems Engineer TMW Systems, Inc. 
>> pbort@tmwsystems.com
> 
>> ----------------------------------
> 
>> --- en_stemmer.js +++ en_stemmer.js @@ -54,6 +54,14 @@ meq1 =
>> "^(" + C + ")?" + V + C + "(" + V + ")?$",  // [C]VC[V] is m=1
>> mgr1 = "^(" + C + ")?" + V + C + V + C,       // [C]VCVC... is
>> m>1 s_v = "^(" + C + ")?" + v;                   // vowel in stem
>> + +    var exceptionWords = { +            "relay":"relay", + 
>> "relaying":"relay", +            "relays":"relay", + 
>> "nucleus":"nucleus", +            "zeus":"zeus" +        };
> 
>> return function (w) { var     stem, @@ -67,6 +75,8 @@
> 
>> if (w.length < 3) { return w; }
> 
>> +        if (w in exceptionWords) { return exceptionWords{w}; }
>> + firstch = w.substr(0,1); if (firstch == "y") { w = 
>> firstch.toUpperCase() + w.substr(1);
> 
> 
> 
> ---------------------------------------------------------------------
>
> 
To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail:
> docbook-apps-help@lists.oasis-open.org
> 
> 
> 
> ---------------------------------------------------------------------
>
> 
To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail:
> docbook-apps-help@lists.oasis-open.org
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPDwILAAoJEMHeSXG7afUhDxUIAJBPziKE2UPiwqONYFUxaERp
LtNGMNX9dK2urJBumDEln2DgXVbrTxoBVZqHTDPl1i5+jFZesCohWSSQrDIqigLv
7A5bQCjEZ7f86EG9Rer9WXClB7yA1hOqRx2Kh0hyld8oXYA75kip5sVl++1tb6Z+
+YDqYuT4XlvBW8/v9DKnXxDV3NDoxX3M5nRKl7MGIVeanpAIVw4YJ9dZKLaqnNXU
0tIFOd/cCTYY0NUADvf2nyBC/f7NVOaDb/hABLNBsRae82KpAQknYFk782Q30YWs
o48Fvj1KmFpTvjxqCCSaVvdOsxD/gVH+X5YmXsuGZmySRMGHY2RtGyPOxBR0/6c=
=8jzZ
-----END PGP SIGNATURE-----


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]