OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [docbook-apps] WebHelp, English stemmer, problems with specific words


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Oops. Snapshots are available, but the dns change hasn't happened yet.
I forgot that I'd added the following to my hosts file a while back:

50.56.245.89 snapshots.docbook.org snapshots

Feel free to go directly to http://50.56.245.89 in the mean time.

David

On 01/12/2012 09:53 AM, David Cramer wrote:
> Yes, Kasun, Peter, and I talked about it then, but are just now 
> finding time to fix it.
> 
> Btw., Mike Smith has the snapshot builds moved over to the new
> server, so you can again download a snapshot to test the latest
> functionality [1] or check out what the latest output looks like
> [2].
> 
> David
> 
> [1] http://snapshots.docbook.org/ [2]
> http://snapshots.docbook.org/xsl/webhelp/docs/content/ch01.html
> 
> On 01/11/2012 09:31 AM, Bill Burns wrote:
>> Thanks, David. I reported this same issue to Kasun about three 
>> months ago.
> 
>> Bill Burns Verbum Communications, Inc. +1.208.336.6081 
>> bburns@verbumcomm.com http://www.verbumcomm.com
> 
> 
>> -----Original Message----- From: David Cramer 
>> [mailto:david@thingbag.net] Sent: Tuesday, January 10, 2012 9:54 
>> PM To: Bort, Paul Cc: docbook-apps@lists.oasis-open.org Subject: 
>> Re: [docbook-apps] WebHelp, English stemmer, problems with
>> specific words
> 
>> Hi Paul, Funny you should mention that. I've also been working
>> on the client side stemmer recently to address the same issue
>> you mention and some others. The problem was with all words
>> ending with vowel+y (relay, array, key, say, day) being stemmed
>> to -i (relai, arrai,kei, sai, dai) by the client side stemmer but
>> not by the build-time indexer. I'm mostly done, but I think it
>> still overstems words like arsenal.
> 
>> http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/stemmers/en_stemmer.js?r1=9067&r2=9178
>
>>  Basically, nothing from the section "Exceptional forms in
>> general" was implemented and step 1c was incorrectly implemented:
>>  http://snowball.tartarus.org/algorithms/english/stemmer.html
> 
>> Regarding nucleus etc., I've also committed a fix from a
>> colleague that should always check the index for the full
>> unstemmed word to catch those Latinate terms that are handled
>> correctly by the indexer but not the client side stemmer:
> 
>> http://docbook.svn.sourceforge.net/viewvc/docbook/trunk/xsl/webhelp/template/content/search/nwSearchFnt.js?r1=9105&r2=9172
>
>>  He's also working on always searching the index for things that 
>> look like filenames (e.g. build.xml, which it currently
>> tokenizes to 'build' and 'xml').
> 
>> Here's a demo of the current state of things:
> 
>> http://www.thingbag.net/docbook/docs/content/ch05s01.html
> 
>> You can grab the en_stemmer.js and use it now. The
>> nwSearchFnt.js file also has changes related to adding search
>> weighting to the results, so you'd need to take changes from it
>> more carefully.
> 
>> We should have a release of the xsls out before too long though.
> 
>> Thanks, David
> 
>> On 01/10/2012 07:33 PM, Bort, Paul wrote:
>>> Hi,
> 
>>> I found the conversation about problems with the stemmer used 
>>> with English at 
>>> http://lists.oasis-open.org/archives/docbook-apps/201103/msg00040.html
>
>>> 
>>> 
> 
>> very informative in tracking down the problem I'm having with
>> the
>>> stemmer, which is similar. In my case, the word that isn't
>>> being stemmed correctly is "relay".(It comes out as "relai".)
>>> This does break searches: searching for "relay" in a document
>>> that should have six matches returns an error "Your search
>>> returned no results for relai".
> 
>>> The solution that I've implemented locally, and offer below
>>> for your consideration, is a list of words to be stemmed
>>> manually. I've tried to follow your coding style but I'm not a
>>> serious JavaScript hacker so I may have stepped on some toes 
>>> inadvertently.
> 
>>> Regards, Paul Bort Systems Engineer TMW Systems, Inc. 
>>> pbort@tmwsystems.com
> 
>>> ----------------------------------
> 
>>> --- en_stemmer.js +++ en_stemmer.js @@ -54,6 +54,14 @@ meq1 = 
>>> "^(" + C + ")?" + V + C + "(" + V + ")?$",  // [C]VC[V] is m=1 
>>> mgr1 = "^(" + C + ")?" + V + C + V + C,       // [C]VCVC... is 
>>> m>1 s_v = "^(" + C + ")?" + v;                   // vowel in
>>> stem + +    var exceptionWords = { +
>>> "relay":"relay", + "relaying":"relay", +
>>> "relays":"relay", + "nucleus":"nucleus", +
>>> "zeus":"zeus" +        };
> 
>>> return function (w) { var     stem, @@ -67,6 +75,8 @@
> 
>>> if (w.length < 3) { return w; }
> 
>>> +        if (w in exceptionWords) { return exceptionWords{w};
>>> } + firstch = w.substr(0,1); if (firstch == "y") { w = 
>>> firstch.toUpperCase() + w.substr(1);
> 
> 
> 
>> ---------------------------------------------------------------------
>
>> 
> 
> To unsubscribe, e-mail:
> docbook-apps-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: 
>> docbook-apps-help@lists.oasis-open.org
> 
> 
> 
>> ---------------------------------------------------------------------
>
>> 
> 
> To unsubscribe, e-mail:
> docbook-apps-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: 
>> docbook-apps-help@lists.oasis-open.org
> 
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPDx4vAAoJEMHeSXG7afUhqJIH/0dM7XiwCjovObvS0pfjKNC0
obmZsGbV3+03bKXAVbuDDfTtjysdf18sp+AxXDsA7pg2cS4VVNjuimnnTTG3PrKh
rCFIgpoQ+/Z5Cr3R/M8fVmxTkve9ytPn14BWYYlaip84Qt1HUdKPxHuIJXRlbJzl
O42OHoJPXXta5DKWNaqnqo4puwgoagMqVq3ICkiBZdagTJIXPWVWJGJK5RFrc0sq
3btvOVzSgSshC/U7mlq2nxCsNxuFIvwulqXnvHTcQ9PhCYwj8Inc2fFucUiimW+7
YC/P7EzKnOY2AQplvGVwgxlW4DLSNqvkJaATZqJXx1gIFuw8S7U39LNch2Qb2ns=
=rywA
-----END PGP SIGNATURE-----


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]