[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: [xtm-wg] Sketch of a Possible Algorithm for Fragment Grabbing
Hi, Here is a very rough sketch of a possible algorithm for grabbing fragments of TMs n' stuff over the web. It's bound to be missing stuff -- I don't claim to be any sort of expert with this. Assume that there is a client machine that makes an initial topic name request to a search server. 1. Client issues request to search server, e.g. the string "Flying Boats" 2. Two cases: a) Either the search server uses a topic map in which case the string is appended to an XPath style query, and the retrieval crawler starts with this on the server's TM. b) The search server is using a traditional compressed inverted index. b.1) take each href found in the trad index and crawl for TMs by reading MIME headers and looking for clues in xml docs. b.2) Meanwhile construct XPath style query and hold in memory b.3) For each TM found, set the crawler off with the XPath query. 3. A "Flying Boats" topic is located by the crawler 4. The <topic>...</topic> string is determined and copied, and the assigned characteristics of the "Flying Boats" topic analysed. 5. Topic analysis stages: a) Get list of scopes, then for each scope, names in scope, occrs in scope. b) (Hoping these are at the top or bottom of the doc, but not scattered) Get any addthms and store. c) Foreach href in the occrs within each scope request the page/object at that location d) Read MIME type in HTTP header e) If text and xml scan for clues to TM-ness f) If TM store in TMList g) Store anything else into a bag under the relevant scope h) Assuming a pre-specified crawl depth/distance: h.1) Foreach TM in TMList: Foreach topic in TM: Return to (3) and repeat steps to here. cheers Peter ------------------------------------------------------------------------ Create professional forms and interactive web pages in less time with Mozquito(tm) technology. Form the Web today - visit: http://click.egroups.com/1/6342/4/_/337252/_/964002010/ ------------------------------------------------------------------------ To Post a message, send it to: xtm-wg@eGroups.com To Unsubscribe, send a blank message to: xtm-wg-unsubscribe@eGroups.com
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC