cgmo-webcgm message

Subject: Re[3]: [cgmo-webcgm] Text searching
From: Benoit Bezaire <benoit@itedo.com>
To: cgmo-webcgm@lists.oasis-open.org
Date: Fri, 5 May 2006 09:34:25 -0400
See inline...

Thursday, May 4, 2006, 8:26:17 PM, you wrote:
> Hi Benoit,

> Some technical replies for you (and Dieter)...

> At 06:31 PM 5/3/2006 -0400, Benoit Bezaire wrote:
>>I'm seeing the emails coming in about this topic. And I have to state
>>that I don't understand how people get to such an understanding of the
>>feature by reading what is in the specification. More inline...
>>
>>Wednesday, May 3, 2006, 6:04:20 PM, you wrote:
>> > Benoit,
>>
>> > I think the example does not reflect the intentions of the authors.
>>
>> > It should be like this
>> >>   (approx syntax)
>> >>   BEGAPS 'myPara'
>> >>    APSATTR 'content' 'Hello World';
>> >>    ...
>> >>    BEGAPS 'mySubpara'
>> >>     APSATTR 'content' 'World';
>> >>     ...
>> >>    ENDAPS;
>> >>   ENDAPS;
>>
>> > Hence the content attribute of the para would contain all the text of the
>> > para, whereas the attribute of the subpara woul contain the text of
>> > the subpara only.
>>Hmmm. Isn't this an assumption? I could see it use this way when using
>>para/subpara on a raster; but that may not always be the case.

> Perhaps it is an assumption, but it seems to me to be at least
> hinted by the text of 3.2.1.3, 3.2.1.4, and 3.2.2.8.  (Or ...
> perhaps I'm too biased by what the 1.0 authors meant to say, but
> that they didn't express unambiguously.)
Sorry, I disagree.

There's no hint in there which says 'content' on a para MUST contain
all text strings found in all subpara 'content's. I see things like
'may be used to identify text', 'can potentially enable text search',
'identifying matches [...] is not specified in WebCGM.', 'may be used
to identify smaller fragments', 'This enables, for example [...]'.

What para/subpara/content is suppose to do, is far from clear. And
according to recent W3C standards would not make it into the spec if
not corrected.

>>Regardless, doesn't Chris' question still stand?
> That question is: is para a block and subpara an inline?  Yes, we're
> going to have to answer the question somehow.  There are a couple of
> problems here. 

> First problem, para and subpara (as you pointed out in your proposed
> reply) are APS objects which group stuff which might not even be
> text.  So the question, as it stands, seems meaningless.  However,
> para+content could be viewed as a surrogate for or abstraction of
> the textual-related thingy inside its APS, and similarly for
> subpara+content.  Then you could phrase the question about those
> "surrogates".
You are playing with words here!
On the call we explained to Chris that para/supara were not text
elements but APS. But his question still stand and has now become:
is para+'content' a block and subpara+'content' an inline.

> Second problem, I still don't know what block and inline mean (Chris
> is consulting with an i18n guy before sending more info).
I agree.

> But from XHTML, a block element is like a 'p' and an inline element
> is like a 'span'.
Yes.

> Let's suppose HTML had a 'content' attribute (maybe you could do
> this example with 'title' attribute, which is typically used for a
> tooltip). 

> <p content="???">Hello <span content="world">world</span></p>

> Would you expect ??? to reflect the entire content of the <p>
> element, or only that portion of the <p> element that is outside of
> the <span>? I would expect the first, i.e., ??? should be "Hello
> world". 
I would have no expectation. I don't know any specification that puts
restrictions on character data for an attribute. It's either a
predefined set of values or plain character data.

I think using HTML 'alt' would be a better comparison... and you will
notice that it can only be specified on IMG, AREA, APPLET, and INPUT.
It cannot be used on <p> and <span>, thus most (if not all) the WebCGM
problems related to this do not exist in HTML.

> This is the way I think about para and subpara (and apparently some
> others do as well).  However, from the example that Chris posed, I
> may be entirely off base as to the meaning of "block" and "inline".
I don't think we are way off on the block/inline thing. But I do think
that using an attribute (content) on APS which can be nested and
possibly already readable, to be a mistake.

> More...

>> >> -----Original Message-----
>> >> From: Benoit Bezaire [mailto:benoit@itedo.com]
>> >> Sent: Wednesday, May 03, 2006 11:53 PM
>> >> To: cgmo-webcgm@lists.oasis-open.org
>> >> Subject: [cgmo-webcgm] Text searching
>> >>
>> >> Hi,
>> >>
>> >>   On the call today, Chris asked me the following question... Assume
>> >>   we have:
>> >>
>> >>   (approx syntax)
>> >>   BEGAPS 'myPara'
>> >>    APSATTR 'content' 'Hello';
>> >>    ...
>> >>    BEGAPS 'mySubpara'
>> >>     APSATTR 'content' 'World';
>> >>     ...
>> >>    ENDAPS;
>> >>   ENDAPS;
>> >>
>> >>   And he does a text search on the string "Hello World", will he get a
>> >>   hit, yes or no?
>> >>
>> >>   I believe this to be an indirect way of asking/answering if
>> >>   'subpara' is an inline or a block.
>> >>
>> >>   If we say, yes there's a hit, then we've defined 'subpara' as
>> >>   inline, if we say, no there's no hit, it's a block.

> I'd say "no hit".  But the problem here is that the 1.0 authors designed
> this with a very specific ad hoc semantic in mind -- like <p> and <span> --
> and the question is ... well, baffling to me still.

> That doesn't mean that we can't answer it, once we know what block and
> inline mean, but we need to be a little careful of adding semantic that
> wasn't there and not intended in 1.0.

> Btw, we have other under-spec problems as well.  In this example

> BEGAPS 'myPara'
>    APSATTR 'content' 'Hello World';
>    ...
>    BEGAPS 'mySubpara'
>       APSATTR 'content' 'World';
>       ...
>    ENDAPS;
> ENDAPS;

> Does a search on "World" return the para or the subpara?  (I would say the
> subpara -- "closest to leaf" -- and I think this is what users like Dave
> would expect.)
I don't know what kind of searching you guys have in mind. But the
search functionality that I use on a daily basis (Dev Studio, email
search, PDF search, HTML/browser search)... would generate two hits;
the user than picks the one which is most relevant to him.

>> >>   What's the answer?
>> >>   The specification says the following (for para)... The WebCGM
>> >>   prescription for priority of text search matching is: 'para' with
>> >>   matching 'content' (1st priority match); 'para' without 'content'
>> >>   but with recognizable single-element RESTRICTED TEXT match (2nd
>> >>   priority match); or, single-element RESTRICTED TEXT match, outside
>> >>   of any 'para' (3rd priority match).
>> >>   And for subpara: See 3.2.1.3, 'para'.
>> >>
>> >>   In other words, it's not specified :(
>> > I think that Chris wants to build a logical relationship between the
>> > attributes where there is none. You search ONE attribute at a time,
>> > not a combination of nested attributes.
>>I don't get to the same conclusion. The above wording doesn't even say
>>how to perform a search within RESTRICTED TEXT and APPEND TEXT
>>(without the 'content' attribute).

> As I suggested yesterday, perhaps that search-priority specification
> should be made into recommendations for search applications,
> non-normative, along with some clarification/guidance for how we
> expect 'content' to be used on para and subpara?  (Hello World on
> para, and just World on subpara).

> More about RT and AT below.

>> >>
>> >>   Chris made it relatively clear that if we want to have these APS
>> >>   types in WebCGM 2, we need to improve how they are specified.

> Reluctantly agree.  But I think (as I said above), we need to be
> careful about adding (e.g., from some W3C CharMod model) some
> concepts or semantics that are unrelated to the original purpose of
> para/subpara/content.

> Question for Dave: did this stuff derive from something in ATA?

>> > I agree that this is all underspecified, however, the entire search
>> > is still wide open, no syntax, nothing.
>>I'm not sure what you mean by syntax? I would expect this to be a
>>vendor feature (like the Search functionality in Web Browsers).
>>
>> > The only way to get access is limited by the DOM functions, which don't
>> > allow you to access the RESTRICTED TEXT anyway if I remember this
>> > correctly.
>>
>> > So right now, whoever wants to search, can retrieve the content
>> > attribute of a para or subpara using the DOM, and he can then do
>> > whatever he wants to perform a search therein.
>>That's sounds quite difficult to perform from a user's perspective.
>>
>> > I want to point out that I brought up this issue several times, it
>> > is an important requirement of the Navy, but the group decided to
>> > turn this down and to not define text search in WebCGM 2.0.
>>Well, maybe it will have to be defined after all.
>>
>>Kind regards,
>>  Benoit   mailto:benoit@itedo.com
>>
>> > Regards,
>> > Dieter
>> >>
>> >>   So here are some thoughts...
>> >>   I see RESTRICTED TEXT as a block.
>> >>   I see APPEND TEXT as an inline.

> That's a novel view!  Seriously, it is an intriguing idea.  But it
> diverges from the conventional ISO CGM:1999 picture of RT and AT.
> AT is a syntactic artifice, invented solely for the purpose of
> changing text attributes within a single text primitive.
Yes, exactly like <span> in HTML. And, as you said, <span> is an
inline.

> If you look at pages 108-111 of CGM:1999, you'll see that only a
> handful of things -- basically just text attributes -- are allowed
> between RT and AT.  So for example this is illegal: 
I know.

> BEGAPS 'myPara'
>     APSATTR 'content' 'Hello World';
>     RestrText (x,y,width,height) "Hello ";
>     BEGAPS 'mySubpara'
>        APSATTR 'content' 'World';
>        ApndText final "World";
>     ENDAPS;
> ENDAPS;

> Which is not to say that we couldn't put some search semantics, or
> impose a block/inline model, on a sequence of RT+AT+... +AT(final).
> But I'd prefer that we don't go there.

>> >>
>> >>   So regardless of para/subpara/content... If 'Hello' is in a
>> >>   RESTRICTED TEXT and 'World' in a child APPEND TEXT, a search on
>> >>   "Hello World" would generate a hit. Anyone agrees with me?

> Well, if there were a 'content' match, then 1.0 says that generates
> the hit (1st priority).
That wasn't the question.

> But assuming no content match, RT"Hello " + AT"World" would generate
> a hit for Hello World, IMO.  But I say that because, in my reading
> of CGM :1999,  RT+AT+...+AT is logicially a single, single-line text
> primitive.
Lets wait for the definition of block/inline... but I think you've
just explained your own definition (i.e., it's a single line of text).

> Not because of a block-inline model (which I don't yet understand).

>> >>   I would be tempted to use the same logic on 'content'. I.e., if
>> >>   'content' is specified on a para, it's a block. If it's specified on
>> >>   a child subpara, it's an inline. However, I don't know if the
>> >>   current search functionality provided by vendors adopts the same
>> >>   logic?!

> I think it does not.  But the vendors and users are the ones to
> consult on this -- some have spoken, like Forrest and Dave (whom I
> associate with the origin of this stuff, for Boeing and/or ATA
> application)
I've asked in a previous email... is this stuff even used in the real
world? An concrete example would be nice.

>> >>   I'm still waiting for more information from Chris about this, but
>> >>   why not get the conversation started right away within the group?

> Okay.

> Btw, how would you define block and inline?  You seem to be getting a 
> pretty good working sense of them.
At the moment, I'm assuming that Chris is coming from an HTML and SVG
background. Which means <p> and <span>; <text> and <tspan>.

> Best,
> -Lofton.

-- 
Regards,
 Benoit   mailto:benoit@itedo.com

This e-mail and any attachments are confidential and may be protected
by legal privilege. If you are not the intended recipient, be aware
that any disclosure, copying, distribution or use of this e-mail or
any attachment is prohibited. If you have received this e-mail in
error, please notify us immediately by returning it to the sender and
delete this copy from your system. Thank you for your cooperation.
Follow-Ups:
- Re[3]: [cgmo-webcgm] Text searching
  - From: Lofton Henderson <lofton@rockynet.com>
References:
- RE: [cgmo-webcgm] Text searching
  - From: Dieter Weidenbr�ck <dieter@itedo.com>
- Text searching
  - From: Benoit Bezaire <benoit@itedo.com>
- Re[2]: [cgmo-webcgm] Text searching
  - From: Lofton Henderson <lofton@rockynet.com>