cgmo-webcgm message

Subject: RE: Re[3]: [cgmo-webcgm] Text searching
From: "Cruikshank, David W" <david.w.cruikshank@boeing.com>
To: "Lofton Henderson" <lofton@rockynet.com>, "Benoit Bezaire" <benoit@itedo.com>, <cgmo-webcgm@lists.oasis-open.org>
Date: Sat, 6 May 2006 21:31:34 -0700
 I'll try to address how "content" is used at Boeing (albeit our
implementation of CGM V4 predates WebCGM and even the ATA IGExchange
stuff (although this spec contained the concept of para and subpara with
content attributes...as unspecified as it is in WebCGM).  In our IETM
(PMA) we do capture content on para.  Subpara is really implemented as a
substring with pointers to capture and reference in the middle of a
"para".  The para really contains the content.  The implementation of
searchable text in PMA is that when we "build" the document for PMA all
the "content" strings are externalized and run through the text indexer
along with all the other SGML text in the document, so when you do a
full text search on the document you get hits in both the text and
graphics.  This is not the same as a DOM call to a viewer to search text
within a single CGM file.  The major use case or searching in a single
cgm file is for a complicated graphic like a wiring diagram.  You may
know that a wire in on a particular diagram, but is is very difficult to
spot it by looking.  The ability to search a bunch of graphics for a
string is outside the scope of a WebCGM viewer, but the content
attribute on para facilitates that in the IETM.  Content on subpara is
probably of less importance.

Dave


Technical Fellow - Graphics/Digital Data Interchange
Boeing Commercial Airplane
206.544.3560, fax 206.662.3734  <-- NEW NUMBERS
david.w.cruikshank@boeing.com

-----Original Message-----
From: Lofton Henderson [mailto:lofton@rockynet.com] 
Sent: Saturday, May 06, 2006 11:34 AM
To: Benoit Bezaire; cgmo-webcgm@lists.oasis-open.org
Subject: Re[3]: [cgmo-webcgm] Text searching

Let me extract a couple of points up front


1.)  Process/procedural:  About the current under-defined state of
para/subpara/content, you said, "And according to recent W3C standards
would not make it into the spec if not corrected."  An interesting
question about this:  if W3C approved this twice as Rec (1999 and 2001),
and if it has been in 1.0 for 7 years, and in the field and
implementations (if indeed anyone is using it) ... to what degree is it
appropriate for W3C to try to force this legacy 1.0 stuff to be revised
and cleaned up to current standards?  I.e., how hard will they push on
it?

Regardless of the answer, we should definitely do something -- Chris's
question must be answered.  But how far should we go?  It seems that
this has progressed somewhat like the drawing-model issue.  We have
gotten much deeper into it than answering Chris's question.  That's
good, we need to understand the situation fully and clearly.  But then
we need to decide what to do about it.

2.) This is probably premature until we have Chris's clarification about
block/inline comment.  But pretty soon it would be nice to progress to a
concrete proposal for and answer to Chris changes to the (1.0)
specification.

I'll make a few more comments inline...

At 09:34 AM 5/5/2006 -0400, Benoit Bezaire wrote:
>See inline...
>
>Thursday, May 4, 2006, 8:26:17 PM, you wrote:
> > Hi Benoit,
>
> > Some technical replies for you (and Dieter)...
>
> > At 06:31 PM 5/3/2006 -0400, Benoit Bezaire wrote:
> >>I'm seeing the emails coming in about this topic. And I have to 
> >>state that I don't understand how people get to such an 
> >>understanding of the feature by reading what is in the
specification. More inline...
> >>
> >>Wednesday, May 3, 2006, 6:04:20 PM, you wrote:
> >> > Benoit,
> >>
> >> > I think the example does not reflect the intentions of the
authors.
> >>
> >> > It should be like this
> >> >>   (approx syntax)
> >> >>   BEGAPS 'myPara'
> >> >>    APSATTR 'content' 'Hello World';
> >> >>    ...
> >> >>    BEGAPS 'mySubpara'
> >> >>     APSATTR 'content' 'World';
> >> >>     ...
> >> >>    ENDAPS;
> >> >>   ENDAPS;
> >>
> >> > Hence the content attribute of the para would contain all the 
> >> > text
> of the
> >> > para, whereas the attribute of the subpara woul contain the text 
> >> > of the subpara only.
> >>Hmmm. Isn't this an assumption? I could see it use this way when 
> >>using para/subpara on a raster; but that may not always be the case.
>
> > Perhaps it is an assumption, but it seems to me to be at least 
> > hinted by the text of 3.2.1.3, 3.2.1.4, and 3.2.2.8.  (Or ...
> > perhaps I'm too biased by what the 1.0 authors meant to say, but 
> > that they didn't express unambiguously.)
>Sorry, I disagree.
>
>There's no hint in there which says 'content' on a para MUST contain 
>all text strings found in all subpara 'content's. I see things like 
>'may be used to identify text', 'can potentially enable text search', 
>'identifying matches [...] is not specified in WebCGM.', 'may be used 
>to identify smaller fragments', 'This enables, for example [...]'.
>
>What para/subpara/content is suppose to do, is far from clear. And 
>according to recent W3C standards would not make it into the spec if 
>not corrected.

I'll accept that the document doesn't explain it clearly.  The fact that
four "old-timers" have expressed the same view of it probably means that
we all talked about it in 1999 and over the years, and evolved something
of a common understanding, but that cannot be divined from the 1.0 text.
Fair enough.

(Don't be put off by my use of "old timers" -- it is only meant to
signify those who have been around from the beginning, and share a
common but poorly written understanding.)


> >>Regardless, doesn't Chris' question still stand?
> > That question is: is para a block and subpara an inline?  Yes, we're

> > going to have to answer the question somehow.  There are a couple of

> > problems here.
>
> > First problem, para and subpara (as you pointed out in your proposed
> > reply) are APS objects which group stuff which might not even be 
> > text.  So the question, as it stands, seems meaningless.  However,
> > para+content could be viewed as a surrogate for or abstraction of
> > the textual-related thingy inside its APS, and similarly for
> > subpara+content.  Then you could phrase the question about those
> > "surrogates".
>You are playing with words here!
>On the call we explained to Chris that para/supara were not text 
>elements but APS. But his question still stand and has now become:
>is para+'content' a block and subpara+'content' an inline.
>
> > Second problem, I still don't know what block and inline mean (Chris

> > is consulting with an i18n guy before sending more info).
>I agree.
>
> > But from XHTML, a block element is like a 'p' and an inline element 
> > is like a 'span'.
>Yes.
>
> > Let's suppose HTML had a 'content' attribute (maybe you could do 
> > this example with 'title' attribute, which is typically used for a 
> > tooltip).
>
> > <p content="???">Hello <span content="world">world</span></p>
>
> > Would you expect ??? to reflect the entire content of the <p> 
> > element, or only that portion of the <p> element that is outside of 
> > the <span>? I would expect the first, i.e., ??? should be "Hello 
> > world".
>I would have no expectation. I don't know any specification that puts 
>restrictions on character data for an attribute. It's either a 
>predefined set of values or plain character data.
>
>I think using HTML 'alt' would be a better comparison... and you will 
>notice that it can only be specified on IMG, AREA, APPLET, and INPUT.
>It cannot be used on <p> and <span>, thus most (if not all) the WebCGM 
>problems related to this do not exist in HTML.
>
> > This is the way I think about para and subpara (and apparently some 
> > others do as well).  However, from the example that Chris posed, I 
> > may be entirely off base as to the meaning of "block" and "inline".
>I don't think we are way off on the block/inline thing. But I do think 
>that using an attribute (content) on APS which can be nested and 
>possibly already readable, to be a mistake.

I disagree.  There is a perfectly simple explanation:  'content' on
'para' 
should reflect the text content of the entire 'para' APS; 'content' on
'subpara' should reflect the text content of the entire 'subpara' 
APS.  Period.  (By "text content", I mean the RT elements, or the text
that is drawn by the filled polybeziers, rasters, etc).

I claim that is the common understanding.  I believe it originated with
is an ad hoc solution to needs of Boeing and/or ATA, that made its way
into WebCGM 1.0.  (On this thread I have asked Dave to confirm or refute
that, but he hasn't replied.)

To be clear about "nested" ... as you know, 'subpara' (and only
'subpara') can be nested in 'para', and nothing can be nested in
'subpara'.


> > More...
>
> >> >> -----Original Message-----
> >> >> From: Benoit Bezaire [mailto:benoit@itedo.com]
> >> >> Sent: Wednesday, May 03, 2006 11:53 PM
> >> >> To: cgmo-webcgm@lists.oasis-open.org
> >> >> Subject: [cgmo-webcgm] Text searching
> >> >>
> >> >> Hi,
> >> >>
> >> >>   On the call today, Chris asked me the following question...
Assume
> >> >>   we have:
> >> >>
> >> >>   (approx syntax)
> >> >>   BEGAPS 'myPara'
> >> >>    APSATTR 'content' 'Hello';
> >> >>    ...
> >> >>    BEGAPS 'mySubpara'
> >> >>     APSATTR 'content' 'World';
> >> >>     ...
> >> >>    ENDAPS;
> >> >>   ENDAPS;
> >> >>
> >> >>   And he does a text search on the string "Hello World", will he
get a
> >> >>   hit, yes or no?
> >> >>
> >> >>   I believe this to be an indirect way of asking/answering if
> >> >>   'subpara' is an inline or a block.
> >> >>
> >> >>   If we say, yes there's a hit, then we've defined 'subpara' as
> >> >>   inline, if we say, no there's no hit, it's a block.
>
> > I'd say "no hit".  But the problem here is that the 1.0 authors
designed
> > this with a very specific ad hoc semantic in mind -- like <p> and
<span> --
> > and the question is ... well, baffling to me still.
>
> > That doesn't mean that we can't answer it, once we know what block
and
> > inline mean, but we need to be a little careful of adding semantic
that
> > wasn't there and not intended in 1.0.
>
> > Btw, we have other under-spec problems as well.  In this example
>
> > BEGAPS 'myPara'
> >    APSATTR 'content' 'Hello World';
> >    ...
> >    BEGAPS 'mySubpara'
> >       APSATTR 'content' 'World';
> >       ...
> >    ENDAPS;
> > ENDAPS;
>
> > Does a search on "World" return the para or the subpara?  (I would
say the
> > subpara -- "closest to leaf" -- and I think this is what users like
Dave
> > would expect.)
>I don't know what kind of searching you guys have in mind. But the
>search functionality that I use on a daily basis (Dev Studio, email
>search, PDF search, HTML/browser search)... would generate two hits;
>the user than picks the one which is most relevant to him.

One could treat this either like one of those searches, or like the 
generation of mouse hits from nested APSs.  I was espousing the 
latter.  But I don't care much, and I would actually like it best if we 
could avoid this depth of detailed specification.


> >> >>   What's the answer?
> >> >>   The specification says the following (for para)... The WebCGM
> >> >>   prescription for priority of text search matching is: 'para'
with
> >> >>   matching 'content' (1st priority match); 'para' without
'content'
> >> >>   but with recognizable single-element RESTRICTED TEXT match
(2nd
> >> >>   priority match); or, single-element RESTRICTED TEXT match,
outside
> >> >>   of any 'para' (3rd priority match).
> >> >>   And for subpara: See 3.2.1.3, 'para'.
> >> >>
> >> >>   In other words, it's not specified :(
> >> > I think that Chris wants to build a logical relationship between
the
> >> > attributes where there is none. You search ONE attribute at a
time,
> >> > not a combination of nested attributes.
> >>I don't get to the same conclusion. The above wording doesn't even
say
> >>how to perform a search within RESTRICTED TEXT and APPEND TEXT
> >>(without the 'content' attribute).
>
> > As I suggested yesterday, perhaps that search-priority specification
> > should be made into recommendations for search applications,
> > non-normative, along with some clarification/guidance for how we
> > expect 'content' to be used on para and subpara?  (Hello World on
> > para, and just World on subpara).
>
> > More about RT and AT below.
>
> >> >>
> >> >>   Chris made it relatively clear that if we want to have these
APS
> >> >>   types in WebCGM 2, we need to improve how they are specified.
>
> > Reluctantly agree.  But I think (as I said above), we need to be
> > careful about adding (e.g., from some W3C CharMod model) some
> > concepts or semantics that are unrelated to the original purpose of
> > para/subpara/content.
>
> > Question for Dave: did this stuff derive from something in ATA?
>
> >> > I agree that this is all underspecified, however, the entire
search
> >> > is still wide open, no syntax, nothing.
> >>I'm not sure what you mean by syntax? I would expect this to be a
> >>vendor feature (like the Search functionality in Web Browsers).
> >>
> >> > The only way to get access is limited by the DOM functions, which
don't
> >> > allow you to access the RESTRICTED TEXT anyway if I remember this
> >> > correctly.
> >>
> >> > So right now, whoever wants to search, can retrieve the content
> >> > attribute of a para or subpara using the DOM, and he can then do
> >> > whatever he wants to perform a search therein.
> >>That's sounds quite difficult to perform from a user's perspective.
> >>
> >> > I want to point out that I brought up this issue several times,
it
> >> > is an important requirement of the Navy, but the group decided to
> >> > turn this down and to not define text search in WebCGM 2.0.
> >>Well, maybe it will have to be defined after all.
> >>
> >>Kind regards,
> >>  Benoit   mailto:benoit@itedo.com
> >>
> >> > Regards,
> >> > Dieter
> >> >>
> >> >>   So here are some thoughts...
> >> >>   I see RESTRICTED TEXT as a block.
> >> >>   I see APPEND TEXT as an inline.
>
> > That's a novel view!  Seriously, it is an intriguing idea.  But it
> > diverges from the conventional ISO CGM:1999 picture of RT and AT.
> > AT is a syntactic artifice, invented solely for the purpose of
> > changing text attributes within a single text primitive.
>Yes, exactly like <span> in HTML. And, as you said, <span> is an
>inline.
>
> > If you look at pages 108-111 of CGM:1999, you'll see that only a
> > handful of things -- basically just text attributes -- are allowed
> > between RT and AT.  So for example this is illegal:
>I know.
>
> > BEGAPS 'myPara'
> >     APSATTR 'content' 'Hello World';
> >     RestrText (x,y,width,height) "Hello ";
> >     BEGAPS 'mySubpara'
> >        APSATTR 'content' 'World';
> >        ApndText final "World";
> >     ENDAPS;
> > ENDAPS;
>
> > Which is not to say that we couldn't put some search semantics, or
> > impose a block/inline model, on a sequence of RT+AT+... +AT(final).
> > But I'd prefer that we don't go there.
>
> >> >>
> >> >>   So regardless of para/subpara/content... If 'Hello' is in a
> >> >>   RESTRICTED TEXT and 'World' in a child APPEND TEXT, a search
on
> >> >>   "Hello World" would generate a hit. Anyone agrees with me?
>
> > Well, if there were a 'content' match, then 1.0 says that generates
> > the hit (1st priority).
>That wasn't the question.
>
> > But assuming no content match, RT"Hello " + AT"World" would generate
> > a hit for Hello World, IMO.  But I say that because, in my reading
> > of CGM :1999,  RT+AT+...+AT is logicially a single, single-line text
> > primitive.
>Lets wait for the definition of block/inline... but I think you've
>just explained your own definition (i.e., it's a single line of text).
>
> > Not because of a block-inline model (which I don't yet understand).
>
> >> >>   I would be tempted to use the same logic on 'content'. I.e.,
if
> >> >>   'content' is specified on a para, it's a block. If it's
specified on
> >> >>   a child subpara, it's an inline. However, I don't know if the
> >> >>   current search functionality provided by vendors adopts the
same
> >> >>   logic?!
>
> > I think it does not.  But the vendors and users are the ones to
> > consult on this -- some have spoken, like Forrest and Dave (whom I
> > associate with the origin of this stuff, for Boeing and/or ATA
> > application)
>I've asked in a previous email... is this stuff even used in the real
>world? An concrete example would be nice.

Good question.


> >> >>   I'm still waiting for more information from Chris about this,
but
> >> >>   why not get the conversation started right away within the
group?
>
> > Okay.
>
> > Btw, how would you define block and inline?  You seem to be getting
a
> > pretty good working sense of them.
>At the moment, I'm assuming that Chris is coming from an HTML and SVG
>background. Which means <p> and <span>; <text> and <tspan>.

It seems to be taking a long time to answer.  (I know he went back to 
discuss it with Richard Ishida, so there must be some subtlety and
nuance 
about the concepts in the original question.)

-Lofton.
References:
- Re[3]: [cgmo-webcgm] Text searching
  - From: Lofton Henderson <lofton@rockynet.com>