Re: [cti] Database Subcommittee / conceptual/logical model subcommittee

On Jun 24, 2015, at 4:45 PM, Cory Casanave <cory-c@MODELDRIVEN.COM> wrote:

Team,
I purposely did not suggest a particular language for expressing the conceptual/logical model as that is a worthy topic of discussion for the group. In the related OMG activity we are using a profile of UML that adds more semantic capabilities but has the tooling, established base and graphic support of UML. This profile is currently going through the standards process and is then able to generate OWL. You can say 90% of what you can say in OWL with less complexity. We have also used OWL for other projects as it also has some valuable features, but is also far from perfect. This is a good topic for discussion. But, we get ahead of ourselves, the purpose and scope should drive such choices.

What I have found in every similar activity is that once people start looking at terms and concepts from a model perspective instead of XML (or SQL, etc) data structures they discover issues, complexities, simplifications and opportunities that are not very apparent looking at schema. This activity represents a different viewpoint that when combined with the more “bottom up” implementation and representation concerns makes the specification that much better. For this reason it would be my suggestion that such a viewpoint should drive the vocabulary and semantics and work in concert with but not be the same as the team that focuses on the best representation and implementation in XML or a DBMS.

In the best scenario the former would then generate the latter based on transformation rules that map the terms, structure and semantics onto the technology framework of choice. The existing schema provide a valuable resource to start with whereas the models provide a better way to evolve and certainly a better way to support multiple technologies. This then can be considered a candidate strategy for phase 2, it is a different SDLC than starting with XML schema. Coming to consensus on our approach and SDLC should, perhaps, precede forming subcommittees to start the work.

Regards,
Cory Casanave

From: cti@lists.oasis-open.org [mailto:cti@lists.oasis-open.org] On Behalf Of Jane Ginn - jg@ctin.us
Sent: Wednesday, June 24, 2015 2:05 PM
To: sbarnum@mitre.org; Jerome Athias; Cory Casanave
Cc: Eric.Burger@georgetown.edu; cti@lists.oasis-open.org
Subject: Re: [cti] Database Subcommittee / conceptual/logical model subcommittee

All:
Building on Cory's suggestion... Jerome's observations... and Sean's note about using OWL or RDFS....
Would it make sense to establish a Sub-Committee that combines some of the issues associated with database design that have been discussed previously (RDBMS vs. NoSQL) with this need for clarification at the abstract level (conceptual & logical)?
If so.... would the scope of such a Sub-Committee also cover implementation and tooling issues as was earlier suggested by Patrick?
Further, what would be the tangible outputs, and how would they map to the STIX/TAXII/ & CYBOX Sub-Committees?
Jane Ginn, MSIA, MRP
Cyber Threat Intelligence Network, Inc.
jg@ctin.us

-------- Original Message --------
From: "Barnum, Sean D." <sbarnum@mitre.org>
Sent: Wednesday, June 24, 2015 10:41 AM
To: Jerome Athias <athiasjerome@gmail.com>,Cory Casanave <cory-c@modeldriven.com>
Subject: Re: [cti] Database Subcommittee / conceptual/logical model subcommittee
CC: Eric Burger <Eric.Burger@georgetown.edu>,"cti@lists.oasis-open.org " <cti@lists.oasis-open.org>
I just wanted to add a note of clarification here for the intent/scope of STIX and CybOX to date.
STIX and CybOX are intended to be Languages for expressing cyber threat information and cyber observable information respectively.
As such, they are more than simple data models or schemas. They also involve the conceptual model for their scope.
To date, the emergent and exploratory nature of this community seeking not only to formalize expressive representations for cyber threat information but to work collaboratively and iteratively to even figure out what that meant led to some necessary choices to work from the bottom up.

This is why the language has initially been developed, refined and defined in the form of XML schema. The schematic level of abstraction gave us something concrete to discuss, model specific technical details and to experiment with real world data and implementations in order to iterate and improve. XML schema was chosen not because it is some magical answer that everyone everywhere should use but rather because it is ubiquitous, supported by a mature body of tooling and synergistic standards (XPATH, Xpointer, Xquery, etc.) and provides a powerful formal schema language to explicitly constrain syntax while enabling necessary flexibility. All of these things were needed to model and evolve a representation of an emergent knowledge space among a very diverse set of players.

This approach served us well to successfully get us where we are today but it has always been recognized that specifying the language at this level of abstraction has significant downsides. First, it is difficult to define semantics and high level concepts effectively at this level and choosing any particular technical implementation (XML, JSON, etc.) inherently introduces technology-specific characteristics that really are not part of the more generalized language.

In recognition of this, it has always been the plan to move the specification of the languages to a more general form once an appropriate level of maturity and stability had been reached (very similar to the plan to move to a formal standards body at the appropriate time). The first steps toward this were put into motion several months back when work began on an implementation independent specification for STIX and a separate but related one for CybOX. It was decided that based on community needs and maturity the appropriate first step in generalization would be to capture language structure and syntax in the form of a UML model that would be accompanied by a set of textual specifications to explain and characterize the UML model in a more human consumable form. The draft set of these specifications for STIX 1.1.1 are currently available in the STIXProject on github and the updated versions to STIX 1.2 should be completed within the next couple weeks. This will be the primary normative contribution to the CTI TC. There is a UML model for CybOX also available but the set of accompanying full textual specs similar to STIX will not be created before transition to the CTI TC so that work will likely fall to the CybOX SC.

While UML models are formal and are abstracted from particular syntactic implementations (XML, JSON, etc.), they are not in all honesty really built to convey high-level conceptual models or explicit semantics of knowledge. They can be somewhat twisted to serve this purpose (as we have done in the implementation independent specs) but the fact that they were designed to serve a systems engineering rather than knowledge engineering purpose leads to some shortcomings. The inability of UML models to effectively convey high-level conceptual models and explicit knowledge semantics in a formal fashion is one of the key reasons the textual specification documents are required in addition to the UML. They not only provide more human-consumable characterizations of what is in the UML but they are also needed to explain semantics that cannot effectively be expressed in the UML. The upside is that some of these semantics can now be explicit in the documents but it is in an informal form and still open to human interpretation. What is ultimately needed for the language specs is a way to formally express the full range of language semantics and structure.

I have personally asserted for a long while, and I know many in the community agree, that the long term solution for specifications of the languages is to define and express them using mechanisms purpose built to define languages like this. That is, utilizing semantic forms of specification such as OWL and RDFS. These forms while less familiar to many (part of the reason we decided to work from the bottom up) provide a way to clearly, explicitly, unambiguously and formally specify the high-level conceptual model for the languages, directly map it to any number of more detailed conceptual models, and then directly map it to specific syntactic/schematic representations (logical models).
Many members of the community have been eager to begin working at this level but it was deemed important to first complete the abstraction work to the UML/textual specification level to serve as a XML-bias-free basis for initial semantic modeling. I propose that some of the CTI TCs early work should be focused on these activities. In fact, I would fairly strongly assert that many of the refactoring issues on the table for STIX 2.0 (e.g., abstraction of several embedded structures (relationships, sources, assets, victims, etc.) to separate constructs) will require semantic modeling in order to fully understand and get right. I think the semantic discussions and modeling as part of these activities could serve as some great initial steps towards more formal specifications for the languages that serve not only better integration for each language across abstraction levels (conceptual to logical) but also better alignment and integration with related information representations within the cyber security sphere (MAEC, CAPEC, CVRF, OVAL, OpenIOC, etc.) and outside the cyber security sphere.

So, that was a long contextual way of saying that I strongly agree with the need to understand and specify these languages across the abstraction spectrum (conceptual to lexical) but strongly feel that this should/must be done within the context of each language (I.e. within the STIX and CybOX SCs with cross coordination via the TC) rather than as a separate activity.

Sean

On 6/24/15, 11:39 AM, "Jerome Athias" <athiasjerome@gmail.com> wrote:

I'm a great fan of conceptual models!

I skipped this step while reading the specifications to go directly to
a data relational model, but I can see a lot of benefits producing a
CMap, especially for new adopters (just because one picture can tell
thousands of words). It's easy to share also (e.g. CmapTools)

The issue that I think we would encounter, is not so much about the
level of abstraction (multiple CMaps could resolve that), while there
is not so much concepts there (in CTI). (I used to do CMap for complex
systems)
It is mainly, AGAIN, related to the taxonomy.
You could see that when dealing with the extensions points, figuring
out what would be the most appropriate standard/specification to map
CTI to. Things that are around CTI and that you have to deal with,
such like Assets, Vulnerabilities, Exploits, Shellcodes, etc.
But I assure you that it's fascinating ;)
And while all these things are somehow linked together, it makes quite
difficult to make choice to -split- this into multiple models.
(you could look at it in many ways, like asset-centric, risk-based,
vulnerability-based, etc.)

My 2c

2015-06-24 18:18 GMT+03:00 Cory Casanave <cory-c@modeldriven.com>:
There is certainly a value in a DBMS capability, perhaps one that can be implemented across multiple technologies. This may then also relate to the "conceptual model" initiatives which have already started. A conceptual model can bridge the exchange and repository viewpoints and also allow for greater flexibility in implementation technologies. We have had great success in generating schema as well as transformations between them from models.

With this in mind perhaps a conceptual and/or logical model subcommittee should be considered. Depending on the approach this could provide some of the value that is being sought for the database. A separation of concerns would allow for the definition of the database in models with implementation in one or more chosen technologies. Such implementation would probably be another activity.

There is some grey area in what people call conceptual and logical models and the levels of abstraction each represents. For me (and many others), a conceptual model is a model of how the world is understood - it is then a model of the terms and concepts of the world, not a data model. An "instance" of a person in a conceptual model is a real person - not data. A logical model is then a technology independent data model about the world where choices are made as to structure and representation. An "instance" of a person in a logical model is data. An initial activity of a conceptual/logical model subcommittee could be to define the purpose, scope and appropriate level of abstraction.

Of course the model activity is just as relevant to the exchange schema and can help make them more understandable as well as provide a basis for support of other technologies (essentially a model driven architecture approach).  This works best when the models are the normative definition and technology schema are generated from them. Since this tends to introduce more change (as well as more consistency), it would best be coupled with the second phase.

There has already been work on conceptual models this direction seems consistent with the communities direction. With the above in mind we may want to consider a conceptual and/or logical model subcommittee.

Regards,
Cory Casanave
Representing OMG

-----Original Message-----
From: cti@lists.oasis-open.org [mailto:cti@lists.oasis-open.org] On Behalf Of Jerome Athias
Sent: Wednesday, June 24, 2015 7:06 AM
To: Eric Burger
Cc: cti@lists.oasis-open.org
Subject: Re: [cti] Database Subcommittee

I wonder if providing consumer-oriented XQuery examples (maybe with the STIX idioms) would help providing guidance and test/validation cases

2015-06-22 14:20 GMT+03:00 Eric Burger <Eric.Burger@georgetown.edu>:
Jerome (as he often does) gets this right in one (how about that - use a British colloquialism instead of a US one!).

We just submitted a paper for publication at MILCOM looking at STIX/TAXII/CybOX versus IODEF/RID from the perspective of humans versus machines doing the processing. My guess is you can guess the end of the story: STIX/TAXII/CybOX is much better for machines. IODEF/RID is much better for people. Since the goal is for inter-machine communication, you get the point.

It does mean there is a lot riding on VERY clear, implementable, interoperable specifications. Debugging this stuff is going to be a nightmare, more especially if the language is so nuanced there are dozens of ways of saying the same thing.

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

cti message