Re: [cti] Database Subcommittee / conceptual/logical model subcommittee

I understand the concern of building a reference implementation for say MongoDB and then everyone believing that, is the only way you can go. However, I also see the other side of the coin, and a small shop, startups, and open-source developers are not going to really understand what is needed on the backend to store all of this data. It seems like we need to help them figure that out some how.

I think the UML models will greatly help this effort, but some best practices for a relational database and a document database would be a good idea.. Also some ideas of things to watch out for and reasons why you might choose one over the other.

Thanks,

Bret

Bret Jordan CISSP

Director of Security Architecture and Standards | Office of the CTO

Blue Coat Systems

PGP Fingerprint: 63B4 FC53 680A 6B7D 1447 F2C0 74F8 ACAE 7415 0050

"Without cryptography vihv vivc ce xhrnrw, however, the only thing that can not be unscrambled is an egg."

On Jul 7, 2015, at 10:59, Barnum, Sean D. <sbarnum@mitre.org> wrote:

I would tend to agree with Eric here on the risks of trying to specify
actual detailed implementations (e.g., a SQL schema that could be directly
instantiated). As Eric points out, it could easily be assumed to be ³the²
way you are supposed to do it rather than ³a possible² way to do it. It is
also likely to be tied to specific environmental/configuration assumptions
that may not hold true across all SQL environments. As such it would
likely be somewhat brittle and difficult to maintain.

My thinking, and what I plan to propose as part of the CTI STIX SC
roadmap, is an approach where the SCs could take on work product efforts
(if desired) to specify binding specs for particular database approaches
(e.g. SQL, No-SQL document-centric, etc.) similar to the sorts of binding
specs that will be created for particular schematic implementations (XML,
JSON, etc.). These binding specs do not define the language but rather
provide details of how you should implement various characteristics of the
language (defined in the language specs) in the specific technology being
bound to. For example, an XML binding spec for STIX would need to specify
how to implement the Vocabularies data model (defined in the language
specs) using XML Schema. It would also need to specify how to implement
the Controlled_Structure for data markings in XML. I think similar sorts
of constructive guidance could be provided for best practices (hopefully
from real world lessons learned) for structuring the language in various
database technologies. I think this would be very useful as the sort of
headstart Bret is looking for (though certainly not turnkey) but would
also remain more flexible and easier to maintain across various specific
environments or for minor language spec changes.

What do you think?

sean

On 7/7/15, 7:54 AM, "Eric Burger" <Eric.Burger@georgetown.edu> wrote:

I do not see this as in scope for CTI. This is precisely the kind of
thing we do not want to specify. The data format and protocol is fixed,
so that if you use MySQL for your backend, I use NoSQL for my backend,
and someone else uses 1,000 Bletchley Park Œcomputers¹ with note pads for
their backend, we can still exchange information. Innovation occurs under
and behind STIX/CybOX.

I would offer that a *few*, independent, open source implementations
would be helpful to jumpstarting the ecosystem. A lot of the success to
SMTP, SNMP, and IPFIX is because there were lots of freely available
implementations for people to pickup and play with.

Creating database schemas directly means making attendant technology
choices. To people outside the CTI group, it will look like these are the
*only* acceptable technology choices. No matter how much we say ³these
are examples only, we expect you to innovate on your own,² it will not
happen. The same thing happens when we include sample code or protocol
snippets in the specifications. No matter how wrong the samples are and
no matter how explicit we are in the specification that implementors need
to follow the specification and not the samples, people will take the
snippets as gospel and not bother to read the specification.

I am not sure if OASIS has this rule, but the IETF has a rule that a
protocol cannot become an Intenet Standard unless there are two,
independent, interoperable implementations. TIFF never progressed to
Internet Standard because everyone used Adobe¹s free libraries, and as
such there was never a second independent implementation. I fear that if
we have an official data base schema, that will become *the* data base
schema. That would not be good for the ecosystem.

All of that said, one of the things that made SIP successful is *outside
of the IETF*, although mentored with a lot of IETF folks, was a SIP
Implementors group. That was where newbies could go to ask for basic
advice (³How do I parse XML?² ³How do I spell STIX?² ³What is a data
base?²) as well as more intricate questions (³I read the specification
and expected Œfoo¹ but I got Œbar¹ instead. What do you think I did
wrong?²).

The separation of SIP Implementors (not IETF) and SIP (IETF) was driven
by two factors. The first one was the SIP list became overwhelmed with
questions like ³Should I use C or C++ to build a stack?² That has nothing
to do with protocol development. The signal to noise ratio fell
precipitously, and creating a SIP Implementors group significantly
improved the S/N ratio for the people building, correcting, and adding to
the SIP spec. The second factor is the inverse S/N issue for
implementors. If you are a development manager who wants to know what
data base to run on the back end of your aggregation system, you probably
could not care less if we are modeling a threat actor¹s left hand¹s
acceleration as an entity or an attribute of a relation. You also
probably do not care if we are looking at the impact of delay tolerant
networks for the interplanetary Internet. That is something important for
*us* to think about as we get closer to a mission to Mars, but is most
likely not important to someone who wants to deliver product this year.

I have no problem with there being a CTI Implementors group. At this very
early stage in CTI¹s development, I would offer it could be beneficial
for OASIS CTI to host the implementors group as a subcommittee. The
reason is we, as the protocol¹s developers, are learning a lot from early
implementations. However, I would also offer we spin it out as soon as
practical, so that the guidance provided is NOT interpreted as gospel.

On Jul 6, 2015, at 5:49 PM, Jordan, Bret <bret.jordan@BLUECOAT.COM>
wrote:

Since I proposed the idea of this working group 12+ months ago, and
begged Eric to run with it, a lot of what I was originally wanting and
asking for has now been lost in really weird discussions about the
object model.

So lets rewind 12 months and get back to what I was asking for in the
first place...

What I want out of this group is some guidelines and database schemas
for developers wanting to write TAXII / STIX / CybOX implementations.
Basically, from a backend database standpoint, how do they get started?
Which database systems should they use based on various implementation
strategies? What should the base configurations be for said databases?
And maybe even some example implementations.

For example, say an open source APP developer was going to write a
basic STIX/Cybox Indicator/Observable UI that could read data in from a
TAXII server, add comments and context and spit it back out, what type
of database should he/she use to store the data, and what should it look
like.

For very simple things, where you are NOT doing all of STIX and Cybox,
maybe a relational database would work fine and infant work really well.
So in these cases, it would be nice if there was a .sql file the
developer could pull down that would build all of the necessary tables
structure for him/her. If they are doing something more complex, maybe
they need a document database. Then it would be really great if we
produced some documents / papers / implementations guides / and maybe
even some examples that could help them get up and running faster.

The problems I see are:

1) STIX is massive and very complex. Just trying to learn it and
figure out how much of it you have to implement is a monumental task.

2) Then if you need something in say Objective-C, you have to write the
API to support STIX

3) Then yo need to write a basic TAXII system to do what you want.

4) Then you need to figure out what kind of database you are going to
use to store the data.

If we could help out on #4, that might just help make things easier for
people getting started..

Thanks,

Bret

Bret Jordan CISSP
Director of Security Architecture and Standards | Office of the CTO
Blue Coat Systems
PGP Fingerprint: 63B4 FC53 680A 6B7D 1447 F2C0 74F8 ACAE 7415 0050
"Without cryptography vihv vivc ce xhrnrw, however, the only thing that
can not be unscrambled is an egg."

On Jul 6, 2015, at 15:33, Barnum, Sean D. <sbarnum@mitre.org> wrote:

Eric, I definitely agree that there will need to be considerable
coordination between STIX and CybOX efforts. I would hope that that is
not a surprise to anyone. :-)

sean

From: Eric Burger <Eric.Burger@georgetown.edu>
Date: Monday, July 6, 2015 at 5:30 PM
To: "cti@lists.oasis-open.org" <cti@lists.oasis-open.org>
Subject: Re: [cti] Database Subcommittee / conceptual/logical model
subcommittee

Even if it means I¹m writing myself out of a Œchair,¹ I agree with
Sean. The most important task that people are talking about for the
³database subcommittee² is the formal modeling of STIX/CybOX (and to a
lesser extent, TAXII). If that has to be a part of STIX or CybOX SCs,
so be it.

The downside is based on the work we have done so far at Georgetown,
it is very difficult to build a model considering STIX and CybOX as
separate entities.

On Jul 6, 2015, at 5:13 PM, Barnum, Sean D. <sbarnum@mitre.org> wrote:

We agree that these sub-topics have value and should be managed
appropriately to ensure they are addressed consistently with minimal
impact to other areas and sub-topics.

As the co-chairs for the CTI STIX SC we wanted to express our
thoughts on how these various sub-topics might be addressed most
effectively.

We propose that issues relevant to specific languages (language
specifications at all levels of abstraction (ontology, logical
data-model, specific schematic implementations (xml, json, etc.)),
database and implementation guidance, etc.) should be managed from
within the appropriate STIX SC or CybOX SC. The concept of the
language specifications existing at different levels of abstraction is
already part of our way of doing things. For the last year or so we
have been working to lift the STIX specification from just XML Schema
to a set of implementation-independent specifications based on a UML
model and explanatory textual documents. These specification documents
will form the normative basis of the language specification that will
be transferred to OASIS. It has also been the goal to eventually
evolve the implementation-independent specification and lift it to a
more formal and explicit semantic form. All of these differing levels
of abstraction are part of the effort to specify the language. For
consistency sake they should all be part of a single evolutionary
thread for the language and not separate parallel efforts. Similarly,
specific guidance on database approaches or other implementations
would practically be tied to the language they are implemented to
support and as such likely should fall within the scope of the SC
working on those languages. Within the language SCs these topics can
be broken down and managed using different work products as
appropriate.

Issues that tend to be relevant to the broader ecosystem (engagement,
interoperability, etc.) may best be managed as separate SCs under the
TC.

We believe this approach will yield the best balance between focusing
on specific issues, ensuring the right people are involved in the
right efforts and achieving consistency across efforts and at this
time will likely improve our focus and support more rapid progress. If
at some future time the TC decides that a different approach is
needed, it will be possible to modify the approach at that time.

The first CTI STIX SC meeting next week will likely flesh out in a
bit more detail how we see this approach taking form for the STIX SC.

We appreciate your consideration of our thoughts on the matter.

Sean Barnum and Aharon Chernin
CTI STIX SC Co-chairs

From: Patrick Maroney <Pmaroney@Specere.org>
Date: Monday, July 6, 2015 at 1:03 AM
To: Eric Burger <Eric.Burger@georgetown.edu>,
"cti@lists.oasis-open.org" <cti@lists.oasis-open.org>
Subject: Re: [cti] Database Subcommittee / conceptual/logical model
subcommittee

[+1] "I have not been comfortable with calling this group the
³database subcommittee² specifically because it is the data model, not
the data model implementation, that needs focus."

[+1] "...once people start looking at terms and concepts from a
model perspective instead of XML (or SQL, etc) data structures they
discover issues, complexities, simplifications and opportunities that
are not very apparent looking at schema. This activity represents a
different viewpoint that when combined with the more ³bottom up²
implementation and representation concerns makes the specification
that much better. For this reason it would be my suggestion that such
a viewpoint should drive the vocabulary and semantics and work in
concert with but not be the same as the team that focuses on the best
representation and implementation in XML or a DBMS. "

Patrick Maroney
Office: (856)983-0001
Cell: (609)841-5104
pmaroney@specere.org
From:cti@lists.oasis-open.org <cti@lists.oasis-open.org> on behalf of
Eric Burger <Eric.Burger@georgetown.edu>
Sent: Sunday, July 5, 2015 3:17:55 AM
To: cti@lists.oasis-open.org
Subject: Re: [cti] Database Subcommittee / conceptual/logical model
subcommittee

I have not been comfortable with calling this group the ³database
subcommittee² specifically because it is the data model, not the data
model implementation, that needs focus. Cory nails it in one (second
paragraph below). In order to build real data migration tools, you
really need to understand what you are migrating. I would offer the
first task (as opposed to a parallel sub-subcommittee) is to do the
modeling.

That is why we have been working on an OWL model for STIX/CybOX at
Georgetown. Our purpose was for a different goal, but the result could
be generally useful.

On Jun 24, 2015, at 4:45 PM, Cory Casanave <cory-c@MODELDRIVEN.COM>
wrote:

Team,
I purposely did not suggest a particular language for expressing the
conceptual/logical model as that is a worthy topic of discussion for
the group. In the related OMG activity we are using a profile of UML
that adds more semantic capabilities but has the tooling, established
base and graphic support of UML. This profile is currently going
through the standards process and is then able to generate OWL. You
can say 90% of what you can say in OWL with less complexity. We have
also used OWL for other projects as it also has some valuable
features, but is also far from perfect. This is a good topic for
discussion. But, we get ahead of ourselves, the purpose and scope
should drive such choices.

What I have found in every similar activity is that once people
start looking at terms and concepts from a model perspective instead
of XML (or SQL, etc) data structures they discover issues,
complexities, simplifications and opportunities that are not very
apparent looking at schema. This activity represents a different
viewpoint that when combined with the more ³bottom up² implementation
and representation concerns makes the specification that much better.
For this reason it would be my suggestion that such a viewpoint
should drive the vocabulary and semantics and work in concert with
but not be the same as the team that focuses on the best
representation and implementation in XML or a DBMS.

In the best scenario the former would then generate the latter based
on transformation rules that map the terms, structure and semantics
onto the technology framework of choice. The existing schema provide
a valuable resource to start with whereas the models provide a better
way to evolve and certainly a better way to support multiple
technologies. This then can be considered a candidate strategy for
phase 2, it is a different SDLC than starting with XML schema. Coming
to consensus on our approach and SDLC should, perhaps, precede
forming subcommittees to start the work.

Regards,
Cory Casanave

From: cti@lists.oasis-open.org [mailto:cti@lists.oasis-open.org] On
Behalf Of Jane Ginn - jg@ctin.us
Sent: Wednesday, June 24, 2015 2:05 PM
To: sbarnum@mitre.org; Jerome Athias; Cory Casanave
Cc: Eric.Burger@georgetown.edu; cti@lists.oasis-open.org
Subject: Re: [cti] Database Subcommittee / conceptual/logical model
subcommittee

All:

Building on Cory's suggestion... Jerome's observations... and Sean's
note about using OWL or RDFS....

Would it make sense to establish a Sub-Committee that combines some
of the issues associated with database design that have been
discussed previously (RDBMS vs. NoSQL) with this need for
clarification at the abstract level (conceptual & logical)?

If so.... would the scope of such a Sub-Committee also cover
implementation and tooling issues as was earlier suggested by Patrick?

Further, what would be the tangible outputs, and how would they map
to the STIX/TAXII/ & CYBOX Sub-Committees?

Jane Ginn, MSIA, MRP
Cyber Threat Intelligence Network, Inc.
jg@ctin.us

-------- Original Message --------
From: "Barnum, Sean D." <sbarnum@mitre.org>
Sent: Wednesday, June 24, 2015 10:41 AM
To: Jerome Athias <athiasjerome@gmail.com>,Cory Casanave
<cory-c@modeldriven.com>
Subject: Re: [cti] Database Subcommittee / conceptual/logical model
subcommittee
CC: Eric Burger
<Eric.Burger@georgetown.edu>,"cti@lists.oasis-open.org "
<cti@lists.oasis-open.org>

I just wanted to add a note of clarification here for the
intent/scope of STIX and CybOX to date.
STIX and CybOX are intended to be Languages for expressing cyber
threat information and cyber observable information respectively.
As such, they are more than simple data models or schemas. They also
involve the conceptual model for their scope.
To date, the emergent and exploratory nature of this community
seeking not only to formalize expressive representations for cyber
threat information but to work collaboratively and iteratively to
even figure out what that meant led to some necessary choices to work
from the bottom up.

This is why the language has initially been developed, refined and
defined in the form of XML schema. The schematic level of abstraction
gave us something concrete to discuss, model specific technical
details and to experiment with real world data and implementations in
order to iterate and improve. XML schema was chosen not because it is
some magical answer that everyone everywhere should use but rather
because it is ubiquitous, supported by a mature body of tooling and
synergistic standards (XPATH, Xpointer, Xquery, etc.) and provides a
powerful formal schema language to explicitly constrain syntax while
enabling necessary flexibility. All of these things were needed to
model and evolve a representation of an emergent knowledge space
among a very diverse set of players.

This approach served us well to successfully get us where we are
today but it has always been recognized that specifying the language
at this level of abstraction has significant downsides. First, it is
difficult to define semantics and high level concepts effectively at
this level and choosing any particular technical implementation (XML,
JSON, etc.) inherently introduces technology-specific characteristics
that really are not part of the more generalized language.

In recognition of this, it has always been the plan to move the
specification of the languages to a more general form once an
appropriate level of maturity and stability had been reached (very
similar to the plan to move to a formal standards body at the
appropriate time). The first steps toward this were put into motion
several months back when work began on an implementation independent
specification for STIX and a separate but related one for CybOX. It
was decided that based on community needs and maturity the
appropriate first step in generalization would be to capture language
structure and syntax in the form of a UML model that would be
accompanied by a set of textual specifications to explain and
characterize the UML model in a more human consumable form. The draft
set of these specifications for STIX 1.1.1 are currently available in
the STIXProject on github and the updated versions to STIX 1.2 should
be completed within the next couple weeks. This will be the primary
normative contribution to the CTI TC. There is a UML model for CybOX
also available but the set of accompanying full textual specs similar
to STIX will not be created before transition to the CTI TC so that
work will likely fall to the CybOX SC.

While UML models are formal and are abstracted from particular
syntactic implementations (XML, JSON, etc.), they are not in all
honesty really built to convey high-level conceptual models or
explicit semantics of knowledge. They can be somewhat twisted to
serve this purpose (as we have done in the implementation independent
specs) but the fact that they were designed to serve a systems
engineering rather than knowledge engineering purpose leads to some
shortcomings. The inability of UML models to effectively convey
high-level conceptual models and explicit knowledge semantics in a
formal fashion is one of the key reasons the textual specification
documents are required in addition to the UML. They not only provide
more human-consumable characterizations of what is in the UML but
they are also needed to explain semantics that cannot effectively be
expressed in the UML. The upside is that some of these semantics can
now be explicit in the documents but it is in an informal form and
still open to human interpretation. What is ultimately needed for the
language specs is a way to formally express the full range of
language semantics and structure.

I have personally asserted for a long while, and I know many in the
community agree, that the long term solution for specifications of
the languages is to define and express them using mechanisms purpose
built to define languages like this. That is, utilizing semantic
forms of specification such as OWL and RDFS. These forms while less
familiar to many (part of the reason we decided to work from the
bottom up) provide a way to clearly, explicitly, unambiguously and
formally specify the high-level conceptual model for the languages,
directly map it to any number of more detailed conceptual models, and
then directly map it to specific syntactic/schematic representations
(logical models).
Many members of the community have been eager to begin working at
this level but it was deemed important to first complete the
abstraction work to the UML/textual specification level to serve as a
XML-bias-free basis for initial semantic modeling. I propose that
some of the CTI TCs early work should be focused on these activities.
In fact, I would fairly strongly assert that many of the refactoring
issues on the table for STIX 2.0 (e.g., abstraction of several
embedded structures (relationships, sources, assets, victims, etc.)
to separate constructs) will require semantic modeling in order to
fully understand and get right. I think the semantic discussions and
modeling as part of these activities could serve as some great
initial steps towards more formal specifications for the languages
that serve not only better integration for each language across
abstraction levels (conceptual to logical) but also better alignment
and integration with related information representations within the
cyber security sphere (MAEC, CAPEC, CVRF, OVAL, OpenIOC, etc.) and
outside the cyber security sphere.

So, that was a long contextual way of saying that I strongly agree
with the need to understand and specify these languages across the
abstraction spectrum (conceptual to lexical) but strongly feel that
this should/must be done within the context of each language (I.e.
within the STIX and CybOX SCs with cross coordination via the TC)
rather than as a separate activity.

Sean

On 6/24/15, 11:39 AM, "Jerome Athias" <athiasjerome@gmail.com> wrote:

I'm a great fan of conceptual models!

I skipped this step while reading the specifications to go directly
to
a data relational model, but I can see a lot of benefits producing a
CMap, especially for new adopters (just because one picture can tell
thousands of words). It's easy to share also (e.g. CmapTools)

The issue that I think we would encounter, is not so much about the
level of abstraction (multiple CMaps could resolve that), while
there
is not so much concepts there (in CTI). (I used to do CMap for
complex
systems)
It is mainly, AGAIN, related to the taxonomy.
You could see that when dealing with the extensions points, figuring
out what would be the most appropriate standard/specification to map
CTI to. Things that are around CTI and that you have to deal with,
such like Assets, Vulnerabilities, Exploits, Shellcodes, etc.
But I assure you that it's fascinating ;)
And while all these things are somehow linked together, it makes
quite
difficult to make choice to -split- this into multiple models.
(you could look at it in many ways, like asset-centric, risk-based,
vulnerability-based, etc.)

My 2c

2015-06-24 18:18 GMT+03:00 Cory Casanave <cory-c@modeldriven.com>:
There is certainly a value in a DBMS capability, perhaps one that
can be implemented across multiple technologies. This may then also
relate to the "conceptual model" initiatives which have already
started. A conceptual model can bridge the exchange and repository
viewpoints and also allow for greater flexibility in implementation
technologies. We have had great success in generating schema as
well as transformations between them from models.

With this in mind perhaps a conceptual and/or logical model
subcommittee should be considered. Depending on the approach this
could provide some of the value that is being sought for the
database. A separation of concerns would allow for the definition
of the database in models with implementation in one or more chosen
technologies. Such implementation would probably be another
activity.

There is some grey area in what people call conceptual and logical
models and the levels of abstraction each represents. For me (and
many others), a conceptual model is a model of how the world is
understood - it is then a model of the terms and concepts of the
world, not a data model. An "instance" of a person in a conceptual
model is a real person - not data. A logical model is then a
technology independent data model about the world where choices are
made as to structure and representation. An "instance" of a person
in a logical model is data. An initial activity of a
conceptual/logical model subcommittee could be to define the
purpose, scope and appropriate level of abstraction.

Of course the model activity is just as relevant to the exchange
schema and can help make them more understandable as well as
provide a basis for support of other technologies (essentially a
model driven architecture approach). This works best when the
models are the normative definition and technology schema are
generated from them. Since this tends to introduce more change (as
well as more consistency), it would best be coupled with the second
phase.

There has already been work on conceptual models this direction
seems consistent with the communities direction. With the above in
mind we may want to consider a conceptual and/or logical model
subcommittee.

Regards,
Cory Casanave
Representing OMG

-----Original Message-----
From: cti@lists.oasis-open.org [mailto:cti@lists.oasis-open.org]
On Behalf Of Jerome Athias
Sent: Wednesday, June 24, 2015 7:06 AM
To: Eric Burger
Cc: cti@lists.oasis-open.org
Subject: Re: [cti] Database Subcommittee

I wonder if providing consumer-oriented XQuery examples (maybe
with the STIX idioms) would help providing guidance and
test/validation cases

2015-06-22 14:20 GMT+03:00 Eric Burger
<Eric.Burger@georgetown.edu>:
Jerome (as he often does) gets this right in one (how about that
- use a British colloquialism instead of a US one!).

We just submitted a paper for publication at MILCOM looking at
STIX/TAXII/CybOX versus IODEF/RID from the perspective of humans
versus machines doing the processing. My guess is you can guess
the end of the story: STIX/TAXII/CybOX is much better for
machines. IODEF/RID is much better for people. Since the goal is
for inter-machine communication, you get the point.

It does mean there is a lot riding on VERY clear, implementable,
interoperable specifications. Debugging this stuff is going to be
a nightmare, more especially if the language is so nuanced there
are dozens of ways of saying the same thing.

--------------------------------------------------------------------
-
To unsubscribe from this mail list, you must leave the OASIS TC
that generates this mail. Follow this link to all your TCs in
OASIS at:

https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.p
hp

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail. Follow this link to all your TCs in OASIS at:

https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.ph
p

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail. Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

cti message