Re: [cti-stix] STIX: Messaging Standard vs. Document Standard

I don’t think your characterization of the recent topic discussions on the list is accurate.

Other than some folks pushing the MTI ballot for deciding an “official” MTI serialization the large majority of discussion has been around information model issues (sightings, timestamps, relationships, data markings, assets, investigations/incidents, versioning, etc.) not around bits on the wire. Some folks may think about these things at a serialization level due to their own experiential context and it is fine for them to contribute their input to the discussions at that level but they truly are information model issues.

>However, saying that the information model is necessarily the storage format and the protocol on the wire is simply not born out in reality.

To clarify, this is not what I said or implied. In fact, I was asserting the exact opposite.

I was saying that the information model is NOT necessarily "the storage format and the protocol on the wire”.

My point is that these are three different things: the information model, the exchange/message protocol/serialization, and the storage format.

Because of this no one of these should dictate considerations for the others.

The information model should not dictate what the exchange/message protocol/serialization looks like as long as its information content is semantically aligned with the information model and can be explicitly mapped to it.

Similarly, the operational concerns of any exchange/message protocol/serialization may offer input to the information model but should not limit or constrain it in undue fashion but rather just use the parts of the model that are relevant and in a way appropriate to the exchange/message context.

Lastly, the storage format is really the most independent variable of all as its only interoperability requirement with any other storage format is that it semantically (but not necessarily lexically or syntactically) aligns to the information model. You point out that it could be implemented in many different ways in a relational database and I agree. Further it could be implemented in a JSON-centric nosql database, an XML-centric database, an RDF graph database, etc. It does not really matter as long as those implementations are semantically aligned to the same information model and can be explicitly mapped to it. If they are then you can take content from any one of them, transform it to an exchange/message protocol/serialization format, transmit it to another implementation, transform it to the specific storage implementation format there, ingest it and analyze or act on it knowing that you understand the meaning of the information in the way that the originator did.

The exchange/message protocol/serialization acting as the middle piece there is very important for sharing but is meaningless unless it is mapped to an underlying information model that can support analysis and action at the ends of the sharing.

The key to all of this being possible is the common information model.

The information model is and has always been the core of what we are developing/defining in the STIX efforts as it forms the basis for consistent derivation of whatever exchange/message protocols/serializations need to exist based on their context and for any storage format implementations desired to operate within a CTI ecosystem.

As a community and standards TC we need to develop both the information model as the language standard and one or more exchange/message protocol/serialization standards layered on top of the language standard to support interoperability in sharing. I would assert strongly that these two things are not the same thing. I would suggest that we do not necessarily have to worry about standardizing specific storage formats. That can be left up to various implementers. As long as they can align their implementation to the language standard (information model) and can translate to and from the exchange/message protocol/serialization standard (also aligned with the language standard) then they should be able to support sharing (production & consumption) and support the analysis and action that such sharing is intended to assist.

>STIX might be a nice ontology to represent cyber threat intelligence, but I think my count of systems that store cyber threat intelligence in STIX is around zero. Moreover, I doubt the world >needs or wants a standard format for cyber threat intelligence documents. This is not an indictment of STIX. It simply says people do not need STIX for that.

These statements are conflating "a standard format for cyber threat intelligence documents” with an underlying information model for STIX. They are NOT the same thing.

I would agree with you as I have stated above that the world does not necessarily need a document format standard.

They, however, DO absolutely need an information model (language) standard for cyber threat information that is not biased to any particular document storage or messaging context.

There is a very large body of players (including numerous FIs, large multinationals and integrators, governments, threat intelligence providers, etc.) that are looking to and implementing solutions based on STIX for its value at the information model level to support analysis not just for sharing and exchange.

>What the world does need is a method for sharing cyber threat intelligence. I.e., how do I convey the information on the wire.

>That is a messaging standard.

Yes, they do need a messaging standard but they need MORE than this. This alone does not solve the problem. They need this layered on top of an information model (language) standard for cyber threat information.

>If the format could happen to be used for storing the information, great. However, document-centric issues in STIX should, IMHO, carry about 0% weight. This is the issue that I would like to >see inform our debates on what STIX needs to look like and how it evolves.

If I correctly interpret your meaning of “document-centric” in the statement "document-centric issues in STIX should, IMHO, carry about 0% weight” as being a continued conflation between document format and underlying information model then I could not disagree more with this assertion. I believe it is completely inaccurate and precisely what we need to carefully avoid. The information model is certainly about more than just bits on the wire.

If I am not interpreting it correctly then I apologize.

>If anything, the JSON vote is a resounding vote from the community that STIX is only about transport on the wire - JSON is utterly useless as a document format, for all the reasons that >people say they want to use it. It has no semantics, it is not extensible, it is not open for discovery, and it is not portable outside of implementations that strictly agree on what the JSON >means.

I see no rational basis for such a conclusion about the ballot. The MTI ballot in no way raised such a question, issue or conclusion.

The ballot was simply an unbinding statement of consensus of a particular technology to pursue for an MTI serialization choice.

I heard from numerous people who voted yes on the ballot simply to stop the issue from repeatedly being raised on the list and distracting progress on information model issues.

In fact this was the primary reason given by the ballot authors for the timing of the ballot. I heard from others who truly prefer JSON but in no way limited that to only transport on the wire.

From what we have heard from a large number of implementations the most popular document/storage format being pursued is JSON. This directly contradicts your assertion/conclusion above. Further, those arguing for JSON as an MTI have long been asserting that JSON CAN be used to support semantics, extensibility and portability across implementations in an open environment. If this proves out to be untrue then the MTI decision will almost certainly need to be revisited in the future. If it proves out to be true then great we can all move forward productively.

Either way, I do not see any justification or value in attempting to fundamentally change the nature of what STIX is and has always been to now only be about transport on the wire.

In brief summary, I view this attempt to make STIX an either or between message format and storage format yet another false dichotomy.

STIX is NOT an either or between these two things.

STIX is the information model that underlies both and ensures they can effectively work together.

Buying into the false dichotomy means trying to push the information model strongly biased from one perspective or the other. I would assert that doing so has a high potential to break the ecosystem.

Truly it is unnecessary. As a group we can work on the information model together driven by the information needs of the overall CTI community and then people who are very concerned with the messaging-specific aspects of STIX usage can work together to reach consensus on a serialization of the model that is optimized for specific messaging context needs.

I apologize if any of my wording choices above read as overly confrontational.

I do not wish this to be confrontational at a personal level but do believe that the issues described here are fundamental to the success of our efforts and that attempts to change and limit STIX to only issues around transport on the wire is likely to break the ecosystem and community we have built and have very negative consequences overall.

Thanks for reading my long post.

Sean

From: Eric Burger <ewb25@georgetown.edu> on behalf of Eric Burger <Eric.Burger@georgetown.edu>
Date: Tuesday, December 8, 2015 at 8:47 PM
To: "Barnum, Sean D." <sbarnum@mitre.org>
Cc: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Subject: Re: [cti-stix] STIX: Messaging Standard vs. Document Standard

Yes, and…

I would buy the argument IIF the 1,000 messages in the past four weeks were arguing over the UML and whether a particular entity really was related to another, and whether that thing is really a related entity or is instead an attribute to the base entity.

However, the 1,000 messages in the past four weeks are arguing about bits on the wire.

I would also offer that I fully agree, for a messaging protocol to convey information, one must understand what that information is. I.e., the information model referenced below.

However, saying that the information model is necessarily the storage format and the protocol on the wire is simply not born out in reality. The source of much joy and gnashing of teeth is there are a lot of isomorphisms between different instantiations of information models. For example, see how many ways one can implement a simple, 13-element ERD in a relational data base. Life gets even more interesting if one is implementing an object data base.

However, there is no logical conclusion that because I happen to implement my storage in a relational data base, I need to be sending ODBC tables back and forth. [Although as a controversial side note, this does inform us as to why CSV, the granddaddy of table transfer on the wire, may be our last, best hope for peace.]

STIX might be a nice ontology to represent cyber threat intelligence, but I think my count of systems that store cyber threat intelligence in STIX is around zero. Moreover, I doubt the world needs or wants a standard format for cyber threat intelligence documents. This is not an indictment of STIX. It simply says people do not need STIX for that.

What the world does need is a method for sharing cyber threat intelligence. I.e., how do I convey the information on the wire.

That is a messaging standard.

If the format could happen to be used for storing the information, great. However, document-centric issues in STIX should, IMHO, carry about 0% weight. This is the issue that I would like to see inform our debates on what STIX needs to look like and how it evolves.

If anything, the JSON vote is a resounding vote from the community that STIX is only about transport on the wire - JSON is utterly useless as a document format, for all the reasons that people say they want to use it. It has no semantics, it is not extensible, it is not open for discovery, and it is not portable outside of implementations that strictly agree on what the JSON means.

On Dec 7, 2015, at 12:01 PM, Barnum, Sean D. <sbarnum@mitre.org> wrote:

>I would offer that we are unequivocally, unquestionably, incontrovertibly working on a message format.

Eric, I would have to respectfully disagree.

Though not just with this statement here but really with the false dichotomy that I think this thread posits.

While I certainly recognize the tradeoffs and contention between messaging-centric and document-centric perspectives, I disagree that STIX is inherently one or the other.

STIX is not just a messaging standard and STIX is not just a document standard. It is not an either-or between these two choices.

STIX is the information model (language) for cyber threat information.

STIX at its core is not targeted at telling you how you must lexically/syntactically structure your messages.

Similarly, at its core it is not targeted at telling you the bits and bytes you must use to store your content.

Whether you are exchanging CTI across messages or storing it within documents the information involved and its meaning is the same.

Structures and formats for exchange or storage may differ or vary as long as each maps back to the same information model so that everyone can understand what is there.

This is the reason that the specification for STIX is a data model (currently UML) that is separate from specifications binding that data model to any particular serialization format.

If the binding specifications are done in such a way that offers high assurance in the integrity of the mapping to the underlying information model then the real nut of the discussion here becomes more tractable. Messaging-centric vs document-centric is a serialization issue not a forced choice on the underlying information model.

Different serialization options (JSON, XML, protocol buffers, etc.) and how they are particularly applied are not just relevant for messaging (which are the battles going on for a long while and have resulted in agreement to choose an MTI) but also between things like messaging and document/storage.

In other words, with well mapped bindings you can define a binding for messaging (could be an MTI) that is tuned to the particular needs of messaging and you can define a different binding for document/storage that is tuned for its particular needs or you might not define a standardized one at all for document/storage leaving that up to each implementer (I think this last part may be part of Eric’s statements below).

This can be done without biasing things either direction and forcing either side to make unnecessary compromises. I could pull content from my repository in one form (document-centric), send it to you in another form (messaging-centric) and you could receive it and store it in your repository in a document-centric form. This is all possible when the information itself is mapped to the same underlying information model.

I think we need to be very careful to avoid coming down on one side or the other in a false battle between messaging-centric and document-centric camps and letting that drive towards attempts to bias the actual underlying information model. The underlying information model should be driven by the information people/systems need to express about cyber threat information whether that information is being messaged, stored or otherwise. For folks concerned with messaging-centric needs lets focus that tuning primarily at the binding level and not at the information model level where it can negatively impact the broader set of use cases.

This layered approach should be self-evident in the active work products we have in play currently with a spec for the language itself and separate binding specs. The XML binding spec for STIX x1.2.1 (representing our pre-2.0 status quo) is still in progress but should be finished soon. For v2.0 there will be a similar binding spec but would be for the JSON MTI serialization.

sean

From: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>, Eric Burger <ewb25@georgetown.edu> on behalf of Eric Burger <Eric.Burger@georgetown.edu>
Date: Sunday, December 6, 2015 at 8:47 AM
To: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>
Subject: Re: [cti-stix] STIX: Messaging Standard vs. Document Standard

Going back to Jason's question that started this thread: are we building a document format or are we building a protocol suite, of which STIX is the message format?

I would offer that we are unequivocally, unquestionably, incontrovertibly working on a message format. To be possibly just slightly controversial, I would offer that if we think STIX is the document format, then we will be moving cyber threat analysis forward about a year or two for a year or three, and then will irrevocably keep cyber threat analysis frozen in mid-2010’s for the next ten years.

Saying STIX is the document format says implies everyone has the same needs for processing the data or the document format has to cover everybody’s needs. We seem to be ‘working’ towards that goal, which may be part of why it takes weeks to define what a date is. Worse, since it is so hard to make everyone happy, once we make a decision, it is cast in stone. The evil in this result is not that STIX becomes inflexible and brittle. The evil is that if people think they have to store STIX as STIX, it means if someone comes up with a better way to look at or analyze threat data, they are S.O.L.

Saying STIX is the message format means we can relax - so long as I can express the transfer of information in STIX, you can store it in whatever way you want. Likewise, so long as I can express information in STIX, I can generate it from whatever format I happen to have it in.

I am not knocking Soltra - it is really cool they can process and store native STIX as STIX. However, I would think that as folks involved just a little in the cyber security environment, that we might shy away from saying that every CTI platform needs to look like Soltra. Said differently, there is nothing inherently wrong with someone saying, “STIX looks cool - I’ll base my world on STIX.” However, there is something majorly wrong with someone saying, “STIX looks cool - everyone must base their world on STIX.”

I would offer one modification to Jason’s observations for a messaging standard: Maximum byte efficiency is explicitly not a goal. If it was a goal, all protocols would use something like ASN.1 PER or handcrafted binary. No one would use keyword-value (e.g., SMTP and S/MIME and HTTP), XML (e.g., IODEF, SIMPLE), or even JSON. All of those carry too much overhead. So, why do we use them? Because I can use off-the-shelf parsers and, in the case of XML, have access to tons of tooling.

One last observation: if the goal is for STIX to be a document format, then there is one and only one reasonable encoding: XML. With XML, you can use XQUERY to ask about the data and variations on XPATH to send updates on the data. Done in one.

On Nov 30, 2015, at 7:50 PM, Cory Casanave <cory-c@MODELDRIVEN.COM> wrote:

Bret,

I will answer your questions, below [cbc] , but perhaps we will then agree to disagree and let the process work. I doubt more needs to be said.

-Cory

From: Jordan, Bret [mailto:bret.jordan@bluecoat.com]
Sent: Monday, November 30, 2015 7:13 PM
To: Cory Casanave
Cc: Jason Keirstead; Richard Struse; cti-stix@lists.oasis-open.org; Wunder, John A.
Subject: Re: [cti-stix] STIX: Messaging Standard vs. Document Standard

1: Definition method

Bret: The specification is English prose.

Cory: The specification is a machine readable model that includes English prose.

How is this an issue? The generalities that there will be problems is vague and I am not sure how it applies to this specification.

[cbc] Well, it has been a huge issue in my own attempt to understand STIX and map it. Some things still make no sense. It is well documented that there are multiple ways to say the same thing – will all the implementations work together? Perhaps I can find some time to document some of the “WTF” questions that came up as I looked as STIX-1. When prose specifications are interpreted differently you get very expensive and hard to resolve BUGS. You get systems under the same standards that don’t work together. To me, this is a problem.

When STIX moves to Cap-n-Proto (aka Binary) there will be no more english fields names,

[cbc] English field names are not required. What is required is a programmatic way to go from instance to specification. This can be done in binary.

so how is this an issue? HTTP and HTML being english centric seem to have worked well. A specification is a specification. Building unit-test to test compliance is a relatively easy thing to do.

[cbc] Wow, you must be really good. Compliance has been hard for most specifications.

This will guarantee interoperability.

[cbc] Trouble is, it does not. Interoperability is hard.

And if one vendors (internet explorer 6) comes out that breaks the eco-system, then consumers should not buy that product and force that vendor to change.

2: Schema production

Bret: The field names and structure are hand crafted.

Cory: The field names and structure are produced from the model.

Organizations and development shops will always produce their own APIs to generate STIX content.

[cbc] That would be unfortunate for wide-scale adoption.

Some may use community build modules / APIs, depending on the licensing and intellectual property aspects. It is very easy to build compliance and unit-testing to verify that what someone produces will match the specification.

[cbc] So do we have that for STIX 1? Ask Oasis about ease of conformance suites.

STIX is not that big.

[cbc] STIX and all it imports is thousands of terms. What is big to you? Or, are you assuming a much reduced scope? If so, the scope question should be #1!

I built an API to do all of the indicators and TTP stuff in a few days. I would argue that the best thing we could do would be to present a text document form the UML that listed out each field name by idiom. Then developers can just copy and paste the entire list. This way there will be no-type'os. But once again, a simple unit-test will pick up any issues.

[cbc] I think we have it on a key point – “Idioms”. Idioms are examples, not specification’s. Coding to an idiom would be very fragile and would then not interoperate with others who coded to other idioms that utilize the same or overlapping data.

By the way, since you will copy/paste the field names I’m not sure why the introduction of a namespace prefix is such an issue, it would have zero development in inconsequential runtime overhead.

3: Namespaces

Bret: The tag names in the data are implicitly mapped to the schema by name

Cory: The tag names are explicitly mapped to their schema and definition by name and explicit namespace

I disagree. In the UML it is very easy to see in the 20 items for each idiom if we have re-used the same name more than once.

[cbc] Again, Idioms are irrelevant. We need to look at all the terms that could be used in any STIX message. I agree it is easier to see in UML. So it is easier to get agreement on the content.

Once again, we are trying to solve a problem that is not there. Using the same name for a field in a different idiom, is not an issue. Higher level code will easily handle this and vendors and developers map those data fields in to their own dataset and that do something with them. Namespaces allows for people to artificially extent a schema and do things that will BREAK compatibility.

[cbc] Interesting assertion. I don’t see how namespaces allow people to break interoperability. Namespaces provide for interoperability. My guess is you are postulating externally introduced namespaces? It is up to the policy of the specification as to the extensibility of new namespaces. I would suggest that some (controlled) extensibility is required for agility. But that is a choice independent of namespaces. CTI could forbid any new namespaces.

4: Variability

Bret: I am only concerned with a specific and very structured exchange schema.

Cory: There will be multiple patterns of exchange for different use cases based on the same underlying model.

Once again I disagree. It is just as easy for me fill out every field and send the blob of data as it is to only fill out one-three fields and send it. I am not only concerned with sending minimal data. If I send several blobs of data, some TTPs, some ThreatActors, some Indicators. Receiving code can easily handle this by saying:

if type = "indicator" do foo

elsif type = "ttp" do foo1

elsif type = "threatactor" do foo2

etc

One group may only be able to send indicators with certain data, and other vendors may be able to send something else. Great, my code will consume and do things with all of it.

[cbc] And ignore what it doesn’t need, right? So what you are saying is that there is one large schema, no idioms and everything is optional? You may want to layer some required interaction profiles on top of that. In any case, CTI and the expectations of using it will change over time – better to plan for it.

5: Development

Bret: All I need is a text editor and I will type in my implementation.

Cory: Reading, writing, mapping and even presenting the data will be heavily assisted with automation. Only special algorithms will be coded.

This is a problem that vendors will solve. This is not a standards track issue. Vendors will produce neat and interesting tools that make use of the data. The vendors that do the best job, will make the most money and get the most sales.

[cbc] What you are suggesting is disenfranchising a large set of vendors that do not implement the way you do, it is up to the standard to provide the artifacts to enable a large community, not pre-suppose particular implementation styles, idioms and use cases.

To answer your question. I am not against a solid UML specification or model or what every you call it. In my mind a UML model is such a wonderful thing to have. It makes it so much easier to learn and understand STIX. When I first started playing with STIX, I build my own UML model as there wasn't one. I needed to do that to make heads and tails of what was going on. So yes, we need a UML specification / model.

Where I believe we fundamentally disagree is on the idea of code writing itself and auto-generating itself.

[cbc] So you work in machine code, no compilers? No virtual machines? No code gen from schema? No visualization tools? No analytics engines? No mapping tools. How are things in the 60’s?

So how ‘bout his? We agree on a small subset model and a JSON representation of it. We then see if that can be generated, if so, there should be no issue.

Some people may use this, but this is NOT a requirement for the standard, IMHO.

[cbc] Again, it is for a standard that enables a larger community.

A nice and clean UML specification

[cbc] Ok, lets start on that now and stop spending so much time on one of multiple syntaxes.

and a super easy to implement binding in JSON is all we need at this point..

[cbc] I really want you to have that as well!

Long-term I see the need for moving to a binary representation in say Cap-n-Proto, but that will be 3-5 years from now if we are successful.

Thanks,

Bret

cti-stix message