OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti-stix message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [cti-stix] [cti-users] MTI Binding


I’m with John here – at the end of the day with our current data model, no matter which serialization we use, it will likely require some custom code to work properly on a backend. I think this boils down to a number factors: model complexity, semantic consistency, and also more generally the dynamic nature of the cyber domain and the associated difficulties in modeling it. IMO, tackling these issues should be our focus – the MTI is an important consideration, but it should not overshadow the discussions around the model itself. This is why I’m in favor of having some consensus agreement on a straw man MTI (likely JSON), that we can revisit later on in STIX 2.0 (and CybOX 3.0) development.

-Ivan

From: <cti-stix@lists.oasis-open.org> on behalf of John Wunder
Date: Tuesday, October 6, 2015 at 8:23 AM
To: "cti-users@lists.oasis-open.org", "cti-stix@lists.oasis-open.org"
Subject: Re: [cti-stix] [cti-users] MTI Binding

I wanted to comment on this paragraph specifically because I think it gets to the heart of the big divide we’re seeing now:

== quote ==
The above is true if you are expecting an environment where every use of data is custom programmed and programmers read doc, write code and hope they get it right. I am hoping for a bit more, where we can use a great deal of information without any custom code at all – perhaps they can bring it into their analytics tool, pivot, match, etc and come out with some valuable knowledge. That requires that the information is linked to its definition and that definition is also machine readable The software is doing the work. Also, when we do write code it can be validated against the definition. Nothing new here, we have been doing this for decades. Also best practice.
== end quote ==

This seems to me to be the fundamental disagreement between the “pure JSON” and “JSON-LD” camps. Speaking for myself, I think custom software will almost always need to be written to do interesting things with this type of information. As such, my preferred optimization is for developers writing code against these formats. In those cases, JSON-LD seems a bit redundant to JSON…it adds extra fields that you don’t really need.

I do agree that some people may want to use non-specialized analytic tools that work with RDF-based data to analyze STIX. But in those cases, couldn’t you just translate the native JSON to the JSON-LD (or XML/RDF or whatever) version since we have the full model defined already? I feel like using an RDF-based exchange format is adding fields and redundancy optimized for a specific type of analysis that in reality not a lot of tools will be taking advantage of.

John

On Oct 5, 2015, at 7:36 PM, Cory Casanave <cory-c@modeldriven.com> wrote:

Bret,

Re: When a user gets a blob of JSON based STIX 2.0 data, and uses some tooling to parse it / dump it in to a MongoDB, they will know what it is that they parsing as the content is fixed by the specification / data model / ontology etc

 

STIX and all it imports defines several thousand terms. These terms may be organized in various ways to represent various data packages for various use cases – these is not one-single STIX “table” or one definition of “name”. In current STIX (And any schema based language) terms are scoped, at least, to a namespace where they are defined. Without such a namespace you would have to either de-conflict across your domain which is then one giant namespace  (which in a small very controlled environment you may well do) or assume the consuming software will “just know” based on some context. JSON has no built-in notion of such namespaces (as it comes from use in small very controlled environments). I think it is well established that information sharing terms should be scoped to the their definitional namespace.  JSON-LD adds this by use of “context”. Other schemes in JSON for open information sharing may add it in other ways. XML, RDF & SQL have it built in.

 

Re: If a developer does not understand what say an InformationSourceType is, they can look it up in the UML or OWL model.  Once they have it figured out, meaning, once they know what an InformationSourceType is, they can work with it.

 

The above is true if you are expecting an environment where every use of data is custom programmed and programmers read doc, write code and hope they get it right. I am hoping for a bit more, where we can use a great deal of information without any custom code at all – perhaps they can bring it into their analytics tool, pivot, match, etc and come out with some valuable knowledge. That requires that the information is linked to its definition and that definition is also machine readable The software is doing the work. Also, when we do write code it can be validated against the definition. Nothing new here, we have been doing this for decades. Also best practice.

 

So I think the question isn’t if we are going to link data to its definition, it’s how. And how rich and flexible is that definition. Such is our discussion.  JSON-LD is a data serialization format based on the RDF data model (which is very simple). The “schema” is also RDF, which can be in JSON-LD or any number of other serialization formats such as XML or Turtle.

 

When you are dealing with one very specific JSON format, say a list of suspect IP addresses, you would only have to know the name-strings used – those name strings have the namespace of definition encoded in them – but they are still  just JSON name strings. So you can look these strings up in the documentation and write your code. Others may interpret the references inside the name strings and process information based on the defining schema, making smarter software.

 

 

-Cory Casanave

 

From:cti-stix@lists.oasis-open.org [mailto:cti-stix@lists.oasis-open.org] On Behalf Of Jordan, Bret
Sent: Monday, October 05, 2015 6:33 PM
To: cti-users@lists.oasis-open.org; cti-stix@lists.oasis-open.org; Sean D. Barnum
Subject: Re: [cti-stix] [cti-users] MTI Binding

 

I get what you are saying...  Let me rephrase and ask the question differently.....

 

1) Yes we need to have a specification for what STIX is.  You call it a bunch of terms dealing with lexicality, syntax, semantics, ontology, & data model.  I agree we need this, regardless of what it is called...  Right now it is in UML, maybe in the future it will be OWL, maybe further out it will be StandAndSit Ontology Markup, whatever.  The fact of the matter is we need this and it needs to be rock solid, well understood, and easy to comprehend.  

 

2) The specification will be represented in a serialization format that products, devices, and software will actually use.  Hopefully that is JSON, today it is XML.  

 

3) When a user gets a blob of JSON based STIX 2.0 data, and uses some tooling to parse it / dump it in to a MongoDB, they will know what it is that they parsing as the content is fixed by the specification / data model / ontology etc..  If a developer does not understand what say an InformationSourceType is, they can look it up in the UML or OWL model.  Once they have it figured out, meaning, once they know what an InformationSourceType is, they can work with it.  

 

JSON-LD and other things like it, are for telling remote software how to guess at what the data actually is because there is no standard form.  Example, say you have two organizations that have data they want to share.....

 

// Twitter as an example

{

            name: "Barney",

            color : "Purple"           

}

 

 

vs

// Facebook as an example

{

            name: "bny00989",

            color: "Purple",

            size: "3 meters"

}

 

 

Now the "name" value in both blobs is related but does not contain the same information.  One appears to contain a real name while the other appears to contain a username.  What JSON-LD and other things like this allow you to do is put some context around the "name" so that some software can understand it correctly.  The reason this is needed is because there is no standard to define what "name" should be.  But we do not have that problem in STIX because we have a specification that tells you what everything is.  

 

So in JSON-LD you would have something like:

 

{
  "@context": "http://schema.org/",
  "@type": "Person",
  "name": "Barney",
  "jobTitle": "Professor",
  "telephone": "(425) 123-4567",
  "url": "http://www.janedoe.com",

  "size": "3 meters"
}

 

 

Thus you have the ability to assign a TYPE and CONTEXT, aka a Schema location for the blob...  This way software can hopefully better understand arbitrary data that is coming across.  Now this seems like it would be cool to do with STIX.  But given that we HAVE a data model and a specification that everyone will be using, why do we need to tell software what something is when it is already defined in the specification.  

 

I could see if we were trying to build a model that allowed anyone to create any type of schema blob, then YES, this would get us there. Then STIX could become a standard for how you define other CTI standards. Then everyone can create their own STIX Lite and publish their own Schema and use JSON-LD to tell someone else what their schema is and how to interpret it.  I could see JSON-LD being used to "overload" the "InformationSourceType" and say that for example Company FOO does not like the OASIS version of "InformationSourceType" so they have defined a new "InformationSourceType" and in order to understand their data blob you need to go to this schema location programmatically and figure it out.  

 

If we are wanting to allow organizations to overload our STIX specification then yes, lets to JSON-LD.  Otherwise the data that will be shared will be known because it will be in the specification.  

 

 

Thanks,

 

Bret

 

 

 

Bret Jordan CISSP

Director of Security Architecture and Standards | Office of the CTO

Blue Coat Systems

PGP Fingerprint: 63B4 FC53 680A 6B7D 1447  F2C0 74F8 ACAE 7415 0050

"Without cryptography vihv vivc ce xhrnrw, however, the only thing that can not be unscrambled is an egg." 

 

On Oct 5, 2015, at 15:07, Barnum, Sean D. <sbarnum@mitre.org> wrote:

 

Comments inline

 

From: "cti-stix@lists.oasis-open.org" on behalf of "Jordan, Bret"
Date: Monday, October 5, 2015 at 1:41 PM
To: "Bush, Jonathan"
Cc: "Barnum, Sean D.", Jane Ginn, John Wunder, "cti-users@lists.oasis-open.org", "cti-stix@lists.oasis-open.org"
Subject: Re: [cti-stix] [cti-users] MTI Binding

 

I have been reading a lot about JSON-LD, and I get how and why it might be interesting in a website context when you are sharing unknown data back and forth.  Meaning there is no standard for the data you are sharing.  Think user profile between Google, Twitter, Facebook etc.  But, unless I am mistaken, the purpose of STIX is to define a standard for CTI so that we all share the same data.  

 

Can someone explain why JSON-LD is needed in the CTI context.   I just do not see why anyone that is building an application to use CTI would care since all of the data that will be shared between them is KNOWN and in a standard well known form, aka STIX...  Please help me understand this use case. 

 

[sean]The only way that STIX can be KNOWN as a standard for cyber threat information (CTI) is for the lexicality, syntax and semantics (i.e. the meaning) of CTI to be explicitly codified. This is the purpose of the ontology/data-model for STIX. It defines the concepts, meanings, properties, relationships, etc. for CTI as a knowledge domain irrespective of any particular technologies chosen to implement any particular use case. In this way the “language” of STIX is portable between any parties, vendors, systems, etc.

When it comes time to implement a specific use case using some particular appropriate technology you will need a way to serialize particular content to a given format (e.g. JSON) such that the implementation can read it in or write it out or exchange it. Any particular chosen serialization format allows the implementation to utilize standards-conformant (to that particular serialization format) tooling to be used to parse in the content. Such a serialization format choice does not however tell you what that content means. To understand what the content means you need a layer of mapping between the end serialization and the overall ontology/data-model that provides meaning to the content. This middle layer involves some formalized constraint on the end serialization format for use within the knowledge domain in question (e.g. XSD or JSON Schema) so that implementations can verify whether or not serialized content is conformant or not. But it also requires some sort of mapping assurance that the formalized schematic constraint itself actually is conformant with the higher level ontology/data-model.

This is the stack we have been working towards:

  • Ontology/data-model: 

Defines the various concepts for the CTI space, as well as their properties, structures and relationships with each other. This also defines how we can deal with uncertainty, diversity and change over time through things like extension points (recognizing that we do not nor ever likely will have the full picture), vocabularies, etc.

  • Binding specifications:

Define a mapping from the Ontology/data-model to a particular representation format (JSON, XML, etc.). These allow a given format to be used by those who support it to represent content according to the ontology/data-model. If these binding specifications are accurate and fully cover the ontology/data-model with explicit mappings then it should be possible to losslessly translate from one bound serialization format to another.

  • Representation format implementation:

An explicit schematic specification (e.g. XSD) for representing CTI content according to the Ontology/data-model as bound by the corresponding binding specification. This will allow implementations that only care about the end serialized content and not the domain meaning of the content to parse and output CTI content in a validatable and interoperable way.

  • Actual instance CTI content expressed in a particular representation format:

Actual instance CTI data.

 

JSON-LD would basically fit into this stack at the binding specification and representation format levels.

The “context” structure of JSON-LD lets you do the sort of mappings from the ontology/data-model to a particular representation that are the purpose of the binding specifications. In this case the “context” (which can be expressed in a separate reference able file rather than only inline with the content) would capture the binding specification rules for a JSON format implementation and the “context” file(s) itself would form the JSON representation format implementation specification.

At that point instance CTI content could be expressed in JSON with the referenced JSON-LD “context” providing the mechanism for interpreting it.

I have not personally worked directly with JSON-LD nor done any sort of very detailed analysis of its capabilities. It is unclear whether or not JSON-LD has adequate expressivity to fully map our domain or the capability to provide automated validation. It may. It may not. That would be one dimension we would need to explore if we wish to consider JSON-LD as an option (which I would personally support).

 

In other words, JSON-LD would not simply be something pursued in addition to STIX. It would/could be HOW STIX is defined (KNOWN) for use within a JSON technology stack. It does not replace the need for the data-model and it does not replace the end serialization as pure JSON. Rather it provides a way to explicitly defined the two middle layers of the stack.

 

 




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]