RE: [cti-stix] RE: [cti-users] MTI Binding

Has any one considered:

PMML stands for "Predictive Model Markup Language". It is the de facto standard to represent predictive solutions. A PMML file may contain a myriad of data transformations (pre- and post-processing) as well as one or more predictive models.

Because it is a standard, PMML allows for different statistical and data mining tools to speak the same language. In this way, a predictive solution can be easily moved among different tools and applications without the need for custom coding. For example, it may be developed in one application and directly deployed on another.

Traditionally, the deployment of a predictive solution could take months, since after building it, the data scientist team had to write a document describing the entire solution. This document was then passed to the IT engineering team, which would then recode it into the production environment to make the solution operational. With PMML, that double effort is no longer required since the predictive solution as a whole (data transformations + predictive model) is simply represented as a PMML file which is then used as is for production deployment. What took months before, now takes hours or minutes with PMML.

PMML is developed by the Data Mining Group (DMG), a consortium of commercial and open-source data mining companies. The latest version of PMML, version 4.1, was released by the DMG in December 2011.

Since PMML is XML-based, it is not rocket science. Its structure follows a set of pre-defined elements and attributes which reflect the inner structure of a predictive workflow: data manipulations followed by one or more predictive models.

What are the benefits of PMML?

PMML makes it extremely easy for any predictive solution to be moved from one data mining system to another. For example, once represented as a PMML file, a predictive solution can be operationally deployed right away, without the need for custom code. In this way, PMML transforms predictive analytic solutions into dynamic assets that can be put to work immediately.

For big companies with many in-house statistical and data mining tools, PMML works as the common denominator, since whenever the solution is built, it is immediately represented as a PMML file. This allows companies to use "best of breed" tools to build the best possible solutions.

Since PMML is a standard, it also fosters transparency and best practices. Transparency comes from the fact that the predictive solution is no longer a black box. Open the box and understanding what is inside, the analytics team can easily recognize past decisions and establish practices that work.

What kind of predictive techniques are supported by PMML?

PMML defines specific elements for several predictive techniques, including neural networks, decision trees, and clustering models, to name just a few. New techniques just recently supported are k-Nearest Neighbors and Scorecards, which include reason codes.

PMML also defines an element for representing multiple models. That is, PMML can be used to represent model segmentation, composition, chaining, cascading, and ensemble, including Random Forest Models.

To review all the elements supported by PMML, take a look at the language specification at the DMG website (see Resources below).

Can PMML represent data pre- and post-processing?

PMML has several built-in functions, such as IF-THEN-ELSE and arithmetic functions, that allow for extensive data manipulation. It also defines specific elements for the most common pre-processing tasks such as normalization, discretization, and value mapping. To review all the pre-processing capabilities PMML has to offer, refer to the PMML pre-processing primer.

With PMML 4.1, all the capabilities available for data pre-processing were also made available for post-processing. In this case, a PMML file can now also contain a set of business rules that define actions or decisions to be taken based on the outcome of the predictive model. A PMML file can represent the entire predictive solution, from raw data and model to business decisions.

Resources

Websites

DMG website: Complete PMML specification
Zementis PMML resources page: Links to PMML tools and examples
PMML Wikipedia page

Book: PMML in Action (2nd Edition) - Available on Amazon.com

From: cti-users@lists.oasis-open.org [mailto:cti-users@lists.oasis-open.org] On Behalf Of Barnum, Sean D.
Sent: Friday, October 02, 2015 12:06 PM
To: Cory Casanave; Davidson II, Mark S; John K. Smith; Shawn Riley
Cc: Wunder, John A.; Jordan, Bret; cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org; Jim Logan (jlogan@nomagic.com)
Subject: [cti-users] Re: [cti-stix] RE: [cti-users] MTI Binding

I agree completely with your first assertion.

It is the second one we disagree on. I definitely agree that nothing is perfect (including OWL) but my personal experience leads me to have lower confidence that UML will be adequate.

As we agreed, we can still agree to disagree. :-) How very meta. ;-)

On the model import, are you trying to import the .emx or the more general UML2.2 form? Are you having issues with both?

sean

From: Cory Casanave <cory-c@modeldriven.com>
Date: Friday, October 2, 2015 at 11:57 AM
To: "Barnum, Sean D." <sbarnum@mitre.org>, Mark Davidson <mdavidson@mitre.org>, "John K. Smith" <jsmith@liveaction.com>, Shawn Riley <shawn.p.riley@gmail.com>
Cc: John Wunder <jwunder@mitre.org>, "Jordan, Bret" <bret.jordan@bluecoat.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>, "cti-users@lists.oasis-open.org" <cti-users@lists.oasis-open.org>, "Jim Logan (jlogan@nomagic.com)" <jlogan@nomagic.com>
Subject: RE: [cti-stix] RE: [cti-users] MTI Binding

Sean,

Re: I think we can continue to agree to disagree on this and still make progress. :-)

Yes we can. I have tried to make 2 separate assertions: 1: You need some high-level model as your foundation. 2: UML works for this, but there are other approaches. Through personal experience I have confidence the UML approach works. I also know it is far from perfect, but so is everything else I have tried (including OWL).

Re: Would you be interested in using your Nomagic to give it a shot?

Yes, I may even get Nomagic to help. The sticking point today is effective model import from RSA. Some work may be required on the models to make them appropriate for this purpose.

From: Barnum, Sean D. [mailto:sbarnum@mitre.org]
Sent: Friday, October 02, 2015 11:50 AM
To: Cory Casanave; Davidson II, Mark S; John K. Smith; Shawn Riley
Cc: Wunder, John A.; Jordan, Bret; cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org; Jim Logan (jlogan@nomagic.com)
Subject: Re: [cti-stix] RE: [cti-users] MTI Binding

Thank you for the clarification.

I still think you and I will continue to disagree on whether or not UML is an appropriate choice to attempt to represent the semantics needed for STIX and CybOX. I continue to hold the opinion that UML was/is not designed to specify the semantics of a language or information representation like we are developing. It can be be useful for capturing much of what we need but not all.

In my opinion, UML is an appropriate stepping stone to get us where we need to go but not necessarily the best end solution.

I think we can continue to agree to disagree on this and still make progress. :-)

BTW, I would love to see a test of auto generating from the current models. I also don’t know the extent of RSA capability on this front. Would you be interested in using your Nomagic to give it a shot?

sean

From: Cory Casanave <cory-c@modeldriven.com>
Date: Friday, October 2, 2015 at 11:35 AM
To: "Barnum, Sean D." <sbarnum@mitre.org>, Mark Davidson <mdavidson@mitre.org>, "John K. Smith" <jsmith@liveaction.com>, Shawn Riley <shawn.p.riley@gmail.com>
Cc: John Wunder <jwunder@mitre.org>, "Jordan, Bret" <bret.jordan@bluecoat.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>, "cti-users@lists.oasis-open.org" <cti-users@lists.oasis-open.org>, "Jim Logan (jlogan@nomagic.com)" <jlogan@nomagic.com>
Subject: RE: [cti-stix] RE: [cti-users] MTI Binding

Sean,

Good summary and context.

I would disagree with one point:

Re: It should be pointed out that for this option to really be practical we would need to move our ontology/data-model spec from its current UML model with text docs form to a full semantic form with dereferenceable structures (such that the JSON-LD context could do the appropriate mappings in an LD fashion.

The JASON-LD schema, which is essentially RDF, can be generated from well-formed UML just like XML schema can*. If there were semantics you could not fully represent in JSON-LD the UML model (and/or generated OWL) could also be referenced to add such semantics. There are a few things you can say in full OWL you can’t say directly in UML and a few things in UML you can’t say directly in OWL – but it seems like the fundamental needs of CTI are captured in both, and they can be mapped.

*Here is a product that does so: http://www.nomagic.com/products/magicdraw-addons/cameo-concept-modeler-plugin.html

It would not be much of a stretch to test generating from your current models to a schema that can be used with JSON-LD (Don’t know if RSA does this, but Nomagic does). Nomagic copied for comment.

-Cory

From: Barnum, Sean D. [mailto:sbarnum@mitre.org]
Sent: Friday, October 02, 2015 11:13 AM
To: Davidson II, Mark S; John K. Smith; Shawn Riley; Cory Casanave
Cc: Wunder, John A.; Jordan, Bret; cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org
Subject: Re: [cti-stix] RE: [cti-users] MTI Binding

Without going to far down the rabbit hole right now let me take a VERY simple stab at providing some context on your question.

Out stack for the CTI TC ecosystem language specs (STIX and CybOX) looks something like:

Ontology/data-model:

Defines the various concepts for the CTI space, as well as their properties, structures and relationships with each other. This also defines how we can deal with uncertainty, diversity and change over time through things like extension points (recognizing that we do not nor ever likely will have the full picture), vocabularies, etc.

Binding specifications:

Define a mapping from the Ontology/data-model to a particular representation format (JSON, XML, etc.). These allow a given format to be used by those who support it to represent content according to the ontology/data-model. If these binding specifications are accurate and fully cover the ontology/data-model with explicit mappings then it should be possible to losslessly translate from one bound serialization format to another.

Representation format implementation:

An explicit schematic specification (e.g. XSD) for representing CTI content according to the Ontology/data-model as bound by the corresponding binding specification. This will allow implementations that only care about the end serialized content and not the domain meaning of the content to parse and output CTI content in a validatable and interoperable way.

Actual instance CTI content expressed in a particular representation format:

Actual instance CTI data.

JSON-LD would basically fit into this stack at the binding specification and representation format levels.

The “context” structure of JSON-LD lets you do the sort of mappings from the ontology/data-model to a particular representation that are the purpose of the binding specifications. In this case the “context” (which can be expressed in a separate reference able file rather than only inline with the content) would capture the binding specification rules for a JSON format implementation and the “context” file(s) itself would form the JSON representation format implementation specification.

At that point instance CTI content could be expressed in JSON with the referenced JSON-LD “context” providing the mechanism for interpreting it.

I have not personally worked directly with JSON-LD nor done any sort of very detailed analysis of its capabilities. It is unclear whether or not JSON-LD has adequate expressivity to fully map our domain or the capability to provide automated validation. It may. It may not. That would be one dimension we would need to explore if we wish to consider JSON-LD as an option (which I would personally support).

It should be pointed out that for this option to really be practical we would need to move our ontology/data-model spec from its current UML model with text docs form to a full semantic form with dereferenceable structures (such that the JSON-LD context could do the appropriate mappings in an LD fashion. This is something we have talked about for quite a while as our ultimate goal for many reasons but it has not to date been something we have put on the roadmap to tackle for STIX 2.0.

Does that help put it in context?

Anyone familiar with JSON-LD please feel free to point out any errors in my explanation.

sean

From: "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org> on behalf of Mark Davidson <mdavidson@mitre.org>
Date: Friday, October 2, 2015 at 9:42 AM
To: "John K. Smith" <jsmith@liveaction.com>, Shawn Riley <shawn.p.riley@gmail.com>, Cory Casanave <cory-c@modeldriven.com>
Cc: John Wunder <jwunder@mitre.org>, "Jordan, Bret" <bret.jordan@bluecoat.com>, "cti-stix@lists.oasis-open.org" <cti-stix@lists.oasis-open.org>, "cti-users@lists.oasis-open.org" <cti-users@lists.oasis-open.org>
Subject: [cti-stix] RE: [cti-users] MTI Binding

How does something like JSON-LD fit into the serialization discussion? For the MTI format discussion we are talking about the thing that products will send to each other (I think, anyway). I did some quick reading on RDF / JSON-LD (complete newbie, forgive my ignorance), and I didn’t get a clear picture on how it would fit.

For instance, as a completely trivial example, imagine a tool sending indicators out to sensors:

{ ‘type’: ‘indicator’, ‘content-type’: ‘snort-signature’, ‘signature’: ‘alert any any’}

Would JSON-LD (or something like it) take the place of the JSON listed above? Or would JSON-LD get automagically translated into something that takes the place of the JSON listed above? Or am I completely off-base in my questions?

Thank you.

-Mark

From: John K. Smith [mailto:jsmith@liveaction.com]
Sent: Thursday, October 01, 2015 7:00 PM
To: Shawn Riley <shawn.p.riley@gmail.com>; Cory Casanave <cory-c@modeldriven.com>
Cc: Davidson II, Mark S <mdavidson@mitre.org>; Wunder, John A. <jwunder@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com>; cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org
Subject: RE: [cti-users] MTI Binding

Just my 2 cents … having used RDF, TTL etc for security ontologies, I think leveraging something like JSON-LD will help better adoption by broader group.

Seems like schema.org is using JSON-LD but I’m not sure to what extent.

Thanks,

JohnS

From:cti-users@lists.oasis-open.org [mailto:cti-users@lists.oasis-open.org] On Behalf Of Shawn Riley
Sent: Friday, October 02, 2015 2:45 AM
To: Cory Casanave <cory-c@modeldriven.com>
Cc: Davidson II, Mark S <mdavidson@mitre.org>; Wunder, John A. <jwunder@mitre.org>; Jordan, Bret <bret.jordan@bluecoat.com>; cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org
Subject: Re: [cti-users] MTI Binding

Just wanted to share a couple links that might be of interest here for RDF translation.

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.

https://github.com/RDFLib/rdflib

JSON-LD parser and serializer plugins for RDFLib (Python 2.5+)

https://github.com/RDFLib/rdflib-jsonld

Here is a online example of a RDF to multi-format translator.

http://rdf-translator.appspot.com/

On Thu, Oct 1, 2015 at 1:39 PM, Cory Casanave <cory-c@modeldriven.com> wrote:

Mark,

Do I see it today? no. There may be some but I don’t know of it.

Could it be used – sure. If you have very atomic data, like a sensor data, RDF can be VERY compact and understandable.

Since I NEVER program to the data syntax (Libraries and MDA magic do that) I really don’t care if the data is in JSON or XML, but some do, and I could see a sensor hard coded like that. So the reason I am suggesting looking at the JSON/RDF (JSON-LD) format is that it reads better (and easier to parse) than the same thing encoded in XML while supporting the requirements I mentioned.

I should have referenced the “standard” name: Json-ld

Other note: I have no vested interest in RDF technologies, its something I use where it is the best choice.

Here is some info on Wikipedia: https://en.wikipedia.org/wiki/JSON-LD

Other note: I’m not entirely convinced a single “MTI” is a good idea, but if it is a distributed graph structure is the only thing that would scale from a sensor report to a query across millions of data points.

From: Davidson II, Mark S [mailto:mdavidson@mitre.org]
Sent: Thursday, October 01, 2015 1:24 PM
To: Cory Casanave; Wunder, John A.
Cc: Jordan, Bret; cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org
Subject: RE: [cti-users] MTI Binding

Cory,

I’m a little unfamiliar with RDF, so I have a clarifying question. In terms of RDF in JSON, is that something that you see security products using directly to interoperate? E.g., my SIEM uses TAXII + STIX/RDF/JSON to talk to my Sensor?

Thank you.

-Mark

From:cti-users@lists.oasis-open.org [mailto:cti-users@lists.oasis-open.org] On Behalf Of Cory Casanave
Sent: Thursday, October 01, 2015 11:09 AM
To: Wunder, John A. <jwunder@mitre.org>
Cc: Jordan, Bret <bret.jordan@bluecoat.com>; cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org
Subject: RE: [cti-users] MTI Binding

John,

With respect to RDF in JSON, logical data models and other options, I will respond here but also look at updating the wiki. Sorry in advance for the long message – but I think it an important point.

JSON has come from an environment of “server applications” supplying data to their “client applications”, where the client applications tended to be coupled and implemented in _javascript_. The use has, of course, broadened, but that is the foundation and what it is very good at. What makes it “easy” is:

·         There is a well defined relationship between the client and server applications, usually under control of the same entity.

·         The server application is primarily in control of what the user will see through the client and how they interact.

·         There is a “dominate decomposition” of the data because it is serving a specific restricted set of use cases that the data structure and applications are tuned for. A strict data hierarchy works just fine. (Look up “dominate decomposition” – there is a lot of good information on the topic)

·         Data is coming from a single source and can be “bundled” for the next step in the applications workflow. Not much need to reference data from other sources or across interactions.

·         The semantics and restrictions of the data are understood within the small team(s) implementing this “client server” relationship – fancy schema or semantic representations are not needed.

·         The data source is the complete authority, at least for the client application.

·         Things don’t change much and when they do it is under a controlled revision on “both ends”.

·         The application technology is tuned to the data structure – _javascript_ directly reflects JSON.

A good example may be the “weather channel” application on your phone and web browser. It is all managed by the weather channel developers (and perhaps their partners) for users (specific stakeholder group) to get weather information (specific information for a purpose) for a region (the dominate decomposition). I don’t know if they use JSON, but it would be a natural choice. This set of clients is served by servers designed for the above purpose.

RDF & the “semantic web” stack has been designed with a very different set of assumptions:

·         Data providers and data consumers are independent and from different organizations, countries and communities.

·         Data providers and data consumers are independently managed.

·         Data providers have no idea what data consumers will use the data for, the consumer is more in control of what they consume and how they use it

·         There are numerous use cases, purposes and viewpoints being served – there is no dominate decomposition.

·         Data may come from multiple sources and the consumer may follow links to get more information, perhaps from the same or different sources. No static fixed “bundles” are practical.

·         Due to the distributed community the data semantics, relations and restrictions must be clearly communicated in a machine readable form.

·         Things change all the time and at different rates

·         No data source is complete, clients may use multiple sources

·         Any number of technology stacks will be used for both data providers and consumers.

An example could be the position and path of all airliners, worldwide.

This difference in design intent results in some specific differences in the technology:

·         RDF (and similar structures) are “data graphs” – information points to information without a dominate decomposition.

·         JSON is a strict hierarchy, essentially nested name/value pairs

·         RDF has as its core a type system with ways to describe those types

·         JSON has no type system, everything is a string. There is an assumption that “everyone knows what the tags mean”

·         RDF depends on URIs to reference data – this works within a “document” and across the web. This is where the “Linked data” term comes from (note: linked data may or may not be “open”)

·         JSON has no reference system at all, you can invent ways to encode references (locally or remote) in strings but they are ad-hoc and tend to be untyped

·         RDF is a data model with multiple syntax representations (XML, JSON, Turtle, etc)

·         JSON is a data syntax

Here is the rub: Programming any application for a more general, more distributed, less “dominate”, less managed and less coupled environment is going to be harder than coding for the coupled, dominate managed and technology tuned one. Changing the syntax is not going to change that. Encoding the RDF model in JSON does allow a simpler syntax (than RDF-XML or, I think, current STIX) and does allow it to be consumed more easily in many clients, but the developer will still have to cope with references, distribution and “creating their viewpoint” in the application rather than having it handed to them. The flexibility has this cost and the community has to decide if and how to handle it.

As I have suggested earlier, the best case is to make sure the description of your information (as understood by stakeholders) is represented in precise high-level machine readable models that will work with different decompositions and different syntaxes. It this is not the “singe source of the truth” for what your data means, you will be stuck in a technology – even if it is RDF.

If there is going to be one “required” syntax it best be one that can reflect this general model well, serve diverse communities, support different technology stacks and is friendly to differing decompositions (no dominate decomposition). Of course, it then has to be as easy to understand and implement as is possible under these constraints.

Where such general structures are encoded in XML it becomes complex. This is a combination of the need for the generality and the limits of XML schema. But, don’t blame XML for complexity that is inherent in the generality of CTI. The same complaint is levied on other general XML formats, like NIEM.

RDF in JSON syntax provides the type system, reference system and allows for a structured composition but does not require it – it is more friendly to this general structure than XML Schema. This seems like a good option. It would be a very good option if generated from a high level model that would serve to bind all the technologies.

Regards,

Cory Casanave

From: Wunder, John A. [mailto:jwunder@mitre.org]
Sent: Thursday, October 01, 2015 9:18 AM
To: Cory Casanave
Cc: Jordan, Bret; cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org
Subject: Re: [cti-users] MTI Binding

Can you elaborate a little, Cory? What are the advantages of RDF in JSON vs. either native JSON, native XML, or RDF in XML? What are the disadvantages?

If you could fill it out on the wiki that would be awesome, but if not then e-mail is fine too.

John

https://github.com/STIXProject/schemas/wiki/MTI-Format-Analysis

On Sep 30, 2015, at 8:20 PM, Cory Casanave <cory-c@modeldriven.com> wrote:

What about RDF in JSON? This then has a well defined schema.

From:cti-users@lists.oasis-open.org [mailto:cti-users@lists.oasis-open.org] On Behalf Of Jordan, Bret
Sent: Wednesday, September 30, 2015 6:56 PM
To: cti-stix@lists.oasis-open.org; cti-users@lists.oasis-open.org
Subject: [cti-users] MTI Binding

From the comments so far on the github wiki [1], the consensus right now from the community is for JSON to be used as the MTI (mandatory to implement) binding for STIX. For those that agree or disagree or have a different opinion, please update at least the final Conclusions section with your opinion.

[1] https://github.com/STIXProject/schemas/wiki/MTI-Format-Analysis

Thanks,

Bret

Bret Jordan CISSP

Director of Security Architecture and Standards | Office of the CTO

Blue Coat Systems

PGP Fingerprint: 63B4 FC53 680A 6B7D 1447 F2C0 74F8 ACAE 7415 0050

"Without cryptography vihv vivc ce xhrnrw, however, the only thing that can not be unscrambled is an egg."

cti-users message