Re: [cti-cybox] simplifying the data model

cti-cybox message

Subject: Re: [cti-cybox] simplifying the data model

From: "Kirillov, Ivan A." <ikirillov@mitre.org>

To: Cedric Le Roux <cleroux@splunk.com>, "cti-cybox@lists.oasis-open.org" <cti-cybox@lists.oasis-open.org>

Date: Mon, 3 Aug 2015 14:31:55 +0000

Great points and feedback, Cedric! Comments inline below. Also, would you mind if I added these to our issues tracker on GitHub [1]? That way they won’t get lost in the email void.

[1] https://github.com/CybOXProject/schemas/issues

Regards,

Ivan

From: <cti-cybox@lists.oasis-open.org> on behalf of Cedric Le Roux
Date: Friday, July 31, 2015 at 7:11 AM
To: "cti-cybox@lists.oasis-open.org"
Subject: [cti-cybox] simplifying the data model

Hello the list,

I tried to catchup on previous discussions before submitting and this idea seems already identified [0][1], so I’m going to emphasize it with practical examples. We all have different experiences with this new standard, but the real success of STIX/CybOX is that sharing has happened across communities, and I think simplifying the standard will make the adoption easier, and faster.

This submission is based on my own experience of actually implementing the standard, meaning writing code. I spent time on this standard but I don’t aim to be an expert of it, I surely have some misunderstanding. So yes, I’m just sharing thoughts and comments here :)

1. Too many options for one single indicator.

The standard permit too many ways of describing one single indicator. Because of this, it’s really difficult to implement the standard and being fully compliant with it, so it’s a blocker to a wider adoption (how many vendors are actually talking about supporting STIX/CybOX and how many really do it?).

Example:

To describe an IPv4 address, we can today have the following representations:

- 127.0.0.1,

- 127.0.0.1/32,

- 127.0.0.1/255.255.255.0,

- 127.0.0.1-127.0.0.2,

- the awful ‘##’ notation (or sometimes a comma separated list of values),

- etc, and probably others.

If the standard were allowing only one way to describes IPv4 addresses, like using the CIDR notation (127.0.0.1/32), it would be super easy to anyone to actually implement the standard. And there is no loss of information because this CIDR notation cover all possibles IPs or Range of IPs. Eventually, for convenience, we may want to have 2 formats: the CIDR notation and the single IP notation, but no more than that.

Don’t get me wrong, I’m not saying an analyst shouldn’t be able to input different format of IPv4 in the software it uses (like Soltra Edge for example), I’m saying that particular need is out of the scope of the standard. This is the goal of the software to do the transformation to what the standard is expecting (the CIDR notation).

In short, I think the problem here is that the standard cover some things that should be part of software specifications and not part of the standard itself.

[Ivan] Agreed – I think many of the existing components of CybOX were designed to cover the broadest set of use cases in mind, without considering that many of these cases can be handled in other ways, such as through software specifications. This will be a balancing act, but I think there’s a strong community consensus towards reducing the number of ways of capturing atomic entities such as IP Addresses, and as such Trey and I are making this a high priority for CybOX v3.0.

2. Logic errors

Because current objects are not atomic, that could lead to logic errors and a lot of confusion, like in the following example. Regarding the current specifications, the following object is valid (and validated by the script stix_validator.py):

<cybox:Properties xsi:type="AddressObj:AddressObjectType" category=“e-mail">

<AddressObj:Address_Value>pouet@whatever.tld</AddressObj:Address_Value>

<AddressObj:VLAN_Name>This is the name of a VLAN</AddressObj:VLAN_name>

</cybox:Properties>

This is a valid object mixing an email definition and a VLAN name, which in my understanding, has no meaning. Note also that I let the “Address_Value” for demonstration purpose, but the very same object is still valid without this field, which is even more awkward.

[Ivan] Good point. In most cases, this is due to us creating relatively abstract Objects that are intended to capture different types of entities – the upside is that we end up with a single Object, but the downside is that this makes semantic validation impossible to do in the schema itself (without additional rules via schematron or other methods). This can likely be addressed by making Objects more atomic entities, as you’ve discussed below.

3. Too many objects

Another problem is that we can describe the same information with different objects. If we keep the previous example of the e-mail, to define an email address we can at least choose between an EmailMessageObject or an AddressObject.

This is even more complicated with DNS related objects: HostnameObject, AddressObject, DomainNameObject, DNSCacheObject, DNSQueryObject, DNSRecordObject, etc.

[Ivan] I also concur – there are too many overlapping Objects at this point. Some of these we’ve known about and have intended to fix (e.g., the AddressObject and AS Object can both capture AS names), while others were created for specific use cases (e.g., DNS Cache and DNS Query). This will require some analysis by ourselves and the CybOX community, but my hope is that we can eliminate any true redundancies while also retaining the ability to target the use cases that some of the Objects were initially created for.

4. Wide objects but still missing coverage

To my understanding, the only way to describe a MAC Address in CybOX is by using an AddressObject with the field category set to “mac”. This cover the definition of a MAC address, but it doesn’t tell me the format of the MAC address itself (is the separator a hyphen? a semicolon? none? dots?, are the characters grouped by 2 or 4? what’s the constructor associated with the first 3 bytes of the Mac? etc).

By extension, to update an atomic indicator over time seems easier than updating complex types like the AddressObject is today. For example, if we extend the AddressObject type for a full coverage of MAC Addresses, we probably have greater chances of side effects in the existing products, rather than if we were using a dedicated atomic object.

[Ivan] Completely agree. One of the main benefits of atomic Objects is that we can much more effectively constrain and therefore validate the data that is captured in the Object, which benefits both producers and consumers.

5. Lists of Objects

Another fact is that CybOX allows the notation ‘##’ as an attempt to describe a list of objects. I think this notation is all except efficient nor convenient, so I see 2 options here:

1- We don’t need lists, the standard already allows to describe multiple objects of the same nature multiple times within the same IOC file, so no need at all of this in the standard. So, no lists are needed, this notation disappear.

2- We need lists, which means we need a proper object to handle lists, and not a trick like the current notation is. For example, something like <ListObj name=“myList”><obj1>,<obj2>, ... </ListObj> (or maybe the “relatedTo” could do the job?)

[Ivan] Agree that lists need revisiting and that the current implementation is painful to deal with. I think the core issue is whether we need to support list-based indicator matching (e.g., matching against a list of File Names), or whether this should instead be performed using another method (e.g., Boolean composition).

6. Benefits

To my understanding, the benefits of this reduction or simplification of the standard are:

- Easier implementation, either its from scratch or using existing libraries

- Wider adoption, because of the previous point

- Objects becomes building blocks, easier to work with to start building real logic within the IOCs.

- By having atomic objects, we avoid logic errors.

- Faster code execution due to less conditional branching required in the code.

To follow up on Trey’s proposal on creating working groups, I would be happy to join the one on Simplifying the Data Model. I would like to see the standard going into a direction where atomic objects are defined, just like we have atomic types in C (int, char, char *, etc). Only based on those few types, we can build a whole operating system with complex rendering. Of course they are many implications of such move, starting with backward compatibility, but I think it’s good for the adoption of the standard by a wider audience.

To conclude, and as a generic rule, indicators should be atomic, building blocks, meaning they should be in a form that cannot be reduced anymore, and they shouldn’t be ambiguous. In short, keep it simple :)

Happy to discuss during BH/Defcon.

[0] https://lists.oasis-open.org/archives/cti/201507/msg00142.html

[1] https://wiki.oasis-open.org/cti/July%2030th%202015

Thanks,

Cedric

Cédric Le Roux

Principal Security Engineer

Minister of Segfault

Splunk Inc.

cleroux@splunk.com