cti-comment message

Subject: Re: [cti-comment] How to determine if a STIX indicator is unique with a large dataset?

From: Drew Varner <drew.varner@ninefx.com>
To: julian kÃster Larsen <jullefis@gmail.com>
Date: Thu, 24 Dec 2020 00:05:15 -0500

Hey Julian,

I donât use Python or the STIX2 library, but it should have what you need to get started. The expression comparison functions seem to work by:

1. Parsing the pattern into an abstract syntax tree (AST)
2. Canonicalizing the ASTs, possibly transforming them
3. Comparing the canonicalized ASTs

So, Iâd use use the STIX2 library guts to canonicalize the ASTs of patterns, and add them as a field/column/key to a database fo some sort and rely on indexes to efficiently detect duplicates using a constraint. I included some horrible example code that you really shouldnât use. It copy/pastes code from the expression comparison functions to generate normalized ASTs and then converts those ASTs back to expressions. But, if youâre using this in a database on millions of records, I wouldnât store the canonicalized expressions. Iâd just store their hash. You may want to make a pull request to surface this capability in the STIX2 public API.

Itâs important to realize completely identifying duplicate signatures would be hard. For example, MD5 â09474b4a679e781775c66d70db5e1b94â and SHA1 â47b95372a52671df9eb0d4b995440868eb991dfaâ refer to the same file, but itâd be hard to handle that case when normalizing expressions.

There are opportunities to improve canonicalization and some cases that are beyond canonicalization that would require a solver:

* Using the fact that file size is an integer, expressions like [file:size >= 371713] can be normalized to [file:size > 371712]. We canât do the same thing with floats.
* The following expressions are semantically equivalent, but donât canonicalize the same: [process:name IN ('proccy', 'proximus', 'badprocâ)], [process:name IN ('proccy', 'badproc', 'proximusâ)]
* The pattern [directory:path LIKE â/var/_fooâ] duplicates [directory:path LIKE â/var/%fooâ], but not vice versa
* [file:size != 4112] is subsumed by [file:size = 1], [file:size < 999], etc. 
* I wouldnât even look at canonicalizing/solving regular expressions. I donât think theyâll really all be PCRE either.
* The sub/superset stuff would need a solver
* I think [file:size > 371712 OR process:name = 'calcâ] subsumes [process:name = 'calcâ] when looking for an indicator. Iâd be more sure after coffee.

A canonicalized expression hash column with a unique constraint is likely good enough at first. 

Thanks,
Drew

- Python3 Snippet -
import stix2
from stix2.equivalence.patterns.transform import (ChainTransformer, SettleTransformer)
from stix2.equivalence.patterns.transform.observation import (AbsorptionTransformer, CanonicalizeComparisonExpressionsTransformer, DNFTransformer, FlattenTransformer, OrderDedupeTransformer)

# Ripped off from the STIX2 library
def canonical_pattern(pattern, stix_version=stix2.DEFAULT_VERSION):
    obs_simplify = ChainTransformer(FlattenTransformer(), OrderDedupeTransformer(), AbsorptionTransformer())
    obs_settle_simplify = SettleTransformer(obs_simplify)
    pattern_canonicalizer = ChainTransformer(CanonicalizeComparisonExpressionsTransformer(), obs_settle_simplify, DNFTransformer(), obs_settle_simplify)
    pattern_ast = stix2.pattern_visitor.create_pattern_object(pattern, version=stix_version)
    canonicalized_pattern, _ = pattern_canonicalizer.transform(pattern_ast)
    return str(canonicalized_pattern)

print(canonical_pattern("[url:value = 'http://example.com/foo' OR url:value = 'http://example.com/bar']"))
# [url:value = 'http://example.com/bar' OR url:value = 'http://example.com/foo']
print(canonical_pattern("[url:value = 'http://example.com/bar' OR url:value = 'http://example.com/foo']"))
# [url:value = 'http://example.com/bar' OR url:value = 'http://example.com/foo']
print(canonical_pattern("[file:size > 371712]"))
# [file:size > 371712]
print(canonical_pattern("[file:size >= 371713]"))
# [file:size >= 371713]
print(canonical_pattern("[process:name IN ('proccy', 'proximus', 'badproc')]"))
# [process:name IN ('proccy', 'proximus', 'badproc')]
print(canonical_pattern("[process:name IN ('proccy', 'badproc', 'proximus')]"))
# [process:name IN ('proccy', 'badproc', 'proximus')]

- End Snippet -

> On Dec 23, 2020, at 6:43 PM, julian kÃster Larsen <jullefis@gmail.com> wrote:
> 
> Hi, i would appreciate advice on how to store millions (10~) of STIX indicators in a way that prevents duplicates.
> As patterns can be written in various ways, i have yet to come up with a solution myself.
> 
> My current idea was to make use of the find_equivalent_patterns() method from the python stix2 library:
> https://stix2.readthedocs.io/en/latest/api/equivalence/stix2.equivalence.pattern.html#module-stix2.equivalence.pattern
> With this solution however i would potentially have to iterate over lot of STIX indicators to determine if the STIX indicator, A, is unique.
> My current bad solution queries the DB and collects STIX indicators that contain the same Object paths and/or constants as A, and then makes use of find_equivalent_patterns() to correctly verify if any of these patterns are equal.
> To avoid confusion here is an example of a STIX indicator pattern:
> https://imgur.com/MsG39x8
> 
> Regards Julian

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Follow-Ups:
- [EXTERNAL EMAIL] - Re: [cti-comment] How to determine if a STIX indicator is unique with a large dataset?
  - From: "Ates, Keven L. (CYD) (FBI)" <klates@fbi.gov>

References:
- How to determine if a STIX indicator is unique with a large dataset?
  - From: julian kÃster Larsen <jullefis@gmail.com>