Hey Julian,
I don’t use Python or the STIX2 library, but it should have what you need to get started. The _expression_ comparison functions seem to work by:
1. Parsing the pattern into an abstract syntax tree (AST)
2. Canonicalizing the ASTs, possibly transforming them
3. Comparing the canonicalized ASTs
So, I’d use use the STIX2 library guts to canonicalize the ASTs of patterns, and add them as a field/column/key to a database fo some sort and rely on indexes to efficiently detect duplicates using a constraint. I included some horrible example code that you
really shouldn’t use. It copy/pastes code from the _expression_ comparison functions to generate normalized ASTs and then converts those ASTs back to expressions. But, if you’re using this in a database on millions of records, I wouldn’t store the canonicalized
expressions. I’d just store their hash. You may want to make a pull request to surface this capability in the STIX2 public API.
It’s important to realize completely identifying duplicate signatures would be hard. For example, MD5 “09474b4a679e781775c66d70db5e1b94” and SHA1 “47b95372a52671df9eb0d4b995440868eb991dfa” refer to the same file, but it’d be hard to handle that case when normalizing
expressions.
There are opportunities to improve canonicalization and some cases that are beyond canonicalization that would require a solver:
* Using the fact that file size is an integer, expressions like [file:size >= 371713] can be normalized to [file:size > 371712]. We can’t do the same thing with floats.
* The following expressions are semantically equivalent, but don’t canonicalize the same: [process:name IN ('proccy', 'proximus', 'badproc’)], [process:name IN ('proccy', 'badproc', 'proximus’)]
* The pattern [directory:path LIKE ‘/var/_foo’] duplicates [directory:path LIKE ‘/var/%foo’], but not vice versa
* [file:size != 4112] is subsumed by [file:size = 1], [file:size < 999], etc.
* I wouldn’t even look at canonicalizing/solving regular expressions. I don’t think they’ll really all be PCRE either.
* The sub/superset stuff would need a solver
* I think [file:size > 371712 OR process:name = 'calc’] subsumes [process:name = 'calc’] when looking for an indicator. I’d be more sure after coffee.
A canonicalized _expression_ hash column with a unique constraint is likely good enough at first.
Thanks,
Drew
- Python3 Snippet -
import stix2
from stix2.equivalence.patterns.transform import (ChainTransformer, SettleTransformer)
from stix2.equivalence.patterns.transform.observation import (AbsorptionTransformer, CanonicalizeComparisonExpressionsTransformer, DNFTransformer, FlattenTransformer, OrderDedupeTransformer)
# Ripped off from the STIX2 library
def canonical_pattern(pattern, stix_version=stix2.DEFAULT_VERSION):
obs_simplify = ChainTransformer(FlattenTransformer(), OrderDedupeTransformer(), AbsorptionTransformer())
obs_settle_simplify = SettleTransformer(obs_simplify)
pattern_canonicalizer = ChainTransformer(CanonicalizeComparisonExpressionsTransformer(), obs_settle_simplify, DNFTransformer(), obs_settle_simplify)
pattern_ast = stix2.pattern_visitor.create_pattern_object(pattern, version=stix_version)
canonicalized_pattern, _ = pattern_canonicalizer.transform(pattern_ast)
return str(canonicalized_pattern)
print(canonical_pattern("[url:value = '
http://example.com/foo' OR url:value = '
http://example.com/bar']"))
# [url:value = '
http://example.com/bar' OR url:value = '
http://example.com/foo']
print(canonical_pattern("[url:value = '
http://example.com/bar' OR url:value = '
http://example.com/foo']"))
# [url:value = '
http://example.com/bar' OR url:value = '
http://example.com/foo']
print(canonical_pattern("[file:size > 371712]"))
# [file:size > 371712]
print(canonical_pattern("[file:size >= 371713]"))
# [file:size >= 371713]
print(canonical_pattern("[process:name IN ('proccy', 'proximus', 'badproc')]"))
# [process:name IN ('proccy', 'proximus', 'badproc')]
print(canonical_pattern("[process:name IN ('proccy', 'badproc', 'proximus')]"))
# [process:name IN ('proccy', 'badproc', 'proximus')]
- End Snippet -
> On Dec 23, 2020, at 6:43 PM, julian køster Larsen <jullefis@gmail.com> wrote:
>
> Hi, i would appreciate advice on how to store millions (10~) of STIX indicators in a way that prevents duplicates.
> As patterns can be written in various ways, i have yet to come up with a solution myself.
>
> My current idea was to make use of the find_equivalent_patterns() method from the python stix2 library:
>
https://stix2.readthedocs.io/en/latest/api/equivalence/stix2.equivalence.pattern.html#module-stix2.equivalence.pattern
> With this solution however i would potentially have to iterate over lot of STIX indicators to determine if the STIX indicator, A, is unique.
> My current bad solution queries the DB and collects STIX indicators that contain the same Object paths and/or constants as A, and then makes use of find_equivalent_patterns() to correctly verify if any of these patterns are equal.
> To avoid confusion here is an example of a STIX indicator pattern:
>
https://imgur.com/MsG39x8
>
> Regards Julian