OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: [EXTERNAL EMAIL] - Re: [cti-comment] How to determine if a STIX indicator is unique with a large dataset?


Julian,

This issues is one of the primary reasons to use an RDF Graph database. As all storage is reduced to a triple representation and all elements can be indexed, the duplication issue is essentially eliminated...a very elegant solution.

When you attempt to add an existing triple portion of a pattern, it can be ignored.  The differing portion is the only part added.  Also, your database maintenance issues will be all but eliminated.  The example relating the SHA1 and MD5 hashes highlights the issue.  If the MD5 was missing, it would simply be added to the file's pattern (subgraph) and give you another method for querying the same file.  Any two generated indicator IDs representing the same data would have the overlapping pattern automatically shared.  Both IDs become aliases to the same indicator. If you need to, you can attribute elements added by one or the other (or both) ID by using reification for provenance.

Look for Semantic Web solutions as STIX data really represents shared graph data.  It might take a bit to understand the technology, but the payoff is tremendous.  Try these free versions to help you get started: AllegroGrsph, GraphDB, Fuseki (part of Apache Jena).  Also, there are Python libraries for RDF.  Also, to help with data transformations, use OpenRefine with the RDF extension.

FYI: STIX is an interesting "transport" format, but has real storage and practical business use issues.  This threat intelligence really needs to be in a knowledge graph format using knowledge graph technology to give organizations actionable business intelligence systems.  Semantic Web technology does that. 

Keven

On Dec 24, 2020 12:05 AM, Drew Varner <drew.varner@ninefx.com> wrote:
Hey Julian,

I don’t use Python or the STIX2 library, but it should have what you need to get started. The _expression_ comparison functions seem to work by:

1. Parsing the pattern into an abstract syntax tree (AST)
2. Canonicalizing the ASTs, possibly transforming them
3. Comparing the canonicalized ASTs

So, I’d use use the STIX2 library guts to canonicalize the ASTs of patterns, and add them as a field/column/key to a database fo some sort and rely on indexes to efficiently detect duplicates using a constraint. I included some horrible example code that you really shouldn’t use. It copy/pastes code from the _expression_ comparison functions to generate normalized ASTs and then converts those ASTs back to expressions. But, if you’re using this in a database on millions of records, I wouldn’t store the canonicalized expressions. I’d just store their hash. You may want to make a pull request to surface this capability in the STIX2 public API.

It’s important to realize completely identifying duplicate signatures would be hard. For example, MD5 “09474b4a679e781775c66d70db5e1b94” and SHA1 “47b95372a52671df9eb0d4b995440868eb991dfa” refer to the same file, but it’d be hard to handle that case when normalizing expressions.

There are opportunities to improve canonicalization and some cases that are beyond canonicalization that would require a solver:

* Using the fact that file size is an integer, expressions like [file:size >= 371713] can be normalized to [file:size > 371712]. We can’t do the same thing with floats.
* The following expressions are semantically equivalent, but don’t canonicalize the same: [process:name IN ('proccy', 'proximus', 'badproc’)], [process:name IN ('proccy', 'badproc', 'proximus’)]
* The pattern [directory:path LIKE ‘/var/_foo’] duplicates [directory:path LIKE ‘/var/%foo’], but not vice versa
* [file:size != 4112] is subsumed by [file:size = 1], [file:size < 999], etc.
* I wouldn’t even look at canonicalizing/solving regular expressions. I don’t think they’ll really all be PCRE either.
* The sub/superset stuff would need a solver
* I think [file:size > 371712 OR process:name = 'calc’] subsumes [process:name = 'calc’] when looking for an indicator. I’d be more sure after coffee.

A canonicalized _expression_ hash column with a unique constraint is likely good enough at first.

Thanks,
Drew

- Python3 Snippet -
import stix2
from stix2.equivalence.patterns.transform import (ChainTransformer, SettleTransformer)
from stix2.equivalence.patterns.transform.observation import (AbsorptionTransformer, CanonicalizeComparisonExpressionsTransformer, DNFTransformer, FlattenTransformer, OrderDedupeTransformer)

# Ripped off from the STIX2 library
def canonical_pattern(pattern, stix_version=stix2.DEFAULT_VERSION):
    obs_simplify = ChainTransformer(FlattenTransformer(), OrderDedupeTransformer(), AbsorptionTransformer())
    obs_settle_simplify = SettleTransformer(obs_simplify)
    pattern_canonicalizer = ChainTransformer(CanonicalizeComparisonExpressionsTransformer(), obs_settle_simplify, DNFTransformer(), obs_settle_simplify)
    pattern_ast = stix2.pattern_visitor.create_pattern_object(pattern, version=stix_version)
    canonicalized_pattern, _ = pattern_canonicalizer.transform(pattern_ast)
    return str(canonicalized_pattern)

print(canonical_pattern("[url:value = 'http://example.com/foo' OR url:value = 'http://example.com/bar']"))
# [url:value = 'http://example.com/bar' OR url:value = 'http://example.com/foo']
print(canonical_pattern("[url:value = 'http://example.com/bar' OR url:value = 'http://example.com/foo']"))
# [url:value = 'http://example.com/bar' OR url:value = 'http://example.com/foo']
print(canonical_pattern("[file:size > 371712]"))
# [file:size > 371712]
print(canonical_pattern("[file:size >= 371713]"))
# [file:size >= 371713]
print(canonical_pattern("[process:name IN ('proccy', 'proximus', 'badproc')]"))
# [process:name IN ('proccy', 'proximus', 'badproc')]
print(canonical_pattern("[process:name IN ('proccy', 'badproc', 'proximus')]"))
# [process:name IN ('proccy', 'badproc', 'proximus')]

- End Snippet -

> On Dec 23, 2020, at 6:43 PM, julian køster Larsen <jullefis@gmail.com> wrote:
>
> Hi, i would appreciate advice on how to store millions (10~) of STIX indicators in a way that prevents duplicates.
> As patterns can be written in various ways, i have yet to come up with a solution myself.
>
> My current idea was to make use of the find_equivalent_patterns() method from the python stix2 library:
> https://stix2.readthedocs.io/en/latest/api/equivalence/stix2.equivalence.pattern.html#module-stix2.equivalence.pattern
> With this solution however i would potentially have to iterate over lot of STIX indicators to determine if the STIX indicator, A, is unique.
> My current bad solution queries the DB and collects STIX indicators that contain the same Object paths and/or constants as A, and then makes use of find_equivalent_patterns() to correctly verify if any of these patterns are equal.
> To avoid confusion here is an example of a STIX indicator pattern:
> https://imgur.com/MsG39x8
>
> Regards Julian



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]