OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti-users message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [cti-users] Parsing corrupt STIXPackages in python-stix


Hi Matthew,

I cannot speak for the python parser, since I have only used Java to
parse stix/cybox.  But if the python parsers are generated from the .xsd
schema files, like I did for Java, I think the fact that the input
document is not 'valid' against the xml schema means that you do indeed
get one big error condition, halting the entire ingest of the input doc
and leaving you with nothing.

In my experience, the way around this is a truly awful pre-processing
step where you 'sanitize' your input docs via use of awk, sed, and THEN
pass those to the xml parser stage.  Of course these tools are
line-oriented and do not grok xml data content at all.  Better would
probably be XPath or XSLT to fix the input, something I have seen but
never done.

If you are feeling really ambitious and know the Data Binding
technologies fairly well (which for Java means JAXB+xjc, not sure of the
Python equiv) you could amend the .xsd files to 'accommodate' your input
docs, again a hack.

The whole notion of schemas as a rigorous definition of allowable
documents for a vocabulary is a double-edged sword.  Great when they
work, but awkward in exactly your situation.

If I find any details of how 'note failure, continue parse' in the Java
tools at least, I'll follow up.  Like I say, Python is not my arena.


Stuart


On 05/03/2016 04:00 PM, Matthew Hall wrote:
> I am running into issues parsing STIX Packages containing corrupted Indicators 
> and/or Observables reliably with python-stix.
> 
> Performing some research on the python-stix code, it appears there is not a 
> good way to catch exceptions at a very granular, per-entity level.
> 
> There is some code in the stix.utils.parser module, which in theory seems like 
> it would help with this, but it doesn't appear to have granular 
> exception-catching capability either.
> 
> Therefore, when the code comes across a CybOX FileObj w/ a bogus 
> Size_In_Bytes, the exception disrupts parsing the entire STIX Package not just 
> the corrupted / invalid entity:
> 
> <FileObj:Size_In_Bytes condition="Equals">380058 bytes</FileObj:Size_In_Bytes>
> 
> ValueError: invalid literal for long() with base 10: '380058 bytes'
> File ".../venv/lib/python2.7/site-packages/cybox/common/properties.py", line 514, in _parse_value
>   return long(value, 0)
> 
> How can I perform a best-effort parse with python-stix in order to operate as 
> properly as possible in such situations?


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]