[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [cti-users] Parsing corrupt STIXPackages in python-stix
Hi Matthew, I cannot speak for the python parser, since I have only used Java to parse stix/cybox. But if the python parsers are generated from the .xsd schema files, like I did for Java, I think the fact that the input document is not 'valid' against the xml schema means that you do indeed get one big error condition, halting the entire ingest of the input doc and leaving you with nothing. In my experience, the way around this is a truly awful pre-processing step where you 'sanitize' your input docs via use of awk, sed, and THEN pass those to the xml parser stage. Of course these tools are line-oriented and do not grok xml data content at all. Better would probably be XPath or XSLT to fix the input, something I have seen but never done. If you are feeling really ambitious and know the Data Binding technologies fairly well (which for Java means JAXB+xjc, not sure of the Python equiv) you could amend the .xsd files to 'accommodate' your input docs, again a hack. The whole notion of schemas as a rigorous definition of allowable documents for a vocabulary is a double-edged sword. Great when they work, but awkward in exactly your situation. If I find any details of how 'note failure, continue parse' in the Java tools at least, I'll follow up. Like I say, Python is not my arena. Stuart On 05/03/2016 04:00 PM, Matthew Hall wrote: > I am running into issues parsing STIX Packages containing corrupted Indicators > and/or Observables reliably with python-stix. > > Performing some research on the python-stix code, it appears there is not a > good way to catch exceptions at a very granular, per-entity level. > > There is some code in the stix.utils.parser module, which in theory seems like > it would help with this, but it doesn't appear to have granular > exception-catching capability either. > > Therefore, when the code comes across a CybOX FileObj w/ a bogus > Size_In_Bytes, the exception disrupts parsing the entire STIX Package not just > the corrupted / invalid entity: > > <FileObj:Size_In_Bytes condition="Equals">380058 bytes</FileObj:Size_In_Bytes> > > ValueError: invalid literal for long() with base 10: '380058 bytes' > File ".../venv/lib/python2.7/site-packages/cybox/common/properties.py", line 514, in _parse_value > return long(value, 0) > > How can I perform a best-effort parse with python-stix in order to operate as > properly as possible in such situations?
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]