RE: [sarif] partialFingerprints: the words the world has been waiting fo

Hi Michael,

Before reading Yekaterina’s follow-on, I wanted to say that I understand your approach. You are saying that once you have decided that a result in today's build is logically the same as a result in yesterday's build, there's no need to persist a “fingerprint” that essentially captures the result of your comparison. Instead, you just stamp the two “logically identical” results with the same id.

If we settle on this model, I suggest that we shouldn’t use result.id for this purpose. Instead, I would introduce a new property result.correlationId. Every single result in every single run would have a unique run.id. Otherwise, a result management system could store only one of a set of “logically identical” results.

I would modify your step 5 as follows:

5) For each result in the current run: if it does not match a result in the baseline run, generate a new GUID and assign it to result.correlationId. If it does match a result in the baseline run, copy the baseline result’s correlationId to the new result. In either case, update result.baselineState in the current result appropriately.

Now to read Yekaterina’s message.

Larry

From: Michael Fanning <Michael.Fanning@microsoft.com>
Sent: Wednesday, May 2, 2018 5:47 PM
To: Larry Golding (Comcast) <larrygolding@comcast.net>; sarif@lists.oasis-open.org
Subject: RE: [sarif] partialFingerprints: the words the world has been waiting for

My thinking which I tried to articulate in today’s discussion, more or less successfully, is that result matching is not a matter of comparing a previously computed fingerprint to another. Instead, result matching is a complex algorithm that tries to stitch various results together. If unsuccessful in producing an exact match, the algorithm may fall back to partial fingerprints, which are essentially logical- and physical-location-free things that may still help determine issue identity (in practice, a result matcher might still have a notion of two files that should be compared for the match, but have lost all other useful intra-file location details).

With the definition above, a partial fingerprint is partial in the sense that it is a speculative match that doesn’t benefit from other data that would increase confidence in a match. It is also a contribution, as per our previous definition, in the sense that you might try to glue this information to whatever else you have (such as a file name, where you’ve lost the location details).

I think the most significant impact to the reorientation above is how we think of result.fingerprints. This data now truly becomes mostly a placeholder for putting data produced by legacy formats. We wouldn’t expect fingerprints to be populated by a result management system. Instead, this is what we’d see:

SARIF baseline is loaded, which the result management system has populated with instance ids (a guid, for example)
A new SARIF log is loaded. The stable ids match between these, so they are candidates to compare
The result matcher runs an elaborate algorithm to try to correlate results, that includes things like remapping file names, loading them, running standard line-level diff algorithms to find matched, moved, new and deleted lines.
After identifying exact matches based on file diff (and other precise locators such as fully qualified logical name), the result matching algorithm falls back to partial fingerprints (such as surrounding context region) to make a match).
For all matches, when found, the instance id form the baseline flows to the newer SARIF. We also update the baseline state.

And that’s it. At no point does it seem critical to populate the fingerprints object. You could imagine the fingerprints of the baseline log file containing some fingerprints that will always match if file name + physical location details haven’t changed. But how useful is that? (we already have file hashes to tell us this). If you have to diff two files anyway to overcome line churn, the extra work of prepopulating and storing fingerprints might not provide cost ROI.

Michael

From: sarif@lists.oasis-open.org <sarif@lists.oasis-open.org> On Behalf Of Larry Golding (Comcast)
Sent: Wednesday, May 2, 2018 5:23 PM
To: sarif@lists.oasis-open.org
Subject: [sarif] partialFingerprints: the words the world has been waiting for

For a long time we’ve agreed that partialFingerprints shouldn’t include information that’s deducible from the SARIF file, but the spec has never said so. As part of the “fingerprints” draft that I just merged and pushed, Appendix B now says the magic words:

An analysis tool SHALL NOT include in partialFingerprints information that a result management system could deduce from other information in the SARIF file, for example, file hashes. Rather, the result management would use such information, along with partialFingerprints, in its computation of fingerprints.

I understand that our vision of partialFingerprints is still evolving, but this will do for now.

Larry

sarif message