[analyzer] exploration strategies and paths

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
Hi All,

I was investigating recently bug reports with very long analyzer paths (more than a few hundred nodes).
In many of such cases the path is long for no good reason: namely, the analyzer would go 3 times around the loop before
going further.
The issue is surprisingly common, and it was exacerbated with a recent bump of analyzer thresholds.

The problem is reproduced on the following file:

```
extern int coin();

int foo() {
    int *x = 0;
    while (coin()) {
        if (coin())
            return *x;
    }
    return 0;
}

void bar() {
    while(coin())
        if (coin())
            foo();
}
```

While a shortest path to the error does not loop around, the current version of the analyzer
will go around the loop three times before going further.
(and we are quite fortunate that the unrolling limit for loops is three, otherwise it would keep going
until the unrolling limit is reached).

Multiple issues were discovered during the investigation.

1. Analyzer queue does not have a concept of priority, and performs a simple DFS by default.
Thus if the successor of the if-branch under the loop in “bar" containing the desired destination is generated second,
it will never be evaluated until the loop exploration limit is exhausted.

2. The previous issue slows down the exploration, but is not enough to get a pathological behavior of ultra-long paths.
The second problem is a combination of:
a) Block counter is not a part of a node's identity, and node A with a small block counter can be merged into a node B with a large block counter,
and the resulting node will have a block counter associated with B.
b) The issue in (a) is triggered due to our heuristic to abandon the function’s exploration and switch to conservative evaluation
if we are already *inside* the function and the block limit has been reached.

Issue (1) combined with (2-b) causes the problematic behavior: the issue is discovered on the longest path first,
and by the time the shortest path gets to “bar”, the block limit is already reached, and the switch to conservative evaluation is performed.

Thus there are two mitigation strategies currently being evaluated:

i) Remove the heuristic in (2-b)
ii) Use a priority queue to hold nodes which should be explored; prefer nodes which give new source code coverage over others
(or alternatively prefer nodes with least depth of loop stack)

Me and Artem have evaluated the option (i) and the results were surprisingly good: some reports disappear, and slightly more reports reappear.
The quality of the new reports seems to be slightly better, and I am still trying to figure out exact reasons.
I suspect merges resulting from heuristic (2-b) cause us to lose some actually valid reports.

Option (ii) has not been evaluated fully yet, but current experiments show slightly more reports (5-10%), and a radical decline in report lengths
(e.g. from 400+ to <100 for largest reports)

Are there any thoughts on the matter?

Personally I think we should do both (i) and (ii), even if they would shake up the results.
- The original idea for heuristics (2-b) was to be able to produce a report even if we are out of budget, but since it actually results in less reports,
I think the data does not validate the approach.

- Option (ii) is AFAIK how most similar engines work, and should get us much larger coverage (and shorter paths) for the same node budget,
even at the cost of log(N) overhead of the priority queue. Moreover, not having the priority queue will bite us later if we ever decide to further
increase the analyzer budget or to increase the unroll limit.

George

 

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev

On 29/01/2018 4:12 PM, George Karpenkov via cfe-dev wrote:
> Hi All,
>
> I was investigating recently bug reports with very long analyzer paths (more than a few hundred nodes).
> In many of such cases the path is long for no good reason: namely, the analyzer would go 3 times around the loop before
> going further.
> The issue is surprisingly common, and it was exacerbated with a recent bump of analyzer thresholds.

Yeah, i guess everybody who used the analyzer has seen some of those
nasty reports with iterating over loops 4 times. It's like why does it
find the issue on the last iteration rather than on the first iteration,
given that we use a depth-first strategy? So it's a great long-overdue
thing to fix.

George, do you have any non-internal before/after html reports to attach?

> The problem is reproduced on the following file:
>
> ```
> extern int coin();
>
> int foo() {
>      int *x = 0;
>      while (coin()) {
>          if (coin())
>              return *x;
>      }
>      return 0;
> }
>
> void bar() {
>      while(coin())
>          if (coin())
>              foo();
> }
> ```
>
> While a shortest path to the error does not loop around, the current version of the analyzer
> will go around the loop three times before going further.
> (and we are quite fortunate that the unrolling limit for loops is three, otherwise it would keep going
> until the unrolling limit is reached).
>
> Multiple issues were discovered during the investigation.
>
> 1. Analyzer queue does not have a concept of priority, and performs a simple DFS by default.
> Thus if the successor of the if-branch under the loop in “bar" containing the desired destination is generated second,
> it will never be evaluated until the loop exploration limit is exhausted.
>
> 2. The previous issue slows down the exploration, but is not enough to get a pathological behavior of ultra-long paths.
> The second problem is a combination of:
> a) Block counter is not a part of a node's identity, and node A with a small block counter can be merged into a node B with a large block counter,
> and the resulting node will have a block counter associated with B.
> b) The issue in (a) is triggered due to our heuristic to abandon the function’s exploration and switch to conservative evaluation
> if we are already *inside* the function and the block limit has been reached.
>
> Issue (1) combined with (2-b) causes the problematic behavior: the issue is discovered on the longest path first,
> and by the time the shortest path gets to “bar”, the block limit is already reached, and the switch to conservative evaluation is performed.

2-a is not even required here.

With our DFS exploration order, on every iteration of the while-loop
within bar(), we'd take the false-branch of if() within bar() from the
worklist, see that it goes back to loop, and end up with new true-branch
and false-branch nodes of the next iteration on the top of the worklist.
Then we pop the false-branch again, etc., until we run out of block
count limit while having 4 true-branches in the worklist. Those would
therefore evaluate in the opposite order, and the first time we enter
foo() we'd be on the 4th iteration.

This situation can happen regardless of in which order we evaluate
if()-branches, by slightly modifying the example. So if the idea in the
previous paragraph is unclear, it should still be obvious that sometimes
we'd run into a function call on the longer path earlier than on a
shorter path.

Now, once we enter foo() and immediately find the bug, we also run out
of block count limit within foo(). Recall that we are on the 4th
iteration of the while-loop in bar(), and here is where the bug is
found. Now, once evaluation of foo() is over, we record that we failed
to fully inline it, so it's probably too complex, so let's evaluate it
conservatively.

It means that on 3th, 2nd, 1st iteration we won't be able to find the
bug, because foo() is evaluated conservatively. So we're stuck with the
long report forever.

> Thus there are two mitigation strategies currently being evaluated:
>
> i) Remove the heuristic in (2-b)
> ii) Use a priority queue to hold nodes which should be explored; prefer nodes which give new source code coverage over others
> (or alternatively prefer nodes with least depth of loop stack)
>
> Me and Artem have evaluated the option (i) and the results were surprisingly good: some reports disappear, and slightly more reports reappear.
> The quality of the new reports seems to be slightly better, and I am still trying to figure out exact reasons.

Yeah, i guess some explanation is necessary here. The skew of results is
pretty huge, and it's surprising that the number of reports actually
increases.

Just to be clear, both replay-without-inlining and
dont-inline-again-after-bailout heuristics were disabled in this test.

> I suspect merges resulting from heuristic (2-b) cause us to lose some actually valid reports.

Because replay-without-inlining is disabled, there should not be many
merges.

> Option (ii) has not been evaluated fully yet, but current experiments show slightly more reports (5-10%), and a radical decline in report lengths
> (e.g. from 400+ to <100 for largest reports)
>
> Are there any thoughts on the matter?
>
> Personally I think we should do both (i) and (ii), even if they would shake up the results.
> - The original idea for heuristics (2-b) was to be able to produce a report even if we are out of budget, but since it actually results in less reports,
> I think the data does not validate the approach.
>
> - Option (ii) is AFAIK how most similar engines work, and should get us much larger coverage (and shorter paths) for the same node budget,
> even at the cost of log(N) overhead of the priority queue. Moreover, not having the priority queue will bite us later if we ever decide to further
> increase the analyzer budget or to increase the unroll limit.

In the example above, (ii) means evaluating the first true-branch of the
if() in bar() before the second false-branch of the if() in bar(),
simply because it's *on an earlier loop iteration*. This, indeed, sounds
like the right thing to do, like, logically, hopefully we'd be able to
confirm this with a more careful evaluation.

>
> George
>
>  
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
Hi George and Artem,

2018-01-30 1:44 GMT+01:00 Artem Dergachev via cfe-dev <[hidden email]>:

On 29/01/2018 4:12 PM, George Karpenkov via cfe-dev wrote:
Hi All,

I was investigating recently bug reports with very long analyzer paths (more than a few hundred nodes).
In many of such cases the path is long for no good reason: namely, the analyzer would go 3 times around the loop before
going further.
The issue is surprisingly common, and it was exacerbated with a recent bump of analyzer thresholds.

Yeah, i guess everybody who used the analyzer has seen some of those nasty reports with iterating over loops 4 times. It's like why does it find the issue on the last iteration rather than on the first iteration, given that we use a depth-first strategy? So it's a great long-overdue thing to fix.

George, do you have any non-internal before/after html reports to attach?


The problem is reproduced on the following file:

```
extern int coin();

int foo() {
     int *x = 0;
     while (coin()) {
         if (coin())
             return *x;
     }
     return 0;
}

void bar() {
     while(coin())
         if (coin())
             foo();
}
```

While a shortest path to the error does not loop around, the current version of the analyzer
will go around the loop three times before going further.
(and we are quite fortunate that the unrolling limit for loops is three, otherwise it would keep going
until the unrolling limit is reached).

Multiple issues were discovered during the investigation.

1. Analyzer queue does not have a concept of priority, and performs a simple DFS by default.
Thus if the successor of the if-branch under the loop in “bar" containing the desired destination is generated second,
it will never be evaluated until the loop exploration limit is exhausted.

2. The previous issue slows down the exploration, but is not enough to get a pathological behavior of ultra-long paths.
The second problem is a combination of:
a) Block counter is not a part of a node's identity, and node A with a small block counter can be merged into a node B with a large block counter,
and the resulting node will have a block counter associated with B.
b) The issue in (a) is triggered due to our heuristic to abandon the function’s exploration and switch to conservative evaluation
if we are already *inside* the function and the block limit has been reached.

Issue (1) combined with (2-b) causes the problematic behavior: the issue is discovered on the longest path first,
and by the time the shortest path gets to “bar”, the block limit is already reached, and the switch to conservative evaluation is performed.

2-a is not even required here.

With our DFS exploration order, on every iteration of the while-loop within bar(), we'd take the false-branch of if() within bar() from the worklist, see that it goes back to loop, and end up with new true-branch and false-branch nodes of the next iteration on the top of the worklist. Then we pop the false-branch again, etc., until we run out of block count limit while having 4 true-branches in the worklist. Those would therefore evaluate in the opposite order, and the first time we enter foo() we'd be on the 4th iteration.

This situation can happen regardless of in which order we evaluate if()-branches, by slightly modifying the example. So if the idea in the previous paragraph is unclear, it should still be obvious that sometimes we'd run into a function call on the longer path earlier than on a shorter path.

Now, once we enter foo() and immediately find the bug, we also run out of block count limit within foo(). Recall that we are on the 4th iteration of the while-loop in bar(), and here is where the bug is found. Now, once evaluation of foo() is over, we record that we failed to fully inline it, so it's probably too complex, so let's evaluate it conservatively.

It means that on 3th, 2nd, 1st iteration we won't be able to find the bug, because foo() is evaluated conservatively. So we're stuck with the long report forever.
 
I do not exactly see it why. If I'm not mistaken you described the replay-without-inlining heuristics. However, I believe this information is stored in the state which means that this only affects one path (in this example the 4th iteration bug finding path). But whenever we simulate the path of the 3rd/2nd/1st iteration where the if(coin()) is true, that is another path. Maybe I just do not see something trivial os just misunderstood something but could you explain me, why does it affect other paths?


Thus there are two mitigation strategies currently being evaluated:

i) Remove the heuristic in (2-b)
ii) Use a priority queue to hold nodes which should be explored; prefer nodes which give new source code coverage over others
(or alternatively prefer nodes with least depth of loop stack)

Me and Artem have evaluated the option (i) and the results were surprisingly good: some reports disappear, and slightly more reports reappear.
The quality of the new reports seems to be slightly better, and I am still trying to figure out exact reasons.

Yeah, i guess some explanation is necessary here. The skew of results is pretty huge, and it's surprising that the number of reports actually increases.

 
Just to make sure: Does the number of actually different reports increases? In case of a missing uniquing location a checker could generate a lot of "spam", so find and report the same bug on different paths. (And turning off these heuristics could lead into that. Probably you already checked this, but seems really suspicious.)

Another interesting stuff about (i) could be that how many times we reached a max size ExplodedGraph (max number of steps) and how many more steps we do because of turning of these heuristics? I feel like this change should result higher analysis time but less coverage overall. However, I am not sure how significant these changes are. So, if you manage to find more valuable bugs then these changes worth it, I guess.
 
Just to be clear, both replay-without-inlining and dont-inline-again-after-bailout heuristics were disabled in this test.

I suspect merges resulting from heuristic (2-b) cause us to lose some actually valid reports.

Because replay-without-inlining is disabled, there should not be many merges.

Option (ii) has not been evaluated fully yet, but current experiments show slightly more reports (5-10%), and a radical decline in report lengths
(e.g. from 400+ to <100 for largest reports)

Are there any thoughts on the matter?

Personally I think we should do both (i) and (ii), even if they would shake up the results.
- The original idea for heuristics (2-b) was to be able to produce a report even if we are out of budget, but since it actually results in less reports,
I think the data does not validate the approach.

- Option (ii) is AFAIK how most similar engines work, and should get us much larger coverage (and shorter paths) for the same node budget,
even at the cost of log(N) overhead of the priority queue. Moreover, not having the priority queue will bite us later if we ever decide to further
increase the analyzer budget or to increase the unroll limit.

In the example above, (ii) means evaluating the first true-branch of the if() in bar() before the second false-branch of the if() in bar(), simply because it's *on an earlier loop iteration*. This, indeed, sounds like the right thing to do, like, logically, hopefully we'd be able to confirm this with a more careful evaluation.
 
Option (ii) is something I personally really support and would like to see implemented in the analyzer. I was already thinking on this change earlier but did not seem easy to come up with a reliable heuristic for that. The aim is clear and great but how would you do it? Would you rate the possibilities based on the number of visits of the possible next analyzed blocks?


Thanks,
Peter


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev

Option (ii) is something I personally really support and would like to see implemented in the analyzer. I was already thinking on this change earlier but did not seem easy to come up with a reliable heuristic for that. The aim is clear and great but how would you do it? Would you rate the possibilities based on the number of visits of the possible next analyzed blocks?

In the queue we already have the counter on the number of times a given block was visited along the path;
The implementation I am currently playing with when comparing two paths would simply picks the one which has visited its CFGBlock less times along its path.



Thanks,
Peter


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
In reply to this post by Artem Dergachev via cfe-dev


On 29/01/2018 5:36 PM, Péter Szécsi wrote:

> Hi George and Artem,
>
> 2018-01-30 1:44 GMT+01:00 Artem Dergachev via cfe-dev
> <[hidden email] <mailto:[hidden email]>>:
>
>
>     On 29/01/2018 4:12 PM, George Karpenkov via cfe-dev wrote:
>
>         Hi All,
>
>         I was investigating recently bug reports with very long
>         analyzer paths (more than a few hundred nodes).
>         In many of such cases the path is long for no good reason:
>         namely, the analyzer would go 3 times around the loop before
>         going further.
>         The issue is surprisingly common, and it was exacerbated with
>         a recent bump of analyzer thresholds.
>
>
>     Yeah, i guess everybody who used the analyzer has seen some of
>     those nasty reports with iterating over loops 4 times. It's like
>     why does it find the issue on the last iteration rather than on
>     the first iteration, given that we use a depth-first strategy? So
>     it's a great long-overdue thing to fix.
>
>     George, do you have any non-internal before/after html reports to
>     attach?
>
>
>         The problem is reproduced on the following file:
>
>         ```
>         extern int coin();
>
>         int foo() {
>              int *x = 0;
>              while (coin()) {
>                  if (coin())
>                      return *x;
>              }
>              return 0;
>         }
>
>         void bar() {
>              while(coin())
>                  if (coin())
>                      foo();
>         }
>         ```
>
>         While a shortest path to the error does not loop around, the
>         current version of the analyzer
>         will go around the loop three times before going further.
>         (and we are quite fortunate that the unrolling limit for loops
>         is three, otherwise it would keep going
>         until the unrolling limit is reached).
>
>         Multiple issues were discovered during the investigation.
>
>         1. Analyzer queue does not have a concept of priority, and
>         performs a simple DFS by default.
>         Thus if the successor of the if-branch under the loop in “bar"
>         containing the desired destination is generated second,
>         it will never be evaluated until the loop exploration limit is
>         exhausted.
>
>         2. The previous issue slows down the exploration, but is not
>         enough to get a pathological behavior of ultra-long paths.
>         The second problem is a combination of:
>         a) Block counter is not a part of a node's identity, and node
>         A with a small block counter can be merged into a node B with
>         a large block counter,
>         and the resulting node will have a block counter associated
>         with B.
>         b) The issue in (a) is triggered due to our heuristic to
>         abandon the function’s exploration and switch to conservative
>         evaluation
>         if we are already *inside* the function and the block limit
>         has been reached.
>
>         Issue (1) combined with (2-b) causes the problematic behavior:
>         the issue is discovered on the longest path first,
>         and by the time the shortest path gets to “bar”, the block
>         limit is already reached, and the switch to conservative
>         evaluation is performed.
>
>
>     2-a is not even required here.
>
>     With our DFS exploration order, on every iteration of the
>     while-loop within bar(), we'd take the false-branch of if() within
>     bar() from the worklist, see that it goes back to loop, and end up
>     with new true-branch and false-branch nodes of the next iteration
>     on the top of the worklist. Then we pop the false-branch again,
>     etc., until we run out of block count limit while having 4
>     true-branches in the worklist. Those would therefore evaluate in
>     the opposite order, and the first time we enter foo() we'd be on
>     the 4th iteration.
>
>     This situation can happen regardless of in which order we evaluate
>     if()-branches, by slightly modifying the example. So if the idea
>     in the previous paragraph is unclear, it should still be obvious
>     that sometimes we'd run into a function call on the longer path
>     earlier than on a shorter path.
>
>     Now, once we enter foo() and immediately find the bug, we also run
>     out of block count limit within foo(). Recall that we are on the
>     4th iteration of the while-loop in bar(), and here is where the
>     bug is found. Now, once evaluation of foo() is over, we record
>     that we failed to fully inline it, so it's probably too complex,
>     so let's evaluate it conservatively.
>
>     It means that on 3th, 2nd, 1st iteration we won't be able to find
>     the bug, because foo() is evaluated conservatively. So we're stuck
>     with the long report forever.
>
> I do not exactly see it why. If I'm not mistaken you described the
> replay-without-inlining heuristics. However, I believe this
> information is stored in the state which means that this only affects
> one path (in this example the 4th iteration bug finding path). But
> whenever we simulate the path of the 3rd/2nd/1st iteration where the
> if(coin()) is true, that is another path. Maybe I just do not see
> something trivial os just misunderstood something but could you
> explain me, why does it affect other paths?
>

Nope, it's not part of the program state. It's in FunctionSummaries,
which is a field in CoreEngine. So whenever we find a function we were
unable to inline even once during analysis, we'd never inline it anymore
on any branch. See stuff around markReachedMaxBlockCount().

>
>         Thus there are two mitigation strategies currently being
>         evaluated:
>
>         i) Remove the heuristic in (2-b)
>         ii) Use a priority queue to hold nodes which should be
>         explored; prefer nodes which give new source code coverage
>         over others
>         (or alternatively prefer nodes with least depth of loop stack)
>
>         Me and Artem have evaluated the option (i) and the results
>         were surprisingly good: some reports disappear, and slightly
>         more reports reappear.
>         The quality of the new reports seems to be slightly better,
>         and I am still trying to figure out exact reasons.
>
>
>     Yeah, i guess some explanation is necessary here. The skew of
>     results is pretty huge, and it's surprising that the number of
>     reports actually increases.
>
> Just to make sure: Does the number of actually *different *reports
> increases? In case of a missing uniquing location a checker could
> generate a lot of "spam", so find and report the same bug on different
> paths. (And turning off these heuristics could lead into that.
> Probably you already checked this, but seems really suspicious.)

Yeah, it's like +600/-300 unique reports on my nightly internal codebase
run, which is a huge skew.

>
> Another interesting stuff about (i) could be that how many times we
> reached a max size ExplodedGraph (max number of steps) and how many
> more steps we do because of turning of these heuristics? I feel like
> this change should result higher analysis time but less coverage
> overall. However, I am not sure how significant these changes are. So,
> if you manage to find more valuable bugs then these changes worth it,
> I guess.
>
>     Just to be clear, both replay-without-inlining and
>     dont-inline-again-after-bailout heuristics were disabled in this test.
>
>         I suspect merges resulting from heuristic (2-b) cause us to
>         lose some actually valid reports.
>
>
>     Because replay-without-inlining is disabled, there should not be
>     many merges.
>
>         Option (ii) has not been evaluated fully yet, but current
>         experiments show slightly more reports (5-10%), and a radical
>         decline in report lengths
>         (e.g. from 400+ to <100 for largest reports)
>
>         Are there any thoughts on the matter?
>
>         Personally I think we should do both (i) and (ii), even if
>         they would shake up the results.
>         - The original idea for heuristics (2-b) was to be able to
>         produce a report even if we are out of budget, but since it
>         actually results in less reports,
>         I think the data does not validate the approach.
>
>         - Option (ii) is AFAIK how most similar engines work, and
>         should get us much larger coverage (and shorter paths) for the
>         same node budget,
>         even at the cost of log(N) overhead of the priority queue.
>         Moreover, not having the priority queue will bite us later if
>         we ever decide to further
>         increase the analyzer budget or to increase the unroll limit.
>
>
>     In the example above, (ii) means evaluating the first true-branch
>     of the if() in bar() before the second false-branch of the if() in
>     bar(), simply because it's *on an earlier loop iteration*. This,
>     indeed, sounds like the right thing to do, like, logically,
>     hopefully we'd be able to confirm this with a more careful evaluation.
>
> Option (ii) is something I personally really support and would like to
> see implemented in the analyzer. I was already thinking on this change
> earlier but did not seem easy to come up with a reliable heuristic for
> that. The aim is clear and great but how would you do it? Would you
> rate the possibilities based on the number of visits of the possible
> next analyzed blocks?
>
>
> Thanks,
> Peter
>

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev


2018-01-30 2:43 GMT+01:00 Artem Dergachev <[hidden email]>:


On 29/01/2018 5:36 PM, Péter Szécsi wrote:
Hi George and Artem,

2018-01-30 1:44 GMT+01:00 Artem Dergachev via cfe-dev <[hidden email] <mailto:[hidden email]>>:



    On 29/01/2018 4:12 PM, George Karpenkov via cfe-dev wrote:

        Hi All,

        I was investigating recently bug reports with very long
        analyzer paths (more than a few hundred nodes).
        In many of such cases the path is long for no good reason:
        namely, the analyzer would go 3 times around the loop before
        going further.
        The issue is surprisingly common, and it was exacerbated with
        a recent bump of analyzer thresholds.


    Yeah, i guess everybody who used the analyzer has seen some of
    those nasty reports with iterating over loops 4 times. It's like
    why does it find the issue on the last iteration rather than on
    the first iteration, given that we use a depth-first strategy? So
    it's a great long-overdue thing to fix.

    George, do you have any non-internal before/after html reports to
    attach?


        The problem is reproduced on the following file:

        ```
        extern int coin();

        int foo() {
             int *x = 0;
             while (coin()) {
                 if (coin())
                     return *x;
             }
             return 0;
        }

        void bar() {
             while(coin())
                 if (coin())
                     foo();
        }
        ```

        While a shortest path to the error does not loop around, the
        current version of the analyzer
        will go around the loop three times before going further.
        (and we are quite fortunate that the unrolling limit for loops
        is three, otherwise it would keep going
        until the unrolling limit is reached).

        Multiple issues were discovered during the investigation.

        1. Analyzer queue does not have a concept of priority, and
        performs a simple DFS by default.
        Thus if the successor of the if-branch under the loop in “bar"
        containing the desired destination is generated second,
        it will never be evaluated until the loop exploration limit is
        exhausted.

        2. The previous issue slows down the exploration, but is not
        enough to get a pathological behavior of ultra-long paths.
        The second problem is a combination of:
        a) Block counter is not a part of a node's identity, and node
        A with a small block counter can be merged into a node B with
        a large block counter,
        and the resulting node will have a block counter associated
        with B.
        b) The issue in (a) is triggered due to our heuristic to
        abandon the function’s exploration and switch to conservative
        evaluation
        if we are already *inside* the function and the block limit
        has been reached.

        Issue (1) combined with (2-b) causes the problematic behavior:
        the issue is discovered on the longest path first,
        and by the time the shortest path gets to “bar”, the block
        limit is already reached, and the switch to conservative
        evaluation is performed.


    2-a is not even required here.

    With our DFS exploration order, on every iteration of the
    while-loop within bar(), we'd take the false-branch of if() within
    bar() from the worklist, see that it goes back to loop, and end up
    with new true-branch and false-branch nodes of the next iteration
    on the top of the worklist. Then we pop the false-branch again,
    etc., until we run out of block count limit while having 4
    true-branches in the worklist. Those would therefore evaluate in
    the opposite order, and the first time we enter foo() we'd be on
    the 4th iteration.

    This situation can happen regardless of in which order we evaluate
    if()-branches, by slightly modifying the example. So if the idea
    in the previous paragraph is unclear, it should still be obvious
    that sometimes we'd run into a function call on the longer path
    earlier than on a shorter path.

    Now, once we enter foo() and immediately find the bug, we also run
    out of block count limit within foo(). Recall that we are on the
    4th iteration of the while-loop in bar(), and here is where the
    bug is found. Now, once evaluation of foo() is over, we record
    that we failed to fully inline it, so it's probably too complex,
    so let's evaluate it conservatively.

    It means that on 3th, 2nd, 1st iteration we won't be able to find
    the bug, because foo() is evaluated conservatively. So we're stuck
    with the long report forever.

I do not exactly see it why. If I'm not mistaken you described the replay-without-inlining heuristics. However, I believe this information is stored in the state which means that this only affects one path (in this example the 4th iteration bug finding path). But whenever we simulate the path of the 3rd/2nd/1st iteration where the if(coin()) is true, that is another path. Maybe I just do not see something trivial os just misunderstood something but could you explain me, why does it affect other paths?


Nope, it's not part of the program state. It's in FunctionSummaries, which is a field in CoreEngine. So whenever we find a function we were unable to inline even once during analysis, we'd never inline it anymore on any branch. See stuff around markReachedMaxBlockCount().
 
 
Hmm I've missed that but checked it now. Thanks! (Im a little bit confused then, why ReplayWithoutInlining stuff is stored in the state as well. It seems it is used only for sanity checks but I havent gone deep into that in the previous minutes.) In this case I would suggest this alternative idea that - instead of removing the heuristic - it could go into the ProgramState. I mean, it does make sense to me, not to reinline functions which will probably exceed the block count (again). However, this would still solve the above mentioned problem since we would find the bug on the shorter path as well. (If they are marked as same, the analyzer reports it with the shorter bugpath, I believe). What do you think?
 

        Thus there are two mitigation strategies currently being
        evaluated:

        i) Remove the heuristic in (2-b)
        ii) Use a priority queue to hold nodes which should be
        explored; prefer nodes which give new source code coverage
        over others
        (or alternatively prefer nodes with least depth of loop stack)

        Me and Artem have evaluated the option (i) and the results
        were surprisingly good: some reports disappear, and slightly
        more reports reappear.
        The quality of the new reports seems to be slightly better,
        and I am still trying to figure out exact reasons.


    Yeah, i guess some explanation is necessary here. The skew of
    results is pretty huge, and it's surprising that the number of
    reports actually increases.

Just to make sure: Does the number of actually *different *reports increases? In case of a missing uniquing location a checker could generate a lot of "spam", so find and report the same bug on different paths. (And turning off these heuristics could lead into that. Probably you already checked this, but seems really suspicious.)

Yeah, it's like +600/-300 unique reports on my nightly internal codebase run, which is a huge skew.



Another interesting stuff about (i) could be that how many times we reached a max size ExplodedGraph (max number of steps) and how many more steps we do because of turning of these heuristics? I feel like this change should result higher analysis time but less coverage overall. However, I am not sure how significant these changes are. So, if you manage to find more valuable bugs then these changes worth it, I guess.

    Just to be clear, both replay-without-inlining and
    dont-inline-again-after-bailout heuristics were disabled in this test.

        I suspect merges resulting from heuristic (2-b) cause us to
        lose some actually valid reports.


    Because replay-without-inlining is disabled, there should not be
    many merges.

        Option (ii) has not been evaluated fully yet, but current
        experiments show slightly more reports (5-10%), and a radical
        decline in report lengths
        (e.g. from 400+ to <100 for largest reports)

        Are there any thoughts on the matter?

        Personally I think we should do both (i) and (ii), even if
        they would shake up the results.
        - The original idea for heuristics (2-b) was to be able to
        produce a report even if we are out of budget, but since it
        actually results in less reports,
        I think the data does not validate the approach.

        - Option (ii) is AFAIK how most similar engines work, and
        should get us much larger coverage (and shorter paths) for the
        same node budget,
        even at the cost of log(N) overhead of the priority queue.
        Moreover, not having the priority queue will bite us later if
        we ever decide to further
        increase the analyzer budget or to increase the unroll limit.


    In the example above, (ii) means evaluating the first true-branch
    of the if() in bar() before the second false-branch of the if() in
    bar(), simply because it's *on an earlier loop iteration*. This,
    indeed, sounds like the right thing to do, like, logically,
    hopefully we'd be able to confirm this with a more careful evaluation.

Option (ii) is something I personally really support and would like to see implemented in the analyzer. I was already thinking on this change earlier but did not seem easy to come up with a reliable heuristic for that. The aim is clear and great but how would you do it? Would you rate the possibilities based on the number of visits of the possible next analyzed blocks?


Thanks,
Peter




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev


On 29/01/2018 6:24 PM, Péter Szécsi wrote:

>
>
> 2018-01-30 2:43 GMT+01:00 Artem Dergachev <[hidden email]
> <mailto:[hidden email]>>:
>
>
>
>     On 29/01/2018 5:36 PM, Péter Szécsi wrote:
>
>         Hi George and Artem,
>
>         2018-01-30 1:44 GMT+01:00 Artem Dergachev via cfe-dev
>         <[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>>:
>
>
>
>             On 29/01/2018 4:12 PM, George Karpenkov via cfe-dev wrote:
>
>                 Hi All,
>
>                 I was investigating recently bug reports with very long
>                 analyzer paths (more than a few hundred nodes).
>                 In many of such cases the path is long for no good reason:
>                 namely, the analyzer would go 3 times around the loop
>         before
>                 going further.
>                 The issue is surprisingly common, and it was
>         exacerbated with
>                 a recent bump of analyzer thresholds.
>
>
>             Yeah, i guess everybody who used the analyzer has seen some of
>             those nasty reports with iterating over loops 4 times.
>         It's like
>             why does it find the issue on the last iteration rather
>         than on
>             the first iteration, given that we use a depth-first
>         strategy? So
>             it's a great long-overdue thing to fix.
>
>             George, do you have any non-internal before/after html
>         reports to
>             attach?
>
>
>                 The problem is reproduced on the following file:
>
>                 ```
>                 extern int coin();
>
>                 int foo() {
>                      int *x = 0;
>                      while (coin()) {
>                          if (coin())
>                              return *x;
>                      }
>                      return 0;
>                 }
>
>                 void bar() {
>                      while(coin())
>                          if (coin())
>                              foo();
>                 }
>                 ```
>
>                 While a shortest path to the error does not loop
>         around, the
>                 current version of the analyzer
>                 will go around the loop three times before going further.
>                 (and we are quite fortunate that the unrolling limit
>         for loops
>                 is three, otherwise it would keep going
>                 until the unrolling limit is reached).
>
>                 Multiple issues were discovered during the investigation.
>
>                 1. Analyzer queue does not have a concept of priority, and
>                 performs a simple DFS by default.
>                 Thus if the successor of the if-branch under the loop
>         in “bar"
>                 containing the desired destination is generated second,
>                 it will never be evaluated until the loop exploration
>         limit is
>                 exhausted.
>
>                 2. The previous issue slows down the exploration, but
>         is not
>                 enough to get a pathological behavior of ultra-long paths.
>                 The second problem is a combination of:
>                 a) Block counter is not a part of a node's identity,
>         and node
>                 A with a small block counter can be merged into a node
>         B with
>                 a large block counter,
>                 and the resulting node will have a block counter
>         associated
>                 with B.
>                 b) The issue in (a) is triggered due to our heuristic to
>                 abandon the function’s exploration and switch to
>         conservative
>                 evaluation
>                 if we are already *inside* the function and the block
>         limit
>                 has been reached.
>
>                 Issue (1) combined with (2-b) causes the problematic
>         behavior:
>                 the issue is discovered on the longest path first,
>                 and by the time the shortest path gets to “bar”, the block
>                 limit is already reached, and the switch to conservative
>                 evaluation is performed.
>
>
>             2-a is not even required here.
>
>             With our DFS exploration order, on every iteration of the
>             while-loop within bar(), we'd take the false-branch of
>         if() within
>             bar() from the worklist, see that it goes back to loop,
>         and end up
>             with new true-branch and false-branch nodes of the next
>         iteration
>             on the top of the worklist. Then we pop the false-branch
>         again,
>             etc., until we run out of block count limit while having 4
>             true-branches in the worklist. Those would therefore
>         evaluate in
>             the opposite order, and the first time we enter foo() we'd
>         be on
>             the 4th iteration.
>
>             This situation can happen regardless of in which order we
>         evaluate
>             if()-branches, by slightly modifying the example. So if
>         the idea
>             in the previous paragraph is unclear, it should still be
>         obvious
>             that sometimes we'd run into a function call on the longer
>         path
>             earlier than on a shorter path.
>
>             Now, once we enter foo() and immediately find the bug, we
>         also run
>             out of block count limit within foo(). Recall that we are
>         on the
>             4th iteration of the while-loop in bar(), and here is
>         where the
>             bug is found. Now, once evaluation of foo() is over, we record
>             that we failed to fully inline it, so it's probably too
>         complex,
>             so let's evaluate it conservatively.
>
>             It means that on 3th, 2nd, 1st iteration we won't be able
>         to find
>             the bug, because foo() is evaluated conservatively. So
>         we're stuck
>             with the long report forever.
>
>         I do not exactly see it why. If I'm not mistaken you described
>         the replay-without-inlining heuristics. However, I believe
>         this information is stored in the state which means that this
>         only affects one path (in this example the 4th iteration bug
>         finding path). But whenever we simulate the path of the
>         3rd/2nd/1st iteration where the if(coin()) is true, that is
>         another path. Maybe I just do not see something trivial os
>         just misunderstood something but could you explain me, why
>         does it affect other paths?
>
>
>     Nope, it's not part of the program state. It's in
>     FunctionSummaries, which is a field in CoreEngine. So whenever we
>     find a function we were unable to inline even once during
>     analysis, we'd never inline it anymore on any branch. See stuff
>     around markReachedMaxBlockCount().
>
> Hmm I've missed that but checked it now. Thanks! (Im a little bit
> confused then, why ReplayWithoutInlining stuff is stored in the state
> as well. It seems it is used only for sanity checks but I havent gone
> deep into that in the previous minutes.)

I guess there are two parts of the heuristic - the
replay-without-inlining-like-i-mean-right-now part and the
dont-ever-try-to-inline-this-thing-again-im-sick-of-it-already part. The
former is what we need to immediately do on the current path, so it
needs support on the program state side to remember that we need to
re-inline it. I'm not sure it's *actually* needed though. The latter is
path-insensitive.


> In this case I would suggest this alternative idea that - instead of
> removing the heuristic - it could go into the ProgramState. I mean, it
> does make sense to me, not to reinline functions which will probably
> exceed the block count (again). However, this would still solve the
> above mentioned problem since we would find the bug on the shorter
> path as well. (If they are marked as same, the analyzer reports it
> with the shorter bugpath, I believe). What do you think?

This is indeed a more careful middle-ground change than cutting out the
heuristic entirely, but i guess this is an intentional part of the
heuristic - we are unlikely to be able to fully explore the function on
the other path anyway, so why bother. We'd still need to understand why
do positives skew in a particular direction. But that's definitely an
option.

>
>                 Thus there are two mitigation strategies currently being
>                 evaluated:
>
>                 i) Remove the heuristic in (2-b)
>                 ii) Use a priority queue to hold nodes which should be
>                 explored; prefer nodes which give new source code coverage
>                 over others
>                 (or alternatively prefer nodes with least depth of
>         loop stack)
>
>                 Me and Artem have evaluated the option (i) and the results
>                 were surprisingly good: some reports disappear, and
>         slightly
>                 more reports reappear.
>                 The quality of the new reports seems to be slightly
>         better,
>                 and I am still trying to figure out exact reasons.
>
>
>             Yeah, i guess some explanation is necessary here. The skew of
>             results is pretty huge, and it's surprising that the number of
>             reports actually increases.
>
>         Just to make sure: Does the number of actually *different
>         *reports increases? In case of a missing uniquing location a
>         checker could generate a lot of "spam", so find and report the
>         same bug on different paths. (And turning off these heuristics
>         could lead into that. Probably you already checked this, but
>         seems really suspicious.)
>
>
>     Yeah, it's like +600/-300 unique reports on my nightly internal
>     codebase run, which is a huge skew.
>
>
>
>         Another interesting stuff about (i) could be that how many
>         times we reached a max size ExplodedGraph (max number of
>         steps) and how many more steps we do because of turning of
>         these heuristics? I feel like this change should result higher
>         analysis time but less coverage overall. However, I am not
>         sure how significant these changes are. So, if you manage to
>         find more valuable bugs then these changes worth it, I guess.
>
>             Just to be clear, both replay-without-inlining and
>             dont-inline-again-after-bailout heuristics were disabled
>         in this test.
>
>                 I suspect merges resulting from heuristic (2-b) cause
>         us to
>                 lose some actually valid reports.
>
>
>             Because replay-without-inlining is disabled, there should
>         not be
>             many merges.
>
>                 Option (ii) has not been evaluated fully yet, but current
>                 experiments show slightly more reports (5-10%), and a
>         radical
>                 decline in report lengths
>                 (e.g. from 400+ to <100 for largest reports)
>
>                 Are there any thoughts on the matter?
>
>                 Personally I think we should do both (i) and (ii), even if
>                 they would shake up the results.
>                 - The original idea for heuristics (2-b) was to be able to
>                 produce a report even if we are out of budget, but
>         since it
>                 actually results in less reports,
>                 I think the data does not validate the approach.
>
>                 - Option (ii) is AFAIK how most similar engines work, and
>                 should get us much larger coverage (and shorter paths)
>         for the
>                 same node budget,
>                 even at the cost of log(N) overhead of the priority queue.
>                 Moreover, not having the priority queue will bite us
>         later if
>                 we ever decide to further
>                 increase the analyzer budget or to increase the unroll
>         limit.
>
>
>             In the example above, (ii) means evaluating the first
>         true-branch
>             of the if() in bar() before the second false-branch of the
>         if() in
>             bar(), simply because it's *on an earlier loop iteration*.
>         This,
>             indeed, sounds like the right thing to do, like, logically,
>             hopefully we'd be able to confirm this with a more careful
>         evaluation.
>
>         Option (ii) is something I personally really support and would
>         like to see implemented in the analyzer. I was already
>         thinking on this change earlier but did not seem easy to come
>         up with a reliable heuristic for that. The aim is clear and
>         great but how would you do it? Would you rate the
>         possibilities based on the number of visits of the possible
>         next analyzed blocks?
>
>
>         Thanks,
>         Peter
>
>
>

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
In reply to this post by Artem Dergachev via cfe-dev
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email]> wrote:
Hi All,

I was investigating recently bug reports with very long analyzer paths (more than a few hundred nodes).
In many of such cases the path is long for no good reason: namely, the analyzer would go 3 times around the loop before
going further.
The issue is surprisingly common, and it was exacerbated with a recent bump of analyzer thresholds.

The problem is reproduced on the following file:

```
extern int coin();

int foo() {
    int *x = 0;
    while (coin()) {
        if (coin())
            return *x;
    }
    return 0;
}

void bar() {
    while(coin())
        if (coin())
            foo();
}
```

While a shortest path to the error does not loop around, the current version of the analyzer
will go around the loop three times before going further.
(and we are quite fortunate that the unrolling limit for loops is three, otherwise it would keep going
until the unrolling limit is reached).

Multiple issues were discovered during the investigation.

1. Analyzer queue does not have a concept of priority, and performs a simple DFS by default.
Thus if the successor of the if-branch under the loop in “bar" containing the desired destination is generated second,
it will never be evaluated until the loop exploration limit is exhausted.

2. The previous issue slows down the exploration, but is not enough to get a pathological behavior of ultra-long paths.
The second problem is a combination of:
a) Block counter is not a part of a node's identity, and node A with a small block counter can be merged into a node B with a large block counter,
and the resulting node will have a block counter associated with B.

Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?
 
b) The issue in (a) is triggered due to our heuristic to abandon the function’s exploration and switch to conservative evaluation
if we are already *inside* the function and the block limit has been reached.

Issue (1) combined with (2-b) causes the problematic behavior: the issue is discovered on the longest path first,
and by the time the shortest path gets to “bar”, the block limit is already reached, and the switch to conservative evaluation is performed.

Thus there are two mitigation strategies currently being evaluated:

i) Remove the heuristic in (2-b)
ii) Use a priority queue to hold nodes which should be explored; prefer nodes which give new source code coverage over others
(or alternatively prefer nodes with least depth of loop stack)

Me and Artem have evaluated the option (i) and the results were surprisingly good: some reports disappear, and slightly more reports reappear.
The quality of the new reports seems to be slightly better, and I am still trying to figure out exact reasons.
I suspect merges resulting from heuristic (2-b) cause us to lose some actually valid reports.

I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)
 

Option (ii) has not been evaluated fully yet, but current experiments show slightly more reports (5-10%), and a radical decline in report lengths
(e.g. from 400+ to <100 for largest reports)

Are there any thoughts on the matter?

Personally I think we should do both (i) and (ii), even if they would shake up the results.
- The original idea for heuristics (2-b) was to be able to produce a report even if we are out of budget, but since it actually results in less reports,
I think the data does not validate the approach.

- Option (ii) is AFAIK how most similar engines work, and should get us much larger coverage (and shorter paths) for the same node budget,
even at the cost of log(N) overhead of the priority queue. Moreover, not having the priority queue will bite us later if we ever decide to further
increase the analyzer budget or to increase the unroll limit.

I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor
 

George



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev


On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:

> Hi George, Artem,
>
> I am glad that you are looking into this problem!
>
> On 30 January 2018 at 01:12, George Karpenkov via cfe-dev
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi All,
>
>     I was investigating recently bug reports with very long analyzer
>     paths (more than a few hundred nodes).
>     In many of such cases the path is long for no good reason: namely,
>     the analyzer would go 3 times around the loop before
>     going further.
>     The issue is surprisingly common, and it was exacerbated with a
>     recent bump of analyzer thresholds.
>
>     The problem is reproduced on the following file:
>
>     ```
>     extern int coin();
>
>     int foo() {
>         int *x = 0;
>         while (coin()) {
>             if (coin())
>                 return *x;
>         }
>         return 0;
>     }
>
>     void bar() {
>         while(coin())
>             if (coin())
>                 foo();
>     }
>     ```
>
>     While a shortest path to the error does not loop around, the
>     current version of the analyzer
>     will go around the loop three times before going further.
>     (and we are quite fortunate that the unrolling limit for loops is
>     three, otherwise it would keep going
>     until the unrolling limit is reached).
>
>     Multiple issues were discovered during the investigation.
>
>     1. Analyzer queue does not have a concept of priority, and
>     performs a simple DFS by default.
>     Thus if the successor of the if-branch under the loop in “bar"
>     containing the desired destination is generated second,
>     it will never be evaluated until the loop exploration limit is
>     exhausted.
>
>     2. The previous issue slows down the exploration, but is not
>     enough to get a pathological behavior of ultra-long paths.
>     The second problem is a combination of:
>     a) Block counter is not a part of a node's identity, and node A
>     with a small block counter can be merged into a node B with a
>     large block counter,
>     and the resulting node will have a block counter associated with B.
>
>
> Sorry for the questions, just wanted to clarify some things. You mean
> ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two
different paths that have different block counts, we'd cache-out on the
latter path, while the worklist element of the first path would still
possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this
example.

This isn't directly related to our problem though, as i noticed in
http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


>     b) The issue in (a) is triggered due to our heuristic to abandon
>     the function’s exploration and switch to conservative evaluation
>     if we are already *inside* the function and the block limit has
>     been reached.
>
>     Issue (1) combined with (2-b) causes the problematic behavior: the
>     issue is discovered on the longest path first,
>     and by the time the shortest path gets to “bar”, the block limit
>     is already reached, and the switch to conservative evaluation is
>     performed.
>
>     Thus there are two mitigation strategies currently being evaluated:
>
>     i) Remove the heuristic in (2-b)
>     ii) Use a priority queue to hold nodes which should be explored;
>     prefer nodes which give new source code coverage over others
>     (or alternatively prefer nodes with least depth of loop stack)
>
>     Me and Artem have evaluated the option (i) and the results were
>     surprisingly good: some reports disappear, and slightly more
>     reports reappear.
>     The quality of the new reports seems to be slightly better, and I
>     am still trying to figure out exact reasons.
>     I suspect merges resulting from heuristic (2-b) cause us to lose
>     some actually valid reports.
>
>
> I also find the results surprising. If you have more information about
> the reasons please do not forget to follow up this thread. We are
> curious :)
>
>
>     Option (ii) has not been evaluated fully yet, but current
>     experiments show slightly more reports (5-10%), and a radical
>     decline in report lengths
>     (e.g. from 400+ to <100 for largest reports)
>
>     Are there any thoughts on the matter?
>
>     Personally I think we should do both (i) and (ii), even if they
>     would shake up the results.
>     - The original idea for heuristics (2-b) was to be able to produce
>     a report even if we are out of budget, but since it actually
>     results in less reports,
>     I think the data does not validate the approach.
>
>     - Option (ii) is AFAIK how most similar engines work, and should
>     get us much larger coverage (and shorter paths) for the same node
>     budget,
>     even at the cost of log(N) overhead of the priority queue.
>     Moreover, not having the priority queue will bite us later if we
>     ever decide to further
>     increase the analyzer budget or to increase the unroll limit.
>
>
> I wonder what will the performance implication be. But I also like the
> idea of having a priority queue. If we find that we get more and
> better report
> but also have worse performance, we can also consider reducing the
> analysis budget slightly.
>
> Regards,
> Gábor
>
>
>     George
>
>
>
>     _______________________________________________
>     cfe-dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>
>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
In reply to this post by Artem Dergachev via cfe-dev


> On Jan 29, 2018, at 4:12 PM, George Karpenkov via cfe-dev <[hidden email]> wrote:
>
> Hi All,
>
> I was investigating recently bug reports with very long analyzer paths (more than a few hundred nodes).
> In many of such cases the path is long for no good reason: namely, the analyzer would go 3 times around the loop before
> going further.
> The issue is surprisingly common, and it was exacerbated with a recent bump of analyzer thresholds.
>
> The problem is reproduced on the following file:
>
> ```
> extern int coin();
>
> int foo() {
>    int *x = 0;
>    while (coin()) {
>        if (coin())
>            return *x;
>    }
>    return 0;
> }
>
> void bar() {
>    while(coin())
>        if (coin())
>            foo();
> }
> ```
>
> While a shortest path to the error does not loop around, the current version of the analyzer
> will go around the loop three times before going further.
> (and we are quite fortunate that the unrolling limit for loops is three, otherwise it would keep going
> until the unrolling limit is reached).
>
> Multiple issues were discovered during the investigation.
>
> 1. Analyzer queue does not have a concept of priority, and performs a simple DFS by default.
> Thus if the successor of the if-branch under the loop in “bar" containing the desired destination is generated second,
> it will never be evaluated until the loop exploration limit is exhausted.
>
> 2. The previous issue slows down the exploration, but is not enough to get a pathological behavior of ultra-long paths.
> The second problem is a combination of:
> a) Block counter is not a part of a node's identity, and node A with a small block counter can be merged into a node B with a large block counter,
> and the resulting node will have a block counter associated with B.
> b) The issue in (a) is triggered due to our heuristic to abandon the function’s exploration and switch to conservative evaluation
> if we are already *inside* the function and the block limit has been reached.
>
> Issue (1) combined with (2-b) causes the problematic behavior: the issue is discovered on the longest path first,
> and by the time the shortest path gets to “bar”, the block limit is already reached, and the switch to conservative evaluation is performed.
>
> Thus there are two mitigation strategies currently being evaluated:
>
> i) Remove the heuristic in (2-b)
> ii) Use a priority queue to hold nodes which should be explored; prefer nodes which give new source code coverage over others
> (or alternatively prefer nodes with least depth of loop stack)
>
> Me and Artem have evaluated the option (i) and the results were surprisingly good: some reports disappear, and slightly more reports reappear.
> The quality of the new reports seems to be slightly better, and I am still trying to figure out exact reasons.
> I suspect merges resulting from heuristic (2-b) cause us to lose some actually valid reports.
>
> Option (ii) has not been evaluated fully yet, but current experiments show slightly more reports (5-10%), and a radical decline in report lengths
> (e.g. from 400+ to <100 for largest reports)
>
> Are there any thoughts on the matter?
>
> Personally I think we should do both (i) and (ii), even if they would shake up the results.
> - The original idea for heuristics (2-b) was to be able to produce a report even if we are out of budget, but since it actually results in less reports,
> I think the data does not validate the approach.
Attached is some of the data I used to justify this change in early 2012 when we just turned on inlining and were looking at heuristics to increase coverage. I did see more reports (10% increase in reports per second). There were more heuristics added after that.

>
> - Option (ii) is AFAIK how most similar engines work, and should get us much larger coverage (and shorter paths) for the same node budget,
> even at the cost of log(N) overhead of the priority queue. Moreover, not having the priority queue will bite us later if we ever decide to further
> increase the analyzer budget or to increase the unroll limit.
>

What is the memory and performance impact of using the priority queue?




> George
>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

Inlining_numbers.pdf (29K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
In reply to this post by Artem Dergachev via cfe-dev
After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

 - 9 reports added
 - 1 report removed

Note on histograms (here and below)

 -> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
 -> Histograms omit data points where the path length has remained the same
 (as otherwise they completely dominate the histogram)
 -> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

Relative path length change for same bugs (path size before / path size after):


Log of relative path length change (log(path size before/ path size after)):


Absolute change in path length (path size before - path size after)


On postgresql:

 - 377 reports added
 - 43 reports removed

Relative path length change (path size before / path size after):


Log of relative path length change (log(path size before / path size after))


Absolute change in path length (path size before - path size after)



On sqlite3 + a few other misc files:

 - 239 reports added
 - 1 report removed

Relative path length change (path size before / path size after):


Log of relative path length change (log(path size before / path size after)):



Absolute path length change (path size before - path size after):



On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
In reply to this post by Artem Dergachev via cfe-dev
The list did not like a posting with many images,
so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1

The text part was:

After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

  • 9 reports added
  • 1 report removed

On postgresql:

  • 377 reports added
  • 43 reports removed

On sqlite3 + a few other misc files:

  • 239 reports added
  • 1 report removed

Note on histograms (here and below)

-> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
-> Histograms omit data points where the path length has remained the same
(as otherwise they completely dominate the histogram)
-> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
Patch available at https://reviews.llvm.org/D42775

On Jan 31, 2018, at 5:00 PM, George Karpenkov <[hidden email]> wrote:

The list did not like a posting with many images,
so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1

The text part was:

After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

  • 9 reports added
  • 1 report removed

On postgresql:

  • 377 reports added
  • 43 reports removed

On sqlite3 + a few other misc files:

  • 239 reports added
  • 1 report removed

Note on histograms (here and below)

-> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
-> Histograms omit data points where the path length has remained the same
(as otherwise they completely dominate the histogram)
-> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
In reply to this post by Artem Dergachev via cfe-dev

On Jan 31, 2018, at 5:00 PM, George Karpenkov <[hidden email]> wrote:

The list did not like a posting with many images,
so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1

The text part was:

After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

  • 9 reports added
  • 1 report removed

On postgresql:

  • 377 reports added
  • 43 reports removed

On sqlite3 + a few other misc files:

  • 239 reports added
This is a lot of additional reports! Are there actually that many bugs in that codebase that need to be fixed? I think we need to manually evaluate these reports.
Also, we have to make sure uniquing works. Do all regression tests pass?
  • 1 report removed

Note on histograms (here and below)

-> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
-> Histograms omit data points where the path length has remained the same
(as otherwise they completely dominate the histogram)
-> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev


On Jan 31, 2018, at 8:46 PM, Anna Zaks <[hidden email]> wrote:


On Jan 31, 2018, at 5:00 PM, George Karpenkov <[hidden email]> wrote:

The list did not like a posting with many images,
so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1

The text part was:

After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

  • 9 reports added
  • 1 report removed

On postgresql:

  • 377 reports added
  • 43 reports removed

On sqlite3 + a few other misc files:

  • 239 reports added
This is a lot of additional reports! Are there actually that many bugs in that codebase that need to be fixed?

Out of 239 or in general? The 239 reports mostly come from non-sqlite files.
In general, for most practical purposes C codebases provide an infinite supply of bugs :P

On a more serious note, I don’t think this is very surprising.
The previous emails in this chain have shown that for loops, the analyzer has a very high chance of entering
a degenerate behavior where the longest path through the loop will be evaluated first,
and a large chunk of the analyzer budget is then spent on going around the loop in circles.

Under new exploration strategy, paths which increase coverage are explored first, and then we can actually find
bugs with the budget which is no longer spent going in circles.

I think we need to manually evaluate these reports.

Yes, I’ve looked through quite a few of them, the false positive ratio seems to be actually getting lower,
as the probability of the report being a false positive grows with the path length,
and path lengths are getting much shorter.

Also, we have to make sure uniquing works. Do all regression tests pass?

Yes.
  • 1 report removed

Note on histograms (here and below)

-> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
-> Histograms omit data points where the path length has remained the same
(as otherwise they completely dominate the histogram)
-> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev





_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev


On Jan 31, 2018, at 9:41 PM, George Karpenkov <[hidden email]> wrote:



On Jan 31, 2018, at 8:46 PM, Anna Zaks <[hidden email]> wrote:


On Jan 31, 2018, at 5:00 PM, George Karpenkov <[hidden email]> wrote:

The list did not like a posting with many images,
so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1

The text part was:

After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

  • 9 reports added
  • 1 report removed

On postgresql:

  • 377 reports added
  • 43 reports removed

On sqlite3 + a few other misc files:

  • 239 reports added
This is a lot of additional reports! Are there actually that many bugs in that codebase that need to be fixed?

Out of 239 or in general? The 239 reports mostly come from non-sqlite files.

Can you provide data just for sqlite?

For postgresql the number of additional reports also seems very high.

In general, for most practical purposes C codebases provide an infinite supply of bugs :P

On a more serious note, I don’t think this is very surprising.
The previous emails in this chain have shown that for loops, the analyzer has a very high chance of entering
a degenerate behavior where the longest path through the loop will be evaluated first,
and a large chunk of the analyzer budget is then spent on going around the loop in circles.

Under new exploration strategy, paths which increase coverage are explored first, and then we can actually find
bugs with the budget which is no longer spent going in circles.

I think we need to manually evaluate these reports.

Yes, I’ve looked through quite a few of them, the false positive ratio seems to be actually getting lower,
as the probability of the report being a false positive grows with the path length,
and path lengths are getting much shorter.

Also, we have to make sure uniquing works. Do all regression tests pass?

Yes.
  • 1 report removed

Note on histograms (here and below)

-> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
-> Histograms omit data points where the path length has remained the same
(as otherwise they completely dominate the histogram)
-> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev
On sqlite there are 8 new reports and one removed reports.

I think it’s natural that the number is high. In many cases we are saving 3x of the analyzer budget,
which is then spent finding new bugs.

On Jan 31, 2018, at 9:52 PM, Anna Zaks <[hidden email]> wrote:



On Jan 31, 2018, at 9:41 PM, George Karpenkov <[hidden email]> wrote:



On Jan 31, 2018, at 8:46 PM, Anna Zaks <[hidden email]> wrote:


On Jan 31, 2018, at 5:00 PM, George Karpenkov <[hidden email]> wrote:

The list did not like a posting with many images,
so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1

The text part was:

After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

  • 9 reports added
  • 1 report removed

On postgresql:

  • 377 reports added
  • 43 reports removed

On sqlite3 + a few other misc files:

  • 239 reports added
This is a lot of additional reports! Are there actually that many bugs in that codebase that need to be fixed?

Out of 239 or in general? The 239 reports mostly come from non-sqlite files.

Can you provide data just for sqlite?

For postgresql the number of additional reports also seems very high.

In general, for most practical purposes C codebases provide an infinite supply of bugs :P

On a more serious note, I don’t think this is very surprising.
The previous emails in this chain have shown that for loops, the analyzer has a very high chance of entering
a degenerate behavior where the longest path through the loop will be evaluated first,
and a large chunk of the analyzer budget is then spent on going around the loop in circles.

Under new exploration strategy, paths which increase coverage are explored first, and then we can actually find
bugs with the budget which is no longer spent going in circles.

I think we need to manually evaluate these reports.

Yes, I’ve looked through quite a few of them, the false positive ratio seems to be actually getting lower,
as the probability of the report being a false positive grows with the path length,
and path lengths are getting much shorter.

Also, we have to make sure uniquing works. Do all regression tests pass?

Yes.
  • 1 report removed

Note on histograms (here and below)

-> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
-> Histograms omit data points where the path length has remained the same
(as otherwise they completely dominate the histogram)
-> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev


On Jan 31, 2018, at 9:59 PM, George Karpenkov <[hidden email]> wrote:

On sqlite there are 8 new reports and one removed reports.

I think it’s natural that the number is high. In many cases we are saving 3x of the analyzer budget,
which is then spent finding new bugs.

A high jump in basic blocks coverage could be explained by your argument. (By the way, have you measured that?)

However, whenever I see that we start reporting hundreds of new bugs on a single codebase it looks suspicious. Will the developers really want and need to fix more than 3 hundred bugs on postgresql? Often when we dig deeper we find out that either we became too aggressive in reporting what people will view as false positives or we are reporting duplicate reports that have the same underlying cause. It’s possible postgresql is really that buggy, but we really need to be careful here. 


On Jan 31, 2018, at 9:52 PM, Anna Zaks <[hidden email]> wrote:



On Jan 31, 2018, at 9:41 PM, George Karpenkov <[hidden email]> wrote:



On Jan 31, 2018, at 8:46 PM, Anna Zaks <[hidden email]> wrote:


On Jan 31, 2018, at 5:00 PM, George Karpenkov <[hidden email]> wrote:

The list did not like a posting with many images,
so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1

The text part was:

After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

  • 9 reports added
  • 1 report removed

On postgresql:

  • 377 reports added
  • 43 reports removed

On sqlite3 + a few other misc files:

  • 239 reports added
This is a lot of additional reports! Are there actually that many bugs in that codebase that need to be fixed?

Out of 239 or in general? The 239 reports mostly come from non-sqlite files.

Can you provide data just for sqlite?

For postgresql the number of additional reports also seems very high.

In general, for most practical purposes C codebases provide an infinite supply of bugs :P

On a more serious note, I don’t think this is very surprising.
The previous emails in this chain have shown that for loops, the analyzer has a very high chance of entering
a degenerate behavior where the longest path through the loop will be evaluated first,
and a large chunk of the analyzer budget is then spent on going around the loop in circles.

Under new exploration strategy, paths which increase coverage are explored first, and then we can actually find
bugs with the budget which is no longer spent going in circles.

I think we need to manually evaluate these reports.

Yes, I’ve looked through quite a few of them, the false positive ratio seems to be actually getting lower,
as the probability of the report being a false positive grows with the path length,
and path lengths are getting much shorter.

Also, we have to make sure uniquing works. Do all regression tests pass?

Yes.
  • 1 report removed

Note on histograms (here and below)

-> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
-> Histograms omit data points where the path length has remained the same
(as otherwise they completely dominate the histogram)
-> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] exploration strategies and paths

Artem Dergachev via cfe-dev


On Jan 31, 2018, at 10:15 PM, Anna Zaks <[hidden email]> wrote:



On Jan 31, 2018, at 9:59 PM, George Karpenkov <[hidden email]> wrote:

On sqlite there are 8 new reports and one removed reports.

I think it’s natural that the number is high. In many cases we are saving 3x of the analyzer budget,
which is then spent finding new bugs.

A high jump in basic blocks coverage could be explained by your argument. (By the way, have you measured that?)

However, whenever I see that we start reporting hundreds of new bugs on a single codebase it looks suspicious. Will the developers really want and need to fix more than 3 hundred bugs on postgresql?

Right, I see your point.
I think that the right metric here is the proportion of false positives (after all, it’s easy to minimize the absolute number of false positives by not returning any results at all).

Coming back to postgres, out of 377 new results, 289 are on the auto-generated yacc/bison files.
We have seen those reports before, albeit in a lesser quantity.
I have previously investigated them, and basically the root cause is that yacc/bison maintains some complex invariants
(e.g. branch A could only be taken if branch B was taken before due to the global state being set up in a certain way)
the analyzer can not pick up.
This problem is not specific to any exploration strategy, and the mitigation is just disabling analysis on those files (after all, it would not make sense to fix the auto-generated code).

Often when we dig deeper we find out that either we became too aggressive in reporting what people will view as false positives or we are reporting duplicate reports that have the same underlying cause. It’s possible postgresql is really that buggy, but we really need to be careful here. 

Of course, that’s why I’m looking into those files. So far, no problems have been noticed.

I have also evaluated the results on XNU, and uploaded the results to https://reviews.llvm.org/M1

In # of reports I get:
  • 133 reports added
  • 57 reports removed
More interesting is the change in report size, shown at https://reviews.llvm.org/M1/18/

We can see there are about ~8 reports (for the same bug) getting shorter by a factor of ~3, and ~3 reports getting shorter by a factor of ~12.



On Jan 31, 2018, at 9:52 PM, Anna Zaks <[hidden email]> wrote:



On Jan 31, 2018, at 9:41 PM, George Karpenkov <[hidden email]> wrote:



On Jan 31, 2018, at 8:46 PM, Anna Zaks <[hidden email]> wrote:


On Jan 31, 2018, at 5:00 PM, George Karpenkov <[hidden email]> wrote:

The list did not like a posting with many images,
so I have posted an evaluation to phabricator: https://reviews.llvm.org/M1

The text part was:

After fixing a few bugs, another evaluation of the approach shows considerably better results.

On openssl:

  • 9 reports added
  • 1 report removed

On postgresql:

  • 377 reports added
  • 43 reports removed

On sqlite3 + a few other misc files:

  • 239 reports added
This is a lot of additional reports! Are there actually that many bugs in that codebase that need to be fixed?

Out of 239 or in general? The 239 reports mostly come from non-sqlite files.

Can you provide data just for sqlite?

For postgresql the number of additional reports also seems very high.

In general, for most practical purposes C codebases provide an infinite supply of bugs :P

On a more serious note, I don’t think this is very surprising.
The previous emails in this chain have shown that for loops, the analyzer has a very high chance of entering
a degenerate behavior where the longest path through the loop will be evaluated first,
and a large chunk of the analyzer budget is then spent on going around the loop in circles.

Under new exploration strategy, paths which increase coverage are explored first, and then we can actually find
bugs with the budget which is no longer spent going in circles.

I think we need to manually evaluate these reports.

Yes, I’ve looked through quite a few of them, the false positive ratio seems to be actually getting lower,
as the probability of the report being a false positive grows with the path length,
and path lengths are getting much shorter.

Also, we have to make sure uniquing works. Do all regression tests pass?

Yes.
  • 1 report removed

Note on histograms (here and below)

-> Histograms only show the ratio for same bugs (compared using issue hash),
that is, if the histogram says “decrease by a factor of three”, it means the new approach finds the *same* bug
with a path size 1/3d of the original
-> Histograms omit data points where the path length has remained the same
(as otherwise they completely dominate the histogram)
-> Relative histograms are provided as both ratio and logarithm of the ratio.
Logarithms of the ratio are convenient as they are symmetric in case changes balance out
(e.g. log(1/2) = -log(2/1))

On Jan 30, 2018, at 4:23 PM, George Karpenkov <[hidden email]> wrote:

Preliminary evaluation of a patch which prefers exploring nodes associated with statements which weren’t seen before first:

On openssl:

 - Adds four reports
 - Removes four reports 
 - Path lengths before: 317, 75, 75, 72, 70, 58, 50, 50, 44, 36, 23, 23, 21, 21, 20, 20, 19, 19, 19, 19, 18, 18, 18, 16, 15, 15, 15, 14, 13, 13, 12, 11, 11, 9, 7, 7, 6, 4
 - Path lengths after: 72, 60, 59, 53, 53, 52, 46, 38, 37, 30, 29, 28, 23, 21, 20, 19, 19, 19, 19, 19, 18, 16, 15, 15, 15, 15, 13, 13, 12, 12, 11, 9, 8, 7, 7, 7, 6, 4

The quality of the added reports seems higher, mainly due to the fact that report length is shorter.

On postgresql:

 - Added 80 reports
 - Removed 154 reports
   -> Out of those, 72 are reports on the yacc/bison autogenerated files, so whatever the cause is, good thing they are gone
 - The overall number of reports is 1188
 - Path lengths are lower on overall, but not in such a dramatic way
 - For many reports, I am quite confused as to why they got removed

On sqlite:

 - 7 inserted, 7 removed

On Jan 30, 2018, at 1:10 PM, Artem Dergachev <[hidden email]> wrote:



On 30/01/2018 12:40 PM, Gábor Horváth via cfe-dev wrote:
Hi George, Artem,

I am glad that you are looking into this problem!

On 30 January 2018 at 01:12, George Karpenkov via cfe-dev <[hidden email] <[hidden email]>> wrote:

   Hi All,

   I was investigating recently bug reports with very long analyzer
   paths (more than a few hundred nodes).
   In many of such cases the path is long for no good reason: namely,
   the analyzer would go 3 times around the loop before
   going further.
   The issue is surprisingly common, and it was exacerbated with a
   recent bump of analyzer thresholds.

   The problem is reproduced on the following file:

   ```
   extern int coin();

   int foo() {
       int *x = 0;
       while (coin()) {
           if (coin())
               return *x;
       }
       return 0;
   }

   void bar() {
       while(coin())
           if (coin())
               foo();
   }
   ```

   While a shortest path to the error does not loop around, the
   current version of the analyzer
   will go around the loop three times before going further.
   (and we are quite fortunate that the unrolling limit for loops is
   three, otherwise it would keep going
   until the unrolling limit is reached).

   Multiple issues were discovered during the investigation.

   1. Analyzer queue does not have a concept of priority, and
   performs a simple DFS by default.
   Thus if the successor of the if-branch under the loop in “bar"
   containing the desired destination is generated second,
   it will never be evaluated until the loop exploration limit is
   exhausted.

   2. The previous issue slows down the exploration, but is not
   enough to get a pathological behavior of ultra-long paths.
   The second problem is a combination of:
   a) Block counter is not a part of a node's identity, and node A
   with a small block counter can be merged into a node B with a
   large block counter,
   and the resulting node will have a block counter associated with B.


Sorry for the questions, just wanted to clarify some things. You mean ExplodedNodes? By merge, you mean the same thing as "caching-out"?

Yeah, George notices that if we construct the same ExplodedNode on two different paths that have different block counts, we'd cache-out on the latter path, while the worklist element of the first path would still possess the original block count.

Which happens a lot when we're evaluating foo() conservatively in this example.

This isn't directly related to our problem though, as i noticed in http://lists.llvm.org/pipermail/cfe-dev/2018-January/056719.html .


   b) The issue in (a) is triggered due to our heuristic to abandon
   the function’s exploration and switch to conservative evaluation
   if we are already *inside* the function and the block limit has
   been reached.

   Issue (1) combined with (2-b) causes the problematic behavior: the
   issue is discovered on the longest path first,
   and by the time the shortest path gets to “bar”, the block limit
   is already reached, and the switch to conservative evaluation is
   performed.

   Thus there are two mitigation strategies currently being evaluated:

   i) Remove the heuristic in (2-b)
   ii) Use a priority queue to hold nodes which should be explored;
   prefer nodes which give new source code coverage over others
   (or alternatively prefer nodes with least depth of loop stack)

   Me and Artem have evaluated the option (i) and the results were
   surprisingly good: some reports disappear, and slightly more
   reports reappear.
   The quality of the new reports seems to be slightly better, and I
   am still trying to figure out exact reasons.
   I suspect merges resulting from heuristic (2-b) cause us to lose
   some actually valid reports.


I also find the results surprising. If you have more information about the reasons please do not forget to follow up this thread. We are curious :)


   Option (ii) has not been evaluated fully yet, but current
   experiments show slightly more reports (5-10%), and a radical
   decline in report lengths
   (e.g. from 400+ to <100 for largest reports)

   Are there any thoughts on the matter?

   Personally I think we should do both (i) and (ii), even if they
   would shake up the results.
   - The original idea for heuristics (2-b) was to be able to produce
   a report even if we are out of budget, but since it actually
   results in less reports,
   I think the data does not validate the approach.

   - Option (ii) is AFAIK how most similar engines work, and should
   get us much larger coverage (and shorter paths) for the same node
   budget,
   even at the cost of log(N) overhead of the priority queue.
   Moreover, not having the priority queue will bite us later if we
   ever decide to further
   increase the analyzer budget or to increase the unroll limit.


I wonder what will the performance implication be. But I also like the idea of having a priority queue. If we find that we get more and better report
but also have worse performance, we can also consider reducing the analysis budget slightly.

Regards,
Gábor


   George



   _______________________________________________
   cfe-dev mailing list
   [hidden email] <[hidden email]>
   http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
   <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev





_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12