Process in UncheckedReturn

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Process in UncheckedReturn

Lei Zhang
Hi clang,

About the path explotion problems, I think it's a big job. So IMO i should get involved in it later.
Here is another problem, it's about how to store path information.
After the checker ownership changed, i want all the statistic information stored in the checker itself. So i designed some densemaps to store these information:
   
typedef llvm::DenseMap<const CallExpr*, bool > LocalMapTy;
// The "local" DenseMap to record checked/unchecked state of all CallExprs in
// one path.
LocalMapTy *LocalPathMap;

// The "local" DenseMap to record checked/unchecked state of all CallExpr
// uses.
LocalMapTy *LocalUseMap;

// First int indicates the checked count, and the secnd indicates the
// unchecked count.
typedef llvm::DenseMap<const NamedDecl*, std::pair<int, int> > GlobalMapTy;
// The "global" DenseMap to keep statistics of a callee decl.
GlobalMapTy *GlobalMap;

// The visited map to record whether a function body was already analyzed.
typedef llvm::DenseMap<const NamedDecl*, bool> VisitedMapTy;
VisitedMapTy *VisitedMap;

And a struct to keep checkedreturn information in GRState:
   
struct CheckedReturnState {
enum Kind { Unchecked, Checked } K;
const Stmt *S;
...
};

My process:
  1. While we do path simulation, record CheckedReturnState in GRState. And update it to LocalPathMap at some program point(checkdeadsymbol, endpath, brach).
  2. While in endpath analysis, analyze LocalPathMap(currently using a pessimistic way, a CE is checked unless it's checked in all the paths), then update the LocalUseMap. Of course, this step is optional.
  3. If we use LocalPathMap, update the GlobalMap in checkEndPath; if LocalUseMap, update the GlobalMap in checkEndAnalysis.
  4. In checkEndOfTranslationUnit, do the math and emit bug report.

Soon i found i was wrong in step 1, i should not use any structure in checker to record path-sensitive states. So i need other ways to handle this:
  1. It seems ok that only to update LocalPathMap when end path(so we can record in GRState until the path analysis finished), but how should we take care of the dead symbols? Maybe i could keep that dead symbols instead of removing them?
  2. Keep the <CE, CheckedReturnState> in GRState, so we get it over. But how to effectively keep these information.
I'll appreciate it if there are any advice.   

--
Best regards!

Lei Zhang

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

UncheckedReturnChecker.cpp (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Process in UncheckedReturn

Ted Kremenek
On May 30, 2011, at 10:48 PM, 章磊 wrote:

Soon i found i was wrong in step 1, i should not use any structure in checker to record path-sensitive states. So i need other ways to handle this:
  1. It seems ok that only to update LocalPathMap when end path(so we can record in GRState until the path analysis finished), but how should we take care of the dead symbols? Maybe i could keep that dead symbols instead of removing them?
  2. Keep the <CE, CheckedReturnState> in GRState, so we get it over. But how to effectively keep these information.



Hi Lei,

Besides aggregating statistics in checkEndPath, another option is to also doing this in checkDeadSymbols.  You then don't need to worry about symbols that have been removed from GRState by the time you reach checkEndPath, since you've already done the aggregation.  Of course this influences your counting, but in all fairness we aren't enumerating all paths explicitly anyway, so your counts are directly influenced by how we explore paths, which paths are explored, and how paths are coalesced.

Ted

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Process in UncheckedReturn

Lei Zhang
 
2011/6/2 Ted Kremenek <[hidden email]>
On May 30, 2011, at 10:48 PM, 章磊 wrote:

Soon i found i was wrong in step 1, i should not use any structure in checker to record path-sensitive states. So i need other ways to handle this:
  1. It seems ok that only to update LocalPathMap when end path(so we can record in GRState until the path analysis finished), but how should we take care of the dead symbols? Maybe i could keep that dead symbols instead of removing them?
  2. Keep the <CE, CheckedReturnState> in GRState, so we get it over. But how to effectively keep these information.



Hi Lei,

Besides aggregating statistics in checkEndPath, another option is to also doing this in checkDeadSymbols.  You then don't need to worry about symbols that have been removed from GRState by the time you reach checkEndPath, since you've already done the aggregation.  Of course this influences your counting, but in all fairness we aren't enumerating all paths explicitly anyway, so your counts are directly influenced by how we explore paths, which paths are explored, and how paths are coalesced.

Ted
 
Hi Ted,
 
It's ok to do aggregating in checkDeadSymbols, and that's how i do in my patch. My problem is if we do so, the deadsymbols are path-related, so we must keep the statistics in GRState instead of checker itself.
 
Is it ok to add some GRState like llvm::immutablemap<CE, CheckedReturnState>, without related to some symbols? Or we must keep the GRState like llvm::immutablemap<sym, llvm::DenseMap<CE, CheckedReturnState>>?

--
Best regards!

Lei Zhang

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Process in UncheckedReturn

Ted Kremenek
On Jun 1, 2011, at 7:04 PM, 章磊 wrote:

 
2011/6/2 Ted Kremenek <[hidden email]>
On May 30, 2011, at 10:48 PM, 章磊 wrote:

Soon i found i was wrong in step 1, i should not use any structure in checker to record path-sensitive states. So i need other ways to handle this:
  1. It seems ok that only to update LocalPathMap when end path(so we can record in GRState until the path analysis finished), but how should we take care of the dead symbols? Maybe i could keep that dead symbols instead of removing them?
  2. Keep the <CE, CheckedReturnState> in GRState, so we get it over. But how to effectively keep these information.



Hi Lei,

Besides aggregating statistics in checkEndPath, another option is to also doing this in checkDeadSymbols.  You then don't need to worry about symbols that have been removed from GRState by the time you reach checkEndPath, since you've already done the aggregation.  Of course this influences your counting, but in all fairness we aren't enumerating all paths explicitly anyway, so your counts are directly influenced by how we explore paths, which paths are explored, and how paths are coalesced.

Ted
 
Hi Ted,
 
It's ok to do aggregating in checkDeadSymbols, and that's how i do in my patch.

Hi Lei,

I don't see that.  I see a call to updateCRState, and that sets an entry in LocalPathMap, but I don't see that as counting the number of times a given CallExpr's return value is checked.  Is it meant just to record that the value was checked on a given path?

I'm concerned that LocalPathMap isn't the right model of what you want.  The analyzer doesn't trace out all paths individually; it will coalesce paths together as it sees them having the same state.  It is possible by the time you reach a call to 'checkEndPath' that an infinite number of paths were coalesced together into one by the time you hit the end of a function.

My problem is if we do so, the deadsymbols are path-related, so we must keep the statistics in GRState instead of checker itself.

Why?  I'm really confused here.

I think trying to collect statistics in GRState itself is inherently the wrong approach.  Statistics should be about aggregating what you have observed; the checker itself should only focus on the semantics, which in this case is whether or not a return value was checked.

Is it ok to add some GRState like llvm::immutablemap<CE, CheckedReturnState>, without related to some symbols?

Yes, but what would it record?  What would it represent?

Or we must keep the GRState like llvm::immutablemap<sym, llvm::DenseMap<CE, CheckedReturnState>>?

No, that would not be a good approach.  DenseMap is not an immutable data structure, and should not be a value in ImmutableMap.

Stepping back…  how are you trying to count things?  A few concerns come to mind…

1) If you are trying to count statistics along individual paths, your current approach isn't really working as you intend.  You are essentially relying not he analyzer exploring the state space in a particular way, and (as I've said before) you aren't actually enumerating individual paths.  This means the attempt to use LocalPathMap to count such statistics is inherently not measuring per-path information.

2) Even if we could enumerate all paths, is counting them individually really the right answer?  There could be an infinite number of paths.

3) Paths don't measure everything, and they can really over count the same cases.

I think you can make counting far more simple, and simplify your checker.

Here's my suggestion:

1) Have the checker track the CheckedReturnState for each symbol, as it does now.

2) When a symbol is dead, or we reach the end of a path, count it in the following way…

 void processSymbol(const GRState *state, SymbolRef Sym) {
   const CheckedReturnState *CS = state->get<CheckedReturnState>(Sym))
   if (!CS)
     return;
   std::pair<unsigned,unsigned> &stats = CheckedSymStatistics[Sym];
   if (CS->isChecked())
     stats.first++;
   else
     stats.second++;
 }

This will cause you to get statistics for each symbol.  The next stage is then to aggregate them per call site.

  void aggregateByCallSite() {
    for (llvm::DenseMap<SymbolRef, std::pair<unsigned, unsigned> >::iterator i = CheckedStatistics.begin(); … ) {
       SymbolRef sym = i->first;
       std::pair<unsigned, unsigned> stats = i->second;
       
       // Compute normalized statistics for that symbol.
       double sum = stats.first + stats.second;
       std::pair<double, double> normalizedStats(stats.first / sum, stats.second / sum);

       // Aggregate that count into the call site count.
       const CallExpr *CE = getCallSite(sym);
       std::pair<double, double> &callsiteStat = CheckedCallsiteStatistics[CE];
       callsiteStat.first += normalizedStats.first;
       callsiteStat.second += normalizedStats.second;
   }
   // then loop over CheckedCallSiteStatistics and normalize the statistics for each entry
 }

I've left some details out, but basically there are a variety of ways to go from a symbol to the original call site.  Either we can record it in a DenseMap (on the side), or if it is a conjured symbol, we can look at the symbol itself to figure out where it came from.

How this counting works:

(a) Paths help us figure out the behavior of a given symbol.  Since counting paths is problematic, we just normalize the counts for a given symbol.  This essentially gives us a probability distribution.

(b) Aggregating normalized counts for symbols into their CallExprs gives us "counts" for a specific CallExpr.  This is an arbitrary choice to weigh symbols in this way, but it causes spurious paths to not be over counted.

(c) Normalizing the aggregated counts for CallExprs gives us another probability distribution.  It is basically the probability, for a given CallExpr, whether its value was checked, factoring out some redundancy from multiple paths.

(d) Using the normalized counts (probability distribution) for CallExprs, each CallExpr then has a single count, which you can then use to count whether or not a given function has it's return value checked or not.

This counting strategy obviously biases towards inferring properties of functions that are *called in more places* versus *appear in more paths*, which I think is a better balance than trying to make sense of the different paths explored by the analyzer.  It also allows you to skirt the path explosion problem entirely.  As long as you think you have traced a "representative set of paths", the statistics will yield a confidence measure in how likely a given function's return value should be checked, regardless of whether or not you were able to evaluate all paths (which IMO isn't terribly relevant, as long as you are able to collect representative evidence of behavioral trends in the code).

What do you think?  Sorry for sounding so aggressive; it's late here and I'm tired (so my wording isn't as finessed as I would like), but I wanted to strongly get across that I think counting needs to not be so biased by how we count paths.  Of course you are free to disagree.

Cheers,
Ted

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Process in UncheckedReturn

Lei Zhang


在 2011年6月2日 下午3:36,Ted Kremenek <[hidden email]>写道:
On Jun 1, 2011, at 7:04 PM, 章磊 wrote:

 
2011/6/2 Ted Kremenek <[hidden email]>
On May 30, 2011, at 10:48 PM, 章磊 wrote:

Soon i found i was wrong in step 1, i should not use any structure in checker to record path-sensitive states. So i need other ways to handle this:
  1. It seems ok that only to update LocalPathMap when end path(so we can record in GRState until the path analysis finished), but how should we take care of the dead symbols? Maybe i could keep that dead symbols instead of removing them?
  2. Keep the <CE, CheckedReturnState> in GRState, so we get it over. But how to effectively keep these information.



Hi Lei,

Besides aggregating statistics in checkEndPath, another option is to also doing this in checkDeadSymbols.  You then don't need to worry about symbols that have been removed from GRState by the time you reach checkEndPath, since you've already done the aggregation.  Of course this influences your counting, but in all fairness we aren't enumerating all paths explicitly anyway, so your counts are directly influenced by how we explore paths, which paths are explored, and how paths are coalesced.

Ted
 
Hi Ted,
 
It's ok to do aggregating in checkDeadSymbols, and that's how i do in my patch.

Hi Lei,

I don't see that.  I see a call to updateCRState, and that sets an entry in LocalPathMap, but I don't see that as counting the number of times a given CallExpr's return value is checked.  Is it meant just to record that the value was checked on a given path?

I'm concerned that LocalPathMap isn't the right model of what you want.  The analyzer doesn't trace out all paths individually; it will coalesce paths together as it sees them having the same state.  It is possible by the time you reach a call to 'checkEndPath' that an infinite number of paths were coalesced together into one by the time you hit the end of a function.

My problem is if we do so, the deadsymbols are path-related, so we must keep the statistics in GRState instead of checker itself.

Why?  I'm really confused here.

I think trying to collect statistics in GRState itself is inherently the wrong approach.  Statistics should be about aggregating what you have observed; the checker itself should only focus on the semantics, which in this case is whether or not a return value was checked.

Is it ok to add some GRState like llvm::immutablemap<CE, CheckedReturnState>, without related to some symbols?

Yes, but what would it record?  What would it represent?

Or we must keep the GRState like llvm::immutablemap<sym, llvm::DenseMap<CE, CheckedReturnState>>?

No, that would not be a good approach.  DenseMap is not an immutable data structure, and should not be a value in ImmutableMap.

Stepping back…  how are you trying to count things?  A few concerns come to mind…

1) If you are trying to count statistics along individual paths, your current approach isn't really working as you intend.  You are essentially relying not he analyzer exploring the state space in a particular way, and (as I've said before) you aren't actually enumerating individual paths.  This means the attempt to use LocalPathMap to count such statistics is inherently not measuring per-path information.

2) Even if we could enumerate all paths, is counting them individually really the right answer?  There could be an infinite number of paths.

3) Paths don't measure everything, and they can really over count the same cases.

I think you can make counting far more simple, and simplify your checker.

Here's my suggestion:

1) Have the checker track the CheckedReturnState for each symbol, as it does now.

2) When a symbol is dead, or we reach the end of a path, count it in the following way…

 void processSymbol(const GRState *state, SymbolRef Sym) {
   const CheckedReturnState *CS = state->get<CheckedReturnState>(Sym))
   if (!CS)
     return;
   std::pair<unsigned,unsigned> &stats = CheckedSymStatistics[Sym];
   if (CS->isChecked())
     stats.first++;
   else
     stats.second++;
 }

This will cause you to get statistics for each symbol.  The next stage is then to aggregate them per call site.

  void aggregateByCallSite() {
    for (llvm::DenseMap<SymbolRef, std::pair<unsigned, unsigned> >::iterator i = CheckedStatistics.begin(); … ) {
       SymbolRef sym = i->first;
       std::pair<unsigned, unsigned> stats = i->second;
       
       // Compute normalized statistics for that symbol.
       double sum = stats.first + stats.second;
       std::pair<double, double> normalizedStats(stats.first / sum, stats.second / sum);

       // Aggregate that count into the call site count.
       const CallExpr *CE = getCallSite(sym);
       std::pair<double, double> &callsiteStat = CheckedCallsiteStatistics[CE];
       callsiteStat.first += normalizedStats.first;
       callsiteStat.second += normalizedStats.second;
   }
   // then loop over CheckedCallSiteStatistics and normalize the statistics for each entry
 }

I've left some details out, but basically there are a variety of ways to go from a symbol to the original call site.  Either we can record it in a DenseMap (on the side), or if it is a conjured symbol, we can look at the symbol itself to figure out where it came from.

How this counting works:

(a) Paths help us figure out the behavior of a given symbol.  Since counting paths is problematic, we just normalize the counts for a given symbol.  This essentially gives us a probability distribution.

(b) Aggregating normalized counts for symbols into their CallExprs gives us "counts" for a specific CallExpr.  This is an arbitrary choice to weigh symbols in this way, but it causes spurious paths to not be over counted.

(c) Normalizing the aggregated counts for CallExprs gives us another probability distribution.  It is basically the probability, for a given CallExpr, whether its value was checked, factoring out some redundancy from multiple paths.

(d) Using the normalized counts (probability distribution) for CallExprs, each CallExpr then has a single count, which you can then use to count whether or not a given function has it's return value checked or not.

This counting strategy obviously biases towards inferring properties of functions that are *called in more places* versus *appear in more paths*, which I think is a better balance than trying to make sense of the different paths explored by the analyzer.  It also allows you to skirt the path explosion problem entirely.  As long as you think you have traced a "representative set of paths", the statistics will yield a confidence measure in how likely a given function's return value should be checked, regardless of whether or not you were able to evaluate all paths (which IMO isn't terribly relevant, as long as you are able to collect representative evidence of behavioral trends in the code).

What do you think?  Sorry for sounding so aggressive; it's late here and I'm tired (so my wording isn't as finessed as I would like), but I wanted to strongly get across that I think counting needs to not be so biased by how we count paths.  Of course you are free to disagree.

Cheers,
Ted

Hi Ted,

Thank you very much for your reply and sorry for misunderstanding the word "aggregate".

I thought that we can do aggregation only when a path is fully analyzed, so when you mention the aggregation in checkDeadSymbols, i didn't get what you mean to do.

I think this counting strategy is great, but i have a little problem:

IMO, it's a trade-off between counting statistics along individual paths and counting statistics in your way. When there is no path explosion, they give the same counts. But when the path explodes, they works different.

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. But we will leave out some paths because of the step limit and the analyzer should work in a particular way.
In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path"). And could this results be the "representative evidence of behavioral trends" ?

If the second way is ok, and since we should not force the analyzer to
exploring the state space in a particular way, i will do it in the second way.

--
Best regards!

Lei Zhang

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Process in UncheckedReturn

Ted Kremenek
On Jun 6, 2011, at 8:14 PM, 章磊 wrote:

Hi Ted,

Thank you very much for your reply and sorry for misunderstanding the word "aggregate".

I thought that we can do aggregation only when a path is fully analyzed, so when you mention the aggregation in checkDeadSymbols, i didn't get what you mean to do.

I think this counting strategy is great, but i have a little problem: 

IMO, it's a trade-off between counting statistics along individual paths and counting statistics in your way. When there is no path explosion, they give the same counts. But when the path explodes, they works different.

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. But we will leave out some paths because of the step limit and the analyzer should work in a particular way.
In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path"). And could this results be the "representative evidence of behavioral trends" ?

If the second way is ok, and since we should not force the analyzer to 
exploring the state space in a particular way, i will do it in the second way.

Hi Lei,

Let me focus around this particular comment:

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. 

I think this is a misconception.  We never will fully analyze part of all the paths in a manner where each count in our statistics can be considered to have come from a single path.  Consider:


  void foo() {
    int x = bar();
    if (x > 0)
     x++;
    else
     x--;
   // never use 'x' afterwards
   ...
 }

The exact operations here are not important.  What I want to stress is that, regardless of path exploration, we will merge the two paths after the if..else... because 'x' is not used afterwards.  When a variable is no longer live, the analysis engine prunes that variable binding, and any associated symbolic constraints, from the state.  In the example above, this means that the GRState along the two paths (one for the true branch, the other for the false branch) will be the same.  At this point, the paths are *merged* in the ExplodedGraph, because analyzing either of them from that point on would produce the same analysis results.  This is a key optimization in the analyzer engine to reduce the exponential growth from path exploration.

My point is that "counting paths" just doesn't make sense.  Paths get merged all the time, or we may come up with other ways to cut out paths from the ExplodedGraph that don't need to be explicitly represented or explicitly explored.  Definitely this impacts how we count statistics, but I'm suggesting we pick a scheme that seems less sensitive to how we enumerate paths, and more on the behavior we observed from those paths.

The motivation behind my suggesting for how to aggregate statistics was simple: forgo any notion that we are enumerating paths, and instead accumulate distinct observations where behavior was consistent or different.  Here I actually try and marginalize out the effects of path explosion or merging, and instead focus on the behavior at particular program points.

In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path").

I don't fully understand this comment.  All I am suggesting is that aggregate statistics when a symbol no longer is being tracked.  When a symbol "dies" early, we do that aggregation in checkDeadSymbols.  If a symbol lives to the end of a path, when do it in checkEndPath.

Can you clarify, precisely, what you mean by 'but in checkEndPath some symbols may be checked after this "step limited end path"'?  checkEndPath will never be called when we are step limited.  It is only called when we actually hit the end of a function.  When we are step limited, we simply stop analyzing a path.

Cheers,
Ted

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Process in UncheckedReturn

Lei Zhang
Hi Ted,

在 2011年6月9日 上午7:38,Ted Kremenek <[hidden email]>写道:
On Jun 6, 2011, at 8:14 PM, 章磊 wrote:

Hi Ted,

Thank you very much for your reply and sorry for misunderstanding the word "aggregate".

I thought that we can do aggregation only when a path is fully analyzed, so when you mention the aggregation in checkDeadSymbols, i didn't get what you mean to do.

I think this counting strategy is great, but i have a little problem: 

IMO, it's a trade-off between counting statistics along individual paths and counting statistics in your way. When there is no path explosion, they give the same counts. But when the path explodes, they works different.

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. But we will leave out some paths because of the step limit and the analyzer should work in a particular way.
In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path"). And could this results be the "representative evidence of behavioral trends" ?

If the second way is ok, and since we should not force the analyzer to 
exploring the state space in a particular way, i will do it in the second way.

Hi Lei,

Let me focus around this particular comment:

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. 

I think this is a misconception.  We never will fully analyze part of all the paths in a manner where each count in our statistics can be considered to have come from a single path.  Consider:


  void foo() {
    int x = bar();
    if (x > 0)
     x++;
    else
     x--;
   // never use 'x' afterwards
   ...
 }

The exact operations here are not important.  What I want to stress is that, regardless of path exploration, we will merge the two paths after the if..else... because 'x' is not used afterwards.  When a variable is no longer live, the analysis engine prunes that variable binding, and any associated symbolic constraints, from the state.  In the example above, this means that the GRState along the two paths (one for the true branch, the other for the false branch) will be the same.  At this point, the paths are *merged* in the ExplodedGraph, because analyzing either of them from that point on would produce the same analysis results.  This is a key optimization in the analyzer engine to reduce the exponential growth from path exploration.

My point is that "counting paths" just doesn't make sense.  Paths get merged all the time, or we may come up with other ways to cut out paths from the ExplodedGraph that don't need to be explicitly represented or explicitly explored.  Definitely this impacts how we count statistics, but I'm suggesting we pick a scheme that seems less sensitive to how we enumerate paths, and more on the behavior we observed from those paths.
 
 
Thank you very much to expound this and sorry for taking this long time to get your idea.
 
The motivation behind my suggesting for how to aggregate statistics was simple: forgo any notion that we are enumerating paths, and instead accumulate distinct observations where behavior was consistent or different.  Here I actually try and marginalize out the effects of path explosion or merging, and instead focus on the behavior at particular program points.

In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path").

I don't fully understand this comment.  All I am suggesting is that aggregate statistics when a symbol no longer is being tracked.  When a symbol "dies" early, we do that aggregation in checkDeadSymbols.  If a symbol lives to the end of a path, when do it in checkEndPath.

Can you clarify, precisely, what you mean by 'but in checkEndPath some symbols may be checked after this "step limited end path"'?  checkEndPath will never be called when we are step limited.  It is only called when we actually hit the end of a function.  When we are step limited, we simply stop analyzing a path.
 
Sorry, my bad, I mistaken the checkEndAnalysis as checkEndPath.
 

Cheers,
Ted

I still corcern about that how the second way really works, so i will take this way and do some test.

--
Best regards!

Lei Zhang

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Process in UncheckedReturn

Ted Kremenek

On Jun 8, 2011, at 7:08 PM, 章磊 wrote:

Hi Ted,

在 2011年6月9日 上午7:38,Ted Kremenek <[hidden email]>写道:
On Jun 6, 2011, at 8:14 PM, 章磊 wrote:

Hi Ted,

Thank you very much for your reply and sorry for misunderstanding the word "aggregate".

I thought that we can do aggregation only when a path is fully analyzed, so when you mention the aggregation in checkDeadSymbols, i didn't get what you mean to do.

I think this counting strategy is great, but i have a little problem: 

IMO, it's a trade-off between counting statistics along individual paths and counting statistics in your way. When there is no path explosion, they give the same counts. But when the path explodes, they works different.

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. But we will leave out some paths because of the step limit and the analyzer should work in a particular way.
In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path"). And could this results be the "representative evidence of behavioral trends" ?

If the second way is ok, and since we should not force the analyzer to 
exploring the state space in a particular way, i will do it in the second way.

Hi Lei,

Let me focus around this particular comment:

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. 

I think this is a misconception.  We never will fully analyze part of all the paths in a manner where each count in our statistics can be considered to have come from a single path.  Consider:


  void foo() {
    int x = bar();
    if (x > 0)
     x++;
    else
     x--;
   // never use 'x' afterwards
   ...
 }

The exact operations here are not important.  What I want to stress is that, regardless of path exploration, we will merge the two paths after the if..else... because 'x' is not used afterwards.  When a variable is no longer live, the analysis engine prunes that variable binding, and any associated symbolic constraints, from the state.  In the example above, this means that the GRState along the two paths (one for the true branch, the other for the false branch) will be the same.  At this point, the paths are *merged* in the ExplodedGraph, because analyzing either of them from that point on would produce the same analysis results.  This is a key optimization in the analyzer engine to reduce the exponential growth from path exploration.

My point is that "counting paths" just doesn't make sense.  Paths get merged all the time, or we may come up with other ways to cut out paths from the ExplodedGraph that don't need to be explicitly represented or explicitly explored.  Definitely this impacts how we count statistics, but I'm suggesting we pick a scheme that seems less sensitive to how we enumerate paths, and more on the behavior we observed from those paths.
 
 
Thank you very much to expound this and sorry for taking this long time to get your idea.

No need to apologize.  This is something very worth discussing.

BTW, my opinion is just my opinion.  You are free to experiment with this any way you wish.


 
The motivation behind my suggesting for how to aggregate statistics was simple: forgo any notion that we are enumerating paths, and instead accumulate distinct observations where behavior was consistent or different.  Here I actually try and marginalize out the effects of path explosion or merging, and instead focus on the behavior at particular program points.

In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path").

I don't fully understand this comment.  All I am suggesting is that aggregate statistics when a symbol no longer is being tracked.  When a symbol "dies" early, we do that aggregation in checkDeadSymbols.  If a symbol lives to the end of a path, when do it in checkEndPath.

Can you clarify, precisely, what you mean by 'but in checkEndPath some symbols may be checked after this "step limited end path"'?  checkEndPath will never be called when we are step limited.  It is only called when we actually hit the end of a function.  When we are step limited, we simply stop analyzing a path.
 
Sorry, my bad, I mistaken the checkEndAnalysis as checkEndPath.
 

Cheers,
Ted

I still corcern about that how the second way really works, so i will take this way and do some test.

Absolutely.  With these kinds of statistical tricks, experimentation is really key to finding what works best.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Process in UncheckedReturn

Lei Zhang
Hi Ted,

Sorry for this late reply.

I modified the UncheckedReturnChecker as you suggested, and test it on some small code base(larbin, cgicc, skia, vcg...). It gives some bugs when i simply set the unchecked ratio to 75%. It seems OK but still has some problems can not fix right now.

1. Because lack of taint analysis in the engine, we can not pass some checkedreturn state through SVals. Here are some examples:

bool b = f() == 0;

In this example, we can not pass the checked state(f() was checked) to decl b right now.

int a = f();
if (-a) {
  ...
}
return;

In this example, decl "a" has a SVal bind to it. I use this SVal as a symbol and record the checked state in the form DenseMap<symbol, bool> . When our engine visit the unaryoperator "-a", it will remove the binding form SVal to the decl "a" because there is no further use of decl "a". So later when the engine visit the branch condition, there is no binding which i need to get the SVal as a symbol.

Right now, i leave these cases behind and they will make some miscount.

2. Bug report and count strategy need to be improved.

What you think about these?

在 2011年6月9日 上午11:06,Ted Kremenek <[hidden email]>写道:

On Jun 8, 2011, at 7:08 PM, 章磊 wrote:

Hi Ted,

在 2011年6月9日 上午7:38,Ted Kremenek <[hidden email]>写道:
On Jun 6, 2011, at 8:14 PM, 章磊 wrote:

Hi Ted,

Thank you very much for your reply and sorry for misunderstanding the word "aggregate".

I thought that we can do aggregation only when a path is fully analyzed, so when you mention the aggregation in checkDeadSymbols, i didn't get what you mean to do.

I think this counting strategy is great, but i have a little problem: 

IMO, it's a trade-off between counting statistics along individual paths and counting statistics in your way. When there is no path explosion, they give the same counts. But when the path explodes, they works different.

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. But we will leave out some paths because of the step limit and the analyzer should work in a particular way.
In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path"). And could this results be the "representative evidence of behavioral trends" ?

If the second way is ok, and since we should not force the analyzer to 
exploring the state space in a particular way, i will do it in the second way.

Hi Lei,

Let me focus around this particular comment:

In the first way, we will fully analyze part of all the paths and we can give the accurate results of those paths. 

I think this is a misconception.  We never will fully analyze part of all the paths in a manner where each count in our statistics can be considered to have come from a single path.  Consider:


  void foo() {
    int x = bar();
    if (x > 0)
     x++;
    else
     x--;
   // never use 'x' afterwards
   ...
 }

The exact operations here are not important.  What I want to stress is that, regardless of path exploration, we will merge the two paths after the if..else... because 'x' is not used afterwards.  When a variable is no longer live, the analysis engine prunes that variable binding, and any associated symbolic constraints, from the state.  In the example above, this means that the GRState along the two paths (one for the true branch, the other for the false branch) will be the same.  At this point, the paths are *merged* in the ExplodedGraph, because analyzing either of them from that point on would produce the same analysis results.  This is a key optimization in the analyzer engine to reduce the exponential growth from path exploration.

My point is that "counting paths" just doesn't make sense.  Paths get merged all the time, or we may come up with other ways to cut out paths from the ExplodedGraph that don't need to be explicitly represented or explicitly explored.  Definitely this impacts how we count statistics, but I'm suggesting we pick a scheme that seems less sensitive to how we enumerate paths, and more on the behavior we observed from those paths.
 
 
Thank you very much to expound this and sorry for taking this long time to get your idea.

No need to apologize.  This is something very worth discussing.

BTW, my opinion is just my opinion.  You are free to experiment with this any way you wish.


 
The motivation behind my suggesting for how to aggregate statistics was simple: forgo any notion that we are enumerating paths, and instead accumulate distinct observations where behavior was consistent or different.  Here I actually try and marginalize out the effects of path explosion or merging, and instead focus on the behavior at particular program points.

In the second way, we do not fully analyze a path, we will call checkEndPath while we are not meet the actual path end, so we can not guarantee the states we are counting are totally right(the aggregation in checkDeadSymbols is ok, since all the dead symbols will not be checked; but in checkEndPath some symbols may be checked after this "step limited end path").

I don't fully understand this comment.  All I am suggesting is that aggregate statistics when a symbol no longer is being tracked.  When a symbol "dies" early, we do that aggregation in checkDeadSymbols.  If a symbol lives to the end of a path, when do it in checkEndPath.

Can you clarify, precisely, what you mean by 'but in checkEndPath some symbols may be checked after this "step limited end path"'?  checkEndPath will never be called when we are step limited.  It is only called when we actually hit the end of a function.  When we are step limited, we simply stop analyzing a path.
 
Sorry, my bad, I mistaken the checkEndAnalysis as checkEndPath.
 

Cheers,
Ted

I still corcern about that how the second way really works, so i will take this way and do some test.

Absolutely.  With these kinds of statistical tricks, experimentation is really key to finding what works best.



--
Best regards!

Lei Zhang

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

UncheckedReturn.patch (27K) Download Attachment