Quantcast

[GSoC 2017] Improve loop execution model of the Clang Static Analyzer

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[GSoC 2017] Improve loop execution model of the Clang Static Analyzer

Keane, Erich via cfe-dev
Hello,

I would love to work on a Clang Static Analyzer project during the summer by joining GSoC.
Let me introduce myself:
I'm Peter Szecsi, a third-year BSc student at Eötvös Loránd University, Budapest. It would be my first Summer of Code project but I have already contributed to clang:
During the last summer I have uploaded 2 patches:
- An accepted and merged clang-tidy checker [1]
- An accepted Clang SA checker [2]
Since then I am working on cross translational unit analysis as a university project (and because of that I`ve send some patches about the ASTImporter [3]). Furthermore I participated in the preparations of a talk that was accepted at the EuroLLVM conference. [4]
I found SA interesting since I like to work on algorithmic challenges and enjoy participating in programming competitions and there is a lot of algorithms in the Static Analyzer which could be optimized/made more precise by heuristics.

That is the reason why I would be most interested in project "Implement generalized loop execution modeling" from the Clang SA Open Projects page [5]. Hopefully it is a suitable project for GSoC and it is possible to achieve useful improvements during this time frame.
I have read the discussion at the "loop widening" patch [6] and the most important enhancements (and milestones in the project) would be invalidating only the possible modified values. In addition to that I was thinking to have some heuristics which would approximate if a loop is too complex to even use this kind of strategy because it would still lead to invalidate most of the values and result in a lot of false positives. Another small heuristic could be: when we already reached the maximum loop unroll number then in case we are sure that there is only a few loopsteps ahead we could unroll it to the end since probably this algorithm will not consume too much time/memory and the result state will not be much more complex.

What do you think about these ideas?

(I have CC'd everyone from the "loop widening" patch discussion.)

Best regards,
Peter

[1] https://reviews.llvm.org/D22507
[2] https://reviews.llvm.org/D24246
[3] https://reviews.llvm.org/D29612, https://reviews.llvm.org/D30876, https://reviews.llvm.org/D30877
[4] http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7
[5] http://clang-analyzer.llvm.org/open_projects.html
[6] https://reviews.llvm.org/D12358

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [GSoC 2017] Improve loop execution model of the Clang Static Analyzer

Keane, Erich via cfe-dev
Hi Péter,

Better modeling of loops is definitely an interesting topic. My main concern would be having a specific proposal on what improvements you could complete in the GSoC timeframe. The next step here is to write down more specifics about what you intend to work on. Please, let me, Devin, and Artem know if you’d like to talk to us while figuring this out.

Cheers!
Anna.


On Mar 13, 2017, at 9:43 AM, Péter Szécsi <[hidden email]> wrote:

Hello,

I would love to work on a Clang Static Analyzer project during the summer by joining GSoC.
Let me introduce myself:
I'm Peter Szecsi, a third-year BSc student at Eötvös Loránd University, Budapest. It would be my first Summer of Code project but I have already contributed to clang:
During the last summer I have uploaded 2 patches:
- An accepted and merged clang-tidy checker [1]
- An accepted Clang SA checker [2]
Since then I am working on cross translational unit analysis as a university project (and because of that I`ve send some patches about the ASTImporter [3]). Furthermore I participated in the preparations of a talk that was accepted at the EuroLLVM conference. [4]
I found SA interesting since I like to work on algorithmic challenges and enjoy participating in programming competitions and there is a lot of algorithms in the Static Analyzer which could be optimized/made more precise by heuristics.

That is the reason why I would be most interested in project "Implement generalized loop execution modeling" from the Clang SA Open Projects page [5]. Hopefully it is a suitable project for GSoC and it is possible to achieve useful improvements during this time frame.
I have read the discussion at the "loop widening" patch [6] and the most important enhancements (and milestones in the project) would be invalidating only the possible modified values. In addition to that I was thinking to have some heuristics which would approximate if a loop is too complex to even use this kind of strategy because it would still lead to invalidate most of the values and result in a lot of false positives. Another small heuristic could be: when we already reached the maximum loop unroll number then in case we are sure that there is only a few loopsteps ahead we could unroll it to the end since probably this algorithm will not consume too much time/memory and the result state will not be much more complex.

What do you think about these ideas?

(I have CC'd everyone from the "loop widening" patch discussion.)

Best regards,
Peter

[1] https://reviews.llvm.org/D22507
[2] https://reviews.llvm.org/D24246
[3] https://reviews.llvm.org/D29612, https://reviews.llvm.org/D30876, https://reviews.llvm.org/D30877
[4] http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7
[5] http://clang-analyzer.llvm.org/open_projects.html
[6] https://reviews.llvm.org/D12358


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [GSoC 2017] Improve loop execution model of the Clang Static Analyzer

Keane, Erich via cfe-dev
Hello Péter,

Sounds good to me. I'd also be interested in knowing the specifics of what you are planning to do.

Very happy to help at any point.

Thanks,

Sean Eveson
SN Systems - Sony Interactive Entertainment Group

On Thu, Mar 16, 2017 at 12:49 AM, Anna Zaks <[hidden email]> wrote:
Hi Péter,

Better modeling of loops is definitely an interesting topic. My main concern would be having a specific proposal on what improvements you could complete in the GSoC timeframe. The next step here is to write down more specifics about what you intend to work on. Please, let me, Devin, and Artem know if you’d like to talk to us while figuring this out.

Cheers!
Anna.


On Mar 13, 2017, at 9:43 AM, Péter Szécsi <[hidden email]> wrote:

Hello,

I would love to work on a Clang Static Analyzer project during the summer by joining GSoC.
Let me introduce myself:
I'm Peter Szecsi, a third-year BSc student at Eötvös Loránd University, Budapest. It would be my first Summer of Code project but I have already contributed to clang:
During the last summer I have uploaded 2 patches:
- An accepted and merged clang-tidy checker [1]
- An accepted Clang SA checker [2]
Since then I am working on cross translational unit analysis as a university project (and because of that I`ve send some patches about the ASTImporter [3]). Furthermore I participated in the preparations of a talk that was accepted at the EuroLLVM conference. [4]
I found SA interesting since I like to work on algorithmic challenges and enjoy participating in programming competitions and there is a lot of algorithms in the Static Analyzer which could be optimized/made more precise by heuristics.

That is the reason why I would be most interested in project "Implement generalized loop execution modeling" from the Clang SA Open Projects page [5]. Hopefully it is a suitable project for GSoC and it is possible to achieve useful improvements during this time frame.
I have read the discussion at the "loop widening" patch [6] and the most important enhancements (and milestones in the project) would be invalidating only the possible modified values. In addition to that I was thinking to have some heuristics which would approximate if a loop is too complex to even use this kind of strategy because it would still lead to invalidate most of the values and result in a lot of false positives. Another small heuristic could be: when we already reached the maximum loop unroll number then in case we are sure that there is only a few loopsteps ahead we could unroll it to the end since probably this algorithm will not consume too much time/memory and the result state will not be much more complex.

What do you think about these ideas?

(I have CC'd everyone from the "loop widening" patch discussion.)

Best regards,
Peter

[1] https://reviews.llvm.org/D22507
[2] https://reviews.llvm.org/D24246
[3] https://reviews.llvm.org/D29612, https://reviews.llvm.org/D30876, https://reviews.llvm.org/D30877
[4] http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7
[5] http://clang-analyzer.llvm.org/open_projects.html
[6] https://reviews.llvm.org/D12358



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [GSoC 2017] Improve loop execution model of the Clang Static Analyzer

Keane, Erich via cfe-dev

Hello,


thank you for your replies!

I collected my ideas about it but at the end of the email I put my concerns in perspective which dependency would be important and maybe it should be done first. I would gladly take it for the summer since it was mentioned in the GSoC project page earlier.


My initial thoughts in details about handling loops in a better way:
First, we have a control number N which determines how many steps to unroll during the analysis of a loop. Instead of this, in my opinion it would be better to have a number M which determines the maximum times a loop can branch the ExplodedGraph. In this case simulating a simple (non-branching) loop to the end would not be too complex. This means we could analyze the following example without a problem:


vector<int> v;

for(int i = 1; i<SIZE;++i)

{

                v[i] = f(i);

}

 

But we would not waste much time to analyze a complex loop for „nothing”:


for(auto it = begin();it!=end();it++ ) 
{

                if(it->complexfun1()) …
                if(it->complexfun2()) …
                else …

}

 

And this could be Milestone1. (That is quite optional but I think this would be a great improvement.)


The next step is the LoopWidening so whenever we stop the simulation - since the loop would result in a too complex/huge ExplodedGraph - we would continue the analysis and invalidate only the MemRegions of which binded values can be changed in the loop. In order to avoid false positives, however, an important restriction would be that whenever we would invalidate too much MemRegions then it is just better to stop the analyzing process. It can be tricky to say when exactly it is too much but determining it would be part of the project.


So how the invalidation should work? I believe a great approch would be the following: we would iterate over the statements of the loop's body and they would be handled one by one. Implementing an ASTVisitor could solve this and the statements which are not handled at the moment would result in the „conservative analysis” so it would cut the processing of the given path.

This way it can be an extensible and incremental solution since new callbacks can be implemented in an easy way to cover new statements. (It could work like in the ASTImporter.)

Milestone2 could be implementing the ASTVisitor described above.

The next step could be implementing the following nodes: function calls (CallExpr) of which parameters are const or passed by value, and the assignement operator in some trivial cases (no pointer/reference/array on the left side).
So whenever we stop the analysis of a loop which contains only those nodes, we have the information which regions to invalidate. This way the analysis can be continued safely.

For example:


int f(int);
int h(int,int);
int a,b,c = 0;
for(int i =0;i<100;i++)

{

                a = f(i);
                b = h(a,i);

}


int d = 12/c; // The analyzer should be able to find this and report the divByZero bug since the foor loop contains only those statements which are implemented.


However, in the next example:


int f(int);
int h(int*,int);
int a,b,c = 0;
for(int i =0;i<100;i++)

{

                a = f(i);
                b = h(&a,i);

}

int d = 12/c;

 

Here we have a function call which is not implemented (pointer parameter) so we just end this path’s analysis and not spot the bug after the loop.
This would be Milestone3.

After these fundamentals I would investigate which statements are commonly used in a loop, testing this feature on open source projects and I would aim to implement these visitor callbacks which are important according to these tests.


AAAAnd that is the point where I think we have a problem because usage of pointers and function calls by reference is quite common in a loop. In the naive and conservative way all of the typeT variable’s regions should be invalidated whenever we visit an assignment of which left hand side variable is typeT*. Moreover it could be casted to another type so maybe it should result invalidating all MemRegions. That is why I think that a points-to analysis would be critical in order to make this whole approach work on analysis of real projects. Because of that I am considering to apply for the project “Implement points-to analysis” which I have seen it on http://llvm.org/OpenProjects.html 2 weeks ago but it mysteriously disappeared.


My questions:
1. What do you think of this whole approach? (Please let me know if the whole idea is broken because I missed something.)
2. Do you agree that arguing about pointer's possible values is important for achieving this?
3. If that is the case could I still apply for that project or is it removed because of good reasons?


Cheers,

Peter

2017-03-16 13:25 GMT+01:00 Sean Eveson <[hidden email]>:
Hello Péter,

Sounds good to me. I'd also be interested in knowing the specifics of what you are planning to do.

Very happy to help at any point.

Thanks,

Sean Eveson
SN Systems - Sony Interactive Entertainment Group

On Thu, Mar 16, 2017 at 12:49 AM, Anna Zaks <[hidden email]> wrote:
Hi Péter,

Better modeling of loops is definitely an interesting topic. My main concern would be having a specific proposal on what improvements you could complete in the GSoC timeframe. The next step here is to write down more specifics about what you intend to work on. Please, let me, Devin, and Artem know if you’d like to talk to us while figuring this out.

Cheers!
Anna.


On Mar 13, 2017, at 9:43 AM, Péter Szécsi <[hidden email]> wrote:

Hello,

I would love to work on a Clang Static Analyzer project during the summer by joining GSoC.
Let me introduce myself:
I'm Peter Szecsi, a third-year BSc student at Eötvös Loránd University, Budapest. It would be my first Summer of Code project but I have already contributed to clang:
During the last summer I have uploaded 2 patches:
- An accepted and merged clang-tidy checker [1]
- An accepted Clang SA checker [2]
Since then I am working on cross translational unit analysis as a university project (and because of that I`ve send some patches about the ASTImporter [3]). Furthermore I participated in the preparations of a talk that was accepted at the EuroLLVM conference. [4]
I found SA interesting since I like to work on algorithmic challenges and enjoy participating in programming competitions and there is a lot of algorithms in the Static Analyzer which could be optimized/made more precise by heuristics.

That is the reason why I would be most interested in project "Implement generalized loop execution modeling" from the Clang SA Open Projects page [5]. Hopefully it is a suitable project for GSoC and it is possible to achieve useful improvements during this time frame.
I have read the discussion at the "loop widening" patch [6] and the most important enhancements (and milestones in the project) would be invalidating only the possible modified values. In addition to that I was thinking to have some heuristics which would approximate if a loop is too complex to even use this kind of strategy because it would still lead to invalidate most of the values and result in a lot of false positives. Another small heuristic could be: when we already reached the maximum loop unroll number then in case we are sure that there is only a few loopsteps ahead we could unroll it to the end since probably this algorithm will not consume too much time/memory and the result state will not be much more complex.

What do you think about these ideas?

(I have CC'd everyone from the "loop widening" patch discussion.)

Best regards,
Peter

[1] https://reviews.llvm.org/D22507
[2] https://reviews.llvm.org/D24246
[3] https://reviews.llvm.org/D29612, https://reviews.llvm.org/D30876, https://reviews.llvm.org/D30877
[4] http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7
[5] http://clang-analyzer.llvm.org/open_projects.html
[6] https://reviews.llvm.org/D12358




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [GSoC 2017] Improve loop execution model of the Clang Static Analyzer

Keane, Erich via cfe-dev

On Mar 25, 2017, at 7:33 AM, Péter Szécsi <[hidden email]> wrote:

Hello,


thank you for your replies!

I collected my ideas about it but at the end of the email I put my concerns in perspective which dependency would be important and maybe it should be done first. I would gladly take it for the summer since it was mentioned in the GSoC project page earlier.


My initial thoughts in details about handling loops in a better way:

Thanks for the writeup!
It would be very useful if you could prepend each idea with motivation explaining which problem you are trying to address.

One of the issues we’d like to solve with generalized modeling of loops is that the analyzer looses coverage since it only simulates the first N iterations of the loop. This means that it does not cover paths that represent all other iterations such as the last iteration of the loop. One problem is that the loops that are known by the analyzer to have a constant bound will end exploration of a path. Note that the main reason to have a bound in the first place is that the analyzer usually does not know how many iterations of the loop to “execute” and will keep going forever. It will create all these branches:
 - simulate that exactly one iteration is executed
 - simulate that exactly two iterations are executed
 - simulate that exactly three iterations are executed
 -….

 First, we have a control number N which determines how many steps to unroll during the analysis of a loop. Instead of this, in my opinion it would be better to have a number M which determines the maximum times a loop can branch the ExplodedGraph.

Could you explain how this number will be computed and give examples on what it will be at least for the cases below? And more importantly, how will this number be used? (I don’t think the intent here is that M will replace N in all cases, correct?)

In this case simulating a simple (non-branching) loop to the end would not be too complex. This means we could analyze the following example without a problem:


Are you saying M=0 for the case below? Is SIZE assumed to be a constant?

vector<int> v;

for(int i = 1; i<SIZE;++i)

{

                v[i] = f(i);

} 

But we would not waste much time to analyze a complex loop for „nothing”:


for(auto it = begin();it!=end();it++ ) 
{

                if(it->complexfun1()) …
                if(it->complexfun2()) …
                else …

}

 

And this could be Milestone1. (That is quite optional but I think this would be a great improvement.)


The next step is the LoopWidening so whenever we stop the simulation - since the loop would result in a too complex/huge ExplodedGraph - we would continue the analysis and invalidate only the MemRegions of which binded values can be changed in the loop. In order to avoid false positives, however, an important restriction would be that whenever we would invalidate too much MemRegions then it is just better to stop the analyzing process. It can be tricky to say when exactly it is too much but determining it would be part of the project.


So how the invalidation should work? I believe a great approch would be the following: we would iterate over the statements of the loop's body and they would be handled one by one. Implementing an ASTVisitor could solve this and the statements which are not handled at the moment would result in the „conservative analysis” so it would cut the processing of the given path.

This way it can be an extensible and incremental solution since new callbacks can be implemented in an easy way to cover new statements. (It could work like in the ASTImporter.)

Milestone2 could be implementing the ASTVisitor described above.

The next step could be implementing the following nodes: function calls (CallExpr) of which parameters are const or passed by value, and the assignement operator in some trivial cases (no pointer/reference/array on the left side).
So whenever we stop the analysis of a loop which contains only those nodes, we have the information which regions to invalidate. This way the analysis can be continued safely.

For example:


int f(int);
int h(int,int);
int a,b,c = 0;
for(int i =0;i<100;i++)

{

                a = f(i);
                b = h(a,i);

}


int d = 12/c; // The analyzer should be able to find this and report the divByZero bug since the foor loop contains only those statements which are implemented.


However, in the next example:


int f(int);
int h(int*,int);
int a,b,c = 0;
for(int i =0;i<100;i++)

{

                a = f(i);
                b = h(&a,i);

}

int d = 12/c;

 

Here we have a function call which is not implemented (pointer parameter) so we just end this path’s analysis and not spot the bug after the loop.
This would be Milestone3.

After these fundamentals I would investigate which statements are commonly used in a loop, testing this feature on open source projects and I would aim to implement these visitor callbacks which are important according to these tests.


AAAAnd that is the point where I think we have a problem because usage of pointers and function calls by reference is quite common in a loop. In the naive and conservative way all of the typeT variable’s regions should be invalidated whenever we visit an assignment of which left hand side variable is typeT*. Moreover it could be casted to another type so maybe it should result invalidating all MemRegions. That is why I think that a points-to analysis would be critical in order to make this whole approach work on analysis of real projects. Because of that I am considering to apply for the project “Implement points-to analysis” which I have seen it on http://llvm.org/OpenProjects.html 2 weeks ago but it mysteriously disappeared.


My questions:
1. What do you think of this whole approach? (Please let me know if the whole idea is broken because I missed something.)
2. Do you agree that arguing about pointer's possible values is important for achieving this?
3. If that is the case could I still apply for that project or is it removed because of good reasons?


Cheers,

Peter

2017-03-16 13:25 GMT+01:00 Sean Eveson <[hidden email]>:
Hello Péter,

Sounds good to me. I'd also be interested in knowing the specifics of what you are planning to do.

Very happy to help at any point.

Thanks,

Sean Eveson
SN Systems - Sony Interactive Entertainment Group

On Thu, Mar 16, 2017 at 12:49 AM, Anna Zaks <[hidden email]> wrote:
Hi Péter,

Better modeling of loops is definitely an interesting topic. My main concern would be having a specific proposal on what improvements you could complete in the GSoC timeframe. The next step here is to write down more specifics about what you intend to work on. Please, let me, Devin, and Artem know if you’d like to talk to us while figuring this out.

Cheers!
Anna.


On Mar 13, 2017, at 9:43 AM, Péter Szécsi <[hidden email]> wrote:

Hello,

I would love to work on a Clang Static Analyzer project during the summer by joining GSoC.
Let me introduce myself:
I'm Peter Szecsi, a third-year BSc student at Eötvös Loránd University, Budapest. It would be my first Summer of Code project but I have already contributed to clang:
During the last summer I have uploaded 2 patches:
- An accepted and merged clang-tidy checker [1]
- An accepted Clang SA checker [2]
Since then I am working on cross translational unit analysis as a university project (and because of that I`ve send some patches about the ASTImporter [3]). Furthermore I participated in the preparations of a talk that was accepted at the EuroLLVM conference. [4]
I found SA interesting since I like to work on algorithmic challenges and enjoy participating in programming competitions and there is a lot of algorithms in the Static Analyzer which could be optimized/made more precise by heuristics.

That is the reason why I would be most interested in project "Implement generalized loop execution modeling" from the Clang SA Open Projects page [5]. Hopefully it is a suitable project for GSoC and it is possible to achieve useful improvements during this time frame.
I have read the discussion at the "loop widening" patch [6] and the most important enhancements (and milestones in the project) would be invalidating only the possible modified values. In addition to that I was thinking to have some heuristics which would approximate if a loop is too complex to even use this kind of strategy because it would still lead to invalidate most of the values and result in a lot of false positives. Another small heuristic could be: when we already reached the maximum loop unroll number then in case we are sure that there is only a few loopsteps ahead we could unroll it to the end since probably this algorithm will not consume too much time/memory and the result state will not be much more complex.

What do you think about these ideas?

(I have CC'd everyone from the "loop widening" patch discussion.)

Best regards,
Peter

[1] https://reviews.llvm.org/D22507
[2] https://reviews.llvm.org/D24246
[3] https://reviews.llvm.org/D29612, https://reviews.llvm.org/D30876, https://reviews.llvm.org/D30877
[4] http://llvm.org/devmtg/2017-03//2017/02/20/accepted-sessions.html#7
[5] http://clang-analyzer.llvm.org/open_projects.html
[6] https://reviews.llvm.org/D12358





_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Loading...