[analyzer][RFC] Get info from the LLVM IR for precision

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
> you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right?
This works differently. We generate the llvm code for the whole translation unit during parsing. It is the Parser and the Sema that calls into the callbacks of the CodeGenerator via the ASTConsumer interface. This is the exact same mechanism that is used for the Backend (see the BackendConsumer). We register both the CodeGenerator ast consumer and the AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer). By the time we start the symbolic execution in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since CodeGen is added first to the MultiplexConsumer so its HandleTranslationUnit and other callbacks are called back earlier). About caching, the llvm code is cached, we generate that only once, then during the function call evaluation we search it in the llvm::Module using the mangled name as the key (we don't cache the mangled names now, but we could).
It would be possible to directly call the callbacks of the CodeGenerator on-demand, without registering that to the FrontendAction. Actually, my first attempt was to call HandleTopLevelDecl for a given FunctionDecl on demand when we needed the llvm code. However, this is a false attempt for the following reasons: (1) Could not support ObjC/C++ because I could not get all the information that the Sema has when it calls to HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to call these callbacks directly, just indirectly through a registered ASTConsumer because we may not know how the Parser and the Sema calls to these. (2) It is not enough to get the llvm code for a function in isolation. E.g., for the "readonly" attribute we must enable alias analysis on global variables (see GlobalsAAResult), so we must emit llvm code for global variables.

> 1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.
> 2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.
We should not call the CodeGen on a merged AST. ASTImporter does not support the ASTConsumer interface. In the case of CTU, I think we should generate the IR for each TU in isolation. And we should probably want to extend the CrossTranslationUnit interface to give back the llvm::Function for a given FunctionDecl. Or we could make this more transparent and the IRContext in this prototype could be CTU aware.

> Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? 
There is a dependency we will never be able to get rid of: CodeGen generates lifetime markers only when the optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers are needed to get the precise pureness info out of GlobalsAA.

> The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?)
Yes, but we need to set the optimization level so CodeGen generates lifetime markers. Indeed, there are many llvm analyses that simply do not change the IR and just populate their results. And we could simply use the results in CSA.
> We should probably not be optimizing the IR at all in the process(?)
Some llvm passes may invalidate the results of previous analyses and then we need to rerun those. I am not an expert, but I think if we run an analysis again after another analysis that optimizes the IR (i.e truncates it) then our results could be more precise. And that is the reason why we see multiple passes for the same analyses when we do optimizations. And perhaps this is the exact job of the PassManager to orchestrate this (?). 
There are passes that extend the IR (e.g InferFunctionAttrsPass), we may not need these strictly speaking, but I really don't know how the different analyses use the function attributes.
Maybe we need the IR both in unoptimized form and in optimized form. Also, we may want to have our own CSA specific pipeline, but having the default O2 pipeline seems to simplify things.

On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <[hidden email]> wrote:
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
Umm, ok!~

Static analysis is commonly run in debug builds and those are typically unoptimized. It is not common for a project to have a release+asserts build but we are relying on asserts for analysis, so debug builds are commonly used for analysis. If your project completely ignores debug builds its usefulness drops a lot.

Sounds like we want to disconnect this new fake codegen from compiler flags entirely. Like, the AST will depend on compiler flags, but we should not be taking -O flags into account at all, but pick some default -O2 regardless of flags; and ideally all flags should be ignored by default, to ensure experience as consistent as possible.

You'd also have to make sure that running CodeGen doesn't have unwanted side effects such as emitting a .o file.

Would something like that actually work?

And if it would, would this also address the usual concerns about making warnings depend on optimizations? Because, like, optimizations now remain consistent and no longer depend on optimization flags used for actual code generation or interact with code generation; they're now simply another analysis performed on the AST that depends solely on the AST.

On 8/6/20 2:06 AM, Gábor Márton wrote:
> you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right?
This works differently. We generate the llvm code for the whole translation unit during parsing. It is the Parser and the Sema that calls into the callbacks of the CodeGenerator via the ASTConsumer interface. This is the exact same mechanism that is used for the Backend (see the BackendConsumer). We register both the CodeGenerator ast consumer and the AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer). By the time we start the symbolic execution in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since CodeGen is added first to the MultiplexConsumer so its HandleTranslationUnit and other callbacks are called back earlier). About caching, the llvm code is cached, we generate that only once, then during the function call evaluation we search it in the llvm::Module using the mangled name as the key (we don't cache the mangled names now, but we could).
It would be possible to directly call the callbacks of the CodeGenerator on-demand, without registering that to the FrontendAction. Actually, my first attempt was to call HandleTopLevelDecl for a given FunctionDecl on demand when we needed the llvm code. However, this is a false attempt for the following reasons: (1) Could not support ObjC/C++ because I could not get all the information that the Sema has when it calls to HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to call these callbacks directly, just indirectly through a registered ASTConsumer because we may not know how the Parser and the Sema calls to these. (2) It is not enough to get the llvm code for a function in isolation. E.g., for the "readonly" attribute we must enable alias analysis on global variables (see GlobalsAAResult), so we must emit llvm code for global variables.

> 1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.
> 2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.
We should not call the CodeGen on a merged AST. ASTImporter does not support the ASTConsumer interface. In the case of CTU, I think we should generate the IR for each TU in isolation. And we should probably want to extend the CrossTranslationUnit interface to give back the llvm::Function for a given FunctionDecl. Or we could make this more transparent and the IRContext in this prototype could be CTU aware.

> Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? 
There is a dependency we will never be able to get rid of: CodeGen generates lifetime markers only when the optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers are needed to get the precise pureness info out of GlobalsAA.

> The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?)
Yes, but we need to set the optimization level so CodeGen generates lifetime markers. Indeed, there are many llvm analyses that simply do not change the IR and just populate their results. And we could simply use the results in CSA.
> We should probably not be optimizing the IR at all in the process(?)
Some llvm passes may invalidate the results of previous analyses and then we need to rerun those. I am not an expert, but I think if we run an analysis again after another analysis that optimizes the IR (i.e truncates it) then our results could be more precise. And that is the reason why we see multiple passes for the same analyses when we do optimizations. And perhaps this is the exact job of the PassManager to orchestrate this (?). 
There are passes that extend the IR (e.g InferFunctionAttrsPass), we may not need these strictly speaking, but I really don't know how the different analyses use the function attributes.
Maybe we need the IR both in unoptimized form and in optimized form. Also, we may want to have our own CSA specific pipeline, but having the default O2 pipeline seems to simplify things.

On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <[hidden email]> wrote:
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
I like the idea of piggybacking some analysis in the LLVM IR. However, I have some concerns as well. I am not well versed in the LLVM optimizer, but I do see potential side effects. E.g. what if a static function is inlined to ALL call sites, thus the original function can be removed. We will no longer be able to get all the useful info for that function? It would be unfortunate if the analysis result would depend on inlining heuristics. It would make the analyzer even harder to debug or understand.

On Thu, 6 Aug 2020 at 19:20, Artem Dergachev via cfe-dev <[hidden email]> wrote:
Umm, ok!~

Static analysis is commonly run in debug builds and those are typically unoptimized. It is not common for a project to have a release+asserts build but we are relying on asserts for analysis, so debug builds are commonly used for analysis. If your project completely ignores debug builds its usefulness drops a lot.

Sounds like we want to disconnect this new fake codegen from compiler flags entirely. Like, the AST will depend on compiler flags, but we should not be taking -O flags into account at all, but pick some default -O2 regardless of flags; and ideally all flags should be ignored by default, to ensure experience as consistent as possible.

You'd also have to make sure that running CodeGen doesn't have unwanted side effects such as emitting a .o file.

Would something like that actually work?

And if it would, would this also address the usual concerns about making warnings depend on optimizations? Because, like, optimizations now remain consistent and no longer depend on optimization flags used for actual code generation or interact with code generation; they're now simply another analysis performed on the AST that depends solely on the AST.

On 8/6/20 2:06 AM, Gábor Márton wrote:
> you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right?
This works differently. We generate the llvm code for the whole translation unit during parsing. It is the Parser and the Sema that calls into the callbacks of the CodeGenerator via the ASTConsumer interface. This is the exact same mechanism that is used for the Backend (see the BackendConsumer). We register both the CodeGenerator ast consumer and the AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer). By the time we start the symbolic execution in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since CodeGen is added first to the MultiplexConsumer so its HandleTranslationUnit and other callbacks are called back earlier). About caching, the llvm code is cached, we generate that only once, then during the function call evaluation we search it in the llvm::Module using the mangled name as the key (we don't cache the mangled names now, but we could).
It would be possible to directly call the callbacks of the CodeGenerator on-demand, without registering that to the FrontendAction. Actually, my first attempt was to call HandleTopLevelDecl for a given FunctionDecl on demand when we needed the llvm code. However, this is a false attempt for the following reasons: (1) Could not support ObjC/C++ because I could not get all the information that the Sema has when it calls to HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to call these callbacks directly, just indirectly through a registered ASTConsumer because we may not know how the Parser and the Sema calls to these. (2) It is not enough to get the llvm code for a function in isolation. E.g., for the "readonly" attribute we must enable alias analysis on global variables (see GlobalsAAResult), so we must emit llvm code for global variables.

> 1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.
> 2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.
We should not call the CodeGen on a merged AST. ASTImporter does not support the ASTConsumer interface. In the case of CTU, I think we should generate the IR for each TU in isolation. And we should probably want to extend the CrossTranslationUnit interface to give back the llvm::Function for a given FunctionDecl. Or we could make this more transparent and the IRContext in this prototype could be CTU aware.

> Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? 
There is a dependency we will never be able to get rid of: CodeGen generates lifetime markers only when the optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers are needed to get the precise pureness info out of GlobalsAA.

> The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?)
Yes, but we need to set the optimization level so CodeGen generates lifetime markers. Indeed, there are many llvm analyses that simply do not change the IR and just populate their results. And we could simply use the results in CSA.
> We should probably not be optimizing the IR at all in the process(?)
Some llvm passes may invalidate the results of previous analyses and then we need to rerun those. I am not an expert, but I think if we run an analysis again after another analysis that optimizes the IR (i.e truncates it) then our results could be more precise. And that is the reason why we see multiple passes for the same analyses when we do optimizations. And perhaps this is the exact job of the PassManager to orchestrate this (?). 
There are passes that extend the IR (e.g InferFunctionAttrsPass), we may not need these strictly speaking, but I really don't know how the different analyses use the function attributes.
Maybe we need the IR both in unoptimized form and in optimized form. Also, we may want to have our own CSA specific pipeline, but having the default O2 pipeline seems to simplify things.

On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <[hidden email]> wrote:
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
Yes, this is a good point. And a reason to assemble our own CSA specific llvm pipeline to avoid such removal of the static functions. We may want to skip the inliner pass. Or ... I assume there is a module pass that removes the unused static function, so as a better alternative, we could skip that from the pipeline.

On Thu, Aug 6, 2020 at 8:24 PM Gábor Horváth <[hidden email]> wrote:
I like the idea of piggybacking some analysis in the LLVM IR. However, I have some concerns as well. I am not well versed in the LLVM optimizer, but I do see potential side effects. E.g. what if a static function is inlined to ALL call sites, thus the original function can be removed. We will no longer be able to get all the useful info for that function? It would be unfortunate if the analysis result would depend on inlining heuristics. It would make the analyzer even harder to debug or understand.

On Thu, 6 Aug 2020 at 19:20, Artem Dergachev via cfe-dev <[hidden email]> wrote:
Umm, ok!~

Static analysis is commonly run in debug builds and those are typically unoptimized. It is not common for a project to have a release+asserts build but we are relying on asserts for analysis, so debug builds are commonly used for analysis. If your project completely ignores debug builds its usefulness drops a lot.

Sounds like we want to disconnect this new fake codegen from compiler flags entirely. Like, the AST will depend on compiler flags, but we should not be taking -O flags into account at all, but pick some default -O2 regardless of flags; and ideally all flags should be ignored by default, to ensure experience as consistent as possible.

You'd also have to make sure that running CodeGen doesn't have unwanted side effects such as emitting a .o file.

Would something like that actually work?

And if it would, would this also address the usual concerns about making warnings depend on optimizations? Because, like, optimizations now remain consistent and no longer depend on optimization flags used for actual code generation or interact with code generation; they're now simply another analysis performed on the AST that depends solely on the AST.

On 8/6/20 2:06 AM, Gábor Márton wrote:
> you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right?
This works differently. We generate the llvm code for the whole translation unit during parsing. It is the Parser and the Sema that calls into the callbacks of the CodeGenerator via the ASTConsumer interface. This is the exact same mechanism that is used for the Backend (see the BackendConsumer). We register both the CodeGenerator ast consumer and the AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer). By the time we start the symbolic execution in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since CodeGen is added first to the MultiplexConsumer so its HandleTranslationUnit and other callbacks are called back earlier). About caching, the llvm code is cached, we generate that only once, then during the function call evaluation we search it in the llvm::Module using the mangled name as the key (we don't cache the mangled names now, but we could).
It would be possible to directly call the callbacks of the CodeGenerator on-demand, without registering that to the FrontendAction. Actually, my first attempt was to call HandleTopLevelDecl for a given FunctionDecl on demand when we needed the llvm code. However, this is a false attempt for the following reasons: (1) Could not support ObjC/C++ because I could not get all the information that the Sema has when it calls to HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to call these callbacks directly, just indirectly through a registered ASTConsumer because we may not know how the Parser and the Sema calls to these. (2) It is not enough to get the llvm code for a function in isolation. E.g., for the "readonly" attribute we must enable alias analysis on global variables (see GlobalsAAResult), so we must emit llvm code for global variables.

> 1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.
> 2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.
We should not call the CodeGen on a merged AST. ASTImporter does not support the ASTConsumer interface. In the case of CTU, I think we should generate the IR for each TU in isolation. And we should probably want to extend the CrossTranslationUnit interface to give back the llvm::Function for a given FunctionDecl. Or we could make this more transparent and the IRContext in this prototype could be CTU aware.

> Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? 
There is a dependency we will never be able to get rid of: CodeGen generates lifetime markers only when the optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers are needed to get the precise pureness info out of GlobalsAA.

> The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?)
Yes, but we need to set the optimization level so CodeGen generates lifetime markers. Indeed, there are many llvm analyses that simply do not change the IR and just populate their results. And we could simply use the results in CSA.
> We should probably not be optimizing the IR at all in the process(?)
Some llvm passes may invalidate the results of previous analyses and then we need to rerun those. I am not an expert, but I think if we run an analysis again after another analysis that optimizes the IR (i.e truncates it) then our results could be more precise. And that is the reason why we see multiple passes for the same analyses when we do optimizations. And perhaps this is the exact job of the PassManager to orchestrate this (?). 
There are passes that extend the IR (e.g InferFunctionAttrsPass), we may not need these strictly speaking, but I really don't know how the different analyses use the function attributes.
Maybe we need the IR both in unoptimized form and in optimized form. Also, we may want to have our own CSA specific pipeline, but having the default O2 pipeline seems to simplify things.

On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <[hidden email]> wrote:
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
In reply to this post by Hollman, Daisy Sophia via cfe-dev
> Sounds like we want to disconnect this new fake codegen from compiler flags entirely. Like, the AST will depend on compiler flags, but we should not be taking -O flags into account at all, but pick some default -O2 regardless of flags; and ideally all flags should be ignored by default, to ensure experience as consistent as possible.

Yes I agree. Just added a TODO in the patch for that.

> You'd also have to make sure that running CodeGen doesn't have unwanted side effects such as emitting a .o file. Would something like that actually work?
That's easy, we simply don't add the Emit pass. And the `buildPerModuleDefaultPipeline()` that I use in the prototype does not add any emit pass to the built pipeline.

> would this also address the usual concerns about making warnings depend on optimizations? Because, like, optimizations now remain consistent and no longer depend on optimization flags used for actual code generation or interact with code generation; they're now simply another analysis performed on the AST that depends solely on the AST.
Yes, absolutely. The solution I have in the prototype for this is to suppress all diagnostics for the DiagEngine of the CodeGenerator: `CodeGenDiags->setSuppressAllDiagnostics(true);`

p.s.: Artem, I forgot to "reply all" previously, so you receive this twice, sorry.

On Thu, Aug 6, 2020 at 7:20 PM Artem Dergachev <[hidden email]> wrote:
Umm, ok!~

Static analysis is commonly run in debug builds and those are typically unoptimized. It is not common for a project to have a release+asserts build but we are relying on asserts for analysis, so debug builds are commonly used for analysis. If your project completely ignores debug builds its usefulness drops a lot.

Sounds like we want to disconnect this new fake codegen from compiler flags entirely. Like, the AST will depend on compiler flags, but we should not be taking -O flags into account at all, but pick some default -O2 regardless of flags; and ideally all flags should be ignored by default, to ensure experience as consistent as possible.

You'd also have to make sure that running CodeGen doesn't have unwanted side effects such as emitting a .o file.

Would something like that actually work?

And if it would, would this also address the usual concerns about making warnings depend on optimizations? Because, like, optimizations now remain consistent and no longer depend on optimization flags used for actual code generation or interact with code generation; they're now simply another analysis performed on the AST that depends solely on the AST.

On 8/6/20 2:06 AM, Gábor Márton wrote:
> you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right?
This works differently. We generate the llvm code for the whole translation unit during parsing. It is the Parser and the Sema that calls into the callbacks of the CodeGenerator via the ASTConsumer interface. This is the exact same mechanism that is used for the Backend (see the BackendConsumer). We register both the CodeGenerator ast consumer and the AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer). By the time we start the symbolic execution in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since CodeGen is added first to the MultiplexConsumer so its HandleTranslationUnit and other callbacks are called back earlier). About caching, the llvm code is cached, we generate that only once, then during the function call evaluation we search it in the llvm::Module using the mangled name as the key (we don't cache the mangled names now, but we could).
It would be possible to directly call the callbacks of the CodeGenerator on-demand, without registering that to the FrontendAction. Actually, my first attempt was to call HandleTopLevelDecl for a given FunctionDecl on demand when we needed the llvm code. However, this is a false attempt for the following reasons: (1) Could not support ObjC/C++ because I could not get all the information that the Sema has when it calls to HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to call these callbacks directly, just indirectly through a registered ASTConsumer because we may not know how the Parser and the Sema calls to these. (2) It is not enough to get the llvm code for a function in isolation. E.g., for the "readonly" attribute we must enable alias analysis on global variables (see GlobalsAAResult), so we must emit llvm code for global variables.

> 1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.
> 2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.
We should not call the CodeGen on a merged AST. ASTImporter does not support the ASTConsumer interface. In the case of CTU, I think we should generate the IR for each TU in isolation. And we should probably want to extend the CrossTranslationUnit interface to give back the llvm::Function for a given FunctionDecl. Or we could make this more transparent and the IRContext in this prototype could be CTU aware.

> Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? 
There is a dependency we will never be able to get rid of: CodeGen generates lifetime markers only when the optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers are needed to get the precise pureness info out of GlobalsAA.

> The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?)
Yes, but we need to set the optimization level so CodeGen generates lifetime markers. Indeed, there are many llvm analyses that simply do not change the IR and just populate their results. And we could simply use the results in CSA.
> We should probably not be optimizing the IR at all in the process(?)
Some llvm passes may invalidate the results of previous analyses and then we need to rerun those. I am not an expert, but I think if we run an analysis again after another analysis that optimizes the IR (i.e truncates it) then our results could be more precise. And that is the reason why we see multiple passes for the same analyses when we do optimizations. And perhaps this is the exact job of the PassManager to orchestrate this (?). 
There are passes that extend the IR (e.g InferFunctionAttrsPass), we may not need these strictly speaking, but I really don't know how the different analyses use the function attributes.
Maybe we need the IR both in unoptimized form and in optimized form. Also, we may want to have our own CSA specific pipeline, but having the default O2 pipeline seems to simplify things.

On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <[hidden email]> wrote:
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
In reply to this post by Hollman, Daisy Sophia via cfe-dev
Speaking of the pipeline, I think we should strive for a general architecture.

Basically, the proposal is using the analyses of LLVM IR as an oracle for certain properties of conservatively evaluated functions. In the (possibly far) future, Clang might get additional analyses based on the CFG or a new middle level IR. With the optimal solution it should be possible to replace, add, remove, or maybe even combine oracles easily. I do not insist on large efforts for generalizing as we do not have multiple oracles to verify the approach, but whenever we make a design decision I think this is something that we want to keep in mind.

On Thu, 6 Aug 2020 at 21:42, Gábor Márton <[hidden email]> wrote:
Yes, this is a good point. And a reason to assemble our own CSA specific llvm pipeline to avoid such removal of the static functions. We may want to skip the inliner pass. Or ... I assume there is a module pass that removes the unused static function, so as a better alternative, we could skip that from the pipeline.

On Thu, Aug 6, 2020 at 8:24 PM Gábor Horváth <[hidden email]> wrote:
I like the idea of piggybacking some analysis in the LLVM IR. However, I have some concerns as well. I am not well versed in the LLVM optimizer, but I do see potential side effects. E.g. what if a static function is inlined to ALL call sites, thus the original function can be removed. We will no longer be able to get all the useful info for that function? It would be unfortunate if the analysis result would depend on inlining heuristics. It would make the analyzer even harder to debug or understand.

On Thu, 6 Aug 2020 at 19:20, Artem Dergachev via cfe-dev <[hidden email]> wrote:
Umm, ok!~

Static analysis is commonly run in debug builds and those are typically unoptimized. It is not common for a project to have a release+asserts build but we are relying on asserts for analysis, so debug builds are commonly used for analysis. If your project completely ignores debug builds its usefulness drops a lot.

Sounds like we want to disconnect this new fake codegen from compiler flags entirely. Like, the AST will depend on compiler flags, but we should not be taking -O flags into account at all, but pick some default -O2 regardless of flags; and ideally all flags should be ignored by default, to ensure experience as consistent as possible.

You'd also have to make sure that running CodeGen doesn't have unwanted side effects such as emitting a .o file.

Would something like that actually work?

And if it would, would this also address the usual concerns about making warnings depend on optimizations? Because, like, optimizations now remain consistent and no longer depend on optimization flags used for actual code generation or interact with code generation; they're now simply another analysis performed on the AST that depends solely on the AST.

On 8/6/20 2:06 AM, Gábor Márton wrote:
> you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right?
This works differently. We generate the llvm code for the whole translation unit during parsing. It is the Parser and the Sema that calls into the callbacks of the CodeGenerator via the ASTConsumer interface. This is the exact same mechanism that is used for the Backend (see the BackendConsumer). We register both the CodeGenerator ast consumer and the AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer). By the time we start the symbolic execution in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since CodeGen is added first to the MultiplexConsumer so its HandleTranslationUnit and other callbacks are called back earlier). About caching, the llvm code is cached, we generate that only once, then during the function call evaluation we search it in the llvm::Module using the mangled name as the key (we don't cache the mangled names now, but we could).
It would be possible to directly call the callbacks of the CodeGenerator on-demand, without registering that to the FrontendAction. Actually, my first attempt was to call HandleTopLevelDecl for a given FunctionDecl on demand when we needed the llvm code. However, this is a false attempt for the following reasons: (1) Could not support ObjC/C++ because I could not get all the information that the Sema has when it calls to HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to call these callbacks directly, just indirectly through a registered ASTConsumer because we may not know how the Parser and the Sema calls to these. (2) It is not enough to get the llvm code for a function in isolation. E.g., for the "readonly" attribute we must enable alias analysis on global variables (see GlobalsAAResult), so we must emit llvm code for global variables.

> 1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.
> 2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.
We should not call the CodeGen on a merged AST. ASTImporter does not support the ASTConsumer interface. In the case of CTU, I think we should generate the IR for each TU in isolation. And we should probably want to extend the CrossTranslationUnit interface to give back the llvm::Function for a given FunctionDecl. Or we could make this more transparent and the IRContext in this prototype could be CTU aware.

> Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? 
There is a dependency we will never be able to get rid of: CodeGen generates lifetime markers only when the optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers are needed to get the precise pureness info out of GlobalsAA.

> The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?)
Yes, but we need to set the optimization level so CodeGen generates lifetime markers. Indeed, there are many llvm analyses that simply do not change the IR and just populate their results. And we could simply use the results in CSA.
> We should probably not be optimizing the IR at all in the process(?)
Some llvm passes may invalidate the results of previous analyses and then we need to rerun those. I am not an expert, but I think if we run an analysis again after another analysis that optimizes the IR (i.e truncates it) then our results could be more precise. And that is the reason why we see multiple passes for the same analyses when we do optimizations. And perhaps this is the exact job of the PassManager to orchestrate this (?). 
There are passes that extend the IR (e.g InferFunctionAttrsPass), we may not need these strictly speaking, but I really don't know how the different analyses use the function attributes.
Maybe we need the IR both in unoptimized form and in optimized form. Also, we may want to have our own CSA specific pipeline, but having the default O2 pipeline seems to simplify things.

On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <[hidden email]> wrote:
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
I have updated the patch and addressed all concerns hopefully. Now the pipeline contains only those passes that are needed to get the pureness information (GlobalsAA and PostOrderFunctionAttrs). So, we have our CSA specific pipeline now. I also added some unittest (and changed the lit test) to demonstrate that we can get attributes for static functions.
Inlining is now omitted from our pipeline, but I have a gut feeling that this could result in less precise results for some other llvm analyses which we might want to run in the future. But, for now let's keep the pipeline to the minimum, later we may have several pipelines for different needs.

> Clang might get additional analyses based on the CFG or a new middle level IR.
Whenever we'll have middle level IR, then we could build middle level pipelines with similar changes in the architecture: adding a new ASTConsumer for the middle level codegen. But this does not seem to happen in the foreseeable future, so I'd suggest let's focus on the LLVM IR for now.


On Fri, Aug 7, 2020 at 10:35 AM Gábor Horváth <[hidden email]> wrote:
Speaking of the pipeline, I think we should strive for a general architecture.

Basically, the proposal is using the analyses of LLVM IR as an oracle for certain properties of conservatively evaluated functions. In the (possibly far) future, Clang might get additional analyses based on the CFG or a new middle level IR. With the optimal solution it should be possible to replace, add, remove, or maybe even combine oracles easily. I do not insist on large efforts for generalizing as we do not have multiple oracles to verify the approach, but whenever we make a design decision I think this is something that we want to keep in mind.

On Thu, 6 Aug 2020 at 21:42, Gábor Márton <[hidden email]> wrote:
Yes, this is a good point. And a reason to assemble our own CSA specific llvm pipeline to avoid such removal of the static functions. We may want to skip the inliner pass. Or ... I assume there is a module pass that removes the unused static function, so as a better alternative, we could skip that from the pipeline.

On Thu, Aug 6, 2020 at 8:24 PM Gábor Horváth <[hidden email]> wrote:
I like the idea of piggybacking some analysis in the LLVM IR. However, I have some concerns as well. I am not well versed in the LLVM optimizer, but I do see potential side effects. E.g. what if a static function is inlined to ALL call sites, thus the original function can be removed. We will no longer be able to get all the useful info for that function? It would be unfortunate if the analysis result would depend on inlining heuristics. It would make the analyzer even harder to debug or understand.

On Thu, 6 Aug 2020 at 19:20, Artem Dergachev via cfe-dev <[hidden email]> wrote:
Umm, ok!~

Static analysis is commonly run in debug builds and those are typically unoptimized. It is not common for a project to have a release+asserts build but we are relying on asserts for analysis, so debug builds are commonly used for analysis. If your project completely ignores debug builds its usefulness drops a lot.

Sounds like we want to disconnect this new fake codegen from compiler flags entirely. Like, the AST will depend on compiler flags, but we should not be taking -O flags into account at all, but pick some default -O2 regardless of flags; and ideally all flags should be ignored by default, to ensure experience as consistent as possible.

You'd also have to make sure that running CodeGen doesn't have unwanted side effects such as emitting a .o file.

Would something like that actually work?

And if it would, would this also address the usual concerns about making warnings depend on optimizations? Because, like, optimizations now remain consistent and no longer depend on optimization flags used for actual code generation or interact with code generation; they're now simply another analysis performed on the AST that depends solely on the AST.

On 8/6/20 2:06 AM, Gábor Márton wrote:
> you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right?
This works differently. We generate the llvm code for the whole translation unit during parsing. It is the Parser and the Sema that calls into the callbacks of the CodeGenerator via the ASTConsumer interface. This is the exact same mechanism that is used for the Backend (see the BackendConsumer). We register both the CodeGenerator ast consumer and the AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer). By the time we start the symbolic execution in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since CodeGen is added first to the MultiplexConsumer so its HandleTranslationUnit and other callbacks are called back earlier). About caching, the llvm code is cached, we generate that only once, then during the function call evaluation we search it in the llvm::Module using the mangled name as the key (we don't cache the mangled names now, but we could).
It would be possible to directly call the callbacks of the CodeGenerator on-demand, without registering that to the FrontendAction. Actually, my first attempt was to call HandleTopLevelDecl for a given FunctionDecl on demand when we needed the llvm code. However, this is a false attempt for the following reasons: (1) Could not support ObjC/C++ because I could not get all the information that the Sema has when it calls to HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to call these callbacks directly, just indirectly through a registered ASTConsumer because we may not know how the Parser and the Sema calls to these. (2) It is not enough to get the llvm code for a function in isolation. E.g., for the "readonly" attribute we must enable alias analysis on global variables (see GlobalsAAResult), so we must emit llvm code for global variables.

> 1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.
> 2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.
We should not call the CodeGen on a merged AST. ASTImporter does not support the ASTConsumer interface. In the case of CTU, I think we should generate the IR for each TU in isolation. And we should probably want to extend the CrossTranslationUnit interface to give back the llvm::Function for a given FunctionDecl. Or we could make this more transparent and the IRContext in this prototype could be CTU aware.

> Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? 
There is a dependency we will never be able to get rid of: CodeGen generates lifetime markers only when the optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers are needed to get the precise pureness info out of GlobalsAA.

> The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?)
Yes, but we need to set the optimization level so CodeGen generates lifetime markers. Indeed, there are many llvm analyses that simply do not change the IR and just populate their results. And we could simply use the results in CSA.
> We should probably not be optimizing the IR at all in the process(?)
Some llvm passes may invalidate the results of previous analyses and then we need to rerun those. I am not an expert, but I think if we run an analysis again after another analysis that optimizes the IR (i.e truncates it) then our results could be more precise. And that is the reason why we see multiple passes for the same analyses when we do optimizations. And perhaps this is the exact job of the PassManager to orchestrate this (?). 
There are passes that extend the IR (e.g InferFunctionAttrsPass), we may not need these strictly speaking, but I really don't know how the different analyses use the function attributes.
Maybe we need the IR both in unoptimized form and in optimized form. Also, we may want to have our own CSA specific pipeline, but having the default O2 pipeline seems to simplify things.

On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <[hidden email]> wrote:
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
Artem, John,

How should we proceed with this?

John, you mention in the patch that this is a huge architectural change. Could you please elaborate? Are you concerned about the additional libs that are being linked to the static analyzer libraries? The clang binary is already dependent on LLVM libs and on the CodeGen and CSA is builtin to the clang binary. Are you concerned about having a MultiplexConsumer as an ASTConsumer? ... I am open to any suggestions, but I need more input from you.

Many thanks,
Gabor

On Tue, Aug 11, 2020 at 5:49 PM Gábor Márton <[hidden email]> wrote:
I have updated the patch and addressed all concerns hopefully. Now the pipeline contains only those passes that are needed to get the pureness information (GlobalsAA and PostOrderFunctionAttrs). So, we have our CSA specific pipeline now. I also added some unittest (and changed the lit test) to demonstrate that we can get attributes for static functions.
Inlining is now omitted from our pipeline, but I have a gut feeling that this could result in less precise results for some other llvm analyses which we might want to run in the future. But, for now let's keep the pipeline to the minimum, later we may have several pipelines for different needs.

> Clang might get additional analyses based on the CFG or a new middle level IR.
Whenever we'll have middle level IR, then we could build middle level pipelines with similar changes in the architecture: adding a new ASTConsumer for the middle level codegen. But this does not seem to happen in the foreseeable future, so I'd suggest let's focus on the LLVM IR for now.


On Fri, Aug 7, 2020 at 10:35 AM Gábor Horváth <[hidden email]> wrote:
Speaking of the pipeline, I think we should strive for a general architecture.

Basically, the proposal is using the analyses of LLVM IR as an oracle for certain properties of conservatively evaluated functions. In the (possibly far) future, Clang might get additional analyses based on the CFG or a new middle level IR. With the optimal solution it should be possible to replace, add, remove, or maybe even combine oracles easily. I do not insist on large efforts for generalizing as we do not have multiple oracles to verify the approach, but whenever we make a design decision I think this is something that we want to keep in mind.

On Thu, 6 Aug 2020 at 21:42, Gábor Márton <[hidden email]> wrote:
Yes, this is a good point. And a reason to assemble our own CSA specific llvm pipeline to avoid such removal of the static functions. We may want to skip the inliner pass. Or ... I assume there is a module pass that removes the unused static function, so as a better alternative, we could skip that from the pipeline.

On Thu, Aug 6, 2020 at 8:24 PM Gábor Horváth <[hidden email]> wrote:
I like the idea of piggybacking some analysis in the LLVM IR. However, I have some concerns as well. I am not well versed in the LLVM optimizer, but I do see potential side effects. E.g. what if a static function is inlined to ALL call sites, thus the original function can be removed. We will no longer be able to get all the useful info for that function? It would be unfortunate if the analysis result would depend on inlining heuristics. It would make the analyzer even harder to debug or understand.

On Thu, 6 Aug 2020 at 19:20, Artem Dergachev via cfe-dev <[hidden email]> wrote:
Umm, ok!~

Static analysis is commonly run in debug builds and those are typically unoptimized. It is not common for a project to have a release+asserts build but we are relying on asserts for analysis, so debug builds are commonly used for analysis. If your project completely ignores debug builds its usefulness drops a lot.

Sounds like we want to disconnect this new fake codegen from compiler flags entirely. Like, the AST will depend on compiler flags, but we should not be taking -O flags into account at all, but pick some default -O2 regardless of flags; and ideally all flags should be ignored by default, to ensure experience as consistent as possible.

You'd also have to make sure that running CodeGen doesn't have unwanted side effects such as emitting a .o file.

Would something like that actually work?

And if it would, would this also address the usual concerns about making warnings depend on optimizations? Because, like, optimizations now remain consistent and no longer depend on optimization flags used for actual code generation or interact with code generation; they're now simply another analysis performed on the AST that depends solely on the AST.

On 8/6/20 2:06 AM, Gábor Márton wrote:
> you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right?
This works differently. We generate the llvm code for the whole translation unit during parsing. It is the Parser and the Sema that calls into the callbacks of the CodeGenerator via the ASTConsumer interface. This is the exact same mechanism that is used for the Backend (see the BackendConsumer). We register both the CodeGenerator ast consumer and the AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer). By the time we start the symbolic execution in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since CodeGen is added first to the MultiplexConsumer so its HandleTranslationUnit and other callbacks are called back earlier). About caching, the llvm code is cached, we generate that only once, then during the function call evaluation we search it in the llvm::Module using the mangled name as the key (we don't cache the mangled names now, but we could).
It would be possible to directly call the callbacks of the CodeGenerator on-demand, without registering that to the FrontendAction. Actually, my first attempt was to call HandleTopLevelDecl for a given FunctionDecl on demand when we needed the llvm code. However, this is a false attempt for the following reasons: (1) Could not support ObjC/C++ because I could not get all the information that the Sema has when it calls to HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to call these callbacks directly, just indirectly through a registered ASTConsumer because we may not know how the Parser and the Sema calls to these. (2) It is not enough to get the llvm code for a function in isolation. E.g., for the "readonly" attribute we must enable alias analysis on global variables (see GlobalsAAResult), so we must emit llvm code for global variables.

> 1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.
> 2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.
We should not call the CodeGen on a merged AST. ASTImporter does not support the ASTConsumer interface. In the case of CTU, I think we should generate the IR for each TU in isolation. And we should probably want to extend the CrossTranslationUnit interface to give back the llvm::Function for a given FunctionDecl. Or we could make this more transparent and the IRContext in this prototype could be CTU aware.

> Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? 
There is a dependency we will never be able to get rid of: CodeGen generates lifetime markers only when the optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers are needed to get the precise pureness info out of GlobalsAA.

> The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?)
Yes, but we need to set the optimization level so CodeGen generates lifetime markers. Indeed, there are many llvm analyses that simply do not change the IR and just populate their results. And we could simply use the results in CSA.
> We should probably not be optimizing the IR at all in the process(?)
Some llvm passes may invalidate the results of previous analyses and then we need to rerun those. I am not an expert, but I think if we run an analysis again after another analysis that optimizes the IR (i.e truncates it) then our results could be more precise. And that is the reason why we see multiple passes for the same analyses when we do optimizations. And perhaps this is the exact job of the PassManager to orchestrate this (?). 
There are passes that extend the IR (e.g InferFunctionAttrsPass), we may not need these strictly speaking, but I really don't know how the different analyses use the function attributes.
Maybe we need the IR both in unoptimized form and in optimized form. Also, we may want to have our own CSA specific pipeline, but having the default O2 pipeline seems to simplify things.

On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <[hidden email]> wrote:
Just to be clear, we should definitely avoid having our analysis results depend on optimization levels. It should be possible to avoid that, right? The way i imagined this, we're only interested in picking up LLVM analyses, which can be run over unoptimized IR just fine(?) We should probably not be optimizing the IR at all in the process(?)

On 05.08.2020 12:17, Artem Dergachev wrote:
I'm excited that this is actually moving somewhere!

Let's see what consequences do we have here. I have some thoughts but i don't immediately see any architecturally catastrophic consequences; you're "just" generating llvm::Function for a given AST FunctionDecl "real quick" and looking at the attributes. This is happening on-demand and cached, right??? I'd love to hear more opinions. Here's what i see:

1. We can no longer mutate the AST for analysis purposes without the risk of screwing up subsequent codegen. And the risk would be pretty high because hand-crafting ASTs is extremely difficult. Good thing we aren't actually doing this.
    1.1. But it sounds like for the CTU users it may amplify the imperfections of ASTImporter.

2. Ok, yeah, we now may have crashes in CodeGen during analysis. Normally they shouldn't be that bad because this would mean that CodeGen would crash during normal compilation as well. And that's rare; codegen crashes are much more rare than analyzer crashes. Of course a difference can be triggered by #ifndef __clang_analyzer__ but it still remains a proof of valid crashing code, so that should be rare.
    2.1. Again, it's worse with CTU because imported ASTs have so far never been tested for compatibility with CodeGen.

Let's also talk about the benefits. First of all, *we still need the source code available during analysis*. This isn't about peeking into binary dependencies and it doesn't immediately aid CTU in any way; this is entirely about improving upon conservative evaluation on the currently available AST, for functions that are already available for inlining but are not being inlined for whatever reason. In fact, in some cases we may later prefer such LLVM IR-based evaluation to inlining, which may improve analysis performance (i.e., less path explosion) *and* correctness (eg., avoid unjustified state splits).

On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
Hi,

I have been working on a prototype that makes it possible to access the IR from the components of the Clang Static Analyzer.

There are many important and useful analyses in the LLVM layer that we can use during the path sensitive analysis. Most notably, the "readnone" and "readonly" function attributes (https://llvm.org/docs/LangRef.html) which can be used to identify "pure" functions (those without side effects). In the prototype I am using the pureness info from the IR to avoid invalidation of any variables during conservative evaluation (when we evaluate a pure function). There are cases when we get false positives exactly because of the too conservative invalidation.

Some further ideas to use info from the IR:
- We should invalidate only the arg regions for functions with "argmemonly" attribute.
- Use the smarter invalidation in cross translation unit analysis too. We can get the IR for the other TUs as well.
- Run the Attributor passes on the IR. We could get range values for return values or for arguments. These range values then could be fed to StdLibraryFunctionsChecker to make the proper assumptions. And we could do this in CTU mode too, these attributes could form some sort of a summary of these functions. Note that I don't expect a meaningful summary for more than a few percent of all the available functions.

Please let me know if you have any further ideas about how we could use IR attributes (or anything else) during the symbolic execution.

There are some concerns as well. There may be some source code that we cannot CodeGen, but we can still analyse with the current CSA. That is why I suppress CodeGen diagnostics in the prototype. But in the worst case we may run into assertions in the CodeGen and this may cause regression in the whole analysis experience. This may be the case especially when we get a compile_commands.json from a project that is compiled only with e.g. GCC.

Thanks,
Gabor

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
On 13 Aug 2020, at 10:15, Gábor Márton wrote:

> Artem, John,
>
> How should we proceed with this?
>
> John, you mention in the patch that this is a huge architectural
> change.
> Could you please elaborate? Are you concerned about the additional
> libs
> that are being linked to the static analyzer libraries? The clang
> binary is
> already dependent on LLVM libs and on the CodeGen and CSA is builtin
> to the
> clang binary. Are you concerned about having a MultiplexConsumer as an
> ASTConsumer? ... I am open to any suggestions, but I need more input
> from you.

Well, it’s adding a major new dependency to the static analyzer and a
major new client to IRGen.  In both cases, the dependency/client happens
to be another part of Clang, but still, it seems like a huge deal for
static analysis to start depending on potentially arbitrary details of
code generation and LLVM optimization.  It’s also pretty expensive.  
Is this really the most reasonable way to get the information you want?

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
John, thank you for your reply. 

> Is this really the most reasonable way to get the information you want?
Here is a list of information we would like to have access to (this is non-comprehensive, Artem could probably extend it) :
1) Is a function pure? 
2) Does a function read/write only the memory pointed to by its arguments?
3) Does a calle make any copies of the pointer argument that outlive the callee itself?
4) Value ranges.
5) Is a loop dead?
6) Is a parameter or return pointer is dereferenceable?

How could we use this information?
With 1-3 we could make the analysis more precise by improving the over-approximation done by invalidation during conservative evaluation.
Using the info from 1-4 we could create "summaries" for functions and we could skip the inlining based evaluation of them. This would be really beneficial in case of cross-translation-unit analysis where the inling stack can grow really deep. 
With 5, we could skip the analysis of dead loops and thus could spare the budget for the symbolic execution in CSA.
By using 6, we could eliminate some false-positive reports, this way improving correctness.

Some of the analyses that provide the needed information can be implemented properly only by using the SSA form. For example, value range propagation. We could do our own way of lowering to SSA, or our own implementation of alias analysis for the pureness info, but that would be repeating the work that had already been done and well tested in LLVM.

> It’s also pretty expensive.
I completely agree that we should not pay for those optimization passes whose results we cannot use in the CSA. In the first version of the patch I used the whole O2 pipeline, but lately I updated it to use only those passes that are needed to get the pureness information (GlobalsAA and PostOrderFunctionAttrs).
Also, static analysis is generally considered to be slower than compilation even with optimizations enabled. We even advertise this in our official webpage (here). And this extension will never be more expensive than a regular O2/O3 compilation. So, this implies that a 2-4x slowdown of CSA could become a 3-5x slowdown, compared to an O2 compilation. In CTU mode, the analysis time is even slower currently, so the additional CodeGen would be less noticable. The slowdown may not be affordable for some clients, so users must explicitly require CodeGen in CSA via a command-line switch. I plan to provide precise results on open-source projects to measure the slowdown. On top of that, it would be interesting to see how many times can we get the desired information in the ratio of all functions (all call sites, all loops).

Gabor.



On Fri, Aug 14, 2020 at 6:46 AM John McCall <[hidden email]> wrote:
On 13 Aug 2020, at 10:15, Gábor Márton wrote:
> Artem, John,
>
> How should we proceed with this?
>
> John, you mention in the patch that this is a huge architectural
> change.
> Could you please elaborate? Are you concerned about the additional
> libs
> that are being linked to the static analyzer libraries? The clang
> binary is
> already dependent on LLVM libs and on the CodeGen and CSA is builtin
> to the
> clang binary. Are you concerned about having a MultiplexConsumer as an
> ASTConsumer? ... I am open to any suggestions, but I need more input
> from you.

Well, it’s adding a major new dependency to the static analyzer and a
major new client to IRGen.  In both cases, the dependency/client happens
to be another part of Clang, but still, it seems like a huge deal for
static analysis to start depending on potentially arbitrary details of
code generation and LLVM optimization.  It’s also pretty expensive. 
Is this really the most reasonable way to get the information you want?

John.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
Technically all these analyses can be conducted in source-based manner.

And as John says, that'd have the advantage of being more predictable; we'd no longer have to investigate sudden changes in analysis results that are in fact caused by backend changes. We already are quite unpredictable with our analysis, with arbitrary things affecting arbitrary things, and that's ok, but it doesn't mean we should make it worse. In particular i'm worried for people who treat analyzer warnings as errors in their builds; for them any update in the compiler would now cause their build to fail, even if we didn't change anything in the static analyzer. (Well, for the same reason i generally don't recommend treating analyzer warnings as errors).

So i believe that implementing as many of these analyses over the Clang CFG (or in many cases it might be over the AST as well) would be beneficial and should be done regardless of this experiment. Gabor, how much did you try that? Because i believe you should try that and compare the results, at least for some analyses that are easy to implement.

The reason why the use of LLVM IR in the static analyzer gets really interesting is because there are already a huge lot of analyses already implemented over it and getting access to them "for free" (in terms of implementation cost) is fairly tempting. I think that's the only real reason; it's a pretty big reason though, because the amount of work we're saving us this way may be pretty large if we put a lot of those analyses to good use.

On 14.08.2020 04:19, Gábor Márton wrote:
John, thank you for your reply. 

> Is this really the most reasonable way to get the information you want?
Here is a list of information we would like to have access to (this is non-comprehensive, Artem could probably extend it) :
1) Is a function pure? 
2) Does a function read/write only the memory pointed to by its arguments?
3) Does a calle make any copies of the pointer argument that outlive the callee itself?
4) Value ranges.
5) Is a loop dead?
6) Is a parameter or return pointer is dereferenceable?

How could we use this information?
With 1-3 we could make the analysis more precise by improving the over-approximation done by invalidation during conservative evaluation.
Using the info from 1-4 we could create "summaries" for functions and we could skip the inlining based evaluation of them. This would be really beneficial in case of cross-translation-unit analysis where the inling stack can grow really deep. 
With 5, we could skip the analysis of dead loops and thus could spare the budget for the symbolic execution in CSA.
By using 6, we could eliminate some false-positive reports, this way improving correctness.

Some of the analyses that provide the needed information can be implemented properly only by using the SSA form. For example, value range propagation. We could do our own way of lowering to SSA, or our own implementation of alias analysis for the pureness info, but that would be repeating the work that had already been done and well tested in LLVM.

> It’s also pretty expensive.
I completely agree that we should not pay for those optimization passes whose results we cannot use in the CSA. In the first version of the patch I used the whole O2 pipeline, but lately I updated it to use only those passes that are needed to get the pureness information (GlobalsAA and PostOrderFunctionAttrs).
Also, static analysis is generally considered to be slower than compilation even with optimizations enabled. We even advertise this in our official webpage (here). And this extension will never be more expensive than a regular O2/O3 compilation. So, this implies that a 2-4x slowdown of CSA could become a 3-5x slowdown, compared to an O2 compilation. In CTU mode, the analysis time is even slower currently, so the additional CodeGen would be less noticable. The slowdown may not be affordable for some clients, so users must explicitly require CodeGen in CSA via a command-line switch. I plan to provide precise results on open-source projects to measure the slowdown. On top of that, it would be interesting to see how many times can we get the desired information in the ratio of all functions (all call sites, all loops).

Gabor.



On Fri, Aug 14, 2020 at 6:46 AM John McCall <[hidden email]> wrote:
On 13 Aug 2020, at 10:15, Gábor Márton wrote:
> Artem, John,
>
> How should we proceed with this?
>
> John, you mention in the patch that this is a huge architectural
> change.
> Could you please elaborate? Are you concerned about the additional
> libs
> that are being linked to the static analyzer libraries? The clang
> binary is
> already dependent on LLVM libs and on the CodeGen and CSA is builtin
> to the
> clang binary. Are you concerned about having a MultiplexConsumer as an
> ASTConsumer? ... I am open to any suggestions, but I need more input
> from you.

Well, it’s adding a major new dependency to the static analyzer and a
major new client to IRGen.  In both cases, the dependency/client happens
to be another part of Clang, but still, it seems like a huge deal for
static analysis to start depending on potentially arbitrary details of
code generation and LLVM optimization.  It’s also pretty expensive. 
Is this really the most reasonable way to get the information you want?

John.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev

On Sun, 16 Aug 2020 at 21:57, Artem Dergachev <[hidden email]> wrote:

So i believe that implementing as many of these analyses over the Clang CFG (or in many cases it might be over the AST as well) would be beneficial and should be done regardless of this experiment.

While I do agree that this would be awesome, I think many of those analyses are quite painful to implement on our current CFG compared to an already lowered representation like the LLVM IR which can be canonicalized and there are fewer corner cases and peculiarities to handle compared to the C++ language. Having the option to derive certain information from a representation that is easier to work with for some purposes might be useful for future analyses as well, not only for leveraging currently implemented analyses. Having a proper Clang IR could of course void this argument.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
> And as John says, that'd have the advantage of being more predictable; we'd no longer have to investigate sudden changes in analysis results that are in fact caused by backend changes.
I believe that all individual LLVM passes are implemented in a way that we can reuse them in any exotic pipeline. Of course there are dependencies between the passes, but besides that I don't think that Clang backend changes should matter that much. Otherwise, custom pipelines would be a nightmare to maintain.

> In particular i'm worried for people who treat analyzer warnings as errors in their builds; for them any update in the compiler would now cause their build to fail
Well, we could protect them by swallowing all the diags from the CodeGen part. And if CodeGen has diags then we could omit the IR.

> So i believe that implementing as many of these analyses over the Clang CFG (or in many cases it might be over the AST as well) would be beneficial and should be done regardless of this experiment. Gabor, how much did you try that? Because i believe you should try that and compare the results, at least for some analyses that are easy to implement.
Yeah, I agree that it is worth trying to implement at least the simplest ones in the Clang CFG. Thus we would see if anything is missing from our infra in the CSA and we could compare the results and their performance. I am thinking about starting with the pureness info, that involves implementing GlobalsModRef over the Clang CFG.

> The reason why the use of LLVM IR in the static analyzer gets really interesting is because there are already a huge lot of analyses already implemented over it and getting access to them "for free" (in terms of implementation cost) is fairly tempting. I think that's the only real reason;
There is another reason (as G. Horvath mentions as well): many of the analyses are quite painful to implement on our current CFG compared to an already lowered representation like the LLVM IR. However, I agree that maybe it should not be the LLVM IR that we need to lower. There is a desire/attempt to use MLIR in Clang. I can't wait to hear the presentation about CIL (Common MLIR Dialect for C/C++ and Fortran) in the upcoming LLVM dev meeting, it would be great to know the status. 
Still, I think it could take years until we can have a proper Clang Intermediate Language incorporated into the Clang CFG. Contrary to this, we could immediately start to use already implemented analyses on top of the LLVM IR.

Gabor


On Mon, Aug 17, 2020 at 1:31 PM Gábor Horváth <[hidden email]> wrote:

On Sun, 16 Aug 2020 at 21:57, Artem Dergachev <[hidden email]> wrote:

So i believe that implementing as many of these analyses over the Clang CFG (or in many cases it might be over the AST as well) would be beneficial and should be done regardless of this experiment.

While I do agree that this would be awesome, I think many of those analyses are quite painful to implement on our current CFG compared to an already lowered representation like the LLVM IR which can be canonicalized and there are fewer corner cases and peculiarities to handle compared to the C++ language. Having the option to derive certain information from a representation that is easier to work with for some purposes might be useful for future analyses as well, not only for leveraging currently implemented analyses. Having a proper Clang IR could of course void this argument.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
Continuing with this topic, I have spent some time and investigated the GlobalsModRef analysis in detail. Below are my findings about what is missing currently from our infrastructure to effectively implement the 'pureness' analysis:
1) Clang CFG does not handle strongly connected components (SCC). Globals modref should do a bottom-up SCC traversal of the call graph. First we must analyze the leaf SCCs and then can we populate up the data in the call graph. We shall handle functions that strictly call each other (they form an SCC in the graph) in one step.
2) We cannot effectively get the "use information" from the Clang AST. During the mod/ref analysis we should take a look at all the uses of one global variable (see for e.g. GlobalsAAResult::AnalyzeUsesOfPointer). In LLVM Every `Value` has a "use list" that keeps track of which other Values are using the Value.

Solving 1) seems quite straightforward. We could implement the Kosaraju-Sharir algorithm on the CFG: have a topological sort on the reverse CFG and then run a standard DFS.

2) seems a bit more complex. 
- One idea could be to do an AST visitation and build up the "user info" during that.
- Another, more effective approach could be to register a custom ASTMutationListener and then overwrite the DeclarationMarkedUsed method. This method has the advantage that we don't have to traverse the whole AST to gather the "user" info, rather we can build that up during the parsing. The detrimental effects are that this could be a serious architectural change.

Artem, Gabor, 
What do you think, what do you suggest, is it worth going on and implementing SCCs into the Clang CFG? Regarding 2) is it worth putting much effort implementing that, which method would you prefer?
Perhaps implementing all these things would be a wasted effort, once we realize that we do want to go forward with CIL (Clang Intermediate Language). Should we take a few steps back and rather put (huge) efforts into CIL, what do you think? In my opinion, CIL could take years until usable, so it does seem as if it is worth to start implementing the missing parts for the Clang CFG and AST.

Thanks,
Gabor

On Tue, Aug 25, 2020 at 9:37 AM Gábor Márton <[hidden email]> wrote:
> And as John says, that'd have the advantage of being more predictable; we'd no longer have to investigate sudden changes in analysis results that are in fact caused by backend changes.
I believe that all individual LLVM passes are implemented in a way that we can reuse them in any exotic pipeline. Of course there are dependencies between the passes, but besides that I don't think that Clang backend changes should matter that much. Otherwise, custom pipelines would be a nightmare to maintain.

> In particular i'm worried for people who treat analyzer warnings as errors in their builds; for them any update in the compiler would now cause their build to fail
Well, we could protect them by swallowing all the diags from the CodeGen part. And if CodeGen has diags then we could omit the IR.

> So i believe that implementing as many of these analyses over the Clang CFG (or in many cases it might be over the AST as well) would be beneficial and should be done regardless of this experiment. Gabor, how much did you try that? Because i believe you should try that and compare the results, at least for some analyses that are easy to implement.
Yeah, I agree that it is worth trying to implement at least the simplest ones in the Clang CFG. Thus we would see if anything is missing from our infra in the CSA and we could compare the results and their performance. I am thinking about starting with the pureness info, that involves implementing GlobalsModRef over the Clang CFG.

> The reason why the use of LLVM IR in the static analyzer gets really interesting is because there are already a huge lot of analyses already implemented over it and getting access to them "for free" (in terms of implementation cost) is fairly tempting. I think that's the only real reason;
There is another reason (as G. Horvath mentions as well): many of the analyses are quite painful to implement on our current CFG compared to an already lowered representation like the LLVM IR. However, I agree that maybe it should not be the LLVM IR that we need to lower. There is a desire/attempt to use MLIR in Clang. I can't wait to hear the presentation about CIL (Common MLIR Dialect for C/C++ and Fortran) in the upcoming LLVM dev meeting, it would be great to know the status. 
Still, I think it could take years until we can have a proper Clang Intermediate Language incorporated into the Clang CFG. Contrary to this, we could immediately start to use already implemented analyses on top of the LLVM IR.

Gabor


On Mon, Aug 17, 2020 at 1:31 PM Gábor Horváth <[hidden email]> wrote:

On Sun, 16 Aug 2020 at 21:57, Artem Dergachev <[hidden email]> wrote:

So i believe that implementing as many of these analyses over the Clang CFG (or in many cases it might be over the AST as well) would be beneficial and should be done regardless of this experiment.

While I do agree that this would be awesome, I think many of those analyses are quite painful to implement on our current CFG compared to an already lowered representation like the LLVM IR which can be canonicalized and there are fewer corner cases and peculiarities to handle compared to the C++ language. Having the option to derive certain information from a representation that is easier to work with for some purposes might be useful for future analyses as well, not only for leveraging currently implemented analyses. Having a proper Clang IR could of course void this argument.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
When you read through my previous email, please replace all occurrences of "CFG" with "CG". I wanted to refer to the CallGraph (CG) and not to the ControlFlowGraph (CFG).

On Wed, Sep 9, 2020 at 5:01 PM Gábor Márton <[hidden email]> wrote:
Continuing with this topic, I have spent some time and investigated the GlobalsModRef analysis in detail. Below are my findings about what is missing currently from our infrastructure to effectively implement the 'pureness' analysis:
1) Clang CFG does not handle strongly connected components (SCC). Globals modref should do a bottom-up SCC traversal of the call graph. First we must analyze the leaf SCCs and then can we populate up the data in the call graph. We shall handle functions that strictly call each other (they form an SCC in the graph) in one step.
2) We cannot effectively get the "use information" from the Clang AST. During the mod/ref analysis we should take a look at all the uses of one global variable (see for e.g. GlobalsAAResult::AnalyzeUsesOfPointer). In LLVM Every `Value` has a "use list" that keeps track of which other Values are using the Value.

Solving 1) seems quite straightforward. We could implement the Kosaraju-Sharir algorithm on the CFG: have a topological sort on the reverse CFG and then run a standard DFS.

2) seems a bit more complex. 
- One idea could be to do an AST visitation and build up the "user info" during that.
- Another, more effective approach could be to register a custom ASTMutationListener and then overwrite the DeclarationMarkedUsed method. This method has the advantage that we don't have to traverse the whole AST to gather the "user" info, rather we can build that up during the parsing. The detrimental effects are that this could be a serious architectural change.

Artem, Gabor, 
What do you think, what do you suggest, is it worth going on and implementing SCCs into the Clang CFG? Regarding 2) is it worth putting much effort implementing that, which method would you prefer?
Perhaps implementing all these things would be a wasted effort, once we realize that we do want to go forward with CIL (Clang Intermediate Language). Should we take a few steps back and rather put (huge) efforts into CIL, what do you think? In my opinion, CIL could take years until usable, so it does seem as if it is worth to start implementing the missing parts for the Clang CFG and AST.

Thanks,
Gabor

On Tue, Aug 25, 2020 at 9:37 AM Gábor Márton <[hidden email]> wrote:
> And as John says, that'd have the advantage of being more predictable; we'd no longer have to investigate sudden changes in analysis results that are in fact caused by backend changes.
I believe that all individual LLVM passes are implemented in a way that we can reuse them in any exotic pipeline. Of course there are dependencies between the passes, but besides that I don't think that Clang backend changes should matter that much. Otherwise, custom pipelines would be a nightmare to maintain.

> In particular i'm worried for people who treat analyzer warnings as errors in their builds; for them any update in the compiler would now cause their build to fail
Well, we could protect them by swallowing all the diags from the CodeGen part. And if CodeGen has diags then we could omit the IR.

> So i believe that implementing as many of these analyses over the Clang CFG (or in many cases it might be over the AST as well) would be beneficial and should be done regardless of this experiment. Gabor, how much did you try that? Because i believe you should try that and compare the results, at least for some analyses that are easy to implement.
Yeah, I agree that it is worth trying to implement at least the simplest ones in the Clang CFG. Thus we would see if anything is missing from our infra in the CSA and we could compare the results and their performance. I am thinking about starting with the pureness info, that involves implementing GlobalsModRef over the Clang CFG.

> The reason why the use of LLVM IR in the static analyzer gets really interesting is because there are already a huge lot of analyses already implemented over it and getting access to them "for free" (in terms of implementation cost) is fairly tempting. I think that's the only real reason;
There is another reason (as G. Horvath mentions as well): many of the analyses are quite painful to implement on our current CFG compared to an already lowered representation like the LLVM IR. However, I agree that maybe it should not be the LLVM IR that we need to lower. There is a desire/attempt to use MLIR in Clang. I can't wait to hear the presentation about CIL (Common MLIR Dialect for C/C++ and Fortran) in the upcoming LLVM dev meeting, it would be great to know the status. 
Still, I think it could take years until we can have a proper Clang Intermediate Language incorporated into the Clang CFG. Contrary to this, we could immediately start to use already implemented analyses on top of the LLVM IR.

Gabor


On Mon, Aug 17, 2020 at 1:31 PM Gábor Horváth <[hidden email]> wrote:

On Sun, 16 Aug 2020 at 21:57, Artem Dergachev <[hidden email]> wrote:

So i believe that implementing as many of these analyses over the Clang CFG (or in many cases it might be over the AST as well) would be beneficial and should be done regardless of this experiment.

While I do agree that this would be awesome, I think many of those analyses are quite painful to implement on our current CFG compared to an already lowered representation like the LLVM IR which can be canonicalized and there are fewer corner cases and peculiarities to handle compared to the C++ language. Having the option to derive certain information from a representation that is easier to work with for some purposes might be useful for future analyses as well, not only for leveraging currently implemented analyses. Having a proper Clang IR could of course void this argument.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][RFC] Get info from the LLVM IR for precision

Hollman, Daisy Sophia via cfe-dev
In reply to this post by Hollman, Daisy Sophia via cfe-dev
Disclaimer: I haven't read the entire thread.

On 8/14/20 6:19 AM, Gábor Márton via cfe-dev wrote:

>> Is this really the most reasonable way to get the information you want?
> Here is a list of information we would like to have access to (this is
> non-comprehensive, Artem could probably extend it) :
> 1) Is a function pure?
> 2) Does a function read/write only the memory pointed to by its arguments?
> 3) Does a calle make any copies of the pointer argument that outlive the
> callee itself?
> 4) Value ranges.
> 5) Is a loop dead?
> 6) Is a parameter or return pointer is dereferenceable?

FWIW, if you run the Attributor you get 1), 2), 3) 4), and 6).
We are working on 5) and there are *a lot* of other things you
could get of out the results.

We could introduce a mode in which we don't delete IR so the mapping
from AST to IR remains. For now, the Attributor would delete internal
functions that are unnecessary for example.

Let me know if this sounds interesting and we can discuss it further :)

~ Johannes

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev