[analyzer][tooling] Architectural questions about Clang and ClangTooling

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev

Hi!

 

Question:

Why is the dependency on ClangTooling ill-advised inside ClangSA (also meaning the Clang binary) itself ?

 

Context:

Currently I am working on an alternative way to import external TU AST-s during analysis ( https://reviews.llvm.org/D75665 ).

In order to produce AST-s, I use a compilation database to extract the necessary flags, and finally use ClangTool::buildAST.

I am aware that I have other options for this as well (like manually coding the compdb handling for my specific case for the

first step, and maybe even dumping ASTs as pch-s into an in-memory buffer), but still consuming JSONCompilationDatabase

is just too convenient. I would not want to introduce another format when compilation database is just right for this purpose.

 

Elaboration:

While I understand that introducing dependencies has its downsides, but not being able to reuse code from Tooling is also not ideal.

I would very much like to be enlightened by someone more familiar with architectural decision already made why this is the case,

and optionally how I could proceed with my efforts so that I can come up with the most fitting solution i.e. not a hack.

 

Thanks,

Endre Fülöp


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev
Hey!

1. I'm glad that we're finally trying to avoid dumping PCH-s on disk!

2. As far as I understand, dependencies are mostly about Clang binary
size. I don't know for sure but that's what I had to consider when I was
adding libASTMatchers into the Clang binary a few years ago.

3. I strongly disagree that JSON compilation database is "just right for
this purpose". I don't mind having explicit improved support for it but
I would definitely prefer not to hardcode it as the only possible
option. Compilation databases are very limited and we cannot drop
projects or entire build systems simply because they can't be
represented accurately via a compilation database. So I believe that
this is not the right solution for CTU in particular. Instead, an
external tool like scan-build should be guiding CTU analysis and
coordinate the work of different Clang instances so that to abstract
Clang away from the build system.

On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:

>
> Hi!
>
> Question:
>
> Why is the dependency on ClangTooling ill-advised inside ClangSA (also
> meaning the Clang binary) itself ?
>
> Context:
>
> Currently I am working on an alternative way to import external TU
> AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>
> In order to produce AST-s, I use a compilation database to extract the
> necessary flags, and finally use ClangTool::buildAST.
>
> I am aware that I have other options for this as well (like manually
> coding the compdb handling for my specific case for the
>
> first step, and maybe even dumping ASTs as pch-s into an in-memory
> buffer), but still consuming JSONCompilationDatabase
>
> is just too convenient. I would not want to introduce another format
> when compilation database is just right for this purpose.
>
> Elaboration:
>
> While I understand that introducing dependencies has its downsides,
> but not being able to reuse code from Tooling is also not ideal.
>
> I would very much like to be enlightened by someone more familiar with
> architectural decision already made why this is the case,
>
> and optionally how I could proceed with my efforts so that I can come
> up with the most fitting solution i.e. not a hack.
>
> Thanks,
>
> Endre Fülöp
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev
On Tue, Apr 28, 2020 at 3:09 AM Artem Dergachev via cfe-dev <[hidden email]> wrote:
Hey!

1. I'm glad that we're finally trying to avoid dumping PCH-s on disk!

2. As far as I understand, dependencies are mostly about Clang binary
size. I don't know for sure but that's what I had to consider when I was
adding libASTMatchers into the Clang binary a few years ago.

3. I strongly disagree that JSON compilation database is "just right for
this purpose". I don't mind having explicit improved support for it but
I would definitely prefer not to hardcode it as the only possible
option. Compilation databases are very limited and we cannot drop
projects or entire build systems simply because they can't be
represented accurately via a compilation database. So I believe that
this is not the right solution for CTU in particular. Instead, an
external tool like scan-build should be guiding CTU analysis and
coordinate the work of different Clang instances so that to abstract
Clang away from the build system.

What functionality do you picture the scan-build-like tool having that couldn't be supported if that tool instead built a compilation database & the CTU/CSA was powered by the database? (that would separate concerns: build command discovery from execution, and make scan-build-like tool more general purpose, rather than specific only to the CSA)
 

On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:
>
> Hi!
>
> Question:
>
> Why is the dependency on ClangTooling ill-advised inside ClangSA (also
> meaning the Clang binary) itself ?
>
> Context:
>
> Currently I am working on an alternative way to import external TU
> AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>
> In order to produce AST-s, I use a compilation database to extract the
> necessary flags, and finally use ClangTool::buildAST.
>
> I am aware that I have other options for this as well (like manually
> coding the compdb handling for my specific case for the
>
> first step, and maybe even dumping ASTs as pch-s into an in-memory
> buffer), but still consuming JSONCompilationDatabase
>
> is just too convenient. I would not want to introduce another format
> when compilation database is just right for this purpose.
>
> Elaboration:
>
> While I understand that introducing dependencies has its downsides,
> but not being able to reuse code from Tooling is also not ideal.
>
> I would very much like to be enlightened by someone more familiar with
> architectural decision already made why this is the case,
>
> and optionally how I could proceed with my efforts so that I can come
> up with the most fitting solution i.e. not a hack.
>
> Thanks,
>
> Endre Fülöp
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev
On 4/28/20 6:23 PM, David Blaikie wrote:

> On Tue, Apr 28, 2020 at 3:09 AM Artem Dergachev via cfe-dev
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hey!
>
>     1. I'm glad that we're finally trying to avoid dumping PCH-s on disk!
>
>     2. As far as I understand, dependencies are mostly about Clang binary
>     size. I don't know for sure but that's what I had to consider when
>     I was
>     adding libASTMatchers into the Clang binary a few years ago.
>
>     3. I strongly disagree that JSON compilation database is "just
>     right for
>     this purpose". I don't mind having explicit improved support for
>     it but
>     I would definitely prefer not to hardcode it as the only possible
>     option. Compilation databases are very limited and we cannot drop
>     projects or entire build systems simply because they can't be
>     represented accurately via a compilation database. So I believe that
>     this is not the right solution for CTU in particular. Instead, an
>     external tool like scan-build should be guiding CTU analysis and
>     coordinate the work of different Clang instances so that to abstract
>     Clang away from the build system.
>
>
> What functionality do you picture the scan-build-like tool having that
> couldn't be supported if that tool instead built a compilation
> database & the CTU/CSA was powered by the database? (that would
> separate concerns: build command discovery from execution, and make
> scan-build-like tool more general purpose, rather than specific only
> to the CSA)

Here are a few examples (please let me know if i'm unaware of the latest
developments in the area of compilation databases!)

- Suppose the project uses precompiled headers. In order to analyze a
file that includes a pch, we need to first rebuild the pch with the
clang that's used for analysis, and only then try to analyze the file.
This introduces a notion of dependency between compilation database
entries; unless entries are ordered in their original compilation order
and we're analyzing with -j1, race conditions will inevitably cause us
to occasionally fail to find the pch. I didn't try to figure out what
happens when modules are used, but i suspect it's worse. But if analysis
is conducted alongside compilation and the build system waits for the
analysis to finish like it waits for compilation to finish before
compiling dependent translation units, race conditions are eliminated.
This is how scan-build currently works: it substitutes the compiler with
a fake compiler that both invokes the original compiler and clang for
analysis. Of course, cross-translation-unit analysis won't be conducted
in parallel with compilation; it's multi-pass by design. The problem is
the same though: it should compile pch files first but there's no notion
of "compile this first" in an unstructured compilation database.

- Suppose the project builds the same translation unit multiple times,
say with different flags, say for different architectures. When we're
trying to lookup such file in the compilation database, how do we figure
out which instance do we take? If we are to ever solve this problem, we
have to introduce a notion of a "shipped binary" (an ultimate linking
target) in the compilation database and perform cross-translation-unit
analysis of one shipped binary at a time.

- There is a variety of hacks that people can introduce in their
projects if they add arbitrary scripts to their build system. For
instance, they can mutate contents of an autogenerated header in the
middle of the build. We can always say "Well, you shouldn't do that",
but people will do that anyway. This makes me believe that no purely
declarative compilation database format will ever be able to handle such
Turing-complete hacks and there's no other way to integrate analysis
into build perfectly other than by letting the build system guide the
analysis.

I'm also all for separation of concerns and I don't think any of this is
specific to our static analysis.

>     On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:
>     >
>     > Hi!
>     >
>     > Question:
>     >
>     > Why is the dependency on ClangTooling ill-advised inside ClangSA
>     (also
>     > meaning the Clang binary) itself ?
>     >
>     > Context:
>     >
>     > Currently I am working on an alternative way to import external TU
>     > AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>     >
>     > In order to produce AST-s, I use a compilation database to
>     extract the
>     > necessary flags, and finally use ClangTool::buildAST.
>     >
>     > I am aware that I have other options for this as well (like
>     manually
>     > coding the compdb handling for my specific case for the
>     >
>     > first step, and maybe even dumping ASTs as pch-s into an in-memory
>     > buffer), but still consuming JSONCompilationDatabase
>     >
>     > is just too convenient. I would not want to introduce another
>     format
>     > when compilation database is just right for this purpose.
>     >
>     > Elaboration:
>     >
>     > While I understand that introducing dependencies has its downsides,
>     > but not being able to reuse code from Tooling is also not ideal.
>     >
>     > I would very much like to be enlightened by someone more
>     familiar with
>     > architectural decision already made why this is the case,
>     >
>     > and optionally how I could proceed with my efforts so that I can
>     come
>     > up with the most fitting solution i.e. not a hack.
>     >
>     > Thanks,
>     >
>     > Endre Fülöp
>     >
>     >
>     > _______________________________________________
>     > cfe-dev mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>     _______________________________________________
>     cfe-dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev
(+Sam, who works on clang tooling)

On Tue, Apr 28, 2020 at 10:38 AM Artem Dergachev <[hidden email]> wrote:
On 4/28/20 6:23 PM, David Blaikie wrote:
> On Tue, Apr 28, 2020 at 3:09 AM Artem Dergachev via cfe-dev
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hey!
>
>     1. I'm glad that we're finally trying to avoid dumping PCH-s on disk!
>
>     2. As far as I understand, dependencies are mostly about Clang binary
>     size. I don't know for sure but that's what I had to consider when
>     I was
>     adding libASTMatchers into the Clang binary a few years ago.
>
>     3. I strongly disagree that JSON compilation database is "just
>     right for
>     this purpose". I don't mind having explicit improved support for
>     it but
>     I would definitely prefer not to hardcode it as the only possible
>     option. Compilation databases are very limited and we cannot drop
>     projects or entire build systems simply because they can't be
>     represented accurately via a compilation database. So I believe that
>     this is not the right solution for CTU in particular. Instead, an
>     external tool like scan-build should be guiding CTU analysis and
>     coordinate the work of different Clang instances so that to abstract
>     Clang away from the build system.
>
>
> What functionality do you picture the scan-build-like tool having that
> couldn't be supported if that tool instead built a compilation
> database & the CTU/CSA was powered by the database? (that would
> separate concerns: build command discovery from execution, and make
> scan-build-like tool more general purpose, rather than specific only
> to the CSA)

Here are a few examples (please let me know if i'm unaware of the latest
developments in the area of compilation databases!)

- Suppose the project uses precompiled headers. In order to analyze a
file that includes a pch, we need to first rebuild the pch with the
clang that's used for analysis, and only then try to analyze the file.
This introduces a notion of dependency between compilation database
entries; unless entries are ordered in their original compilation order
and we're analyzing with -j1, race conditions will inevitably cause us
to occasionally fail to find the pch. I didn't try to figure out what
happens when modules are used, but i suspect it's worse.

Google certainly uses clang tooling, with a custom compilation database on a build that uses explicit modules - I believe the way that's done is to ignore/strip the modules-related flags so the clang tooling uses non-modules related compilation. But I could be wrong there. You could do some analysis to see any inputs/outputs - or reusing the existing outputs in the original build.
 
But if analysis
is conducted alongside compilation and the build system waits for the
analysis to finish like it waits for compilation to finish before
compiling dependent translation units, race conditions are eliminated.
This is how scan-build currently works: it substitutes the compiler with
a fake compiler that both invokes the original compiler and clang for
analysis. Of course, cross-translation-unit analysis won't be conducted
in parallel with compilation; it's multi-pass by design. The problem is
the same though: it should compile pch files first but there's no notion
of "compile this first" in an unstructured compilation database.

- Suppose the project builds the same translation unit multiple times,
say with different flags, say for different architectures. When we're
trying to lookup such file in the compilation database, how do we figure
out which instance do we take? If we are to ever solve this problem, we
have to introduce a notion of a "shipped binary" (an ultimate linking
target) in the compilation database and perform cross-translation-unit
analysis of one shipped binary at a time.

I believe in that case the compilation database would include both compilations of the file - and presumably for the static analyzer, it would want to build all of them (or scan-build would have to have some logic for filtering them out/deciding which one is the interesting one - same sort of thing would have to be done on the compilation database)
 
- There is a variety of hacks that people can introduce in their
projects if they add arbitrary scripts to their build system. For
instance, they can mutate contents of an autogenerated header in the
middle of the build. We can always say "Well, you shouldn't do that",
but people will do that anyway. This makes me believe that no purely
declarative compilation database format will ever be able to handle such
Turing-complete hacks and there's no other way to integrate analysis
into build perfectly other than by letting the build system guide the
analysis.

Yep, a mutating build where you need to observe the state before/after such mutations, etc, not much you could do about it. (& how would CTU SA work in that sort of case? You have to run the whole build multiple times?)

Clang tools are essentially static analysis - so it seems weird that we have two different approaches to static analysis discovery/lookup in the Clang project, but not wholely unacceptable, potentially different goals, etc.
 
I'm also all for separation of concerns and I don't think any of this is
specific to our static analysis.

>     On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:
>     >
>     > Hi!
>     >
>     > Question:
>     >
>     > Why is the dependency on ClangTooling ill-advised inside ClangSA
>     (also
>     > meaning the Clang binary) itself ?
>     >
>     > Context:
>     >
>     > Currently I am working on an alternative way to import external TU
>     > AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>     >
>     > In order to produce AST-s, I use a compilation database to
>     extract the
>     > necessary flags, and finally use ClangTool::buildAST.
>     >
>     > I am aware that I have other options for this as well (like
>     manually
>     > coding the compdb handling for my specific case for the
>     >
>     > first step, and maybe even dumping ASTs as pch-s into an in-memory
>     > buffer), but still consuming JSONCompilationDatabase
>     >
>     > is just too convenient. I would not want to introduce another
>     format
>     > when compilation database is just right for this purpose.
>     >
>     > Elaboration:
>     >
>     > While I understand that introducing dependencies has its downsides,
>     > but not being able to reuse code from Tooling is also not ideal.
>     >
>     > I would very much like to be enlightened by someone more
>     familiar with
>     > architectural decision already made why this is the case,
>     >
>     > and optionally how I could proceed with my efforts so that I can
>     come
>     > up with the most fitting solution i.e. not a hack.
>     >
>     > Thanks,
>     >
>     > Endre Fülöp
>     >
>     >
>     > _______________________________________________
>     > cfe-dev mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>     _______________________________________________
>     cfe-dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev
On Tue, Apr 28, 2020 at 11:53 PM David Blaikie <[hidden email]> wrote:
(+Sam, who works on clang tooling)

On Tue, Apr 28, 2020 at 10:38 AM Artem Dergachev <[hidden email]> wrote:
On 4/28/20 6:23 PM, David Blaikie wrote:
> On Tue, Apr 28, 2020 at 3:09 AM Artem Dergachev via cfe-dev
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hey!
>
>     1. I'm glad that we're finally trying to avoid dumping PCH-s on disk!
>
>     2. As far as I understand, dependencies are mostly about Clang binary
>     size. I don't know for sure but that's what I had to consider when
>     I was
>     adding libASTMatchers into the Clang binary a few years ago.
>
>     3. I strongly disagree that JSON compilation database is "just
>     right for
>     this purpose". I don't mind having explicit improved support for
>     it but
>     I would definitely prefer not to hardcode it as the only possible
>     option. Compilation databases are very limited and we cannot drop
>     projects or entire build systems simply because they can't be
>     represented accurately via a compilation database. So I believe that
>     this is not the right solution for CTU in particular. Instead, an
>     external tool like scan-build should be guiding CTU analysis and
>     coordinate the work of different Clang instances so that to abstract
>     Clang away from the build system.
>
>
> What functionality do you picture the scan-build-like tool having that
> couldn't be supported if that tool instead built a compilation
> database & the CTU/CSA was powered by the database? (that would
> separate concerns: build command discovery from execution, and make
> scan-build-like tool more general purpose, rather than specific only
> to the CSA)

Here are a few examples (please let me know if i'm unaware of the latest
developments in the area of compilation databases!)

- Suppose the project uses precompiled headers. In order to analyze a
file that includes a pch, we need to first rebuild the pch with the
clang that's used for analysis, and only then try to analyze the file.
This introduces a notion of dependency between compilation database
entries; unless entries are ordered in their original compilation order
and we're analyzing with -j1, race conditions will inevitably cause us
to occasionally fail to find the pch. I didn't try to figure out what
happens when modules are used, but i suspect it's worse.

Google certainly uses clang tooling, with a custom compilation database on a build that uses explicit modules - I believe the way that's done is to ignore/strip the modules-related flags so the clang tooling uses non-modules related compilation.
Yeah. In fact we put the build system into a mode where it generates compile commands for a non-modules build, I'm not sure whether it's always easy to do this given a modular compile command.

I suspect to make c++20 modular builds work with tools (and we must), we'll need an extension or peer to compile_commands.json that build systems can use to define the modules in a project. That will at least give tools what they need to maintain a module cache.
 
But I could be wrong there. You could do some analysis to see any inputs/outputs - or reusing the existing outputs in the original build.
The problem with reusing existing outputs (at least serialized ASTs like PCH or modules) is because there's no stable format, you need to version-lock your tool to your compiler (which must be clang!) to ensure compatibility.
I think this is what Artem was alluding to by "with the clang that's used for analysis".
 
But if analysis
is conducted alongside compilation and the build system waits for the
analysis to finish like it waits for compilation to finish before
compiling dependent translation units, race conditions are eliminated.
This is how scan-build currently works: it substitutes the compiler with
a fake compiler that both invokes the original compiler and clang for
analysis. Of course, cross-translation-unit analysis won't be conducted
in parallel with compilation; it's multi-pass by design. The problem is
the same though: it should compile pch files first but there's no notion
of "compile this first" in an unstructured compilation database.
FWIW our internal database plugin (not compile_commands.json) prepares the inputs for a target when its compile command is queried. This works as long as your build system doesn't delete files or give them different content based on what you're building.

- Suppose the project builds the same translation unit multiple times,
say with different flags, say for different architectures. When we're
trying to lookup such file in the compilation database, how do we figure
out which instance do we take? If we are to ever solve this problem, we
have to introduce a notion of a "shipped binary" (an ultimate linking
target) in the compilation database and perform cross-translation-unit
analysis of one shipped binary at a time.

I believe in that case the compilation database would include both compilations of the file - and presumably for the static analyzer, it would want to build all of them (or scan-build would have to have some logic for filtering them out/deciding which one is the interesting one - same sort of thing would have to be done on the compilation database)
 
- There is a variety of hacks that people can introduce in their
projects if they add arbitrary scripts to their build system. For
instance, they can mutate contents of an autogenerated header in the
middle of the build. We can always say "Well, you shouldn't do that",
but people will do that anyway. This makes me believe that no purely
declarative compilation database format will ever be able to handle such
Turing-complete hacks and there's no other way to integrate analysis
into build perfectly other than by letting the build system guide the
analysis.

Yep, a mutating build where you need to observe the state before/after such mutations, etc, not much you could do about it. (& how would CTU SA work in that sort of case? You have to run the whole build multiple times?)

Clang tools are essentially static analysis - so it seems weird that we have two different approaches to static analysis discovery/lookup in the Clang project, but not wholely unacceptable, potentially different goals, etc.
 
I'm also all for separation of concerns and I don't think any of this is
specific to our static analysis.

>     On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:
>     >
>     > Hi!
>     >
>     > Question:
>     >
>     > Why is the dependency on ClangTooling ill-advised inside ClangSA
>     (also
>     > meaning the Clang binary) itself ?
>     >
>     > Context:
>     >
>     > Currently I am working on an alternative way to import external TU
>     > AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>     >
>     > In order to produce AST-s, I use a compilation database to
>     extract the
>     > necessary flags, and finally use ClangTool::buildAST.
>     >
>     > I am aware that I have other options for this as well (like
>     manually
>     > coding the compdb handling for my specific case for the
>     >
>     > first step, and maybe even dumping ASTs as pch-s into an in-memory
>     > buffer), but still consuming JSONCompilationDatabase
>     >
>     > is just too convenient. I would not want to introduce another
>     format
>     > when compilation database is just right for this purpose.
>     >
>     > Elaboration:
>     >
>     > While I understand that introducing dependencies has its downsides,
>     > but not being able to reuse code from Tooling is also not ideal.
>     >
>     > I would very much like to be enlightened by someone more
>     familiar with
>     > architectural decision already made why this is the case,
>     >
>     > and optionally how I could proceed with my efforts so that I can
>     come
>     > up with the most fitting solution i.e. not a hack.
>     >
>     > Thanks,
>     >
>     > Endre Fülöp
>     >
>     >
>     > _______________________________________________
>     > cfe-dev mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>     _______________________________________________
>     cfe-dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev
In reply to this post by Fangrui Song via cfe-dev
For point 2 of Artem, I can chime in with the first-ever assignment I had when I started working on our team: it was to make resolving "jump to definition" logic guided by the link graph. The exec() logger tool CodeChecker uses to create the compilation database logs linker invocations (or at least at some point in time did, and at least did so on a Linux system), which I have succeeded in implementing an understanding of it to our sister project (made by the same people, though), the CodeCompass code navigator... with pretty good results, actually.

Unfortunately, putting a similar logic into the CTU resolution still hasn't succeeded. Neither did extending the JSONCDB for officially supporting linker commands... nor making CMake emit and/or libbear (for scan-build) log linker commands from the build.

To ensure symbols are resolved unambiguously for CTU (or for ASTImporter in general), we just... need the build system's support (in some way - tried to put Modules into CMake last year but it was really problematic because CMake is also running layers upon layers, subprocesses with no full sight, etc...), or a total understanding of the project at hand, so we know which files of "flag set A" and "flag set B" are used together in the real.

Artem Dergachev via cfe-dev <[hidden email]> ezt írta (időpont: 2020. ápr. 28., K, 19:38):
On 4/28/20 6:23 PM, David Blaikie wrote:
> On Tue, Apr 28, 2020 at 3:09 AM Artem Dergachev via cfe-dev
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hey!
>
>     1. I'm glad that we're finally trying to avoid dumping PCH-s on disk!
>
>     2. As far as I understand, dependencies are mostly about Clang binary
>     size. I don't know for sure but that's what I had to consider when
>     I was
>     adding libASTMatchers into the Clang binary a few years ago.
>
>     3. I strongly disagree that JSON compilation database is "just
>     right for
>     this purpose". I don't mind having explicit improved support for
>     it but
>     I would definitely prefer not to hardcode it as the only possible
>     option. Compilation databases are very limited and we cannot drop
>     projects or entire build systems simply because they can't be
>     represented accurately via a compilation database. So I believe that
>     this is not the right solution for CTU in particular. Instead, an
>     external tool like scan-build should be guiding CTU analysis and
>     coordinate the work of different Clang instances so that to abstract
>     Clang away from the build system.
>
>
> What functionality do you picture the scan-build-like tool having that
> couldn't be supported if that tool instead built a compilation
> database & the CTU/CSA was powered by the database? (that would
> separate concerns: build command discovery from execution, and make
> scan-build-like tool more general purpose, rather than specific only
> to the CSA)

Here are a few examples (please let me know if i'm unaware of the latest
developments in the area of compilation databases!)

- Suppose the project uses precompiled headers. In order to analyze a
file that includes a pch, we need to first rebuild the pch with the
clang that's used for analysis, and only then try to analyze the file.
This introduces a notion of dependency between compilation database
entries; unless entries are ordered in their original compilation order
and we're analyzing with -j1, race conditions will inevitably cause us
to occasionally fail to find the pch. I didn't try to figure out what
happens when modules are used, but i suspect it's worse. But if analysis
is conducted alongside compilation and the build system waits for the
analysis to finish like it waits for compilation to finish before
compiling dependent translation units, race conditions are eliminated.
This is how scan-build currently works: it substitutes the compiler with
a fake compiler that both invokes the original compiler and clang for
analysis. Of course, cross-translation-unit analysis won't be conducted
in parallel with compilation; it's multi-pass by design. The problem is
the same though: it should compile pch files first but there's no notion
of "compile this first" in an unstructured compilation database.

- Suppose the project builds the same translation unit multiple times,
say with different flags, say for different architectures. When we're
trying to lookup such file in the compilation database, how do we figure
out which instance do we take? If we are to ever solve this problem, we
have to introduce a notion of a "shipped binary" (an ultimate linking
target) in the compilation database and perform cross-translation-unit
analysis of one shipped binary at a time.

- There is a variety of hacks that people can introduce in their
projects if they add arbitrary scripts to their build system. For
instance, they can mutate contents of an autogenerated header in the
middle of the build. We can always say "Well, you shouldn't do that",
but people will do that anyway. This makes me believe that no purely
declarative compilation database format will ever be able to handle such
Turing-complete hacks and there's no other way to integrate analysis
into build perfectly other than by letting the build system guide the
analysis.

I'm also all for separation of concerns and I don't think any of this is
specific to our static analysis.

>     On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:
>     >
>     > Hi!
>     >
>     > Question:
>     >
>     > Why is the dependency on ClangTooling ill-advised inside ClangSA
>     (also
>     > meaning the Clang binary) itself ?
>     >
>     > Context:
>     >
>     > Currently I am working on an alternative way to import external TU
>     > AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>     >
>     > In order to produce AST-s, I use a compilation database to
>     extract the
>     > necessary flags, and finally use ClangTool::buildAST.
>     >
>     > I am aware that I have other options for this as well (like
>     manually
>     > coding the compdb handling for my specific case for the
>     >
>     > first step, and maybe even dumping ASTs as pch-s into an in-memory
>     > buffer), but still consuming JSONCompilationDatabase
>     >
>     > is just too convenient. I would not want to introduce another
>     format
>     > when compilation database is just right for this purpose.
>     >
>     > Elaboration:
>     >
>     > While I understand that introducing dependencies has its downsides,
>     > but not being able to reuse code from Tooling is also not ideal.
>     >
>     > I would very much like to be enlightened by someone more
>     familiar with
>     > architectural decision already made why this is the case,
>     >
>     > and optionally how I could proceed with my efforts so that I can
>     come
>     > up with the most fitting solution i.e. not a hack.
>     >
>     > Thanks,
>     >
>     > Endre Fülöp
>     >
>     >
>     > _______________________________________________
>     > cfe-dev mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>     _______________________________________________
>     cfe-dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev
In reply to this post by Fangrui Song via cfe-dev
On Tue, Apr 28, 2020 at 3:43 PM Sam McCall <[hidden email]> wrote:
On Tue, Apr 28, 2020 at 11:53 PM David Blaikie <[hidden email]> wrote:
(+Sam, who works on clang tooling)

On Tue, Apr 28, 2020 at 10:38 AM Artem Dergachev <[hidden email]> wrote:
On 4/28/20 6:23 PM, David Blaikie wrote:
> On Tue, Apr 28, 2020 at 3:09 AM Artem Dergachev via cfe-dev
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hey!
>
>     1. I'm glad that we're finally trying to avoid dumping PCH-s on disk!
>
>     2. As far as I understand, dependencies are mostly about Clang binary
>     size. I don't know for sure but that's what I had to consider when
>     I was
>     adding libASTMatchers into the Clang binary a few years ago.
>
>     3. I strongly disagree that JSON compilation database is "just
>     right for
>     this purpose". I don't mind having explicit improved support for
>     it but
>     I would definitely prefer not to hardcode it as the only possible
>     option. Compilation databases are very limited and we cannot drop
>     projects or entire build systems simply because they can't be
>     represented accurately via a compilation database. So I believe that
>     this is not the right solution for CTU in particular. Instead, an
>     external tool like scan-build should be guiding CTU analysis and
>     coordinate the work of different Clang instances so that to abstract
>     Clang away from the build system.
>
>
> What functionality do you picture the scan-build-like tool having that
> couldn't be supported if that tool instead built a compilation
> database & the CTU/CSA was powered by the database? (that would
> separate concerns: build command discovery from execution, and make
> scan-build-like tool more general purpose, rather than specific only
> to the CSA)

Here are a few examples (please let me know if i'm unaware of the latest
developments in the area of compilation databases!)

- Suppose the project uses precompiled headers. In order to analyze a
file that includes a pch, we need to first rebuild the pch with the
clang that's used for analysis, and only then try to analyze the file.
This introduces a notion of dependency between compilation database
entries; unless entries are ordered in their original compilation order
and we're analyzing with -j1, race conditions will inevitably cause us
to occasionally fail to find the pch. I didn't try to figure out what
happens when modules are used, but i suspect it's worse.

Google certainly uses clang tooling, with a custom compilation database on a build that uses explicit modules - I believe the way that's done is to ignore/strip the modules-related flags so the clang tooling uses non-modules related compilation.
Yeah. In fact we put the build system into a mode where it generates compile commands for a non-modules build, I'm not sure whether it's always easy to do this given a modular compile command.

Ah, fair enough *nod*
 
I suspect to make c++20 modular builds work with tools (and we must), we'll need an extension or peer to compile_commands.json that build systems can use to define the modules in a project. That will at least give tools what they need to maintain a module cache.

Yeah, if we end up going with the... I forget the current name of the prototype idea GCC has - the sort of build oracle (the compiler interacts over a socket to request certain prerequisite modules be built - trivially, that could invoke the compiler directly or it could inform a build system like ninja to queue that work, knowing that the current process will be idle until the work can be provided (so less resource starvation than Clang's current implicit modules)), a compiler shim like scan-build could also insert itself in as an oracle proxy to retrieve those module build requests and record them (& the dependency information - this module build request came from the compilation of this other file, etc) as well.
 
 
But I could be wrong there. You could do some analysis to see any inputs/outputs - or reusing the existing outputs in the original build.
The problem with reusing existing outputs (at least serialized ASTs like PCH or modules) is because there's no stable format, you need to version-lock your tool to your compiler (which must be clang!) to ensure compatibility.
I think this is what Artem was alluding to by "with the clang that's used for analysis".

Oh, right!
 
But if analysis
is conducted alongside compilation and the build system waits for the
analysis to finish like it waits for compilation to finish before
compiling dependent translation units, race conditions are eliminated.
This is how scan-build currently works: it substitutes the compiler with
a fake compiler that both invokes the original compiler and clang for
analysis. Of course, cross-translation-unit analysis won't be conducted
in parallel with compilation; it's multi-pass by design. The problem is
the same though: it should compile pch files first but there's no notion
of "compile this first" in an unstructured compilation database.
FWIW our internal database plugin (not compile_commands.json) prepares the inputs for a target when its compile command is queried. This works as long as your build system doesn't delete files or give them different content based on what you're building.

- Suppose the project builds the same translation unit multiple times,
say with different flags, say for different architectures. When we're
trying to lookup such file in the compilation database, how do we figure
out which instance do we take? If we are to ever solve this problem, we
have to introduce a notion of a "shipped binary" (an ultimate linking
target) in the compilation database and perform cross-translation-unit
analysis of one shipped binary at a time.

I believe in that case the compilation database would include both compilations of the file - and presumably for the static analyzer, it would want to build all of them (or scan-build would have to have some logic for filtering them out/deciding which one is the interesting one - same sort of thing would have to be done on the compilation database)
 
- There is a variety of hacks that people can introduce in their
projects if they add arbitrary scripts to their build system. For
instance, they can mutate contents of an autogenerated header in the
middle of the build. We can always say "Well, you shouldn't do that",
but people will do that anyway. This makes me believe that no purely
declarative compilation database format will ever be able to handle such
Turing-complete hacks and there's no other way to integrate analysis
into build perfectly other than by letting the build system guide the
analysis.

Yep, a mutating build where you need to observe the state before/after such mutations, etc, not much you could do about it. (& how would CTU SA work in that sort of case? You have to run the whole build multiple times?)

Clang tools are essentially static analysis - so it seems weird that we have two different approaches to static analysis discovery/lookup in the Clang project, but not wholely unacceptable, potentially different goals, etc.
 
I'm also all for separation of concerns and I don't think any of this is
specific to our static analysis.

>     On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:
>     >
>     > Hi!
>     >
>     > Question:
>     >
>     > Why is the dependency on ClangTooling ill-advised inside ClangSA
>     (also
>     > meaning the Clang binary) itself ?
>     >
>     > Context:
>     >
>     > Currently I am working on an alternative way to import external TU
>     > AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>     >
>     > In order to produce AST-s, I use a compilation database to
>     extract the
>     > necessary flags, and finally use ClangTool::buildAST.
>     >
>     > I am aware that I have other options for this as well (like
>     manually
>     > coding the compdb handling for my specific case for the
>     >
>     > first step, and maybe even dumping ASTs as pch-s into an in-memory
>     > buffer), but still consuming JSONCompilationDatabase
>     >
>     > is just too convenient. I would not want to introduce another
>     format
>     > when compilation database is just right for this purpose.
>     >
>     > Elaboration:
>     >
>     > While I understand that introducing dependencies has its downsides,
>     > but not being able to reuse code from Tooling is also not ideal.
>     >
>     > I would very much like to be enlightened by someone more
>     familiar with
>     > architectural decision already made why this is the case,
>     >
>     > and optionally how I could proceed with my efforts so that I can
>     come
>     > up with the most fitting solution i.e. not a hack.
>     >
>     > Thanks,
>     >
>     > Endre Fülöp
>     >
>     >
>     > _______________________________________________
>     > cfe-dev mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>     _______________________________________________
>     cfe-dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer][tooling] Architectural questions about Clang and ClangTooling

Fangrui Song via cfe-dev

Hi,

 

Thank you all for your input!

 

I will evaluate the possibilities of driving the analyzer in a way that leads to the most straingthforward solution (least amount of dependency-tangling 😊 ).

As an approximation step I have experimented with dumping clang command-line invocations for source files in a YAML format (the parser in LLVM seemed convenient to use), and using that to drive the CompilerInvocation object creation manually. I think there could be some refactoring done on the functions in Tooling.cpp (there is a fixme which also suggests that), but there may be some efforts there I may be unaware of.

 

Cheers,

Endre

 

 

On Tue, Apr 28, 2020 at 3:43 PM Sam McCall <[hidden email]> wrote:

On Tue, Apr 28, 2020 at 11:53 PM David Blaikie <[hidden email]> wrote:

(+Sam, who works on clang tooling)

On Tue, Apr 28, 2020 at 10:38 AM Artem Dergachev <[hidden email]> wrote:

On 4/28/20 6:23 PM, David Blaikie wrote:
> On Tue, Apr 28, 2020 at 3:09 AM Artem Dergachev via cfe-dev
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hey!
>
>     1. I'm glad that we're finally trying to avoid dumping PCH-s on disk!
>
>     2. As far as I understand, dependencies are mostly about Clang binary
>     size. I don't know for sure but that's what I had to consider when
>     I was
>     adding libASTMatchers into the Clang binary a few years ago.
>
>     3. I strongly disagree that JSON compilation database is "just
>     right for
>     this purpose". I don't mind having explicit improved support for
>     it but
>     I would definitely prefer not to hardcode it as the only possible
>     option. Compilation databases are very limited and we cannot drop
>     projects or entire build systems simply because they can't be
>     represented accurately via a compilation database. So I believe that
>     this is not the right solution for CTU in particular. Instead, an
>     external tool like scan-build should be guiding CTU analysis and
>     coordinate the work of different Clang instances so that to abstract
>     Clang away from the build system.
>
>
> What functionality do you picture the scan-build-like tool having that
> couldn't be supported if that tool instead built a compilation
> database & the CTU/CSA was powered by the database? (that would
> separate concerns: build command discovery from execution, and make
> scan-build-like tool more general purpose, rather than specific only
> to the CSA)

Here are a few examples (please let me know if i'm unaware of the latest
developments in the area of compilation databases!)

- Suppose the project uses precompiled headers. In order to analyze a
file that includes a pch, we need to first rebuild the pch with the
clang that's used for analysis, and only then try to analyze the file.
This introduces a notion of dependency between compilation database
entries; unless entries are ordered in their original compilation order
and we're analyzing with -j1, race conditions will inevitably cause us
to occasionally fail to find the pch. I didn't try to figure out what
happens when modules are used, but i suspect it's worse.


Google certainly uses clang tooling, with a custom compilation database on a build that uses explicit modules - I believe the way that's done is to ignore/strip the modules-related flags so the clang tooling uses non-modules related compilation.

Yeah. In fact we put the build system into a mode where it generates compile commands for a non-modules build, I'm not sure whether it's always easy to do this given a modular compile command.

 

Ah, fair enough *nod*

 

I suspect to make c++20 modular builds work with tools (and we must), we'll need an extension or peer to compile_commands.json that build systems can use to define the modules in a project. That will at least give tools what they need to maintain a module cache.


Yeah, if we end up going with the... I forget the current name of the prototype idea GCC has - the sort of build oracle (the compiler interacts over a socket to request certain prerequisite modules be built - trivially, that could invoke the compiler directly or it could inform a build system like ninja to queue that work, knowing that the current process will be idle until the work can be provided (so less resource starvation than Clang's current implicit modules)), a compiler shim like scan-build could also insert itself in as an oracle proxy to retrieve those module build requests and record them (& the dependency information - this module build request came from the compilation of this other file, etc) as well.
 

 

But I could be wrong there. You could do some analysis to see any inputs/outputs - or reusing the existing outputs in the original build.

The problem with reusing existing outputs (at least serialized ASTs like PCH or modules) is because there's no stable format, you need to version-lock your tool to your compiler (which must be clang!) to ensure compatibility.

I think this is what Artem was alluding to by "with the clang that's used for analysis".


Oh, right!
 

But if analysis
is conducted alongside compilation and the build system waits for the
analysis to finish like it waits for compilation to finish before
compiling dependent translation units, race conditions are eliminated.
This is how scan-build currently works: it substitutes the compiler with
a fake compiler that both invokes the original compiler and clang for
analysis. Of course, cross-translation-unit analysis won't be conducted
in parallel with compilation; it's multi-pass by design. The problem is
the same though: it should compile pch files first but there's no notion
of "compile this first" in an unstructured compilation database.

FWIW our internal database plugin (not compile_commands.json) prepares the inputs for a target when its compile command is queried. This works as long as your build system doesn't delete files or give them different content based on what you're building.

 

- Suppose the project builds the same translation unit multiple times,
say with different flags, say for different architectures. When we're
trying to lookup such file in the compilation database, how do we figure
out which instance do we take? If we are to ever solve this problem, we
have to introduce a notion of a "shipped binary" (an ultimate linking
target) in the compilation database and perform cross-translation-unit
analysis of one shipped binary at a time.


I believe in that case the compilation database would include both compilations of the file - and presumably for the static analyzer, it would want to build all of them (or scan-build would have to have some logic for filtering them out/deciding which one is the interesting one - same sort of thing would have to be done on the compilation database)
 

- There is a variety of hacks that people can introduce in their
projects if they add arbitrary scripts to their build system. For
instance, they can mutate contents of an autogenerated header in the
middle of the build. We can always say "Well, you shouldn't do that",
but people will do that anyway. This makes me believe that no purely
declarative compilation database format will ever be able to handle such
Turing-complete hacks and there's no other way to integrate analysis
into build perfectly other than by letting the build system guide the
analysis.


Yep, a mutating build where you need to observe the state before/after such mutations, etc, not much you could do about it. (& how would CTU SA work in that sort of case? You have to run the whole build multiple times?)

Clang tools are essentially static analysis - so it seems weird that we have two different approaches to static analysis discovery/lookup in the Clang project, but not wholely unacceptable, potentially different goals, etc.
 

I'm also all for separation of concerns and I don't think any of this is
specific to our static analysis.

>     On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:
>     >
>     > Hi!
>     >
>     > Question:
>     >
>     > Why is the dependency on ClangTooling ill-advised inside ClangSA
>     (also
>     > meaning the Clang binary) itself ?
>     >
>     > Context:
>     >
>     > Currently I am working on an alternative way to import external TU
>     > AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>     >
>     > In order to produce AST-s, I use a compilation database to
>     extract the
>     > necessary flags, and finally use ClangTool::buildAST.
>     >
>     > I am aware that I have other options for this as well (like
>     manually
>     > coding the compdb handling for my specific case for the
>     >
>     > first step, and maybe even dumping ASTs as pch-s into an in-memory
>     > buffer), but still consuming JSONCompilationDatabase
>     >
>     > is just too convenient. I would not want to introduce another
>     format
>     > when compilation database is just right for this purpose.
>     >
>     > Elaboration:
>     >
>     > While I understand that introducing dependencies has its downsides,
>     > but not being able to reuse code from Tooling is also not ideal.
>     >
>     > I would very much like to be enlightened by someone more
>     familiar with
>     > architectural decision already made why this is the case,
>     >
>     > and optionally how I could proceed with my efforts so that I can
>     come
>     > up with the most fitting solution i.e. not a hack.
>     >
>     > Thanks,
>     >
>     > Endre Fülöp
>     >
>     >
>     > _______________________________________________
>     > cfe-dev mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>     _______________________________________________
>     cfe-dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev