[analyzer] Regression testing for the static analyzer

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[analyzer] Regression testing for the static analyzer

Vassil Vassilev via cfe-dev
Hi everyone,

this thread is mainly for the static analyzer developers, but, of course,
everyone else is most welcome to join and share their view on this topic.

I have a few thoughts, ideas, and a bit of prototyping.

INTRO & MOTIVATION

Quite a big portion of patches need to be checked on real projects as opposed to
syntetic tests we have in the repo.  Right now it all comes to manual testing.
Person has to find at least a couple of projects, build them natively, and check
with the analyzer.  So the first problem that I really want to solve, is to
eliminate all this haste.  It should be dead simple, maybe as simple as running
`lit` tests.

Another point that of interest, is reproducibility.  We, at Apple, regularly
check for difference in results on a set of projects.  I believe that other
parts of the community have similar CI setups.  So, there are situations when we
need to come back to the community with undesired changes, we have to make a
reproducible example.  Even if it is a well-known open-source project, it is not
guaranteed that another developer will be able to get somewhat similar results.
The analyzer is extremely susceptible to differences in the environment.  OS,
its version, and the versions of the libraries installed can change the warnings
that the analyzer produces.  This being said, the second problem that has to be
solved is the stability of results, every developer should get exactly the same
results.

MAIN IDEA

One way to solve both of the aforementioned problems is to use `docker`.  It is
available on Linux, Windows, and MacOS.  It is pretty widespread, so it is quite
probable that developer already has some experience with docker.  It is used for
other parts of the LLVM project.  It is fairly easy to run scripts in docker
and make it seem like they are executed outside of it.

WHAT IS DONE

There is a series of revisions starting from https://reviews.llvm.org/D81571
that make a first working version for this approach.

Short summary of what is there:
  * Info on 15 open-source projects to analyze, most of which are pretty small
  * Dockerfile with fixed versions of dependencies for these projects
  * Python interface that abstracts away user interaction with docker

WHAT DOES IT TAKE TO RUN IT RIGHT NOW

The system has two dependencies: python (2 or 3) and docker.  

Right now the prototype of the system is not feature full, but it supports the
following workflow for testing the new patch for crashes and changes against
master (some options are left off for clarity):

1. Build docker image
./SATest.py docker --build-image

2. Build LLVM in docker
./SATest.py docker -- --build-llvm-only

3. Collect reference results for master
./SATest.py docker -- build -r

4. Make changes to the analyzer

5. Incrementally re-build LLVM in docker
./SATest.py docker -- --build-llvm-only

6. Collect results and compare them with reference
./SATest.py docker -- build --strictness 2

HOW IS IT DIFFERENT FROM OTHER SOLUTIONS

There are two main contestants here: SATestBuild and csa-testbench:

SATestBuild is a set of scripts that already exists in the repo
(clang/utils/analyzer) and is essentially a foundation for the new system.
  + already exists and works
  + lives in the tree
  + doesn't have external dependencies
  - doesn't have a pre-defined set of projects and their dependencies
  - doesn't provide a fast setup good for the newcomers
  - doesn't guarantee stable results on different machines
  - doesn't have benchmarking tools

csa-testbench (https://github.com/Xazax-hun/csa-testbench) is a much richer
in functionality set of scripts.
  + already exists and works
  + has an existing pre-defined set of projects
  + has support for coverage
  + compares various statistics
  + has a nice visualization
  - depends on `CodeChecker` that is not used by all of the analyzer's
    developers and should be installed separately
  - doesn't live in the repo, so it's harder to find
  - the user still has to deal with project dependencies, what makes initial
    setup longer and harder for the newcomers
  - doesn't guarantee stable results on different machines

(I am not a `csa-testbench` user, so please correct me if I'm wrong here)

DIRECTIONS

In this section, I want to cover all the things I want to see in this testing
system.

  * I want it to cover all basic needs of the developer:
      - analyze a bunch of projects and show results
      - compare two given revisions
      - benchmark and compare performance

  * I want all commands to be as simple as possible, e.g.:
      - ./SATest.py docker analyze
      - ./SATest.py docker compare HEAD^1 HEAD
      - ./SATest.py docker benchmark --project ABC
    Try to minimize the number of options and actions required.

  * I want to have a community supported CI bot that will test it.
    We can have current reference results in the master and the bot can check
    those.  This can help reducing the amount of time spent on testing, as the
    reference results are already there.

  * I want to have a separate Phabricator-friendly output to post results

DISCUSSION

Please tell me what you think about this topic and this particular solution and
help me to answer these questions:

  * Would you use a system like this?

  * Does the proposed solution seem reasonable in this situation?

  * What do you think about the directions?

  * What other features do you want to see in the system?

  * What are the priorities for the project and what is the minimal feature
    scope to start using it?

Thank you for taking your time and reading through this!

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Regression testing for the static analyzer

Vassil Vassilev via cfe-dev
Hi!

I'm glad that someone picked this up. Making it easier to test the analyzer on real-world topics is an important task that can ultimately make it much easier to contribute to the analyzer.
See some of my comments inline.

On Thu, 11 Jun 2020 at 16:23, Valeriy Savchenko via cfe-dev <[hidden email]> wrote:

Person has to find at least a couple of projects, build them natively, and check
with the analyzer. ... It should be dead simple, maybe as simple as running
`lit` tests.

While I think this is a great idea we also should not forget that the tested projects should exercise the right parts of the analyzer. For instance, a patch adding exception support should be tested on projects that are using exceptions extensively. Having a static set of projects will not solve this problem. Nevertheless, this is something that is far less important to solve. First, we need something that is very close to what you proposed.
 

Another point that of interest, is reproducibility.

Huge +1. Actually, I'd be even glad to see more extremes like running the analyzer multiple times making sure that the number of exploded graphs and other statistics are stable to avoid introducing non-deterministic behavior.
 

Short summary of what is there:
  * Info on 15 open-source projects to analyze, most of which are pretty small
  * Dockerfile with fixed versions of dependencies for these projects 

Dependencies are the bane of C++ at the moment. I'd love to see some other solutions for this problem. Some of them coming to my mind:
* Piggy backing on the source repositories of linux distributions. We could easily install all the build dependencies using the package manager automatically. The user would only need to specify the name of the source package, the rest could be automated without having to manually search for the names of the dependent packages.
* Supporting C++ package managers. There is Conan, vcpkg and some CMake based. We could use a base docker image that already has these installed.
 

The system has two dependencies: python (2 or 3) and docker. 

How long do we want to retain Python 2 compatibility? I'm all in favor of not supporting it for long (or at all).
 

(I am not a `csa-testbench` user, so please correct me if I'm wrong here)

Your assessment is 100% correct here. We always wanted to add docker support and support for rebuilding source deb packages to solve most of the issues you mentioned.
 

  * I want it to cover all basic needs of the developer:
      - analyze a bunch of projects and show results
      - compare two given revisions
      - benchmark and compare performance

I think one very important feature is to collect/compare not only the analysis results but more fine-grained information like the statistics emitted by the analyzer (number of refuted reports in case of refutation, number of exploded nodes, and so on).
It would be nice to be able to retrieve anything crash-related like call stacks and have an easy way to ssh into the docker image to debug the crash within the image.
Also, the csa-testbench has a feature to define regular expressions and collect the matching lines of the analyzer output. This can be useful to count/collect log messages.
 

  * I want all commands to be as simple as possible, e.g.:

While I see the value of having a minimal interface I wonder if it will be a bit limiting to the power users in the end (see extracting statistics and logs based on regexp).,
 


  * Would you use a system like this?

In the case, it supports my needs, definitely. As you mentioned, there are multiple contenders here: csa-testbench and SATest. I do see why the testbench is not desirable (mainly because of the dependencies), but I wonder if it would make sense to have compatible configurations. I.e. one could copy and paste a project from one to the other have it working without any additional efforts.
 

  * Does the proposed solution seem reasonable in this situation?

Looks good to me.
 

  * What do you think about the directions?

+1
 

  * What other features do you want to see in the system?

See my other inlines above.
 

  * What are the priorities for the project and what is the minimal feature
    scope to start using it?

If we can run it reliably on big projects I'd say have a built bot as soon as possible (that only triggers when crashes are introduced). I think it could have prevented many errors.
 

Thank you for taking your time and reading through this!

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Regression testing for the static analyzer

Vassil Vassilev via cfe-dev
+Ericssson gang

Endre and Gábor Márton in particular worked a lot of builtbots (CTU related ones in particular), so I wouldn't risk summarizing our current stance/progress on this issue.

What I will say however from my perspective is that I find committing stressful for all the reasons you mentioned. While I do my best to contribute non-breaking code, the tedious process of jumping on the company VPN, finding the appropriate server that isn't under heavy load to run an analysis that is thorough enough sometimes leaves me to commit seemingly miscellaneous patches after only running check-clang-analysis, which on occasions comes to bite back. Things like changes in the report count (in drastic cases changes in the bug reports themselves, such as new notes), side effects on other platforms, etc. makes this process really error prone as well, not to mention that its at the point where I'm just itching to commit and move on. While the responsibility of the committed or soon-to-be-commited code still falls on the contributor, the lack of builbots on a variety of platforms still makes this process very inconvenient and downright hostile to non-regulars. Not to mention the case where I fill the role of the reviewer.

All in all, I really appreciate this project and agree strongly with your goals!

On Thu, 11 Jun 2020 at 17:51, Gábor Horváth via cfe-dev <[hidden email]> wrote:
Hi!

I'm glad that someone picked this up. Making it easier to test the analyzer on real-world topics is an important task that can ultimately make it much easier to contribute to the analyzer.
See some of my comments inline.

On Thu, 11 Jun 2020 at 16:23, Valeriy Savchenko via cfe-dev <[hidden email]> wrote:

Person has to find at least a couple of projects, build them natively, and check
with the analyzer. ... It should be dead simple, maybe as simple as running
`lit` tests.

While I think this is a great idea we also should not forget that the tested projects should exercise the right parts of the analyzer. For instance, a patch adding exception support should be tested on projects that are using exceptions extensively. Having a static set of projects will not solve this problem. Nevertheless, this is something that is far less important to solve. First, we need something that is very close to what you proposed.
 

Another point that of interest, is reproducibility.

Huge +1. Actually, I'd be even glad to see more extremes like running the analyzer multiple times making sure that the number of exploded graphs and other statistics are stable to avoid introducing non-deterministic behavior.
 

Short summary of what is there:
  * Info on 15 open-source projects to analyze, most of which are pretty small
  * Dockerfile with fixed versions of dependencies for these projects 

Dependencies are the bane of C++ at the moment. I'd love to see some other solutions for this problem. Some of them coming to my mind:
* Piggy backing on the source repositories of linux distributions. We could easily install all the build dependencies using the package manager automatically. The user would only need to specify the name of the source package, the rest could be automated without having to manually search for the names of the dependent packages.
* Supporting C++ package managers. There is Conan, vcpkg and some CMake based. We could use a base docker image that already has these installed.
 

The system has two dependencies: python (2 or 3) and docker. 

How long do we want to retain Python 2 compatibility? I'm all in favor of not supporting it for long (or at all).
 

(I am not a `csa-testbench` user, so please correct me if I'm wrong here)

Your assessment is 100% correct here. We always wanted to add docker support and support for rebuilding source deb packages to solve most of the issues you mentioned.
 

  * I want it to cover all basic needs of the developer:
      - analyze a bunch of projects and show results
      - compare two given revisions
      - benchmark and compare performance

I think one very important feature is to collect/compare not only the analysis results but more fine-grained information like the statistics emitted by the analyzer (number of refuted reports in case of refutation, number of exploded nodes, and so on).
It would be nice to be able to retrieve anything crash-related like call stacks and have an easy way to ssh into the docker image to debug the crash within the image.
Also, the csa-testbench has a feature to define regular expressions and collect the matching lines of the analyzer output. This can be useful to count/collect log messages.
 

  * I want all commands to be as simple as possible, e.g.:

While I see the value of having a minimal interface I wonder if it will be a bit limiting to the power users in the end (see extracting statistics and logs based on regexp).,
 


  * Would you use a system like this?

In the case, it supports my needs, definitely. As you mentioned, there are multiple contenders here: csa-testbench and SATest. I do see why the testbench is not desirable (mainly because of the dependencies), but I wonder if it would make sense to have compatible configurations. I.e. one could copy and paste a project from one to the other have it working without any additional efforts.
 

  * Does the proposed solution seem reasonable in this situation?

Looks good to me.
 

  * What do you think about the directions?

+1
 

  * What other features do you want to see in the system?

See my other inlines above.
 

  * What are the priorities for the project and what is the minimal feature
    scope to start using it?

If we can run it reliably on big projects I'd say have a built bot as soon as possible (that only triggers when crashes are introduced). I think it could have prevented many errors.
 

Thank you for taking your time and reading through this!

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Regression testing for the static analyzer

Vassil Vassilev via cfe-dev


11.06.2020 8:13 PM, Kristóf Umann via cfe-dev пишет:
+Ericssson gang

Endre and Gábor Márton in particular worked a lot of builtbots (CTU related ones in particular), so I wouldn't risk summarizing our current stance/progress on this issue.

What I will say however from my perspective is that I find committing stressful for all the reasons you mentioned. While I do my best to contribute non-breaking code, the tedious process of jumping on the company VPN, finding the appropriate server that isn't under heavy load to run an analysis that is thorough enough sometimes leaves me to commit seemingly miscellaneous patches after only running check-clang-analysis, which on occasions comes to bite back. Things like changes in the report count (in drastic cases changes in the bug reports themselves, such as new notes), side effects on other platforms, etc. makes this process really error prone as well, not to mention that its at the point where I'm just itching to commit and move on. While the responsibility of the committed or soon-to-be-commited code still falls on the contributor, the lack of builbots on a variety of platforms still makes this process very inconvenient and downright hostile to non-regulars. Not to mention the case where I fill the role of the reviewer.

All in all, I really appreciate this project and agree strongly with your goals!

On Thu, 11 Jun 2020 at 17:51, Gábor Horváth via cfe-dev <[hidden email]> wrote:
Hi!

I'm glad that someone picked this up. Making it easier to test the analyzer on real-world topics is an important task that can ultimately make it much easier to contribute to the analyzer.
See some of my comments inline.

On Thu, 11 Jun 2020 at 16:23, Valeriy Savchenko via cfe-dev <[hidden email]> wrote:

Person has to find at least a couple of projects, build them natively, and check
with the analyzer. ... It should be dead simple, maybe as simple as running
`lit` tests.

While I think this is a great idea we also should not forget that the tested projects should exercise the right parts of the analyzer. For instance, a patch adding exception support should be tested on projects that are using exceptions extensively. Having a static set of projects will not solve this problem. Nevertheless, this is something that is far less important to solve. First, we need something that is very close to what you proposed.
 

Another point that of interest, is reproducibility.

Huge +1. Actually, I'd be even glad to see more extremes like running the analyzer multiple times making sure that the number of exploded graphs and other statistics are stable to avoid introducing non-deterministic behavior.


This one's not just about nondeterminism, it's also about reproducibility across machines with different systems and system headers. Like you'll be able to say "hey we broke something in our docker tests, take a look" and you'll no longer need to extract and send to me a preprocessed file. That's a lot if we try to collectively keep an eye on the effects of our changes on a single benchmark (or even if you have your own benchmark it's easy to share the project config because it's basically just a link to the project on github).


Short summary of what is there:
  * Info on 15 open-source projects to analyze, most of which are pretty small
  * Dockerfile with fixed versions of dependencies for these projects 

Dependencies are the bane of C++ at the moment. I'd love to see some other solutions for this problem. Some of them coming to my mind:
* Piggy backing on the source repositories of linux distributions. We could easily install all the build dependencies using the package manager automatically. The user would only need to specify the name of the source package, the rest could be automated without having to manually search for the names of the dependent packages.
* Supporting C++ package managers. There is Conan, vcpkg and some CMake based. We could use a base docker image that already has these installed.


Just curious, given that it's debian under the hood, can we replace our make scripts with "scan-build apt-build" or something like that?


The system has two dependencies: python (2 or 3) and docker. 

How long do we want to retain Python 2 compatibility? I'm all in favor of not supporting it for long (or at all).
 


As far as i understand we've still not "officially" transitioned to python3 in llvm. I don't think it actually matters for these scripts; it's not like they're run every day on an ancient buildbot that still doesn't have python3 (in fact as of now i don't think anybody uses them at all except us) but it sounds like in any case the only script that really needs to be python2 up to all possible formal requirements is `SATest.py` itself which is a trivial wrapper that parses some arguments and forwards them into docker; for everything else there's docker and you don't care what's within it.



(I am not a `csa-testbench` user, so please correct me if I'm wrong here)

Your assessment is 100% correct here. We always wanted to add docker support and support for rebuilding source deb packages to solve most of the issues you mentioned.
 

  * I want it to cover all basic needs of the developer:
      - analyze a bunch of projects and show results
      - compare two given revisions
      - benchmark and compare performance

I think one very important feature is to collect/compare not only the analysis results but more fine-grained information like the statistics emitted by the analyzer (number of refuted reports in case of refutation, number of exploded nodes, and so on).
It would be nice to be able to retrieve anything crash-related like call stacks and have an easy way to ssh into the docker image to debug the crash within the image.
Also, the csa-testbench has a feature to define regular expressions and collect the matching lines of the analyzer output. This can be useful to count/collect log messages.
 

  * I want all commands to be as simple as possible, e.g.:

While I see the value of having a minimal interface I wonder if it will be a bit limiting to the power users in the end (see extracting statistics and logs based on regexp).,


I think it's totally worth it to have both. When a newcomer tries to test their first checker there's nothing better than a simple one-liner that we can tell them to run. But on the other hand having fine-grained commands for controlling every step of the process is absolutely empowering and not going anywhere.


  * Would you use a system like this?

In the case, it supports my needs, definitely. As you mentioned, there are multiple contenders here: csa-testbench and SATest. I do see why the testbench is not desirable (mainly because of the dependencies), but I wonder if it would make sense to have compatible configurations. I.e. one could copy and paste a project from one to the other have it working without any additional efforts.
 

  * Does the proposed solution seem reasonable in this situation?

Looks good to me.
 

  * What do you think about the directions?

+1
 

  * What other features do you want to see in the system?

See my other inlines above.
 

  * What are the priorities for the project and what is the minimal feature
    scope to start using it?

If we can run it reliably on big projects I'd say have a built bot as soon as possible (that only triggers when crashes are introduced). I think it could have prevented many errors.
 

Thank you for taking your time and reading through this!

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Regression testing for the static analyzer

Vassil Vassilev via cfe-dev
First of all, thank you a lot for engaging in this conversation, sharing your ideas, and, of course, for your kind words :-)

Huge +1. Actually, I'd be even glad to see more extremes like running the analyzer multiple times making sure that the number of exploded graphs and other statistics are stable to avoid introducing non-deterministic behavior. 

This one's not just about nondeterminism, it's also about reproducibility across machines with different systems and system headers. Like you'll be able to say "hey we broke something in our docker tests, take a look" and you'll no longer need to extract and send to me a preprocessed file. That's a lot if we try to collectively keep an eye on the effects of our changes on a single benchmark (or even if you have your own benchmark it's easy to share the project config because it's basically just a link to the project on github).

Yes, and yes!  I thought about running the analyzer multiple times on one project for benchmarking.  In that mode we can also check for variations in stats between seemingly identical runs.
Sharing is probably the best part of this approach.

Dependencies are the bane of C++ at the moment. I'd love to see some other solutions for this problem. Some of them coming to my mind:
* Piggy backing on the source repositories of linux distributions. We could easily install all the build dependencies using the package manager automatically. The user would only need to specify the name of the source package, the rest could be automated without having to manually search for the names of the dependent packages.
* Supporting C++ package managers. There is Conan, vcpkg and some CMake based. We could use a base docker image that already has these installed.

Unfortunately, package managers could not be a solution across different platforms if we need reproducible results, one of the Conan’s main features is downloading packages appropriate for the current user setup pretty seemingly and consistently across different setups.  However it means that they are still different.

We could use a base docker image that already has these installed.

This is probably a good solution, but I didn’t have problem to build natively in docker so far.

While I think this is a great idea we also should not forget that the tested projects should exercise the right parts of the analyzer. For instance, a patch adding exception support should be tested on projects that are using exceptions extensively. Having a static set of projects will not solve this problem. Nevertheless, this is something that is far less important to solve. First, we need something that is very close to what you proposed.

I do agree, it should be definitely on a roadmap.  I also planned to have a tag system for projects, like “tiny”, “math”, “C++14”, “web”, and so on.  It is not exactly what you have in mind, but I think it could be a good first step in that direction.

How long do we want to retain Python 2 compatibility? I'm all in favor of not supporting it for long (or at all).

As far as i understand we've still not "officially" transitioned to python3 in llvm. I don't think it actually matters for these scripts; it's not like they're run every day on an ancient buildbot that still doesn't have python3 (in fact as of now i don't think anybody uses them at all except us) but it sounds like in any case the only script that really needs to be python2 up to all possible formal requirements is `SATest.py` itself which is a trivial wrapper that parses some arguments and forwards them into docker; for everything else there's docker and you don't care what's within it.

So, yes, as Artem said, SATest.py is the only script that is compatible with both versions.  All other parts of the system had been migrated to Python 3.

I think one very important feature is to collect/compare not only the analysis results but more fine-grained information like the statistics emitted by the analyzer (number of refuted reports in case of refutation, number of exploded nodes, and so on). 
It would be nice to be able to retrieve anything crash-related like call stacks and have an easy way to ssh into the docker image to debug the crash within the image. 
Also, the csa-testbench has a feature to define regular expressions and collect the matching lines of the analyzer output. This can be useful to count/collect log messages.

The usability of this system is probably the main point for me, so the ease of debugging should be the priority.  I already introduced a `—shell` option that provides an easy way to ssh into a docker without thinking too much about docker nature of things (if container is not running and how to clean it up afterwards).  And I believe pattern-matching certain erroneous situations and making a short summary out of those can be a really good feature.

While I see the value of having a minimal interface I wonder if it will be a bit limiting to the power users in the end (see extracting statistics and logs based on regexp)., 

I think it's totally worth it to have both. When a newcomer tries to test their first checker there's nothing better than a simple one-liner that we can tell them to run. But on the other hand having fine-grained commands for controlling every step of the process is absolutely empowering and not going anywhere.

It’s exactly right.  I believe that with a set of very reasonable defaults it can be achieved.  All additional tweaking should be probably categorized and put into README.

While I do my best to contribute non-breaking code, the tedious process of jumping on the company VPN, finding the appropriate server that isn't under heavy load to run an analysis that is thorough enough sometimes leaves me to commit seemingly miscellaneous patches after only running check-clang-analysis, which on occasions comes to bite back.

Yeah, it’s hard to predict sometimes what can be broken on a real code.  And if it is pretty scary to make changes for experienced users, what can be said about newcomers.

side effects on other platforms

It is pretty hard to avoid those, I guess we’ll have to keep our separate CI setups for that, but at least we’ll be able to get rid of the majority of issues common across platforms.

On 11 Jun 2020, at 20:50, Artem Dergachev via cfe-dev <[hidden email]> wrote:


11.06.2020 8:13 PM, Kristóf Umann via cfe-dev пишет:
+Ericssson gang

Endre and Gábor Márton in particular worked a lot of builtbots (CTU related ones in particular), so I wouldn't risk summarizing our current stance/progress on this issue.

What I will say however from my perspective is that I find committing stressful for all the reasons you mentioned. While I do my best to contribute non-breaking code, the tedious process of jumping on the company VPN, finding the appropriate server that isn't under heavy load to run an analysis that is thorough enough sometimes leaves me to commit seemingly miscellaneous patches after only running check-clang-analysis, which on occasions comes to bite back. Things like changes in the report count (in drastic cases changes in the bug reports themselves, such as new notes), side effects on other platforms, etc. makes this process really error prone as well, not to mention that its at the point where I'm just itching to commit and move on. While the responsibility of the committed or soon-to-be-commited code still falls on the contributor, the lack of builbots on a variety of platforms still makes this process very inconvenient and downright hostile to non-regulars. Not to mention the case where I fill the role of the reviewer.

All in all, I really appreciate this project and agree strongly with your goals!

On Thu, 11 Jun 2020 at 17:51, Gábor Horváth via cfe-dev <[hidden email]> wrote:
Hi!

I'm glad that someone picked this up. Making it easier to test the analyzer on real-world topics is an important task that can ultimately make it much easier to contribute to the analyzer.
See some of my comments inline.

On Thu, 11 Jun 2020 at 16:23, Valeriy Savchenko via cfe-dev <[hidden email]> wrote:

Person has to find at least a couple of projects, build them natively, and check
with the analyzer. ... It should be dead simple, maybe as simple as running
`lit` tests.

While I think this is a great idea we also should not forget that the tested projects should exercise the right parts of the analyzer. For instance, a patch adding exception support should be tested on projects that are using exceptions extensively. Having a static set of projects will not solve this problem. Nevertheless, this is something that is far less important to solve. First, we need something that is very close to what you proposed.
 

Another point that of interest, is reproducibility. 

Huge +1. Actually, I'd be even glad to see more extremes like running the analyzer multiple times making sure that the number of exploded graphs and other statistics are stable to avoid introducing non-deterministic behavior. 


This one's not just about nondeterminism, it's also about reproducibility across machines with different systems and system headers. Like you'll be able to say "hey we broke something in our docker tests, take a look" and you'll no longer need to extract and send to me a preprocessed file. That's a lot if we try to collectively keep an eye on the effects of our changes on a single benchmark (or even if you have your own benchmark it's easy to share the project config because it's basically just a link to the project on github).


Short summary of what is there:
  * Info on 15 open-source projects to analyze, most of which are pretty small
  * Dockerfile with fixed versions of dependencies for these projects 

Dependencies are the bane of C++ at the moment. I'd love to see some other solutions for this problem. Some of them coming to my mind:
* Piggy backing on the source repositories of linux distributions. We could easily install all the build dependencies using the package manager automatically. The user would only need to specify the name of the source package, the rest could be automated without having to manually search for the names of the dependent packages.
* Supporting C++ package managers. There is Conan, vcpkg and some CMake based. We could use a base docker image that already has these installed.


Just curious, given that it's debian under the hood, can we replace our make scripts with "scan-build apt-build" or something like that?


The system has two dependencies: python (2 or 3) and docker.  

How long do we want to retain Python 2 compatibility? I'm all in favor of not supporting it for long (or at all).
 


As far as i understand we've still not "officially" transitioned to python3 in llvm. I don't think it actually matters for these scripts; it's not like they're run every day on an ancient buildbot that still doesn't have python3 (in fact as of now i don't think anybody uses them at all except us) but it sounds like in any case the only script that really needs to be python2 up to all possible formal requirements is `SATest.py` itself which is a trivial wrapper that parses some arguments and forwards them into docker; for everything else there's docker and you don't care what's within it.



(I am not a `csa-testbench` user, so please correct me if I'm wrong here)

Your assessment is 100% correct here. We always wanted to add docker support and support for rebuilding source deb packages to solve most of the issues you mentioned. 
 

  * I want it to cover all basic needs of the developer:
      - analyze a bunch of projects and show results
      - compare two given revisions
      - benchmark and compare performance

I think one very important feature is to collect/compare not only the analysis results but more fine-grained information like the statistics emitted by the analyzer (number of refuted reports in case of refutation, number of exploded nodes, and so on). 
It would be nice to be able to retrieve anything crash-related like call stacks and have an easy way to ssh into the docker image to debug the crash within the image. 
Also, the csa-testbench has a feature to define regular expressions and collect the matching lines of the analyzer output. This can be useful to count/collect log messages.
 

  * I want all commands to be as simple as possible, e.g.:

While I see the value of having a minimal interface I wonder if it will be a bit limiting to the power users in the end (see extracting statistics and logs based on regexp)., 


I think it's totally worth it to have both. When a newcomer tries to test their first checker there's nothing better than a simple one-liner that we can tell them to run. But on the other hand having fine-grained commands for controlling every step of the process is absolutely empowering and not going anywhere.


  * Would you use a system like this?

In the case, it supports my needs, definitely. As you mentioned, there are multiple contenders here: csa-testbench and SATest. I do see why the testbench is not desirable (mainly because of the dependencies), but I wonder if it would make sense to have compatible configurations. I.e. one could copy and paste a project from one to the other have it working without any additional efforts. 
 

  * Does the proposed solution seem reasonable in this situation?

Looks good to me. 
 

  * What do you think about the directions?

+1
 

  * What other features do you want to see in the system?

See my other inlines above.
 

  * What are the priorities for the project and what is the minimal feature
    scope to start using it?

If we can run it reliably on big projects I'd say have a built bot as soon as possible (that only triggers when crashes are introduced). I think it could have prevented many errors. 
 

Thank you for taking your time and reading through this!

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Regression testing for the static analyzer

Vassil Vassilev via cfe-dev
Hi Valeriy,

I just can second the opinions that Kristof and Gábor had said previously, this is indeed a great initiative!

About the docker images, perhaps we need a hierarchy of images. A base image that makes it possible to analyze locally (e.g. on a laptop), an other image for the build bots (jenkins) that builds on top of the base image.

Note, that we have a publicly available build bot that analyses many open source projects:
- Tmux (C)
- Curl (C)
- Redis (C)
- Xerces (C++14)
- Bitcoin (C++11)
- Protobuf (C++11/C++14)
But the focus is mostly on CTU. See, http://lists.llvm.org/pipermail/cfe-dev/2019-November/063945.html for more details.
This build bot uses csa-testbanch and CodeChecker.
In the past this bot indicated a crash introduced by a change in the CSA about CXXInheritedConstructors, first I thought we have a CTU related error but then it turned out it was not. So, yes, we need build bots that do analyze opensource projects.

Thanks,
Gabor


From: Valeriy Savchenko <[hidden email]>
Sent: Monday, June 15, 2020 3:05 PM
To: Artem Dergachev <[hidden email]>
Cc: Kristóf Umann <[hidden email]>; Gábor Horváth <[hidden email]>; Gábor Márton <[hidden email]>; cfe-dev <[hidden email]>; Balázs Kéri <[hidden email]>
Subject: Re: [cfe-dev] [analyzer] Regression testing for the static analyzer
 
First of all, thank you a lot for engaging in this conversation, sharing your ideas, and, of course, for your kind words :-)

Huge +1. Actually, I'd be even glad to see more extremes like running the analyzer multiple times making sure that the number of exploded graphs and other statistics are stable to avoid introducing non-deterministic behavior. 

This one's not just about nondeterminism, it's also about reproducibility across machines with different systems and system headers. Like you'll be able to say "hey we broke something in our docker tests, take a look" and you'll no longer need to extract and send to me a preprocessed file. That's a lot if we try to collectively keep an eye on the effects of our changes on a single benchmark (or even if you have your own benchmark it's easy to share the project config because it's basically just a link to the project on github).

Yes, and yes!  I thought about running the analyzer multiple times on one project for benchmarking.  In that mode we can also check for variations in stats between seemingly identical runs.
Sharing is probably the best part of this approach.

Dependencies are the bane of C++ at the moment. I'd love to see some other solutions for this problem. Some of them coming to my mind:
* Piggy backing on the source repositories of linux distributions. We could easily install all the build dependencies using the package manager automatically. The user would only need to specify the name of the source package, the rest could be automated without having to manually search for the names of the dependent packages.
* Supporting C++ package managers. There is Conan, vcpkg and some CMake based. We could use a base docker image that already has these installed.

Unfortunately, package managers could not be a solution across different platforms if we need reproducible results, one of the Conan’s main features is downloading packages appropriate for the current user setup pretty seemingly and consistently across different setups.  However it means that they are still different.

We could use a base docker image that already has these installed.

This is probably a good solution, but I didn’t have problem to build natively in docker so far.

While I think this is a great idea we also should not forget that the tested projects should exercise the right parts of the analyzer. For instance, a patch adding exception support should be tested on projects that are using exceptions extensively. Having a static set of projects will not solve this problem. Nevertheless, this is something that is far less important to solve. First, we need something that is very close to what you proposed.

I do agree, it should be definitely on a roadmap.  I also planned to have a tag system for projects, like “tiny”, “math”, “C++14”, “web”, and so on.  It is not exactly what you have in mind, but I think it could be a good first step in that direction.

How long do we want to retain Python 2 compatibility? I'm all in favor of not supporting it for long (or at all).

As far as i understand we've still not "officially" transitioned to python3 in llvm. I don't think it actually matters for these scripts; it's not like they're run every day on an ancient buildbot that still doesn't have python3 (in fact as of now i don't think anybody uses them at all except us) but it sounds like in any case the only script that really needs to be python2 up to all possible formal requirements is `SATest.py` itself which is a trivial wrapper that parses some arguments and forwards them into docker; for everything else there's docker and you don't care what's within it.

So, yes, as Artem said, SATest.py is the only script that is compatible with both versions.  All other parts of the system had been migrated to Python 3.

I think one very important feature is to collect/compare not only the analysis results but more fine-grained information like the statistics emitted by the analyzer (number of refuted reports in case of refutation, number of exploded nodes, and so on). 
It would be nice to be able to retrieve anything crash-related like call stacks and have an easy way to ssh into the docker image to debug the crash within the image. 
Also, the csa-testbench has a feature to define regular expressions and collect the matching lines of the analyzer output. This can be useful to count/collect log messages.

The usability of this system is probably the main point for me, so the ease of debugging should be the priority.  I already introduced a `―shell` option that provides an easy way to ssh into a docker without thinking too much about docker nature of things (if container is not running and how to clean it up afterwards).  And I believe pattern-matching certain erroneous situations and making a short summary out of those can be a really good feature.

While I see the value of having a minimal interface I wonder if it will be a bit limiting to the power users in the end (see extracting statistics and logs based on regexp)., 

I think it's totally worth it to have both. When a newcomer tries to test their first checker there's nothing better than a simple one-liner that we can tell them to run. But on the other hand having fine-grained commands for controlling every step of the process is absolutely empowering and not going anywhere.

It’s exactly right.  I believe that with a set of very reasonable defaults it can be achieved.  All additional tweaking should be probably categorized and put into README.

While I do my best to contribute non-breaking code, the tedious process of jumping on the company VPN, finding the appropriate server that isn't under heavy load to run an analysis that is thorough enough sometimes leaves me to commit seemingly miscellaneous patches after only running check-clang-analysis, which on occasions comes to bite back.

Yeah, it’s hard to predict sometimes what can be broken on a real code.  And if it is pretty scary to make changes for experienced users, what can be said about newcomers.

side effects on other platforms

It is pretty hard to avoid those, I guess we’ll have to keep our separate CI setups for that, but at least we’ll be able to get rid of the majority of issues common across platforms.

On 11 Jun 2020, at 20:50, Artem Dergachev via cfe-dev <[hidden email]> wrote:


11.06.2020 8:13 PM, Kristóf Umann via cfe-dev пишет:
+Ericssson gang

Endre and Gábor Márton in particular worked a lot of builtbots (CTU related ones in particular), so I wouldn't risk summarizing our current stance/progress on this issue.

What I will say however from my perspective is that I find committing stressful for all the reasons you mentioned. While I do my best to contribute non-breaking code, the tedious process of jumping on the company VPN, finding the appropriate server that isn't under heavy load to run an analysis that is thorough enough sometimes leaves me to commit seemingly miscellaneous patches after only running check-clang-analysis, which on occasions comes to bite back. Things like changes in the report count (in drastic cases changes in the bug reports themselves, such as new notes), side effects on other platforms, etc. makes this process really error prone as well, not to mention that its at the point where I'm just itching to commit and move on. While the responsibility of the committed or soon-to-be-commited code still falls on the contributor, the lack of builbots on a variety of platforms still makes this process very inconvenient and downright hostile to non-regulars. Not to mention the case where I fill the role of the reviewer.

All in all, I really appreciate this project and agree strongly with your goals!

On Thu, 11 Jun 2020 at 17:51, Gábor Horváth via cfe-dev <[hidden email]> wrote:
Hi!

I'm glad that someone picked this up. Making it easier to test the analyzer on real-world topics is an important task that can ultimately make it much easier to contribute to the analyzer.
See some of my comments inline.

On Thu, 11 Jun 2020 at 16:23, Valeriy Savchenko via cfe-dev <[hidden email]> wrote:

Person has to find at least a couple of projects, build them natively, and check
with the analyzer. ... It should be dead simple, maybe as simple as running
`lit` tests.

While I think this is a great idea we also should not forget that the tested projects should exercise the right parts of the analyzer. For instance, a patch adding exception support should be tested on projects that are using exceptions extensively. Having a static set of projects will not solve this problem. Nevertheless, this is something that is far less important to solve. First, we need something that is very close to what you proposed.
 

Another point that of interest, is reproducibility. 

Huge +1. Actually, I'd be even glad to see more extremes like running the analyzer multiple times making sure that the number of exploded graphs and other statistics are stable to avoid introducing non-deterministic behavior. 


This one's not just about nondeterminism, it's also about reproducibility across machines with different systems and system headers. Like you'll be able to say "hey we broke something in our docker tests, take a look" and you'll no longer need to extract and send to me a preprocessed file. That's a lot if we try to collectively keep an eye on the effects of our changes on a single benchmark (or even if you have your own benchmark it's easy to share the project config because it's basically just a link to the project on github).


Short summary of what is there:
  * Info on 15 open-source projects to analyze, most of which are pretty small
  * Dockerfile with fixed versions of dependencies for these projects 

Dependencies are the bane of C++ at the moment. I'd love to see some other solutions for this problem. Some of them coming to my mind:
* Piggy backing on the source repositories of linux distributions. We could easily install all the build dependencies using the package manager automatically. The user would only need to specify the name of the source package, the rest could be automated without having to manually search for the names of the dependent packages.
* Supporting C++ package managers. There is Conan, vcpkg and some CMake based. We could use a base docker image that already has these installed.


Just curious, given that it's debian under the hood, can we replace our make scripts with "scan-build apt-build" or something like that?


The system has two dependencies: python (2 or 3) and docker.  

How long do we want to retain Python 2 compatibility? I'm all in favor of not supporting it for long (or at all).
 


As far as i understand we've still not "officially" transitioned to python3 in llvm. I don't think it actually matters for these scripts; it's not like they're run every day on an ancient buildbot that still doesn't have python3 (in fact as of now i don't think anybody uses them at all except us) but it sounds like in any case the only script that really needs to be python2 up to all possible formal requirements is `SATest.py` itself which is a trivial wrapper that parses some arguments and forwards them into docker; for everything else there's docker and you don't care what's within it.



(I am not a `csa-testbench` user, so please correct me if I'm wrong here)

Your assessment is 100% correct here. We always wanted to add docker support and support for rebuilding source deb packages to solve most of the issues you mentioned. 
 

  * I want it to cover all basic needs of the developer:
      - analyze a bunch of projects and show results
      - compare two given revisions
      - benchmark and compare performance

I think one very important feature is to collect/compare not only the analysis results but more fine-grained information like the statistics emitted by the analyzer (number of refuted reports in case of refutation, number of exploded nodes, and so on). 
It would be nice to be able to retrieve anything crash-related like call stacks and have an easy way to ssh into the docker image to debug the crash within the image. 
Also, the csa-testbench has a feature to define regular expressions and collect the matching lines of the analyzer output. This can be useful to count/collect log messages.
 

  * I want all commands to be as simple as possible, e.g.:

While I see the value of having a minimal interface I wonder if it will be a bit limiting to the power users in the end (see extracting statistics and logs based on regexp)., 


I think it's totally worth it to have both. When a newcomer tries to test their first checker there's nothing better than a simple one-liner that we can tell them to run. But on the other hand having fine-grained commands for controlling every step of the process is absolutely empowering and not going anywhere.


  * Would you use a system like this?

In the case, it supports my needs, definitely. As you mentioned, there are multiple contenders here: csa-testbench and SATest. I do see why the testbench is not desirable (mainly because of the dependencies), but I wonder if it would make sense to have compatible configurations. I.e. one could copy and paste a project from one to the other have it working without any additional efforts. 
 

  * Does the proposed solution seem reasonable in this situation?

Looks good to me. 
 

  * What do you think about the directions?

+1
 

  * What other features do you want to see in the system?

See my other inlines above.
 

  * What are the priorities for the project and what is the minimal feature
    scope to start using it?

If we can run it reliably on big projects I'd say have a built bot as soon as possible (that only triggers when crashes are introduced). I think it could have prevented many errors. 
 

Thank you for taking your time and reading through this!

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev