RFC: End-to-end testing

classic Classic list List threaded Threaded
53 messages Options
123
Reply | Threaded
Open this post in threaded view
|

RFC: End-to-end testing

Kristof Beyls via cfe-dev
[ I am initially copying only a few lists since they seem like
  the most impacted projects and I didn't want to spam all the mailing
  lists.  Please let me know if other lists should be included. ]

I submitted D68230 for review but this is not about that patch per se.
The patch allows update_cc_test_checks.py to process tests that should
check target asm rather than LLVM IR.  We use this facility downstream
for our end-to-end tests.  It strikes me that it might be useful for
upstream to do similar end-to-end testing.

Now that the monorepo is about to become the canonical source of truth,
we have an opportunity for convenient end-to-end testing that we didn't
easily have before with svn (yes, it could be done but in an ugly way).
AFAIK the only upstream end-to-end testing we have is in test-suite and
many of those codes are very large and/or unfocused tests.

With the monorepo we have a place to put lit-style tests that exercise
multiple subprojects, for example tests that ensure the entire clang
compilation pipeline executes correctly.  We could, for example, create
a top-level "test" directory and put end-to-end tests there.  Some of
the things that could be tested include:

- Pipeline execution (debug-pass=Executions)
- Optimization warnings/messages
- Specific asm code sequences out of clang (e.g. ensure certain loops
  are vectorized)
- Pragma effects (e.g. ensure loop optimizations are honored)
- Complete end-to-end PGO (generate a profile and re-compile)
- GPU/accelerator offloading
- Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.).  It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.

I realize that some folks prefer to work on only a portion of the
monorepo (for example, they just hack on LLVM).  I am not sure how to
address those developers WRT end-to-end testing.  On the one hand,
requiring them to run end-to-end testing means they will have to at
least check out and build the monorepo.  On the other hand, it seems
less than ideal to have people developing core infrastructure and not
running tests.

I don't yet have a formal proposal but wanted to put this out to spur
discussion and gather feedback and ideas.  Thank you for your interest
and participation!

                        -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: End-to-end testing

Kristof Beyls via cfe-dev


> -----Original Message-----
> From: cfe-dev <[hidden email]> On Behalf Of David Greene
> via cfe-dev
> Sent: Tuesday, October 08, 2019 12:50 PM
> To: [hidden email]; [hidden email]; openmp-
> [hidden email]; [hidden email]
> Subject: [cfe-dev] RFC: End-to-end testing
>
> [ I am initially copying only a few lists since they seem like
>   the most impacted projects and I didn't want to spam all the mailing
>   lists.  Please let me know if other lists should be included. ]
>
> I submitted D68230 for review but this is not about that patch per se.
> The patch allows update_cc_test_checks.py to process tests that should
> check target asm rather than LLVM IR.  We use this facility downstream
> for our end-to-end tests.  It strikes me that it might be useful for
> upstream to do similar end-to-end testing.
>
> Now that the monorepo is about to become the canonical source of truth,
> we have an opportunity for convenient end-to-end testing that we didn't
> easily have before with svn (yes, it could be done but in an ugly way).
> AFAIK the only upstream end-to-end testing we have is in test-suite and
> many of those codes are very large and/or unfocused tests.
>
> With the monorepo we have a place to put lit-style tests that exercise
> multiple subprojects, for example tests that ensure the entire clang
> compilation pipeline executes correctly.  We could, for example, create
> a top-level "test" directory and put end-to-end tests there.  Some of
> the things that could be tested include:
>
> - Pipeline execution (debug-pass=Executions)
> - Optimization warnings/messages
> - Specific asm code sequences out of clang (e.g. ensure certain loops
>   are vectorized)
> - Pragma effects (e.g. ensure loop optimizations are honored)
> - Complete end-to-end PGO (generate a profile and re-compile)
> - GPU/accelerator offloading
> - Debuggability of clang-generated code
>
> Each of these things is tested to some degree within their own
> subprojects, but AFAIK there are currently no dedicated tests ensuring
> such things work through the entire clang pipeline flow and with other
> tools that make use of the results (debuggers, etc.).  It is relatively
> easy to break the pipeline while the individual subproject tests
> continue to pass.

I agree with all your points.  End-to-end testing is a major hole in the
project infrastructure; it has been largely left up to the individual
vendors/packagers/distributors.  The Clang tests verify that Clang will
produce some sort of not-unreasonable IR for given situations; the LLVM
tests verify that some (other) set of input IR will produce something
that looks not-unreasonable on the target side.  Very little connects
the two.

There is more than nothing:
- test-suite has some quantity of code that is compiled end-to-end for
  some targets.
- lldb can be set up to use the just-built Clang to compile its tests,
  but those are focused on debug info and are nothing like comprehensive.
- libcxx likely also can use the just-built Clang, although I've never
  tried it so I don't know for sure. It obviously exercises just the
  runtime side of things.
- compiler-rt likewise. The sanitizer tests in particular I'd expect to
  be using the just-built Clang.
- debuginfo-tests also uses the just-built Clang but is a pathetically
  small set, and again focused on debug info.

I'm not saying the LLVM Project should invest in a commercial suite
(although I'd expect vendors to do so; Sony does).  But a place to *put*
end-to-end tests seems entirely reasonable and useful.  Although I would
resist calling it simply "tests" (we have too many directories with that
name already).

>
> I realize that some folks prefer to work on only a portion of the
> monorepo (for example, they just hack on LLVM).  I am not sure how to
> address those developers WRT end-to-end testing.  On the one hand,
> requiring them to run end-to-end testing means they will have to at
> least check out and build the monorepo.  On the other hand, it seems
> less than ideal to have people developing core infrastructure and not
> running tests.

People should obviously be running the tests for the project(s) they're
modifying.  People aren't expected to run everything.  That's why...

Bots.  "Don't argue with the bots."  I don't checkout and build and test
everything, and I've broken LLDB, compiler-rt, and probably others from
time to time.  Probably everybody has broken other projects unexpectedly.
That's what bots are for, to run the combinations and the projects that
I don't have the infrastructure or resources to do myself.  It's not up
to me to run everything possible before committing; it IS up to me to
respond promptly to bot failures for my changes.  I don't see a new
end-to-end test project being any different in that respect.

>
> I don't yet have a formal proposal but wanted to put this out to spur
> discussion and gather feedback and ideas.  Thank you for your interest
> and participation!

Thanks for bringing it up!  It's been a pebble in my shoe for a long time.
--paulr

>
>                         -David
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
I have a bit of concern about this sort of thing - worrying it'll lead to people being less cautious about writing the more isolated tests. That said, clearly there's value in end-to-end testing for all the reasons you've mentioned (& we do see these problems in practice - recently DWARF indexing broke when support for more nuanced language codes were added to Clang).

Dunno if they need a new place or should just be more stuff in test-suite, though.

On Tue, Oct 8, 2019 at 9:50 AM David Greene via cfe-dev <[hidden email]> wrote:
[ I am initially copying only a few lists since they seem like
  the most impacted projects and I didn't want to spam all the mailing
  lists.  Please let me know if other lists should be included. ]

I submitted D68230 for review but this is not about that patch per se.
The patch allows update_cc_test_checks.py to process tests that should
check target asm rather than LLVM IR.  We use this facility downstream
for our end-to-end tests.  It strikes me that it might be useful for
upstream to do similar end-to-end testing.

Now that the monorepo is about to become the canonical source of truth,
we have an opportunity for convenient end-to-end testing that we didn't
easily have before with svn (yes, it could be done but in an ugly way).
AFAIK the only upstream end-to-end testing we have is in test-suite and
many of those codes are very large and/or unfocused tests.

With the monorepo we have a place to put lit-style tests that exercise
multiple subprojects, for example tests that ensure the entire clang
compilation pipeline executes correctly.  We could, for example, create
a top-level "test" directory and put end-to-end tests there.  Some of
the things that could be tested include:

- Pipeline execution (debug-pass=Executions)
- Optimization warnings/messages
- Specific asm code sequences out of clang (e.g. ensure certain loops
  are vectorized)
- Pragma effects (e.g. ensure loop optimizations are honored)
- Complete end-to-end PGO (generate a profile and re-compile)
- GPU/accelerator offloading
- Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.).  It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.

I realize that some folks prefer to work on only a portion of the
monorepo (for example, they just hack on LLVM).  I am not sure how to
address those developers WRT end-to-end testing.  On the one hand,
requiring them to run end-to-end testing means they will have to at
least check out and build the monorepo.  On the other hand, it seems
less than ideal to have people developing core infrastructure and not
running tests.

I don't yet have a formal proposal but wanted to put this out to spur
discussion and gather feedback and ideas.  Thank you for your interest
and participation!

                        -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Openmp-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
David Blaikie via Openmp-dev <[hidden email]> writes:

> I have a bit of concern about this sort of thing - worrying it'll lead to
> people being less cautious about writing the more isolated tests.

That's a fair concern.  Reviewers will still need to insist on small
component-level tests to go along with patches.  We don't have to
sacrifice one to get the other.

> Dunno if they need a new place or should just be more stuff in test-suite,
> though.

There are at least two problems I see with using test-suite for this:

- It is a separate repository and thus is not as convenient as tests
  that live with the code.  One cannot commit an end-to-end test
  atomically with the change meant to be tested.

- It is full of large codes which is not the kind of testing I'm talking
  about.

Let me describe how I recently added some testing in our downstream
fork.

- I implemented a new feature along with a C source test.

- I used clang to generate asm from that test and captured the small
  piece of it I wanted to check in an end-to-end test.

- I used clang to generate IR just before the feature kicked in and
  created an opt-style test for it.  Generating this IR is not always
  straightfoward and it would be great to have better tools to do this,
  but that's another discussion.

- I took the IR out of opt (after running my feature) and created an
  llc-style test out of it to check the generated asm.  The checks are
  the same as in the original C end-to-end test.

So the tests are checking at each stage that the expected input is
generating the expected output and the end-to-end test checks that we go
from source to asm correctly.

These are all really small tests, easily runnable as part of the normal
"make check" process.

                     -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Openmp-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev


On Tue, Oct 8, 2019 at 12:46 PM David Greene <[hidden email]> wrote:
David Blaikie via Openmp-dev <[hidden email]> writes:

> I have a bit of concern about this sort of thing - worrying it'll lead to
> people being less cautious about writing the more isolated tests.

That's a fair concern.  Reviewers will still need to insist on small
component-level tests to go along with patches.  We don't have to
sacrifice one to get the other.

> Dunno if they need a new place or should just be more stuff in test-suite,
> though.

There are at least two problems I see with using test-suite for this:

- It is a separate repository and thus is not as convenient as tests
  that live with the code.  One cannot commit an end-to-end test
  atomically with the change meant to be tested.

- It is full of large codes which is not the kind of testing I'm talking
  about.

Oh, right - I'd forgotten that the test-suite wasn't part of the monorepo (due to size, I can understand why) - fair enough. Makes sense to me to have lit-style lightweight, targeted, but intentionally end-to-end tests.
 

Let me describe how I recently added some testing in our downstream
fork.

- I implemented a new feature along with a C source test.

- I used clang to generate asm from that test and captured the small
  piece of it I wanted to check in an end-to-end test.

- I used clang to generate IR just before the feature kicked in and
  created an opt-style test for it.  Generating this IR is not always
  straightfoward and it would be great to have better tools to do this,
  but that's another discussion.

- I took the IR out of opt (after running my feature) and created an
  llc-style test out of it to check the generated asm.  The checks are
  the same as in the original C end-to-end test.

So the tests are checking at each stage that the expected input is
generating the expected output and the end-to-end test checks that we go
from source to asm correctly.

These are all really small tests, easily runnable as part of the normal
"make check" process.

                     -David

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev

I have a bit of concern about this sort of thing - worrying it'll lead to people being less cautious about writing the more isolated tests.

I have the same concern. I really believe we need to be careful about testing at the right granularity to keep things both modular and the testing maintainable (for instance checking vectorized ASM from a C++ source through clang has always been considered a bad FileCheck practice).
(Not saying that there is no space for better integration testing in some areas).
 
That said, clearly there's value in end-to-end testing for all the reasons you've mentioned (& we do see these problems in practice - recently DWARF indexing broke when support for more nuanced language codes were added to Clang).

Dunno if they need a new place or should just be more stuff in test-suite, though.

On Tue, Oct 8, 2019 at 9:50 AM David Greene via cfe-dev <[hidden email]> wrote:
[ I am initially copying only a few lists since they seem like
  the most impacted projects and I didn't want to spam all the mailing
  lists.  Please let me know if other lists should be included. ]

I submitted D68230 for review but this is not about that patch per se.
The patch allows update_cc_test_checks.py to process tests that should
check target asm rather than LLVM IR.  We use this facility downstream
for our end-to-end tests.  It strikes me that it might be useful for
upstream to do similar end-to-end testing.

Now that the monorepo is about to become the canonical source of truth,
we have an opportunity for convenient end-to-end testing that we didn't
easily have before with svn (yes, it could be done but in an ugly way).
AFAIK the only upstream end-to-end testing we have is in test-suite and
many of those codes are very large and/or unfocused tests.

With the monorepo we have a place to put lit-style tests that exercise
multiple subprojects, for example tests that ensure the entire clang
compilation pipeline executes correctly. 

I don't think I agree with the relationship to the monorepo: there was nothing that prevented tests inside the clang project to exercise the full pipeline already. I don't believe that the SVN repo structure was really a factor in the way the testing was setup, but instead it was a deliberate choice in the way we structure our testing. 
For instance I remember asking about implementing test based on checking if some loops written in C source file were properly vectorized by the -O2 / -O3 pipeline and it was deemed like the kind of test that we don't want to maintain: instead I was pointed at the test-suite to add better benchmarks there for the end-to-end story. What is interesting is that the test-suite is not gonna be part of the monorepo!

To be clear: I'm not saying here we can't change our way of testing, I just don't think the monorepo has anything to do with it and that it should carefully motivated and scoped into what belongs/doesn't belong to integration tests.

 
We could, for example, create
a top-level "test" directory and put end-to-end tests there.  Some of
the things that could be tested include:

- Pipeline execution (debug-pass=Executions)
- Optimization warnings/messages
- Specific asm code sequences out of clang (e.g. ensure certain loops
  are vectorized)
- Pragma effects (e.g. ensure loop optimizations are honored)
- Complete end-to-end PGO (generate a profile and re-compile)
- GPU/accelerator offloading
- Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.).  It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.


I'm not sure I really see much in your list that isn't purely about testing clang itself here? 
Actually the first one seems more of a pure LLVM test.
 
I realize that some folks prefer to work on only a portion of the
monorepo (for example, they just hack on LLVM).  I am not sure how to
address those developers WRT end-to-end testing.  On the one hand,
requiring them to run end-to-end testing means they will have to at
least check out and build the monorepo.  On the other hand, it seems
less than ideal to have people developing core infrastructure and not
running tests.

I think we already expect LLVM developers to update clang APIs? And we revert LLVM patches when clang testing is broken. So I believe the acknowledgment to maintain the other in-tree projects isn't really new, it is true that the monorepo will make this easy for everyone to reproduce locally most failure, and find all the use of an API across projects (which was provided as a motivation to move to a monorepo model: https://llvm.org/docs/Proposals/GitHubMove.html#monorepo ).

-- 
Mehdi

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
Mehdi AMINI via cfe-dev <[hidden email]> writes:

>> I have a bit of concern about this sort of thing - worrying it'll lead to
>> people being less cautious about writing the more isolated tests.
>>
>
> I have the same concern. I really believe we need to be careful about
> testing at the right granularity to keep things both modular and the
> testing maintainable (for instance checking vectorized ASM from a C++
> source through clang has always been considered a bad FileCheck practice).
> (Not saying that there is no space for better integration testing in some
> areas).

I absolutely disagree about vectorization tests.  We have seen
vectorization loss in clang even though related LLVM lit tests pass,
because something else in the clang pipeline changed that caused the
vectorizer to not do its job.  We need both kinds of tests.  There are
many asm tests of value beyond vectorization and they should include
component and well as end-to-end tests.

> For instance I remember asking about implementing test based on checking if
> some loops written in C source file were properly vectorized by the -O2 /
> -O3 pipeline and it was deemed like the kind of test that we don't want to
> maintain: instead I was pointed at the test-suite to add better benchmarks
> there for the end-to-end story. What is interesting is that the test-suite
> is not gonna be part of the monorepo!

And it shouldn't be.  It's much too big.  But there is a place for small
end-to-end tests that live alongside the code.

>>> We could, for example, create
>>> a top-level "test" directory and put end-to-end tests there.  Some of
>>> the things that could be tested include:
>>>
>>> - Pipeline execution (debug-pass=Executions)
>>>
>>> - Optimization warnings/messages
>>> - Specific asm code sequences out of clang (e.g. ensure certain loops
>>>   are vectorized)
>>> - Pragma effects (e.g. ensure loop optimizations are honored)
>>> - Complete end-to-end PGO (generate a profile and re-compile)
>>> - GPU/accelerator offloading
>>> - Debuggability of clang-generated code
>>>
>>> Each of these things is tested to some degree within their own
>>> subprojects, but AFAIK there are currently no dedicated tests ensuring
>>> such things work through the entire clang pipeline flow and with other
>>> tools that make use of the results (debuggers, etc.).  It is relatively
>>> easy to break the pipeline while the individual subproject tests
>>> continue to pass.
>>>
>>
>
> I'm not sure I really see much in your list that isn't purely about testing
> clang itself here?

Debugging and PGO involve other components, no?  If we want to put clang
end-to-end tests in the clang subdirectory, that's fine with me.  But we
need a place for tests that cut across components.

I could also imagine llvm-mca end-to-end tests through clang.

> Actually the first one seems more of a pure LLVM test.

Definitely not.  It would test the pipeline as constructed by clang,
which is very different from the default pipeline constructed by
opt/llc.  The old and new pass managers also construct different
pipelines.  As we have seen with various mailing list messages, this is
surprising to users.  Best to document and check it with testing.

                  -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev


On Wed, Oct 9, 2019 at 8:12 AM David Greene <[hidden email]> wrote:
Mehdi AMINI via cfe-dev <[hidden email]> writes:

>> I have a bit of concern about this sort of thing - worrying it'll lead to
>> people being less cautious about writing the more isolated tests.
>>
>
> I have the same concern. I really believe we need to be careful about
> testing at the right granularity to keep things both modular and the
> testing maintainable (for instance checking vectorized ASM from a C++
> source through clang has always been considered a bad FileCheck practice).
> (Not saying that there is no space for better integration testing in some
> areas).

I absolutely disagree about vectorization tests.  We have seen
vectorization loss in clang even though related LLVM lit tests pass,
because something else in the clang pipeline changed that caused the
vectorizer to not do its job. 

Of course, and as I mentioned I tried to add these tests (probably 4 or 5 years ago), but someone (I think Chandler?) was asking me at the time: does it affect a benchmark performance? If so why isn't it tracked there? And if not does it matter?
The benchmark was presented as the actual way to check this invariant (because you're only vectoring to get performance, not for the sake of it).
So I never pursued, even if I'm a bit puzzled that we don't have such tests.


 
We need both kinds of tests.  There are
many asm tests of value beyond vectorization and they should include
component and well as end-to-end tests.

> For instance I remember asking about implementing test based on checking if
> some loops written in C source file were properly vectorized by the -O2 /
> -O3 pipeline and it was deemed like the kind of test that we don't want to
> maintain: instead I was pointed at the test-suite to add better benchmarks
> there for the end-to-end story. What is interesting is that the test-suite
> is not gonna be part of the monorepo!

And it shouldn't be.  It's much too big.  But there is a place for small
end-to-end tests that live alongside the code.

>>> We could, for example, create
>>> a top-level "test" directory and put end-to-end tests there.  Some of
>>> the things that could be tested include:
>>>
>>> - Pipeline execution (debug-pass=Executions)
>>>
>>> - Optimization warnings/messages
>>> - Specific asm code sequences out of clang (e.g. ensure certain loops
>>>   are vectorized)
>>> - Pragma effects (e.g. ensure loop optimizations are honored)
>>> - Complete end-to-end PGO (generate a profile and re-compile)
>>> - GPU/accelerator offloading
>>> - Debuggability of clang-generated code
>>>
>>> Each of these things is tested to some degree within their own
>>> subprojects, but AFAIK there are currently no dedicated tests ensuring
>>> such things work through the entire clang pipeline flow and with other
>>> tools that make use of the results (debuggers, etc.).  It is relatively
>>> easy to break the pipeline while the individual subproject tests
>>> continue to pass.
>>>
>>
>
> I'm not sure I really see much in your list that isn't purely about testing
> clang itself here?

Debugging and PGO involve other components, no?

I don't think that you need anything else than LLVM core (which is a dependency of clang) itself?

Things like PGO (unless you're using frontend instrumentation) don't even have anything to do with clang, so we may get into the situation David mentioned where we would rely on clang to test LLVM features, which I find non-desirable.

 
  If we want to put clang
end-to-end tests in the clang subdirectory, that's fine with me.  But we
need a place for tests that cut across components.

I could also imagine llvm-mca end-to-end tests through clang.

> Actually the first one seems more of a pure LLVM test.

Definitely not.  It would test the pipeline as constructed by clang,
which is very different from the default pipeline constructed by
opt/llc. 

I am not convinced it is "very" difference (they are using the PassManagerBuilder AFAIK), I am only aware of very subtle difference.
But more fundamentally: *should* they be different? I would want `opt -O3` to be able to reproduce 1-1 the clang pipeline.
Isn't it the role of LLVM PassManagerBuilder to expose what is the "-O3" pipeline?
If we see the PassManagerBuilder as the abstraction for the pipeline, then I don't see what testing belongs to clang here, this seems like a layering violation (and maintaining the PassManagerBuilder in LLVM I wouldn't want to have to update the tests of all the subproject using it because they retest the same feature).

 
The old and new pass managers also construct different
pipelines.  As we have seen with various mailing list messages, this is
surprising to users.  Best to document and check it with testing.

Yes: both old and new pass managers are LLVM components, so hopefully that are documented and tested in LLVM :)

-- 
Mehdi
 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
Mehdi AMINI via llvm-dev <[hidden email]> writes:

>> I absolutely disagree about vectorization tests.  We have seen
>> vectorization loss in clang even though related LLVM lit tests pass,
>> because something else in the clang pipeline changed that caused the
>> vectorizer to not do its job.
>
> Of course, and as I mentioned I tried to add these tests (probably 4 or 5
> years ago), but someone (I think Chandler?) was asking me at the time: does
> it affect a benchmark performance? If so why isn't it tracked there? And if
> not does it matter?
> The benchmark was presented as the actual way to check this invariant
> (because you're only vectoring to get performance, not for the sake of it).
> So I never pursued, even if I'm a bit puzzled that we don't have such tests.

Thanks for explaining.

Our experience is that relying solely on performance tests to uncover
such issues is problematic for several reasons:

- Performance varies from implementation to implementation.  It is
  difficult to keep tests up-to-date for all possible targets and
  subtargets.
 
- Partially as a result, but also for other reasons, performance tests
  tend to be complicated, either in code size or in the numerous code
  paths tested.  This makes such tests hard to debug when there is a
  regression.

- Performance tests don't focus on the why/how of vectorization.  They
  just check, "did it run fast enough?"  Maybe the test ran fast enough
  for some other reason but we still lost desired vectorization and
  could have run even faster.

With a small asm test one can documented why vectorization is desired
and how it comes about right in the test.  Then when it breaks it's
usually pretty obvious what the problem is.

They don't replace performance tests, they complement each other, just
as end-to-end and component tests complement each other.

>> Debugging and PGO involve other components, no?
>
> I don't think that you need anything else than LLVM core (which is a
> dependency of clang) itself?

What about testing that what clang produces is debuggable with lldb?
debuginfo tests do that now but AFAIK they are not end-to-end.

> Things like PGO (unless you're using frontend instrumentation) don't even
> have anything to do with clang, so we may get into the situation David
> mentioned where we would rely on clang to test LLVM features, which I find
> non-desirable.

We would still expect component-level tests.  This would be additional
end-to-end testing, not replacing existing testing methodology.  I agree
the concern is valid but shouldn't code review discover such problems?

>> > Actually the first one seems more of a pure LLVM test.
>>
>> Definitely not.  It would test the pipeline as constructed by clang,
>> which is very different from the default pipeline constructed by
>> opt/llc.
>
> I am not convinced it is "very" difference (they are using the
> PassManagerBuilder AFAIK), I am only aware of very subtle difference.

I don't think clang exclusively uses PassManagerBuilder but it's been a
while since I looked at that code.

> But more fundamentally: *should* they be different? I would want `opt -O3`
> to be able to reproduce 1-1 the clang pipeline.

Which clang pipeline?  -O3?  -Ofast?  opt currently can't do -Ofast.  I
don't *think* -Ofast affects the pipeline itself but I am not 100%
certain.

> Isn't it the role of LLVM PassManagerBuilder to expose what is the "-O3"
> pipeline?

Ideally, yes.  In practice, it's not.

> If we see the PassManagerBuilder as the abstraction for the pipeline, then
> I don't see what testing belongs to clang here, this seems like a layering
> violation (and maintaining the PassManagerBuilder in LLVM I wouldn't want
> to have to update the tests of all the subproject using it because they
> retest the same feature).

If nothing else, end-to-end testing of the pipeline would uncover
layering violations.  :)

>> The old and new pass managers also construct different
>> pipelines.  As we have seen with various mailing list messages, this is
>> surprising to users.  Best to document and check it with testing.
>>
>
> Yes: both old and new pass managers are LLVM components, so hopefully that
> are documented and tested in LLVM :)

But we have nothing to guarantee that what clang does matches what opt
does.  Currently they do different things.

                         -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev


On 10/8/19 9:49 AM, David Greene via llvm-dev wrote:
[ I am initially copying only a few lists since they seem like
  the most impacted projects and I didn't want to spam all the mailing
  lists.  Please let me know if other lists should be included. ]

I submitted D68230 for review but this is not about that patch per se.
The patch allows update_cc_test_checks.py to process tests that should
check target asm rather than LLVM IR.  We use this facility downstream
for our end-to-end tests.  It strikes me that it might be useful for
upstream to do similar end-to-end testing.

Now that the monorepo is about to become the canonical source of truth,
we have an opportunity for convenient end-to-end testing that we didn't
easily have before with svn (yes, it could be done but in an ugly way).
AFAIK the only upstream end-to-end testing we have is in test-suite and
many of those codes are very large and/or unfocused tests.

With the monorepo we have a place to put lit-style tests that exercise
multiple subprojects, for example tests that ensure the entire clang
compilation pipeline executes correctly.  We could, for example, create
a top-level "test" directory and put end-to-end tests there.  Some of
the things that could be tested include:

- Pipeline execution (debug-pass=Executions)
- Optimization warnings/messages
- Specific asm code sequences out of clang (e.g. ensure certain loops
  are vectorized)
- Pragma effects (e.g. ensure loop optimizations are honored)
- Complete end-to-end PGO (generate a profile and re-compile)
- GPU/accelerator offloading
- Debuggability of clang-generated code

Each of these things is tested to some degree within their own
subprojects, but AFAIK there are currently no dedicated tests ensuring
such things work through the entire clang pipeline flow and with other
tools that make use of the results (debuggers, etc.).  It is relatively
easy to break the pipeline while the individual subproject tests
continue to pass.

I realize that some folks prefer to work on only a portion of the
monorepo (for example, they just hack on LLVM).  I am not sure how to
address those developers WRT end-to-end testing.  On the one hand,
requiring them to run end-to-end testing means they will have to at
least check out and build the monorepo.  On the other hand, it seems
less than ideal to have people developing core infrastructure and not
running tests.

I don't yet have a formal proposal but wanted to put this out to spur
discussion and gather feedback and ideas.  Thank you for your interest
and participation!

The two major concerns I see are a potential decay in component test quality, and an increase in difficulty changing components.  The former has already been discussed a bit downstream, so let me focus on the later.

A challenge we already have - as in, I've broken these tests and had to fix them - is that an end to end test which checks either IR or assembly ends up being extraordinarily fragile.  Completely unrelated profitable transforms create small differences which cause spurious test failures.  This is a very real issue today with the few end-to-end clang tests we have, and I am extremely hesitant to expand those tests without giving this workflow problem serious thought.  If we don't, this could bring development on middle end transforms to a complete stop.  (Not kidding.)

A couple of approaches we could consider:

  1. Simply restrict end to end tests to crash/assert cases.  (i.e. no property of the generated code is checked, other than that it is generated)  This isn't as restrictive as it sounds when combined w/coverage guided fuzzer corpuses.
  2. Auto-update all diffs, but report them to a human user for inspection.  This ends up meaning that tests never "fail" per se, but that individuals who have expressed interest in particular tests get an automated notification and a chance to respond on list with a reduced example. 
  3. As a variant on the former, don't auto-update tests, but only inform the *contributor* of an end-to-end test of a failure.  Responsibility for determining failure vs false positive lies solely with them, and normal channels are used to report a failure after it has been confirmed/analyzed/explained.

I really think this is a problem we need to have thought through and found a workable solution before end-to-end testing as proposed becomes a practically workable option. 

Philip


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev


On Wed, Oct 9, 2019 at 2:31 PM David Greene <[hidden email]> wrote:
Mehdi AMINI via llvm-dev <[hidden email]> writes:

>> I absolutely disagree about vectorization tests.  We have seen
>> vectorization loss in clang even though related LLVM lit tests pass,
>> because something else in the clang pipeline changed that caused the
>> vectorizer to not do its job.
>
> Of course, and as I mentioned I tried to add these tests (probably 4 or 5
> years ago), but someone (I think Chandler?) was asking me at the time: does
> it affect a benchmark performance? If so why isn't it tracked there? And if
> not does it matter?
> The benchmark was presented as the actual way to check this invariant
> (because you're only vectoring to get performance, not for the sake of it).
> So I never pursued, even if I'm a bit puzzled that we don't have such tests.

Thanks for explaining.

Our experience is that relying solely on performance tests to uncover
such issues is problematic for several reasons:

- Performance varies from implementation to implementation.  It is
  difficult to keep tests up-to-date for all possible targets and
  subtargets.

- Partially as a result, but also for other reasons, performance tests
  tend to be complicated, either in code size or in the numerous code
  paths tested.  This makes such tests hard to debug when there is a
  regression.

- Performance tests don't focus on the why/how of vectorization.  They
  just check, "did it run fast enough?"  Maybe the test ran fast enough
  for some other reason but we still lost desired vectorization and
  could have run even faster.

With a small asm test one can documented why vectorization is desired
and how it comes about right in the test.  Then when it breaks it's
usually pretty obvious what the problem is.

They don't replace performance tests, they complement each other, just
as end-to-end and component tests complement each other.

>> Debugging and PGO involve other components, no?
>
> I don't think that you need anything else than LLVM core (which is a
> dependency of clang) itself?

What about testing that what clang produces is debuggable with lldb?
debuginfo tests do that now but AFAIK they are not end-to-end.

> Things like PGO (unless you're using frontend instrumentation) don't even
> have anything to do with clang, so we may get into the situation David
> mentioned where we would rely on clang to test LLVM features, which I find
> non-desirable.

We would still expect component-level tests.  This would be additional
end-to-end testing, not replacing existing testing methodology.  I agree
the concern is valid but shouldn't code review discover such problems?

Yes I agree, this concern is not a blocker for doing end-to-end testing, but more a "we will need to be careful about scoping the role of the end-to-end testing versus component level testing".

 

>> > Actually the first one seems more of a pure LLVM test.
>>
>> Definitely not.  It would test the pipeline as constructed by clang,
>> which is very different from the default pipeline constructed by
>> opt/llc.
>
> I am not convinced it is "very" difference (they are using the
> PassManagerBuilder AFAIK), I am only aware of very subtle difference.

I don't think clang exclusively uses PassManagerBuilder but it's been a
while since I looked at that code.


All the extension point where passes are hooked in are likely things where the pipeline would differ from LLVM.
 

> But more fundamentally: *should* they be different? I would want `opt -O3`
> to be able to reproduce 1-1 the clang pipeline.

Which clang pipeline?  -O3?  -Ofast?  opt currently can't do -Ofast.  I
don't *think* -Ofast affects the pipeline itself but I am not 100%
certain.

If -Ofast affects the pipeline, I would expose it on the PassManagerBuilder I think.
 

> Isn't it the role of LLVM PassManagerBuilder to expose what is the "-O3"
> pipeline?

Ideally, yes.  In practice, it's not.

> If we see the PassManagerBuilder as the abstraction for the pipeline, then
> I don't see what testing belongs to clang here, this seems like a layering
> violation (and maintaining the PassManagerBuilder in LLVM I wouldn't want
> to have to update the tests of all the subproject using it because they
> retest the same feature).

If nothing else, end-to-end testing of the pipeline would uncover
layering violations.  :)

>> The old and new pass managers also construct different
>> pipelines.  As we have seen with various mailing list messages, this is
>> surprising to users.  Best to document and check it with testing.
>>
>
> Yes: both old and new pass managers are LLVM components, so hopefully that
> are documented and tested in LLVM :)

But we have nothing to guarantee that what clang does matches what opt
does.  Currently they do different things.

My point is that this should be guaranteed by refactoring and using the right APIs, not duplicate the testing. But I can also misunderstand what it is that you would test on the clang side. For instance I wouldn't want to duplicate testing the O3 pass pipeline which is covered here: https://github.com/llvm/llvm-project/blob/master/llvm/test/Other/opt-O3-pipeline.ll 
But testing that a specific pass is added with respect to a particular clang option is fair, and actually this is *already* what we do I believe, like here: https://github.com/llvm/llvm-project/blob/master/clang/test/CodeGen/thinlto-debug-pm.c#L11-L14

I don't think these particular tests are the most controversial though, and it is really still fairly "focused" testing. I'm much more curious about larger end-to-end scope: for instance since you mention debug info and LLDB, what about a test that would verify that LLDB can print a particular variable content from a test that would come as a source program for instance. Such test are valuable in the absolute, it isn't clear to me that we could in practice block any commit that would break such test though: this is because a bug fix or an improvement in one of the pass may be perfectly correct in isolation but make the test fail by exposing a bug where we are already losing some debug info precision in a totally unrelated part of the codebase. 
I wonder how you see this managed in practice: would you gate any change on InstCombine (or other mid-level pass) on not regressing any of the debug-info quality test on any of the backend, and from any frontend (not only clang)? Or worse: a middle-end change that would end-up with a slightly different Dwarf construct on this particular test, which would trip LLDB but not GDB (basically expose a bug in LLDB). Should we require the contributor of inst-combine to debug LLDB and fix it first?

Best,

-- 
Mehdi



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
Mehdi AMINI via cfe-dev <[hidden email]> writes:

> I don't think these particular tests are the most controversial though, and
> it is really still fairly "focused" testing. I'm much more curious about
> larger end-to-end scope: for instance since you mention debug info and
> LLDB, what about a test that would verify that LLDB can print a particular
> variable content from a test that would come as a source program for
> instance. Such test are valuable in the absolute, it isn't clear to me that
> we could in practice block any commit that would break such test though:
> this is because a bug fix or an improvement in one of the pass may be
> perfectly correct in isolation but make the test fail by exposing a bug
> where we are already losing some debug info precision in a totally
> unrelated part of the codebase.
> I wonder how you see this managed in practice: would you gate any change on
> InstCombine (or other mid-level pass) on not regressing any of the
> debug-info quality test on any of the backend, and from any frontend (not
> only clang)? Or worse: a middle-end change that would end-up with a
> slightly different Dwarf construct on this particular test, which would
> trip LLDB but not GDB (basically expose a bug in LLDB). Should we require
> the contributor of inst-combine to debug LLDB and fix it first?

Good questions!  I think for situations like this I would tend toward
allowing the change and the test would alert us that something else is
wrong.  At that point it's probably a case-by-case decision.  Maybe we
XFAIL the test.  Maybe the fix is easy enough that we just do it and the
test starts passing again.  What's the policy for breaking current tests
when the change itself is fine but exposes a problem elsewhere (adding
an assert, for example)?

                       -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
Philip Reames via cfe-dev <[hidden email]> writes:

> A challenge we already have - as in, I've broken these tests and had to
> fix them - is that an end to end test which checks either IR or assembly
> ends up being extraordinarily fragile.  Completely unrelated profitable
> transforms create small differences which cause spurious test failures. 
> This is a very real issue today with the few end-to-end clang tests we
> have, and I am extremely hesitant to expand those tests without giving
> this workflow problem serious thought.  If we don't, this could bring
> development on middle end transforms to a complete stop.  (Not kidding.)

Do you have a pointer to these tests?  We literally have tens of
thousands of end-to-end tests downstream and while some are fragile, the
vast majority are not.  A test that, for example, checks the entire
generated asm for a match is indeed very fragile.  A test that checks
whether a specific instruction/mnemonic was emitted is generally not, at
least in my experience.  End-to-end tests require some care in
construction.  I don't think update_llc_test_checks.py-type operation is
desirable.

Still, you raise a valid point and I think present some good options
below.

> A couple of approaches we could consider:
>
>  1. Simply restrict end to end tests to crash/assert cases.  (i.e. no
>     property of the generated code is checked, other than that it is
>     generated)  This isn't as restrictive as it sounds when combined
>     w/coverage guided fuzzer corpuses.

I would be pretty hesitant to do this but I'd like to hear more about
how you see this working with coverage/fuzzing.

>  2. Auto-update all diffs, but report them to a human user for
>     inspection.  This ends up meaning that tests never "fail" per se,
>     but that individuals who have expressed interest in particular tests
>     get an automated notification and a chance to respond on list with a
>     reduced example.

That's certainly workable.

>  3. As a variant on the former, don't auto-update tests, but only inform
>     the *contributor* of an end-to-end test of a failure. Responsibility
>     for determining failure vs false positive lies solely with them, and
>     normal channels are used to report a failure after it has been
>     confirmed/analyzed/explained.

I think I like this best of the three but it raises the question of what
happens when the contributor is no longer contributing.  Who's
responsible for the test?  Maybe it just sits there until someone else
claims it.

> I really think this is a problem we need to have thought through and
> found a workable solution before end-to-end testing as proposed becomes
> a practically workable option.

Noted.  I'm very happy to have this discussion and work the problem.

                     -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
Hi David,

Thanks for kicking off a discussion on this topic!


> On Oct 9, 2019, at 22:31, David Greene via llvm-dev <[hidden email]> wrote:
>
> Mehdi AMINI via llvm-dev <[hidden email]> writes:
>
>>> I absolutely disagree about vectorization tests.  We have seen
>>> vectorization loss in clang even though related LLVM lit tests pass,
>>> because something else in the clang pipeline changed that caused the
>>> vectorizer to not do its job.
>>
>> Of course, and as I mentioned I tried to add these tests (probably 4 or 5
>> years ago), but someone (I think Chandler?) was asking me at the time: does
>> it affect a benchmark performance? If so why isn't it tracked there? And if
>> not does it matter?
>> The benchmark was presented as the actual way to check this invariant
>> (because you're only vectoring to get performance, not for the sake of it).
>> So I never pursued, even if I'm a bit puzzled that we don't have such tests.
>
> Thanks for explaining.
>
> Our experience is that relying solely on performance tests to uncover
> such issues is problematic for several reasons:
>
> - Performance varies from implementation to implementation.  It is
>  difficult to keep tests up-to-date for all possible targets and
>  subtargets.

Could you expand a bit more what you mean here? Are you concerned about having to run the performance tests on different kinds of hardware? In what way do the existing benchmarks require keeping up-to-date?

With tests checking ASM, wouldn’t we end up with lots of checks for various targets/subtargets that we need to keep up to date? Just considering AArch64 as an example, people might want to check the ASM for different architecture versions and different vector extensions and different vendors might want to make sure that the ASM on their specific cores does not regress.

>
> - Partially as a result, but also for other reasons, performance tests
>  tend to be complicated, either in code size or in the numerous code
>  paths tested.  This makes such tests hard to debug when there is a
>  regression.

I am not sure they have to. Have you considered adding the small test functions/loops as micro-benchmarks using the existing google benchmark infrastructure in test-suite?

I think that might be able to address the points here relatively adequately. The separate micro benchmarks would be relatively small and we should be able to track down regressions in a similar fashion as if it would be a stand-alone file we compile and then analyze the ASM. Plus, we can easily run it and verify the performance on actual hardware.
 
>
> - Performance tests don't focus on the why/how of vectorization.  They
>  just check, "did it run fast enough?"  Maybe the test ran fast enough
>  for some other reason but we still lost desired vectorization and
>  could have run even faster.
>

If you would add a new micro-benchmark, you could check that it produces the desired result when adding it. The runtime-tracking should cover cases where we lost optimizations. I guess if the benchmarks are too big, additional optimizations in one part could hide lost optimizations somewhere else. But I would assume this to be relatively unlikely, as long as the benchmarks are isolated.

Also, checking the assembly for vector code does also not guarantee that the vector code will be actually executed. So for example  by just checking the assembly for certain vector instructions, we might miss that we regressed performance, because we messed up the runtime checks guarding the vector loop.

Cheers,
Florian

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev


> On Oct 9, 2019, at 16:12, David Greene via llvm-dev <[hidden email]> wrote:
>
> Mehdi AMINI via cfe-dev <[hidden email]> writes:
>
>>> I have a bit of concern about this sort of thing - worrying it'll lead to
>>> people being less cautious about writing the more isolated tests.
>>>
>>
>> I have the same concern. I really believe we need to be careful about
>> testing at the right granularity to keep things both modular and the
>> testing maintainable (for instance checking vectorized ASM from a C++
>> source through clang has always been considered a bad FileCheck practice).
>> (Not saying that there is no space for better integration testing in some
>> areas).
>
> I absolutely disagree about vectorization tests.  We have seen
> vectorization loss in clang even though related LLVM lit tests pass,
> because something else in the clang pipeline changed that caused the
> vectorizer to not do its job.  We need both kinds of tests.  There are
> many asm tests of value beyond vectorization and they should include
> component and well as end-to-end tests.


Have you considered alternatives to checking the assembly for ensuring vectorization or other transformations? For example, instead of checking the assembly, we could check LLVM’s statistics or optimization remarks. If you want to ensure a loop got vectorized, you could check the loop-vectorize remarks, which should give you the position of the loop in the source and vectorization/interleave factor used. There are few other things that could go wrong later on that would prevent vector instruction selection, but I think it should be sufficient to guard against most cases where we loose vectorization and should be much more robust to unrelated changes. If there are additional properties you want to ensure, they potentially could be added to the remark as well.

This idea of leveraging statistics and optimization remarks to track the impact of changes on overall optimization results is nothing new and I think several people already discussed it in various forms. For regular benchmark runs, in addition to tracking the existing benchmarks, we could also track selected optimization remarks (e.g. loop-vectorize, but not necessarily noisy ones like gvn) and statistics. Comparing those run-to-run could potentially highlight new end-to-end issues on a much larger scale, across all existing benchmarks integrated in test-suite. We might be able to detect loss in vectorization pro-actively, instead of requiring someone to file a bug report and then we add an isolated test after the fact.

But building something like this would be much more work of course….

Cheers,
Florian
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
On Thu, 10 Oct 2019 at 10:56, Florian Hahn <[hidden email]> wrote:
> Have you considered alternatives to checking the assembly for ensuring vectorization or other transformations? For example, instead of checking the assembly, we could check LLVM’s statistics or optimization remarks. If you want to ensure a loop got vectorized, you could check the loop-vectorize remarks, which should give you the position of the loop in the source and vectorization/interleave factor used. There are few other things that could go wrong later on that would prevent vector instruction selection, but I think it should be sufficient to guard against most cases where we loose vectorization and should be much more robust to unrelated changes. If there are additional properties you want to ensure, they potentially could be added to the remark as well.

We used to have lots of them, at least in the initial implementation
of the loop vectoriser (I know, many years ago).

The thread has enough points not to repeat here, but I guess the main
point is to make sure we don't duplicate tests, increasing CI cost
(especially on slower hardware).

I'd recommend trying to move any e2e tests into the test-suite and
make it easier to run, and leave specific tests only in the repo (to
guarantee independence of components).

The last thing we want is to create direct paths from front-ends to
back-ends and make LLVM IR transformation less flexible.

cheers,
--renato
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev


> -----Original Message-----
> From: llvm-dev <[hidden email]> On Behalf Of David Greene
> via llvm-dev
> Sent: Wednesday, October 09, 2019 9:17 PM
> To: Mehdi AMINI <[hidden email]>
> Cc: [hidden email]; [hidden email]; openmp-
> [hidden email]; [hidden email]
> Subject: Re: [llvm-dev] [cfe-dev] RFC: End-to-end testing
>
> Mehdi AMINI via cfe-dev <[hidden email]> writes:
>
> > I don't think these particular tests are the most controversial though,
> and
> > it is really still fairly "focused" testing. I'm much more curious about
> > larger end-to-end scope: for instance since you mention debug info and
> > LLDB, what about a test that would verify that LLDB can print a
> particular
> > variable content from a test that would come as a source program for
> > instance. Such test are valuable in the absolute, it isn't clear to me
> that
> > we could in practice block any commit that would break such test though:
> > this is because a bug fix or an improvement in one of the pass may be
> > perfectly correct in isolation but make the test fail by exposing a bug
> > where we are already losing some debug info precision in a totally
> > unrelated part of the codebase.
> > I wonder how you see this managed in practice: would you gate any change
> on
> > InstCombine (or other mid-level pass) on not regressing any of the
> > debug-info quality test on any of the backend, and from any frontend
> (not
> > only clang)? Or worse: a middle-end change that would end-up with a
> > slightly different Dwarf construct on this particular test, which would
> > trip LLDB but not GDB (basically expose a bug in LLDB). Should we
> require
> > the contributor of inst-combine to debug LLDB and fix it first?
>
> Good questions!  I think for situations like this I would tend toward
> allowing the change and the test would alert us that something else is
> wrong.  At that point it's probably a case-by-case decision.  Maybe we
> XFAIL the test.  Maybe the fix is easy enough that we just do it and the
> test starts passing again.  What's the policy for breaking current tests
> when the change itself is fine but exposes a problem elsewhere (adding
> an assert, for example)?

For debug info in particular, we already have the debuginfo-tests project,
which is separate because it requires executing the test program; this is
something the clang/llvm test suites specifically do NOT require.  There
is of course also the LLDB test suite, which I believe can be configured
to use the just-built clang to compile its test programs.

Regarding breakage policy, it's just like anything else: do what's needed
to make the bots happy.  What exactly that means will depend on the exact
situation.  I can cite a small patch that was held off for a ridiculously
long time, like around a year, because Chromium had some environmental
problem that they were slow to address.  That wasn't even an LLVM bot!
But eventually it got sorted out and our patch went in.

My point here is, that kind of thing happens already, adding a new e2e
test project won't inherently change any policy or how the community
responds to breakage.
--paulr

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
David Greene, will you be at the LLVM Dev Meeting? If so, could you sign
up for a Round Table session on this topic?  Obviously lots to discuss
and concerns to be addressed.

In particular I think there are two broad categories of tests that would
have to be segregated just by the nature of their requirements:

(1) Executable tests. These obviously require an execution platform; for
feasibility reasons this means host==target and the guarantee of having
a linker (possibly but not necessarily LLD) and a runtime (possibly but
not necessarily including libcxx).  Note that the LLDB tests and the
debuginfo-tests project already have this kind of dependency, and in the
case of debuginfo-tests, this is exactly why it's a separate project.

(2) Non-executable tests.  These are near-identical in character to the
existing clang/llvm test suites and I'd expect lit to drive them.  The
only material difference from the majority(*) of existing clang tests is
that they are free to depend on LLVM features/passes.  The only difference
from the majority of existing LLVM tests is that they have [Obj]{C,C++} as
their input source language.
(*) I've encountered clang tests that I feel depend on too much within LLVM,
and it's common for new contributors to provide a C/C++ test that needs to
be converted to a .ll test.  Some of them go in anyway.

More comments/notes below.

> -----Original Message-----
> From: lldb-dev <[hidden email]> On Behalf Of David Greene
> via lldb-dev
> Sent: Wednesday, October 09, 2019 9:25 PM
> To: Philip Reames <[hidden email]>; [hidden email];
> [hidden email]; [hidden email]; [hidden email]
> Subject: Re: [lldb-dev] [cfe-dev] [llvm-dev] RFC: End-to-end testing
>
> Philip Reames via cfe-dev <[hidden email]> writes:
>
> > A challenge we already have - as in, I've broken these tests and had to
> > fix them - is that an end to end test which checks either IR or assembly
> > ends up being extraordinarily fragile.  Completely unrelated profitable
> > transforms create small differences which cause spurious test failures.
> > This is a very real issue today with the few end-to-end clang tests we
> > have, and I am extremely hesitant to expand those tests without giving
> > this workflow problem serious thought.  If we don't, this could bring
> > development on middle end transforms to a complete stop.  (Not kidding.)
>
> Do you have a pointer to these tests?  We literally have tens of
> thousands of end-to-end tests downstream and while some are fragile, the
> vast majority are not.  A test that, for example, checks the entire
> generated asm for a match is indeed very fragile.  A test that checks
> whether a specific instruction/mnemonic was emitted is generally not, at
> least in my experience.  End-to-end tests require some care in
> construction.  I don't think update_llc_test_checks.py-type operation is
> desirable.

Sony likewise has a rather large corpus of end-to-end tests.  I expect any
vendor would.  When they break, we fix them or report/fix the compiler bug.
It has not been an intolerable burden on us, and I daresay if it were at
all feasible to put these upstream, it would not be an intolerable burden
on the community.  (It's not feasible because host!=target and we'd need
to provide test kits to the community and our remote-execution tools. We'd
rather just run them internally.)

Philip, what I'm actually hearing from your statement is along the lines,
"Our end-to-end tests are really fragile, therefore any end-to-end test
will be fragile, and that will be an intolerable burden."

That's an understandable reaction, but I think the community literally
would not tolerate too-fragile tests.  Tests that are too fragile will
be made more robust or removed.  This has been community practice for a
long time.  There's even an entire category of "noisy bots" that certain
people take care of and don't bother the rest of the community.  The
LLVM Project as a whole would not tolerate a test suite that "could
bring development ... to a complete stop" and I hope we can ease your
concerns.

More comments/notes/opinions below.

>
> Still, you raise a valid point and I think present some good options
> below.
>
> > A couple of approaches we could consider:
> >
> >  1. Simply restrict end to end tests to crash/assert cases.  (i.e. no
> >     property of the generated code is checked, other than that it is
> >     generated)  This isn't as restrictive as it sounds when combined
> >     w/coverage guided fuzzer corpuses.
>
> I would be pretty hesitant to do this but I'd like to hear more about
> how you see this working with coverage/fuzzing.

I think this is way too restrictive.

>
> >  2. Auto-update all diffs, but report them to a human user for
> >     inspection.  This ends up meaning that tests never "fail" per se,
> >     but that individuals who have expressed interest in particular tests
> >     get an automated notification and a chance to respond on list with a
> >     reduced example.
>
> That's certainly workable.

This is not different in principle from the "noisy bot" category, and if
it's a significant concern, the e2e tests can start out in that category.
Experience will tell us whether they are inherently fragile.  I would not
want to auto-update tests.

>
> >  3. As a variant on the former, don't auto-update tests, but only inform
> >     the *contributor* of an end-to-end test of a failure. Responsibility
> >     for determining failure vs false positive lies solely with them, and
> >     normal channels are used to report a failure after it has been
> >     confirmed/analyzed/explained.
>
> I think I like this best of the three but it raises the question of what
> happens when the contributor is no longer contributing.  Who's
> responsible for the test?  Maybe it just sits there until someone else
> claims it.

This is *exactly* the "noisy bot" tactic, and bots are supposed to have
owners who are active.

>
> > I really think this is a problem we need to have thought through and
> > found a workable solution before end-to-end testing as proposed becomes
> > a practically workable option.
>
> Noted.  I'm very happy to have this discussion and work the problem.
>
>                      -David
> _______________________________________________
> lldb-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
Florian Hahn via llvm-dev <[hidden email]> writes:

>> - Performance varies from implementation to implementation.  It is
>>  difficult to keep tests up-to-date for all possible targets and
>>  subtargets.
>
> Could you expand a bit more what you mean here? Are you concerned
> about having to run the performance tests on different kinds of
> hardware? In what way do the existing benchmarks require keeping
> up-to-date?

We have to support many different systems and those systems are always
changing (new processors, new BIOS, new OS, etc.).  Performance can vary
widely day to day from factors completely outside the compiler's
control.  As the performance changes you have to keep updating the tests
to expect the new performance numbers.  Relying on performance
measurements to ensure something like vectorization is happening just
isn't reliable in our experience.

> With tests checking ASM, wouldn’t we end up with lots of checks for
> various targets/subtargets that we need to keep up to date?

Yes, that's true.  But the only thing that changes the asm generated is
the compiler.

> Just considering AArch64 as an example, people might want to check the
> ASM for different architecture versions and different vector
> extensions and different vendors might want to make sure that the ASM
> on their specific cores does not regress.

Absolutely.  We do a lot of that sort of thing downstream.

>> - Partially as a result, but also for other reasons, performance tests
>>  tend to be complicated, either in code size or in the numerous code
>>  paths tested.  This makes such tests hard to debug when there is a
>>  regression.
>
> I am not sure they have to. Have you considered adding the small test
> functions/loops as micro-benchmarks using the existing google
> benchmark infrastructure in test-suite?

We have tried nightly performance runs using LNT/test-suite and have
found it to be very unreliable, especially the microbenchmarks.

> I think that might be able to address the points here relatively
> adequately. The separate micro benchmarks would be relatively small
> and we should be able to track down regressions in a similar fashion
> as if it would be a stand-alone file we compile and then analyze the
> ASM. Plus, we can easily run it and verify the performance on actual
> hardware.

A few of my colleagues really struggled to get consistent results out of
LNT.  They asked for help and discussed with a few upstream folks, but
in the end were not able to get something reliable working.  I've talked
to a couple of other people off-list and they've had similar
experiences.  It would be great if we have a reliable performance suite.
Please tell us how to get it working!  :)

But even then, I still maintain there is a place for the kind of
end-to-end testing I describe.  Performance testing would complement it.
Neither is a replacement for the other.

>> - Performance tests don't focus on the why/how of vectorization.  They
>>  just check, "did it run fast enough?"  Maybe the test ran fast enough
>>  for some other reason but we still lost desired vectorization and
>>  could have run even faster.
>>
>
> If you would add a new micro-benchmark, you could check that it
> produces the desired result when adding it. The runtime-tracking
> should cover cases where we lost optimizations. I guess if the
> benchmarks are too big, additional optimizations in one part could
> hide lost optimizations somewhere else. But I would assume this to be
> relatively unlikely, as long as the benchmarks are isolated.

Even then I have seen small performance tests vary widely in performance
due to system issues (see above).  Again, there is a place for them but
they are not sufficient.

> Also, checking the assembly for vector code does also not guarantee
> that the vector code will be actually executed. So for example by just
> checking the assembly for certain vector instructions, we might miss
> that we regressed performance, because we messed up the runtime checks
> guarding the vector loop.

Oh absolutely.  Presumably such checks would be included in the test or
would be checked by a different test.  As always, tests have to be
constructed intelligently.  :)

                      -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: End-to-end testing

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
Florian Hahn via cfe-dev <[hidden email]> writes:

> Have you considered alternatives to checking the assembly for ensuring
> vectorization or other transformations? For example, instead of
> checking the assembly, we could check LLVM’s statistics or
> optimization remarks.

Yes, absolutely.  We have tests that do things like that.  I don't want
to focus on the asm bit, that's just one type of test.  The larger issue
is end-to-end tests that ensure the compiler and other tools are working
correctly, be it from checking messages, statistics, asm or something
else.

> This idea of leveraging statistics and optimization remarks to track
> the impact of changes on overall optimization results is nothing new
> and I think several people already discussed it in various forms. For
> regular benchmark runs, in addition to tracking the existing
> benchmarks, we could also track selected optimization remarks
> (e.g. loop-vectorize, but not necessarily noisy ones like gvn) and
> statistics. Comparing those run-to-run could potentially highlight new
> end-to-end issues on a much larger scale, across all existing
> benchmarks integrated in test-suite. We might be able to detect loss
> in vectorization pro-actively, instead of requiring someone to file a
> bug report and then we add an isolated test after the fact.

That's an interesting idea!  I would love to get more use out of
test-suite.

                       -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
123