[RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
We would like to propose a new feature to disable optimizations on IR Functions that are considered “cold” by PGO profiles. The primary goal for this work is to improve code optimization speed (which also improves compilation and LTO speed) without making too much impact on target code performance.

The mechanism is pretty simple: In the second phase (i.e. optimization phase) of PGO, we would add `optnone` attributes on functions that are considered “cold”. That is, functions with low profiling counts. Similar approach can be applied on loops. The rationale behind this idea is pretty simple as well: If a given IR Function will not be frequently executed, we shouldn’t waste time optimizing it. Similar approaches can be found in modern JIT compilers for dynamic languages (e.g. Javascript and Python) that adopt a multi-tier compilation model: Only “hot” functions or execution traces will be brought to higher-tier compilers for aggressive optimizations.

In addition to de-optimizing on functions whose profiling counts are exactly zero (`-fprofile-deopt-cold`), we also provide a knob (`-fprofile-deopt-cold-percent=<X percent>`) to adjust the “cold threshold”. That is, after sorting profiling counts of all functions, this knob provides an option to de-optimize functions whose count values are sitting in the lower X percent.

We evaluated this feature on LLVM Test Suite (the Bitcode, SingleSource, and MultiSource sub-folders were selected). Both compilation speed and target program performance are measured by the number of instructions reported by Linux perf. The table below shows the percentage of compilation speed improvement and target performance overhead relative to the baseline that only uses (instrumentation-based) PGO.

Experiment Name               Compile Speedup           Target Overhead
DeOpt Cold Zero Count                5.13%                             0.02%
DeOpt Cold 25%                           8.06%                             0.12%
DeOpt Cold 50%                           13.32%                           2.38%
DeOpt Cold 75%                           17.53%                           7.07%

(The “DeOpt Cold Zero Count” experiment will only disable optimizations on functions whose profiling counts are exactly zero. Rest of the experiments are disabling optimizations on functions whose profiling counts are in the lower X%.)

We also did evaluations on FullLTO, here are the numbers:

Experiment Name               Link Time Speedup         Target Overhead
DeOpt Cold Zero Count                10.87%                           1.29%
DeOpt Cold 25%                           18.76%                           1.50%
DeOpt Cold 50%                           30.16%                           3.94%
DeOpt Cold 75%                           38.71%                           8.97%

(The link time presented here included the LTO and code generation time. We omitted the compile time numbers here since it’s not really interesting in LTO setup)

From the above experiments we observed that compilation / link time improvement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty.

We believed that the above numbers had justified this patch to be useful on improving build time with little overhead.

Here are the patches for review:
* Modifications on LLVM instrumentation-based PGO: https://reviews.llvm.org/D87337
* Modifications on Clang driver: https://reviews.llvm.org/D87338

Credit: This project was originally started by Paul Robinson <[hidden email]> and Edward Dawson <[hidden email]> from Sony PlayStation compiler team. I picked it up when I was interning there this summer.

Thank you for your reading.
-Min
--
Min-Yih Hsu
Ph.D Student in ICS Department, University of California, Irvine (UCI).

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
On Wed, 9 Sep 2020 at 01:21, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:
From the above experiments we observed that compilation / link time improvement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty.

Hi Min (Paul, Edd),

This is great work! Small, clear patch, substantial impact, virtually no downsides.

Just looking at your test-suite numbers, not optimising functions "never used" during the profile run sounds like an obvious "default PGO behaviour" to me. The flag defining the percentage range is a good option for development builds.

I imagine you guys have run this on internal programs and found beneficial, too, not just the LLVM test-suite (which is very small and non-representative). It would be nice if other groups that already use PGO could try that locally and spot any issues.

cheers,
--renato

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
This sounds very interesting and the compile time gains in the conservative range (say under 25%) seem quite promising.

One concern that comes to mind is if it is possible for performance to degrade severely in the situation where a function has a hot call site (where it gets inlined) and some non-zero number of cold sites (where it does not get inlined). When we decorate the function with `optnone, noinline` it will presumably not be inlined into the hot call site any longer and will furthermore be unoptimized.
Have you considered such a case and if so, is it something that cannot happen (i.e. inlining has already happened, etc.) or something that we can mitigate in the future?

A more aesthetic comment I have is that personally, I would prefer a single option with a default percentage (say 0%) rather than having to specify two options.
Also, it might be useful to add an option to dump the names of functions that are decorated so the user can track an execution count of such functions when running their code. But of course, the debug messages may be adequate for this purpose.

Nemanja

On Wed, Sep 9, 2020 at 6:26 AM Tobias Hieta via llvm-dev <[hidden email]> wrote:
Hello,

We use PGO to optimize clang itself. I can see if I have time to give this patch some testing. Anything special to look out for except compile benchmark and time to build clang, do you expect any changes in code size?

On Wed, Sep 9, 2020, 10:03 Renato Golin via llvm-dev <[hidden email]> wrote:
On Wed, 9 Sep 2020 at 01:21, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:
From the above experiments we observed that compilation / link time improvement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty.

Hi Min (Paul, Edd),

This is great work! Small, clear patch, substantial impact, virtually no downsides.

Just looking at your test-suite numbers, not optimising functions "never used" during the profile run sounds like an obvious "default PGO behaviour" to me. The flag defining the percentage range is a good option for development builds.

I imagine you guys have run this on internal programs and found beneficial, too, not just the LLVM test-suite (which is very small and non-representative). It would be nice if other groups that already use PGO could try that locally and spot any issues.

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
On Wed, 9 Sep 2020 at 14:27, Nemanja Ivanovic <[hidden email]> wrote:
A more aesthetic comment I have is that personally, I would prefer a single option with a default percentage (say 0%) rather than having to specify two options.

0% doesn't mean "don't do it", just means "only do that to functions I didn't see running at all", which could be misrepresented in the profiling run.

If we agree this should be *always* enabled, then only one option is needed. Otherwise, we'd need negative percentages to mean "don't do that" and that would be weird. :)

 
Also, it might be useful to add an option to dump the names of functions that are decorated so the user can track an execution count of such functions when running their code. But of course, the debug messages may be adequate for this purpose.

Remark options should be enough for that.

--renato 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev
Hi Renato,

On Wed, Sep 9, 2020 at 1:03 AM Renato Golin <[hidden email]> wrote:
On Wed, 9 Sep 2020 at 01:21, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:
From the above experiments we observed that compilation / link time improvement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty.

Hi Min (Paul, Edd),

This is great work! Small, clear patch, substantial impact, virtually no downsides.

Thank you :-) 

Just looking at your test-suite numbers, not optimising functions "never used" during the profile run sounds like an obvious "default PGO behaviour" to me. The flag defining the percentage range is a good option for development builds. 

I imagine you guys have run this on internal programs and found beneficial, too, not just the LLVM test-suite (which is very small and non-representative). It would be nice if other groups that already use PGO could try that locally and spot any issues.
Good point! We are aware that LLVM Test Suite is too "SPEC-alike" and lean toward scientific computation rather than real-world use cases. So we actually did experiments on the V8 javascript engine, which is absolutely a huge code base and a good real-world example. And it showed a 10~13% speed improvement on optimization + codegen time with up to 4% of target performance overhead (Note that due to some hacky reasons, for many of the V8 source files, over 80% or even 95% of compilation time was spent on frontend, so measuring by total compilation time will be heavily skewed and unable to reflect the impact of this feature)

Best
-Min
 

cheers,
--renato


--
Min-Yih Hsu
Ph.D Student in ICS Department, University of California, Irvine (UCI).

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev
Hi Tobias and Dominique,

I didn't evaluate the impact on code size in the first place since it was not my primary goal. But thanks to the design of LLVM Test Suite benchmarking infrastructure, I can call out those numbers right away.

(Non-LTO)
Experiment Name                 Code Size Increase Percentage
DeOpt Cold Zero Count                        5.2%
DeOpt Cold 25%                                   6.8%
DeOpt Cold 50%                                   7.0%
DeOpt Cold 75%                                   7.0%

(FullLTO)
Experiment Name                 Code Size Increase Percentage
DeOpt Cold Zero Count                        4.8%
DeOpt Cold 25%                                   6.4%
DeOpt Cold 50%                                   6.2%
DeOpt Cold 75%                                   5.3%

For non-LTO its cap is around 7%. For FullLTO things got a little more interesting where code size actually decreased when we increased the cold threshold, but I'll say it's around 6%. To dive a little deeper, the majority of increased code size was (not-surprisingly) coming from the .text section. The PLT section contributed a little bit, and the rest of sections brealey changed.

Though the overhead on code size is higher than the target performance overhead, I think it's still acceptable in normal cases. In addition, David mentioned in D87337 that LLVM has used similar techniques on code size (not sure what he was referencing, my guess will be something related to hot-cold code splitting). So I think the feature we're proposing here can be a complement to that one. 

Finally: Tobias, thanks for evaluating the impact on Clang, I'm really interested to see the result.

Best,
Min

On Wed, Sep 9, 2020 at 3:26 AM Tobias Hieta <[hidden email]> wrote:
Hello,

We use PGO to optimize clang itself. I can see if I have time to give this patch some testing. Anything special to look out for except compile benchmark and time to build clang, do you expect any changes in code size?

On Wed, Sep 9, 2020, 10:03 Renato Golin via llvm-dev <[hidden email]> wrote:
On Wed, 9 Sep 2020 at 01:21, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:
From the above experiments we observed that compilation / link time improvement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty.

Hi Min (Paul, Edd),

This is great work! Small, clear patch, substantial impact, virtually no downsides.

Just looking at your test-suite numbers, not optimising functions "never used" during the profile run sounds like an obvious "default PGO behaviour" to me. The flag defining the percentage range is a good option for development builds.

I imagine you guys have run this on internal programs and found beneficial, too, not just the LLVM test-suite (which is very small and non-representative). It would be nice if other groups that already use PGO could try that locally and spot any issues.

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Min-Yih Hsu
Ph.D Student in ICS Department, University of California, Irvine (UCI).

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
On Wed, 9 Sep 2020 at 18:15, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:
David mentioned in D87337 that LLVM has used similar techniques on code size (not sure what he was referencing, my guess will be something related to hot-cold code splitting).

IIUC, it's just using optsize instead of optnone. The idea is that, if the code really doesn't run often/at all, then the performance impact of reducing the size is negligible, but the size impact is considerable.

I'd wager that optsize could even be faster than optnone, as it would delete a lot of useless code... but not noticeable, as it wouldn't run much.

This is an idea that we (Verona Language) are interested in, too.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev


On Wed, Sep 9, 2020 at 10:11 AM Renato Golin <[hidden email]> wrote:
On Wed, 9 Sep 2020 at 14:27, Nemanja Ivanovic <[hidden email]> wrote:
A more aesthetic comment I have is that personally, I would prefer a single option with a default percentage (say 0%) rather than having to specify two options.

0% doesn't mean "don't do it", just means "only do that to functions I didn't see running at all", which could be misrepresented in the profiling run.

If we agree this should be *always* enabled, then only one option is needed. Otherwise, we'd need negative percentages to mean "don't do that" and that would be weird. :)

I am not sure I follow. My suggestion was to have one option that would give you a default of 0% (i.e. only add the attribute on functions that were never called). So the semantics would be fairly straightforward:
- Default (i.e. no -profile-deopt-cold): do nothing
- Option with no arg (i.e. -profile-deopt-cold): add attribute only to functions that have an execution count of zero
- Option with an arg (i.e. -profile-deopt-cold=<N>): add attribute to functions that account for <N>% of total execution counts

 
Also, it might be useful to add an option to dump the names of functions that are decorated so the user can track an execution count of such functions when running their code. But of course, the debug messages may be adequate for this purpose.

Remark options should be enough for that.

--renato 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
On Wed, 9 Sep 2020 at 19:26, Nemanja Ivanovic <[hidden email]> wrote:
- Default (i.e. no -profile-deopt-cold): do nothing
- Option with no arg (i.e. -profile-deopt-cold): add attribute only to functions that have an execution count of zero
- Option with an arg (i.e. -profile-deopt-cold=<N>): add attribute to functions that account for <N>% of total execution counts

I see. This looks confusing to me, but perhaps it's just me. 

Though, I'm not sure we can get this behaviour from the same flag, as you need to provide a default value if the flag isn't passed (usually boolean or integer, not both).

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
On Wed, Sep 9, 2020 at 3:10 PM Renato Golin via cfe-dev <[hidden email]> wrote:
On Wed, 9 Sep 2020 at 19:26, Nemanja Ivanovic <[hidden email]> wrote:
- Default (i.e. no -profile-deopt-cold): do nothing
- Option with no arg (i.e. -profile-deopt-cold): add attribute only to functions that have an execution count of zero
- Option with an arg (i.e. -profile-deopt-cold=<N>): add attribute to functions that account for <N>% of total execution counts

I see. This looks confusing to me, but perhaps it's just me. 

It's not just you. :)  Assuming "account for <N>% of total execution counts" means "account for <N>% or less of total execution counts," then it seems like the proposed -profile-deopt-cold does the same thing as -profile-deopt-cold=0.

Also, for build-system-friendliness, IMHO every positive option should have a negative option — i.e., the default behavior should be regainable via an option such as -profile-no-deopt-cold.  (Or -fno-profile-deopt-cold, if there was a missing `f` in all of the above.)  That seems easier to do if the whole thing is controlled by just one option instead of two.

my $.02,
–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
Hi All,

On Wed, Sep 9, 2020 at 12:23 PM Arthur O'Dwyer via cfe-dev <[hidden email]> wrote:
On Wed, Sep 9, 2020 at 3:10 PM Renato Golin via cfe-dev <[hidden email]> wrote:
On Wed, 9 Sep 2020 at 19:26, Nemanja Ivanovic <[hidden email]> wrote:
- Default (i.e. no -profile-deopt-cold): do nothing
- Option with no arg (i.e. -profile-deopt-cold): add attribute only to functions that have an execution count of zero
- Option with an arg (i.e. -profile-deopt-cold=<N>): add attribute to functions that account for <N>% of total execution counts

I see. This looks confusing to me, but perhaps it's just me. 

It's not just you. :)  Assuming "account for <N>% of total execution counts" means "account for <N>% or less of total execution counts," then it seems like the proposed -profile-deopt-cold does the same thing as -profile-deopt-cold=0

Also, for build-system-friendliness, IMHO every positive option should have a negative option — i.e., the default behavior should be regainable via an option such as -profile-no-deopt-cold.  (Or -fno-profile-deopt-cold, if there was a missing `f` in all of the above.)  That seems easier to do if the whole thing is controlled by just one option instead of two.
Actually there has always been a `-fno-profile-deopt-cold` driver flag in my second Phabricator review (D87338).
But to sum up, I think it's a good idea to have only one driver flag, or even one LLVM CLI option.
 

my $.02,
–Arthur
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


--
Min-Yih Hsu
Ph.D Student in ICS Department, University of California, Irvine (UCI).

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev

FYI David is referring to PGSO (profile-guided size optimization) as it exists directly under that name, see: https://reviews.llvm.org/D67120. And yeah using PGSO is selecting optsize while this change is selecting optnone.

 

On 9/9/20, 10:58 AM, "llvm-dev on behalf of Tobias Hieta via llvm-dev" <[hidden email] on behalf of [hidden email]> wrote:

 

Would it make sense to have a flag to select optnone or optsize? We would probably also do the tradeoff for a smaller binary. 

 

On Wed, Sep 9, 2020, 19:28 Renato Golin <[hidden email]> wrote:

On Wed, 9 Sep 2020 at 18:15, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:

David mentioned in D87337 that LLVM has used similar techniques on code size (not sure what he was referencing, my guess will be something related to hot-cold code splitting).

 

IIUC, it's just using optsize instead of optnone. The idea is that, if the code really doesn't run often/at all, then the performance impact of reducing the size is negligible, but the size impact is considerable.

 

I'd wager that optsize could even be faster than optnone, as it would delete a lot of useless code... but not noticeable, as it wouldn't run much.

 

This is an idea that we (Verona Language) are interested in, too.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev

The 1.29% is pretty considerable on functions that should never be hit according to profile information. This can indicate that there might be something amiss with the profile quality and that certain hot functions are not getting caught. Alternatively, given the ~5% code size increase you mention in the other thread the cold code may not be being moved out to a cold page so i-cache pollution ends up being a factor. I think it would be worthwhile to dig deeper into why there’s any performance degradation on functions that should never be called.

 

Also if you’re curious on how to build clang itself with PGO the documentation is here: https://llvm.org/docs/HowToBuildWithPGO.html

 

On 9/8/20, 5:21 PM, "llvm-dev on behalf of Min-Yih Hsu via llvm-dev" <[hidden email] on behalf of [hidden email]> wrote:

We also did evaluations on FullLTO, here are the numbers:

Experiment Name               Link Time Speedup         Target Overhead
DeOpt Cold Zero Count                10.87%                           1.29%
DeOpt Cold 25%                           18.76%                           1.50%
DeOpt Cold 50%                           30.16%                           3.94%
DeOpt Cold 75%                           38.71%                           8.97%

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev

I think calling PGSO size opt is probably a bit misleading though. It’s more of an adaptive opt strategy, and it can improve performance too due to better locality. We have something similar internally for selecting opt level based on profile hotness too under AutoFDO.

 

Perhaps similar implementations can all be unified under a profile guided “adaptive optimization” framework to avoid duplication:

  • A unified way of setting hot/cold cutoff percentile (e.g. through PSI that’s already used by all PGO/FDO).
  • A unified way of selecting opt strategy for cold functions: default, none, size, minsize.

 

Thanks,

Wenlei

 

From: llvm-dev <[hidden email]> on behalf of Modi Mo via llvm-dev <[hidden email]>
Reply-To: Modi Mo <[hidden email]>
Date: Wednesday, September 9, 2020 at 5:55 PM
To: Tobias Hieta <[hidden email]>, Renato Golin <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, llvm-dev <[hidden email]>, "cfe-dev ([hidden email])" <[hidden email]>
Subject: Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

 

FYI David is referring to PGSO (profile-guided size optimization) as it exists directly under that name, see: https://reviews.llvm.org/D67120. And yeah using PGSO is selecting optsize while this change is selecting optnone.

 

On 9/9/20, 10:58 AM, "llvm-dev on behalf of Tobias Hieta via llvm-dev" <[hidden email] on behalf of [hidden email]> wrote:

 

Would it make sense to have a flag to select optnone or optsize? We would probably also do the tradeoff for a smaller binary. 

 

On Wed, Sep 9, 2020, 19:28 Renato Golin <[hidden email]> wrote:

On Wed, 9 Sep 2020 at 18:15, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:

David mentioned in D87337 that LLVM has used similar techniques on code size (not sure what he was referencing, my guess will be something related to hot-cold code splitting).

 

IIUC, it's just using optsize instead of optnone. The idea is that, if the code really doesn't run often/at all, then the performance impact of reducing the size is negligible, but the size impact is considerable.

 

I'd wager that optsize could even be faster than optnone, as it would delete a lot of useless code... but not noticeable, as it wouldn't run much.

 

This is an idea that we (Verona Language) are interested in, too.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev

1%+ overhead is indeed interesting. If you use lld as linker (together with new pass manager), you should be able to have a good profile guided function level layout so dead functions are moved out of the hot pages.

 

This may also be related to subtle pass ordering issue. Pre-inline counts may not be super accurate, but we can’t use post-inline counts either given CGSCC inline is half way through the opt pipeline. Looking at the patch, it seems the decision is made at PGO annotation time which is between pre-instrumentation inline and CGSCC inline.  

 

From: llvm-dev <[hidden email]> on behalf of Modi Mo via llvm-dev <[hidden email]>
Reply-To: Modi Mo <[hidden email]>
Date: Wednesday, September 9, 2020 at 6:18 PM
To: Min-Yih Hsu <[hidden email]>, llvm-dev <[hidden email]>, "cfe-dev ([hidden email])" <[hidden email]>, Hongtao Yu <[hidden email]>
Subject: Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

 

The 1.29% is pretty considerable on functions that should never be hit according to profile information. This can indicate that there might be something amiss with the profile quality and that certain hot functions are not getting caught. Alternatively, given the ~5% code size increase you mention in the other thread the cold code may not be being moved out to a cold page so i-cache pollution ends up being a factor. I think it would be worthwhile to dig deeper into why there’s any performance degradation on functions that should never be called.

 

Also if you’re curious on how to build clang itself with PGO the documentation is here: https://llvm.org/docs/HowToBuildWithPGO.html

 

On 9/8/20, 5:21 PM, "llvm-dev on behalf of Min-Yih Hsu via llvm-dev" <[hidden email] on behalf of [hidden email]> wrote:

We also did evaluations on FullLTO, here are the numbers:

Experiment Name               Link Time Speedup         Target Overhead
DeOpt Cold Zero Count                10.87%                           1.29%
DeOpt Cold 25%                           18.76%                           1.50%
DeOpt Cold 50%                           30.16%                           3.94%
DeOpt Cold 75%                           38.71%                           8.97%


 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev


On Wed, Sep 9, 2020 at 9:23 PM Wenlei He via llvm-dev <[hidden email]> wrote:

I think calling PGSO size opt is probably a bit misleading though. It’s more of an adaptive opt strategy, and it can improve performance too due to better locality. We have something similar internally for selecting opt level based on profile hotness too under AutoFDO.

 

Perhaps similar implementations can all be unified under a profile guided “adaptive optimization” framework to avoid duplication:

  • A unified way of setting hot/cold cutoff percentile (e.g. through PSI that’s already used by all PGO/FDO).
  • A unified way of selecting opt strategy for cold functions: default, none, size, minsize.

 

Thanks,

Wenlei

 

From: llvm-dev <[hidden email]> on behalf of Modi Mo via llvm-dev <[hidden email]>
Reply-To: Modi Mo <[hidden email]>
Date: Wednesday, September 9, 2020 at 5:55 PM
To: Tobias Hieta <[hidden email]>, Renato Golin <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, llvm-dev <[hidden email]>, "cfe-dev ([hidden email])" <[hidden email]>
Subject: Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

 

FYI David is referring to PGSO (profile-guided size optimization) as it exists directly under that name, see: https://reviews.llvm.org/D67120. And yeah using PGSO is selecting optsize while this change is selecting optnone.


PGSO looks at the block-level profile, too.

 

On 9/9/20, 10:58 AM, "llvm-dev on behalf of Tobias Hieta via llvm-dev" <[hidden email] on behalf of [hidden email]> wrote:

 

Would it make sense to have a flag to select optnone or optsize? We would probably also do the tradeoff for a smaller binary. 

 

On Wed, Sep 9, 2020, 19:28 Renato Golin <[hidden email]> wrote:

On Wed, 9 Sep 2020 at 18:15, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:

David mentioned in D87337 that LLVM has used similar techniques on code size (not sure what he was referencing, my guess will be something related to hot-cold code splitting).

 

IIUC, it's just using optsize instead of optnone. The idea is that, if the code really doesn't run often/at all, then the performance impact of reducing the size is negligible, but the size impact is considerable.

 

I'd wager that optsize could even be faster than optnone, as it would delete a lot of useless code... but not noticeable, as it wouldn't run much.

 

This is an idea that we (Verona Language) are interested in, too.

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
Hi,

Thanks for all the feedback related to code size.

From my understanding, the current PGSO (D59514, D67120) has two parts:
1. Adding a new llvm::shouldOptimizeForSize framework that leverages BFI and PSI to provide block level and function level assessments on whether we should optimize for size.
2. In Passes (mostly MachinePasses), they'll change certain behaviors (e.g. whether adding pads or not) if llvm::shouldOptimizeForSize returns true OR there is an `optsize` or `minsize` attribute

I totally agree with Wenlei that (somewhere in the future) we should have a unified FDO framework for both code size and compilation time. And I think Renato and Tobias's suggestions to do the same thing for size-oriented attributes (i.e. `minsize` and `optsize`) is the low-hanging fruit we can support in a short time.
Engineering-wised I'll prefer to send out a separate review for the size-oriented attributes work, since `minsize` / `optsize` are kind of in conflict with `optnone` so I don't think it's a good idea to put them into one flag / feature set.

Best,
Min

On Thu, Sep 10, 2020 at 9:18 AM Hiroshi Yamauchi via llvm-dev <[hidden email]> wrote:


On Wed, Sep 9, 2020 at 9:23 PM Wenlei He via llvm-dev <[hidden email]> wrote:

I think calling PGSO size opt is probably a bit misleading though. It’s more of an adaptive opt strategy, and it can improve performance too due to better locality. We have something similar internally for selecting opt level based on profile hotness too under AutoFDO.

 

Perhaps similar implementations can all be unified under a profile guided “adaptive optimization” framework to avoid duplication:

  • A unified way of setting hot/cold cutoff percentile (e.g. through PSI that’s already used by all PGO/FDO).
  • A unified way of selecting opt strategy for cold functions: default, none, size, minsize.

 

Thanks,

Wenlei

 

From: llvm-dev <[hidden email]> on behalf of Modi Mo via llvm-dev <[hidden email]>
Reply-To: Modi Mo <[hidden email]>
Date: Wednesday, September 9, 2020 at 5:55 PM
To: Tobias Hieta <[hidden email]>, Renato Golin <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, llvm-dev <[hidden email]>, "cfe-dev ([hidden email])" <[hidden email]>
Subject: Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

 

FYI David is referring to PGSO (profile-guided size optimization) as it exists directly under that name, see: https://reviews.llvm.org/D67120. And yeah using PGSO is selecting optsize while this change is selecting optnone.


PGSO looks at the block-level profile, too.

 

On 9/9/20, 10:58 AM, "llvm-dev on behalf of Tobias Hieta via llvm-dev" <[hidden email] on behalf of [hidden email]> wrote:

 

Would it make sense to have a flag to select optnone or optsize? We would probably also do the tradeoff for a smaller binary. 

 

On Wed, Sep 9, 2020, 19:28 Renato Golin <[hidden email]> wrote:

On Wed, 9 Sep 2020 at 18:15, Min-Yih Hsu via llvm-dev <[hidden email]> wrote:

David mentioned in D87337 that LLVM has used similar techniques on code size (not sure what he was referencing, my guess will be something related to hot-cold code splitting).

 

IIUC, it's just using optsize instead of optnone. The idea is that, if the code really doesn't run often/at all, then the performance impact of reducing the size is negligible, but the size impact is considerable.

 

I'd wager that optsize could even be faster than optnone, as it would delete a lot of useless code... but not noticeable, as it wouldn't run much.

 

This is an idea that we (Verona Language) are interested in, too.

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Min-Yih Hsu
Ph.D Student in ICS Department, University of California, Irvine (UCI).

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
On Thu, 10 Sep 2020 at 18:22, Min-Yih Hsu <[hidden email]> wrote:
I totally agree with Wenlei that (somewhere in the future) we should have a unified FDO framework for both code size and compilation time. And I think Renato and Tobias's suggestions to do the same thing for size-oriented attributes (i.e. `minsize` and `optsize`) is the low-hanging fruit we can support in a short time.
Engineering-wised I'll prefer to send out a separate review for the size-oriented attributes work, since `minsize` / `optsize` are kind of in conflict with `optnone` so I don't think it's a good idea to put them into one flag / feature set.

I'm happy for this unification to happen at a later stage. Just not too long later.

I worry exposing the flags will get people to use it and then we'll change it. The longer we leave it, the more people will be hit by the subtle change.

Worse still if we release with one behaviour now and then with a different behaviour in the next release. 

Having conditional paths in build systems for different versions of the compiler isn't fun.

--renato

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

Hubert Tong via cfe-dev
HI Renato,

On Thu, Sep 10, 2020 at 10:54 AM Renato Golin <[hidden email]> wrote:
On Thu, 10 Sep 2020 at 18:22, Min-Yih Hsu <[hidden email]> wrote:
I totally agree with Wenlei that (somewhere in the future) we should have a unified FDO framework for both code size and compilation time. And I think Renato and Tobias's suggestions to do the same thing for size-oriented attributes (i.e. `minsize` and `optsize`) is the low-hanging fruit we can support in a short time.
Engineering-wised I'll prefer to send out a separate review for the size-oriented attributes work, since `minsize` / `optsize` are kind of in conflict with `optnone` so I don't think it's a good idea to put them into one flag / feature set.

I'm happy for this unification to happen at a later stage. Just not too long later.

I worry exposing the flags will get people to use it and then we'll change it. The longer we leave it, the more people will be hit by the subtle change.

Worse still if we release with one behaviour now and then with a different behaviour in the next release. 

Yeah totally understandable. I will try to not let it happen.
I'm willing to implement the support for optsize / minsize after this patch. As well as providing some experiment numbers to justify it. 

Best,
Min

Having conditional paths in build systems for different versions of the compiler isn't fun.

--renato


--
Min-Yih Hsu
Ph.D Student in ICS Department, University of California, Irvine (UCI).

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev