Varying per function optimisation based on include path?

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
Hello Clang,

There's a proposal in SG15 (http://wg21.link/p1832) with a suggestion to more aggressively inline functions included via isystem when building for -Og. The goal is to improve runtime performance for debug builds that make heavy use of the STL.

Does clang already set various attributes based on whether a function was found via -i or -isystem, and if not, does that seem a reasonable extension?

Thanks,

Jon

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev


On Mon, Aug 19, 2019 at 8:34 AM Jon Chesterfield via cfe-dev <[hidden email]> wrote:
Hello Clang,

There's a proposal in SG15 (http://wg21.link/p1832) with a suggestion to more aggressively inline functions included via isystem when building for -Og.

To me that sounds pretty outside the scope of the C++ standards committee, but don't know too much about how they deal with things on the fringe like this. It would seem simpler/more direct to provide patches/discuss development in existing compilers to demonstrate the value and leave it up to "market forces" to handle this sort of quality-of-implementation thing.
 
The goal is to improve runtime performance for debug builds that make heavy use of the STL.

Does clang already set various attributes based on whether a function was found via -i or -isystem, and if not, does that seem a reasonable extension?

Nope - pretty sure it doesn't & probably tries fairly hard not to vary code generation (or even diagnostics, except for a very big slice around system headers specifically to avoid warnings users can't fix) depending on where the code was written.

There is work in Clang/LLVM to try to make -Og/-O1 (currently synonymous and hope to keep them that way as long as possible) be a good/better fast/debuggable tradeoff. But mostly that centers around avoiding destructive optimizations & keeping as much debug-ability-related state as possible.

- Dave
 

Thanks,

Jon
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
On Mon, Aug 19, 2019 at 1:25 PM David Blaikie via cfe-dev <[hidden email]> wrote:
On Mon, Aug 19, 2019 at 8:34 AM Jon Chesterfield via cfe-dev <[hidden email]> wrote:

There's a proposal in SG15 (http://wg21.link/p1832) with a suggestion to more aggressively inline functions included via isystem when building for -Og.

To me that sounds pretty outside the scope of the C++ standards committee, but don't know too much about how they deal with things on the fringe like this. It would seem simpler/more direct to provide patches/discuss development in existing compilers to demonstrate the value and leave it up to "market forces" to handle this sort of quality-of-implementation thing.

I think "discuss development in existing compilers" was the purpose of Jon's opening this thread, yeah?
Does P1832's approach seem like a reasonable extension in the context of Clang? 

The goal is to improve runtime performance for debug builds that make heavy use of the STL.

Does clang already set various attributes based on whether a function was found via -i or -isystem, and if not, does that seem a reasonable extension?

Nope - pretty sure it doesn't & probably tries fairly hard not to vary code generation (or even diagnostics, except for a very big slice around system headers specifically to avoid warnings users can't fix) depending on where the code was written.

In -Og mode, it seems that it would equally make sense to take "a very big slice around system headers specifically to avoid" debug symbols for code that users can't debug.

There is work in Clang/LLVM to try to make -Og/-O1 (currently synonymous and hope to keep them that way as long as possible) be a good/better fast/debuggable tradeoff. But mostly that centers around avoiding destructive optimizations & keeping as much debug-ability-related state as possible.

Are "debug symbols / lack of inlining" for any piece of code always "debuggability-related"?
P1832 takes the position that for code in system headers, performing inlining and other optimizations doesn't have much impact on "debuggability" because that system code is never going to be debugged by the user anyway.

Naïvely, it seems to me like a reasonable extension, in the context of Clang.

my $.02,
–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev


On Mon, Aug 19, 2019 at 10:36 AM Arthur O'Dwyer <[hidden email]> wrote:
On Mon, Aug 19, 2019 at 1:25 PM David Blaikie via cfe-dev <[hidden email]> wrote:
On Mon, Aug 19, 2019 at 8:34 AM Jon Chesterfield via cfe-dev <[hidden email]> wrote:

There's a proposal in SG15 (http://wg21.link/p1832) with a suggestion to more aggressively inline functions included via isystem when building for -Og.

To me that sounds pretty outside the scope of the C++ standards committee, but don't know too much about how they deal with things on the fringe like this. It would seem simpler/more direct to provide patches/discuss development in existing compilers to demonstrate the value and leave it up to "market forces" to handle this sort of quality-of-implementation thing.

I think "discuss development in existing compilers" was the purpose of Jon's opening this thread, yeah?

The discussion here seems fine - I was/am a bit confused by this involving a paper to the C++ committee.
 
Does P1832's approach seem like a reasonable extension in the context of Clang? 

Wording pedantry, etc - but I wouldn't describe this as an extension because it's not part of the observable behavior as far as C++ is concerned. (it doesn't make code that would be invalid (or unspecified, IB, or UB) valid with some specific semantics)

I'd probably describe it as a clang feature request - one I personally would be pretty hesitant about, but data could help motivate the decision - performance data combined with some attempt to quantify debuggability.
 
The goal is to improve runtime performance for debug builds that make heavy use of the STL.

Does clang already set various attributes based on whether a function was found via -i or -isystem, and if not, does that seem a reasonable extension?

Nope - pretty sure it doesn't & probably tries fairly hard not to vary code generation (or even diagnostics, except for a very big slice around system headers specifically to avoid warnings users can't fix) depending on where the code was written.

In -Og mode, it seems that it would equally make sense to take "a very big slice around system headers specifically to avoid" debug symbols for code that users can't debug.

That seems different to me - users can debug into templates and it can be useful - if they've corrupted the state of a container (yeah, other tools might be better there, like sanitizers) or the library is doing the right thing but it's surprising because the user didn't realize what they were asking for.
 

There is work in Clang/LLVM to try to make -Og/-O1 (currently synonymous and hope to keep them that way as long as possible) be a good/better fast/debuggable tradeoff. But mostly that centers around avoiding destructive optimizations & keeping as much debug-ability-related state as possible.

Are "debug symbols / lack of inlining" for any piece of code always "debuggability-related"?
P1832 takes the position that for code in system headers, performing inlining and other optimizations doesn't have much impact on "debuggability" because that system code is never going to be debugged by the user anyway.

Once you inline though, the code you've inlined can get jumbled up with the other code - potentially placing a container in an invalid state inside that user code - perhaps they try to print the contents of the container on a line after something was added but it doesn't show up because of instruction reordering, for instance.
 

Naïvely, it seems to me like a reasonable extension, in the context of Clang.

my $.02,
–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
In reply to this post by Tom Stellard via cfe-dev
 On Mon, Aug 19, 2019 at 10:36 AM Arthur O'Dwyer <[hidden email]>
wrote:

> On Mon, Aug 19, 2019 at 1:25 PM David Blaikie via cfe-dev <
> [hidden email]> wrote:
>
>> On Mon, Aug 19, 2019 at 8:34 AM Jon Chesterfield via cfe-dev <
>> [hidden email]> wrote:
>>
>>>
>>> There's a proposal in SG15 (http://wg21.link/p1832) with a suggestion
>>> to more aggressively inline functions included via isystem when building
>>> for -Og.
>>>
>>
>> To me that sounds pretty outside the scope of the C++ standards
>> committee, but don't know too much about how they deal with things on the
>> fringe like this. It would seem simpler/more direct to provide
>> patches/discuss development in existing compilers to demonstrate the value
>> and leave it up to "market forces" to handle this sort of
>> quality-of-implementation thing.
>>
>
> I think "discuss development in existing compilers" was the purpose of
> Jon's opening this thread, yeah?
>

The discussion here seems fine - I was/am a bit confused by this involving
a paper to the C++ committee.

It's context. Most of the time I'm compiler back end, but periodically I play with C++. John's paper looked enough like a compiler bug report that I thought I'd raise it here. The paper references GCC but LLVM is more familiar to me.


> Does P1832's approach seem like a reasonable extension *in the context of
> Clang?*
>

Wording pedantry, etc - but I wouldn't describe this as an extension
because it's not part of the observable behavior as far as C++ is
concerned. (it doesn't make code that would be invalid (or unspecified, IB,
or UB) valid with some specific semantics)

I'd probably describe it as a clang feature request - one I personally
would be pretty hesitant about, but data could help motivate the decision -
performance data combined with some attempt to quantify debuggability.

Feature request is fair. I was imagining marking "application" functions as optnone and STL/SDK functions as normal, but don't have a good handle on how clang distinguishes functions from different origins.


> The goal is to improve runtime performance for debug builds that make
>>> heavy use of the STL.
>>>
>>> Does clang already set various attributes based on whether a function
>>> was found via -i or -isystem, and if not, does that seem a reasonable
>>> extension?
>>>
>>
>> Nope - pretty sure it doesn't & probably tries fairly hard not to vary
>> code generation (or even diagnostics, except for a very big slice around
>> system headers specifically to avoid warnings users can't fix) depending on
>> where the code was written.
>>
>
> In -Og mode, it seems that it would equally make sense to take "a very big
> slice around system headers specifically to avoid" debug symbols for code
> that users can't debug.
>

That seems different to me - users can debug into templates and it can be
useful - if they've corrupted the state of a container (yeah, other tools
might be better there, like sanitizers) or the library is doing the right
thing but it's surprising because the user didn't realize what they were
asking for.

Mixed. The attitude of "the bug is in my code, not the standard library" is usually worth encouraging. So I sort of like the idea of optimising library code while keeping application code simpler. Splitting the two by include paths was a new idea for me.


>
> There is work in Clang/LLVM to try to make -Og/-O1 (currently synonymous
>> and hope to keep them that way as long as possible) be a good/better
>> fast/debuggable tradeoff. But mostly that centers around avoiding
>> destructive optimizations & keeping as much debug-ability-related state as
>> possible.
>>
>
> Are "debug symbols / lack of inlining" for any piece of code always
> "debuggability-related"?
> P1832 takes the position that for code in system headers, performing
> inlining and other optimizations doesn't have much impact on
> "debuggability" because that system code is never going to be debugged by
> the user anyway.
>

Once you inline though, the code you've inlined can get jumbled up with the
other code - potentially placing a container in an invalid state inside
that user code - perhaps they try to print the contents of the container on
a line after something was added but it doesn't show up because of
instruction reordering, for instance.

Sure. This suggestion is a rough line between code that is "trusted" and "suspect", at slightly different granularity to compiling different translation units with different optimisation levels.

>
> Naïvely, it seems to me like a reasonable extension, in the context of
> Clang.
>
> my $.02,
> –Arthur

Cheers. Naively seems reasonable is where I am on this. It feels like something games dev people would like. And it feels more QoI than ISO. Good debugging experience of optimised code is, uh, difficult.

Jon

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
In reply to this post by Tom Stellard via cfe-dev

> In -Og mode, it seems that it would equally make sense to take "a very big
> slice around system headers specifically to avoid" debug symbols for code
> that users can't debug.

 

Our users seem to like to be able to dump their STL containers, which definitely requires debug symbols for "code they can't debug."

OTOH being able to more aggressively optimize system-header code even in –Og mode seems reasonable.

OTOOH most of the system-header code is templates or otherwise inlineable early, and after inlining the distinction between app and sys code really goes away.

 

So, I see this as something that you could find interesting to play around with, but seems unlikely to have much practical effect.

--paulr

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
On Tue, Aug 20, 2019 at 9:42 AM via cfe-dev <[hidden email]> wrote:

> In -Og mode, it seems that it would equally make sense to take "a very big
> slice around system headers specifically to avoid" debug symbols for code
> that users can't debug.

 

Our users seem to like to be able to dump their STL containers, which definitely requires debug symbols for "code they can't debug."


Hmm, I may have muddled things up by mentioning "debug symbols" without fully understanding what people mean by that phrase precisely. I meant "line-by-line debugging information enabling single-step through a bunch of templates that the user doesn't care about and would prefer to see inlined away." Forget debug symbols and focus on inlining, if that'll help avoid my confusion. :)
 

OTOH being able to more aggressively optimize system-header code even in –Og mode seems reasonable.

OTOOH most of the system-header code is templates or otherwise inlineable early, and after inlining the distinction between app and sys code really goes away.


I believe we'd like to get "inlining early," but the problem is that `-Og` disables inlining. So there is no "after inlining" at the moment.
Here's a very concrete example: https://godbolt.org/z/5tTgO4

int foo(std::tuple<int, int> t) {

    return std::get<0>(t);

}


At `-Og` this produces the assembly code

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq _ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  jmp _ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE

_ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE:

  jmp _ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_

_ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_:

  addq $4, %rdi

  jmp _ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_

_ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_:

  movq %rdi, %rax

  retq


I believe that if John McFarlane's proposal were adopted by Clang, so that inlining-into-system-functions were allowed at `-Og`, then the resulting assembly code would look like this instead, for a much better experience in both debugging and runtime performance:

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq _ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  leaq 4(%rdi), %rax

  retq


Notice that we still aren't inlining `std::get` into `foo`, because `foo` (as a user function) gets no inlining optimizations at `-Og`. But we do inline and collapse the whole chain of function-template helpers into `std::get` (because `std::get` is a function defined in a system header). This inlining creates new optimization opportunities, such as combining the `add` and `mov` into a single `lea`.

HTH,
–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev

Ah, I'd forgotten that Og prefers not to inline.

Distinguishing optimization levels within one translation unit is tricky given the current way we build optimization pipelines. They are *not* designed to handle function-level differences in optimization levels.  Trying to (essentially) mix O1 and O2 in the same translation unit is a radical departure from how LLVM thinks about optimization.  ('optnone' is a special case where passes effectively disable themselves when presented with an 'optnone' function. Generalizing that to more optimization levels is a seriously invasive proposition.)

 

Re the "symbols" confusion, broadly speaking you can separate debug info into that which describes the source (types, variables, etc), and that which describes the generated code (to a first approximation, the instruction<->source mapping).  So the suggestion in this thread is to retain the former but not the latter.

In this exercise, if we genuinely want to *prevent* debugging of defined-in-system-header functions (which seems like a highly questionable feature) it could be done with judicious application of the 'nodebug' attribute.  Not hard, really.

--paulr

 

From: Arthur O'Dwyer [mailto:[hidden email]]
Sent: Tuesday, August 20, 2019 12:20 PM
To: Robinson, Paul
Cc: Jon Chesterfield; Clang Dev; John McFarlane
Subject: Re: [cfe-dev] Varying per function optimisation based on include path?

 

On Tue, Aug 20, 2019 at 9:42 AM via cfe-dev <[hidden email]> wrote:

> In -Og mode, it seems that it would equally make sense to take "a very big
> slice around system headers specifically to avoid" debug symbols for code
> that users can't debug.

 

Our users seem to like to be able to dump their STL containers, which definitely requires debug symbols for "code they can't debug."

 

Hmm, I may have muddled things up by mentioning "debug symbols" without fully understanding what people mean by that phrase precisely. I meant "line-by-line debugging information enabling single-step through a bunch of templates that the user doesn't care about and would prefer to see inlined away." Forget debug symbols and focus on inlining, if that'll help avoid my confusion. :)

 

OTOH being able to more aggressively optimize system-header code even in –Og mode seems reasonable.

OTOOH most of the system-header code is templates or otherwise inlineable early, and after inlining the distinction between app and sys code really goes away.

 

I believe we'd like to get "inlining early," but the problem is that `-Og` disables inlining. So there is no "after inlining" at the moment.

Here's a very concrete example: https://godbolt.org/z/5tTgO4

 

int foo(std::tuple<int, int> t) {

    return std::get<0>(t);

}

 

At `-Og` this produces the assembly code

 

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq _ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  jmp _ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE

_ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE:

  jmp _ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_

_ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_:

  addq $4, %rdi

  jmp _ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_

_ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_:

  movq %rdi, %rax

  retq

 

I believe that if John McFarlane's proposal were adopted by Clang, so that inlining-into-system-functions were allowed at `-Og`, then the resulting assembly code would look like this instead, for a much better experience in both debugging and runtime performance:

 

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq _ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  leaq 4(%rdi), %rax

  retq

 

Notice that we still aren't inlining `std::get` into `foo`, because `foo` (as a user function) gets no inlining optimizations at `-Og`. But we do inline and collapse the whole chain of function-template helpers into `std::get` (because `std::get` is a function defined in a system header). This inlining creates new optimization opportunities, such as combining the `add` and `mov` into a single `lea`.

 

HTH,

–Arthur


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev

I think a question was glossed over.  Exactly which directions should be inlined…

  1. User callee into user caller (definitely not)
  2. System callee into system caller (yes)
  3. User callee into system caller (maybe?)
  4. System callee into user caller (maybe?)

 

Perhaps number 3 should be prohibited because then a breakpoint on “my” function would either not get hit, or turn into multiple breakpoints.

 

Perhaps number 4 should be prohibited because it makes stepping across loop iterations in things like std::transform more difficult.

 

 

From: cfe-dev <[hidden email]> On Behalf Of via cfe-dev
Sent: Tuesday, August 20, 2019 12:59 PM
To: [hidden email]
Cc: [hidden email]; [hidden email]; [hidden email]
Subject: [EXTERNAL] Re: [cfe-dev] Varying per function optimisation based on include path?

 

Ah, I'd forgotten that Og prefers not to inline.

Distinguishing optimization levels within one translation unit is tricky given the current way we build optimization pipelines. They are *not* designed to handle function-level differences in optimization levels.  Trying to (essentially) mix O1 and O2 in the same translation unit is a radical departure from how LLVM thinks about optimization.  ('optnone' is a special case where passes effectively disable themselves when presented with an 'optnone' function. Generalizing that to more optimization levels is a seriously invasive proposition.)

 

Re the "symbols" confusion, broadly speaking you can separate debug info into that which describes the source (types, variables, etc), and that which describes the generated code (to a first approximation, the instruction<->source mapping).  So the suggestion in this thread is to retain the former but not the latter.

In this exercise, if we genuinely want to *prevent* debugging of defined-in-system-header functions (which seems like a highly questionable feature) it could be done with judicious application of the 'nodebug' attribute.  Not hard, really.

--paulr

 

From: Arthur O'Dwyer [[hidden email]]
Sent: Tuesday, August 20, 2019 12:20 PM
To: Robinson, Paul
Cc: Jon Chesterfield; Clang Dev; John McFarlane
Subject: Re: [cfe-dev] Varying per function optimisation based on include path?

 

On Tue, Aug 20, 2019 at 9:42 AM via cfe-dev <[hidden email]> wrote:

> In -Og mode, it seems that it would equally make sense to take "a very big
> slice around system headers specifically to avoid" debug symbols for code
> that users can't debug.

 

Our users seem to like to be able to dump their STL containers, which definitely requires debug symbols for "code they can't debug."

 

Hmm, I may have muddled things up by mentioning "debug symbols" without fully understanding what people mean by that phrase precisely. I meant "line-by-line debugging information enabling single-step through a bunch of templates that the user doesn't care about and would prefer to see inlined away." Forget debug symbols and focus on inlining, if that'll help avoid my confusion. :)

 

OTOH being able to more aggressively optimize system-header code even in –Og mode seems reasonable.

OTOOH most of the system-header code is templates or otherwise inlineable early, and after inlining the distinction between app and sys code really goes away.

 

I believe we'd like to get "inlining early," but the problem is that `-Og` disables inlining. So there is no "after inlining" at the moment.

Here's a very concrete example: https://godbolt.org/z/5tTgO4

 

int foo(std::tuple<int, int> t) {

    return std::get<0>(t);

}

 

At `-Og` this produces the assembly code

 

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq _ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  jmp _ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE

_ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE:

  jmp _ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_

_ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_:

  addq $4, %rdi

  jmp _ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_

_ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_:

  movq %rdi, %rax

  retq

 

I believe that if John McFarlane's proposal were adopted by Clang, so that inlining-into-system-functions were allowed at `-Og`, then the resulting assembly code would look like this instead, for a much better experience in both debugging and runtime performance:

 

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq _ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  leaq 4(%rdi), %rax

  retq

 

Notice that we still aren't inlining `std::get` into `foo`, because `foo` (as a user function) gets no inlining optimizations at `-Og`. But we do inline and collapse the whole chain of function-template helpers into `std::get` (because `std::get` is a function defined in a system header). This inlining creates new optimization opportunities, such as combining the `add` and `mov` into a single `lea`.

 

HTH,

–Arthur


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
I deliberately don't bring up call direction because I don't think it's important and I want to keep things simple:
- If the code is user code, it should not be inlined.
- If the code is system code (including non-standard 3rd-party dependencies), it should be inlined.

Take a hypothetical call stack from this example program:

?: std::strncmp(....) <- C standard library function, possibly an intrinsic, inline
?: std::string::operator==(char const*)  <- standard library function, inline
8: [](std::string const&) {....} <- lambda, don't inline
?: std::find_if(....) <- standard library taking lambda, inline
4: contains_sg15 <- user function containing lambda, don't inline
13: main <- user's main function, don't inline

I think whether to inline the lambda is mildly contentious but I think we both agree it should not be inlined. The real question is whether to inline `std::find_if`. I guess you're suggesting we don't inline it. I think -- on balance -- we probably should.

Let's step through this program and let's assume that I only use the "step into" functionality of the debugger, because that's an easy way to 'visit' an entire program. Let's assume `find_if` is a single function.

When I step into the call to `std::find_if`, I think it should skip through `std::find_if` and take me directly to line 9 in the lambda function in `contains_sg15`. Then, when I "step into" in the lambda, it'll step over the `std::string` call and take me directly to line 10. (That's what it does in GDB at least.) Then when I "step into" again, it should take me back to line 9 in the second lambda invocation.

Now, if that last action doesn't work, I think that's where you have the concern. I'm not entirely sure what'll happen once `std::find_if` is inlined. I know that without inlining, "step into" eventually does take me back to line 9 of the lambda body. If it doesn't, it's going to kick me back out into the body of `contains_sg15` which is a problem.

In that case, yes, maybe saying that functions which call non-`-isystem` functions should not be inlined is a rule to consider. But that instantly makes things more complicated: we now have a new reason why the same function might be both inlined and not inlined in the same TU. I don't know if that's a problem. We now send the user into the guts of a standard library call. That is a problem. That code looks frightening and obfuscated to the average developer.

A workaround to the problem would be to put a breakpoint in the lambda function. I personally would be perfectly happy with that compromise. We're looking at a problem which game developers (among others) experience then in-header functions are defined in the user's TU. In C code and non-modern C++, the offending code is defined in a separate .cpp file.

Imagine if `std::find_if` could be compiled in a separate TU. And imagine that TU was compiled with `-O2`. What would happen then? You might experience the same debugging difficulty. The same workaround, using a breakpoint, would work there also. We'd at least have parity with non-modern C++. That's the main aim here: not to improve the state of the art for C and C++, merely to help C++ catch up with C.

HTH,
John

On Tue, 20 Aug 2019 at 19:34, Ben Craig <[hidden email]> wrote:

I think a question was glossed over.  Exactly which directions should be inlined…

  1. User callee into user caller (definitely not)
  2. System callee into system caller (yes)
  3. User callee into system caller (maybe?)
  4. System callee into user caller (maybe?)

 

Perhaps number 3 should be prohibited because then a breakpoint on “my” function would either not get hit, or turn into multiple breakpoints.

 

Perhaps number 4 should be prohibited because it makes stepping across loop iterations in things like std::transform more difficult.

 

 

From: cfe-dev <[hidden email]> On Behalf Of via cfe-dev
Sent: Tuesday, August 20, 2019 12:59 PM
To: [hidden email]
Cc: [hidden email]; [hidden email]; [hidden email]
Subject: [EXTERNAL] Re: [cfe-dev] Varying per function optimisation based on include path?

 

Ah, I'd forgotten that Og prefers not to inline.

Distinguishing optimization levels within one translation unit is tricky given the current way we build optimization pipelines. They are *not* designed to handle function-level differences in optimization levels.  Trying to (essentially) mix O1 and O2 in the same translation unit is a radical departure from how LLVM thinks about optimization.  ('optnone' is a special case where passes effectively disable themselves when presented with an 'optnone' function. Generalizing that to more optimization levels is a seriously invasive proposition.)

 

Re the "symbols" confusion, broadly speaking you can separate debug info into that which describes the source (types, variables, etc), and that which describes the generated code (to a first approximation, the instruction<->source mapping).  So the suggestion in this thread is to retain the former but not the latter.

In this exercise, if we genuinely want to *prevent* debugging of defined-in-system-header functions (which seems like a highly questionable feature) it could be done with judicious application of the 'nodebug' attribute.  Not hard, really.

--paulr

 

From: Arthur O'Dwyer [[hidden email]]
Sent: Tuesday, August 20, 2019 12:20 PM
To: Robinson, Paul
Cc: Jon Chesterfield; Clang Dev; John McFarlane
Subject: Re: [cfe-dev] Varying per function optimisation based on include path?

 

On Tue, Aug 20, 2019 at 9:42 AM via cfe-dev <[hidden email]> wrote:

> In -Og mode, it seems that it would equally make sense to take "a very big
> slice around system headers specifically to avoid" debug symbols for code
> that users can't debug.

 

Our users seem to like to be able to dump their STL containers, which definitely requires debug symbols for "code they can't debug."

 

Hmm, I may have muddled things up by mentioning "debug symbols" without fully understanding what people mean by that phrase precisely. I meant "line-by-line debugging information enabling single-step through a bunch of templates that the user doesn't care about and would prefer to see inlined away." Forget debug symbols and focus on inlining, if that'll help avoid my confusion. :)

 

OTOH being able to more aggressively optimize system-header code even in –Og mode seems reasonable.

OTOOH most of the system-header code is templates or otherwise inlineable early, and after inlining the distinction between app and sys code really goes away.

 

I believe we'd like to get "inlining early," but the problem is that `-Og` disables inlining. So there is no "after inlining" at the moment.

Here's a very concrete example: https://godbolt.org/z/5tTgO4

 

int foo(std::tuple<int, int> t) {

    return std::get<0>(t);

}

 

At `-Og` this produces the assembly code

 

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq _ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  jmp _ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE

_ZSt12__get_helperILm0EiJiEERT0_RSt11_Tuple_implIXT_EJS0_DpT1_EE:

  jmp _ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_

_ZNSt11_Tuple_implILm0EJiiEE7_M_headERS0_:

  addq $4, %rdi

  jmp _ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_

_ZNSt10_Head_baseILm0EiLb0EE7_M_headERS0_:

  movq %rdi, %rax

  retq

 

I believe that if John McFarlane's proposal were adopted by Clang, so that inlining-into-system-functions were allowed at `-Og`, then the resulting assembly code would look like this instead, for a much better experience in both debugging and runtime performance:

 

_Z3fooSt5tupleIJiiEE:

  pushq %rax

  callq _ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_

  movl (%rax), %eax

  popq %rcx

  retq

_ZSt3getILm0EJiiEERNSt13tuple_elementIXT_ESt5tupleIJDpT0_EEE4typeERS4_:

  leaq 4(%rdi), %rax

  retq

 

Notice that we still aren't inlining `std::get` into `foo`, because `foo` (as a user function) gets no inlining optimizations at `-Og`. But we do inline and collapse the whole chain of function-template helpers into `std::get` (because `std::get` is a function defined in a system header). This inlining creates new optimization opportunities, such as combining the `add` and `mov` into a single `lea`.

 

HTH,

–Arthur


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Lubos Lunak-2
In reply to this post by Tom Stellard via cfe-dev
On Tuesday 20 of August 2019, via cfe-dev wrote:
> Ah, I'd forgotten that Og prefers not to inline.
> Distinguishing optimization levels within one translation unit is tricky
> given the current way we build optimization pipelines. They are *not*
> designed to handle function-level differences in optimization levels.
> Trying to (essentially) mix O1 and O2 in the same translation unit is a
> radical departure from how LLVM thinks about optimization.  ('optnone' is a
> special case where passes effectively disable themselves when presented
> with an 'optnone' function. Generalizing that to more optimization levels
> is a seriously invasive proposition.)


 It seems to me you see in this more than there is. Read the original mail
again: "a suggestion to more aggressively inline functions included via
isystem when building for -Og." That, in other words, can be also said
as "treat all functions from -isystem as if they have
__attribute__((always_inline)) by default". Or just read the green section in
http://wg21.link/p1832 , that's yet another way of putting it.

 And that's it. And it's not a radical departure from anything LLVM does,
because __attribute__((always_inline)) already works. And defaulting -isystem
functions to having the attribute is probably a few lines of code. Another
few lines of code if this would be controlled by a command line option.

--
 Lubos Lunak
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
In reply to this post by Tom Stellard via cfe-dev
On Wed, Aug 21, 2019 at 1:40 AM John McFarlane <[hidden email]> wrote:
> On Tue, 20 Aug 2019 at 19:34, Ben Craig <[hidden email]> wrote:
>>
>> I think a question was glossed over.  Exactly which directions should be inlined…
>>
>> User callee into user caller (definitely not)
>> System callee into system caller (yes)
>> User callee into system caller (maybe?)
>> System callee into user caller (maybe?)
>>
>> Perhaps number 3 should be prohibited because then a breakpoint on “my” function would either not get hit, or turn into multiple breakpoints.
>> Perhaps number 4 should be prohibited because it makes stepping across loop iterations in things like std::transform more difficult.
>
> I deliberately don't bring up call direction because I don't think it's important and I want to keep things simple:
> - If the code is user code, it should not be inlined.
> - If the code is system code (including non-standard 3rd-party dependencies), it should be inlined.

I think Ben Craig's point about call direction is important. Or at least, his intuition about what you meant matches my intuition, whereas the words you just used in the paragraph above do not match my intuition about what you had meant in P1832.

> Take a hypothetical call stack from this example program:
>
> ?: std::strncmp(....) <- C standard library function, possibly an intrinsic, inline
> ?: std::string::operator==(char const*)  <- standard library function, inline
> 8: [](std::string const&) {....} <- lambda, don't inline
> ?: std::find_if(....) <- standard library taking lambda, inline
> 4: contains_sg15 <- user function containing lambda, don't inline
> 13: main <- user's main function, don't inline
>
> I think whether to inline the lambda is mildly contentious but I think we both agree
> it should not be inlined. The real question is whether to inline `std::find_if`. I guess
> you're suggesting we don't inline it. I think -- on balance -- we probably should.
The verb "inline" doesn't operate on a node of this graph; it operates on an edge between two nodes. My intuition is that -Og should enable inlining on every edge where both endpoint nodes are in "system code," and on no other edges. With your above callstack:

Yes inline strncmp into operator==  (system into system)
No don't inline operator== into the user's lambda  (system into user)
No don't inline the user's lambda into std::find_if  (user into system)
No don't inline std::find_if into contains_sg15  (system into user)
No don't inline contains_sg15 into main  (user into user)

My intuition is that we don't want to inline system code into user code because that would mix "not-my-code" into the codegen for "my code." Imagine the user trying to step through `contains_sg15`, and suddenly they're seeing backward branches and loops. Where did those loops come from? Ah, the compiler inlined a bunch of crap from <algorithm>. That's not the debugging experience we want (says my intuition).  So we should not inline `std::find_if` into `contains_sg15`.

What about inlining the user's lambda into std::find_if?  I think we have always agreed on this one.  The user must be able to set a single breakpoint in the lambda, and run `contains_sg15`, and see the breakpoint get hit. If there's a second copy of the lambda's code inlined into `std::find_if`, then this scenario falls apart. So, we should not inline the user's lambda into `std::find_if`. In fact, the same argument implies that we should never inline user code into anywhere. One line of user source code should continue to map to one point in the compiled binary.


> Let's step through this program and let's assume that I only use the "step into"
> functionality of the debugger, because that's an easy way to 'visit' an entire
> program. Let's assume `find_if` is a single function.

I think you have to consider the workflow for people who use the "disassemble" functionality of the debugger, as well.

Also, I'm not sure how the "step into/ single-step" functionality works, but I would have thought that it just means "continue forward until the current instruction is no longer associated with the same source line of code." Which means that it would still be possible to "step into" system code, because system code does still have line numbers associated with it.

I intuit that it should be easy to "step over" calls to library functions; which means that you must ensure that the call to `std::find_if` shows up as a call instruction in the binary. If we inline `std::find_if` into `contains_sg15`, then you can't "step over" it anymore.

–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
On Sat, 24 Aug 2019 at 14:56, Arthur O'Dwyer <[hidden email]> wrote:
On Wed, Aug 21, 2019 at 1:40 AM John McFarlane <[hidden email]> wrote:
> On Tue, 20 Aug 2019 at 19:34, Ben Craig <[hidden email]> wrote:
>>
>> I think a question was glossed over.  Exactly which directions should be inlined…
>>
>> User callee into user caller (definitely not)
>> System callee into system caller (yes)
>> User callee into system caller (maybe?)
>> System callee into user caller (maybe?)
>>
>> Perhaps number 3 should be prohibited because then a breakpoint on “my” function would either not get hit, or turn into multiple breakpoints.
>> Perhaps number 4 should be prohibited because it makes stepping across loop iterations in things like std::transform more difficult.
>
> I deliberately don't bring up call direction because I don't think it's important and I want to keep things simple:
> - If the code is user code, it should not be inlined.
> - If the code is system code (including non-standard 3rd-party dependencies), it should be inlined.

I think Ben Craig's point about call direction is important. Or at least, his intuition about what you meant matches my intuition, whereas the words you just used in the paragraph above do not match my intuition about what you had meant in P1832.

Thanks for providing a perspective of which I was unaware.

> Take a hypothetical call stack from this example program:
>
> ?: std::strncmp(....) <- C standard library function, possibly an intrinsic, inline
> ?: std::string::operator==(char const*)  <- standard library function, inline
> 8: [](std::string const&) {....} <- lambda, don't inline
> ?: std::find_if(....) <- standard library taking lambda, inline
> 4: contains_sg15 <- user function containing lambda, don't inline
> 13: main <- user's main function, don't inline
>
> I think whether to inline the lambda is mildly contentious but I think we both agree
> it should not be inlined. The real question is whether to inline `std::find_if`. I guess
> you're suggesting we don't inline it. I think -- on balance -- we probably should.
The verb "inline" doesn't operate on a node of this graph; it operates on an edge between two nodes. My intuition is that -Og should enable inlining on every edge where both endpoint nodes are in "system code," and on no other edges. With your above callstack:

Yes inline strncmp into operator==  (system into system)
No don't inline operator== into the user's lambda  (system into user)
No don't inline the user's lambda into std::find_if  (user into system)
No don't inline std::find_if into contains_sg15  (system into user)
No don't inline contains_sg15 into main  (user into user)

My intuition is that we don't want to inline system code into user code because that would mix "not-my-code" into the codegen for "my code." Imagine the user trying to step through `contains_sg15`, and suddenly they're seeing backward branches and loops. Where did those loops come from? Ah, the compiler inlined a bunch of crap from <algorithm>. That's not the debugging experience we want (says my intuition).  So we should not inline `std::find_if` into `contains_sg15`.

Either yours or my particular formula for inlining choice improve the situation. That makes me happy. But I'm far from sold on your formula. (I'll concentrate on `find_if`.)

Little of what I imagine as the UX for this feature comes from intuition. It comes from my experience as a C developer, a 'trad' C++ developer and as a games developer. Imagine that we 'degenericize' the STL and imagine that `find_if` and all the other library calls are plain ol' functions that are defined in a separate TU. The TU is compiled into a .lib file at `-O2` and linked to the users' debug build. Q: What happens now when I step into `contains_sg15`? A: Exactly the same behaviour. So there's a consistency argument to be made. It's the 'User Expectation' from the title of P1832.

Secondly, I wonder if you overestimate just how little a typical game dev cares about being able to step through the loop inside a standard algorithm. For a number of reasons, that code is scary and intimidating for most humble devs. When they step into standard library code by accident, it interrupts their concentration and presents them with an _Uglified mess.

Thirdly, I explicitly simplified my description of `find_if` for the sake of argument. But now my simplification gets in the way. If you look at the real `find_if`, you'll see a great many levels of function calls. (I assume they are your nodes.) The first one when you step into `find_if` has no `for` loop. It is a wrapper to the implementation's algorithm. The last one before you step into the lambda also contains no `for` loop but is a function dedicated to invoking lambdas. So your formula for when to inline library functions will only degrade the experience. Instead, you'd need to say that *everything* between `contains_sg15` and the lambda must not be inlined. At this point, you're losing a lot of your performance gains which brings me on to...

Finally, because some of these abstractions boil down to very few (sometimes zero!) assembler instructions, any additional assembler you get from relaxing the criteria for inlining can heavily affect run-time performance and binary size. I suspect that on balance, the target audience here won't want to pay that price.

What about inlining the user's lambda into std::find_if?  I think we have always agreed on this one.  The user must be able to set a single breakpoint in the lambda, and run `contains_sg15`, and see the breakpoint get hit. If there's a second copy of the lambda's code inlined into `std::find_if`, then this scenario falls apart. So, we should not inline the user's lambda into `std::find_if`. In fact, the same argument implies that we should never inline user code into anywhere. One line of user source code should continue to map to one point in the compiled binary.

Yes, we always want to facilitate breakpoints etc. in user code. 


> Let's step through this program and let's assume that I only use the "step into"
> functionality of the debugger, because that's an easy way to 'visit' an entire
> program. Let's assume `find_if` is a single function.

I think you have to consider the workflow for people who use the "disassemble" functionality of the debugger, as well.

I'd be interested to learn more.

Also, I'm not sure how the "step into/ single-step" functionality works, but I would have thought that it just means "continue forward until the current instruction is no longer associated with the same source line of code." Which means that it would still be possible to "step into" system code, because system code does still have line numbers associated with it.

That's roughly my understanding, yes. You can test this with `forceinline`. There are debugging facilities (including Jeff Trull's) which I mention in the paper and which help avoid stepping into the 'wrong' functions.

I intuit that it should be easy to "step over" calls to library functions; which means that you must ensure that the call to `std::find_if` shows up as a call instruction in the binary. If we inline `std::find_if` into `contains_sg15`, then you can't "step over" it anymore.

So, are you saying that if we inline `find_if` AND we generate debugging symbols, then the debugger's just going to jump us to `find_if` when we step over, right? I'm not sure if that's what happens. Again, you can test this with `forceinline`. I think the debugger's smart enough to skip over those lines. But I still expect my proposed solution to have some major flaw I didn't think about such as this!

Thanks,
John

–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
On Sun, Aug 25, 2019 at 9:48 AM John McFarlane <[hidden email]> wrote:
On Sat, 24 Aug 2019 at 14:56, Arthur O'Dwyer <[hidden email]> wrote:
On Wed, Aug 21, 2019 at 1:40 AM John McFarlane <[hidden email]> wrote:
> On Tue, 20 Aug 2019 at 19:34, Ben Craig <[hidden email]> wrote:
>>
>> I think a question was glossed over.  Exactly which directions should be inlined…
>>
>> User callee into user caller (definitely not)
>> System callee into system caller (yes)
>> User callee into system caller (maybe?)
>> System callee into user caller (maybe?)
>>
>> Perhaps number 3 should be prohibited because then a breakpoint on “my” function would either not get hit, or turn into multiple breakpoints.
>> Perhaps number 4 should be prohibited because it makes stepping across loop iterations in things like std::transform more difficult.
>
> I deliberately don't bring up call direction because I don't think it's important and I want to keep things simple:
> - If the code is user code, it should not be inlined.
> - If the code is system code (including non-standard 3rd-party dependencies), it should be inlined.

I think Ben Craig's point about call direction is important. Or at least, his intuition about what you meant matches my intuition, whereas the words you just used in the paragraph above do not match my intuition about what you had meant in P1832.

Thanks for providing a perspective of which I was unaware.

> Take a hypothetical call stack from this example program:
>
> ?: std::strncmp(....) <- C standard library function, possibly an intrinsic, inline
> ?: std::string::operator==(char const*)  <- standard library function, inline
> 8: [](std::string const&) {....} <- lambda, don't inline
> ?: std::find_if(....) <- standard library taking lambda, inline
> 4: contains_sg15 <- user function containing lambda, don't inline
> 13: main <- user's main function, don't inline
>
> I think whether to inline the lambda is mildly contentious but I think we both agree
> it should not be inlined. The real question is whether to inline `std::find_if`. I guess
> you're suggesting we don't inline it. I think -- on balance -- we probably should.
The verb "inline" doesn't operate on a node of this graph; it operates on an edge between two nodes. My intuition is that -Og should enable inlining on every edge where both endpoint nodes are in "system code," and on no other edges. With your above callstack:

Yes inline strncmp into operator==  (system into system)
No don't inline operator== into the user's lambda  (system into user)
No don't inline the user's lambda into std::find_if  (user into system)
No don't inline std::find_if into contains_sg15  (system into user)
No don't inline contains_sg15 into main  (user into user)

My intuition is that we don't want to inline system code into user code because that would mix "not-my-code" into the codegen for "my code." Imagine the user trying to step through `contains_sg15`, and suddenly they're seeing backward branches and loops. Where did those loops come from? Ah, the compiler inlined a bunch of crap from <algorithm>. That's not the debugging experience we want (says my intuition).  So we should not inline `std::find_if` into `contains_sg15`.

Either yours or my particular formula for inlining choice improve the situation. That makes me happy. But I'm far from sold on your formula. (I'll concentrate on `find_if`.)

Little of what I imagine as the UX for this feature comes from intuition. It comes from my experience as a C developer, a 'trad' C++ developer and as a games developer. Imagine that we 'degenericize' the STL and imagine that `find_if` and all the other library calls are plain ol' functions that are defined in a separate TU. The TU is compiled into a .lib file at `-O2` and linked to the users' debug build. Q: What happens now when I step into `contains_sg15`? A: Exactly the same behaviour. So there's a consistency argument to be made. It's the 'User Expectation' from the title of P1832.

Secondly, I wonder if you overestimate just how little a typical game dev cares about being able to step through the loop inside a standard algorithm. For a number of reasons, that code is scary and intimidating for most humble devs. When they step into standard library code by accident, it interrupts their concentration and presents them with an _Uglified mess.

I think maybe you got my suggestion backwards.  I am suggesting that STL code should never be mixed in with user code. For example, if the user writes straight-line code like this...

    bool contains_sg15(const std::vector<std::string>& vec) {
        auto it = std::find_if(vec.begin(), vec.end(), [](const auto& elt) { return elt == "sg15"; });
        return it != vec.end();
    }

...then I would consider it a bad idea for the compiler to inline the loop from `std::find_if` into the user's code. That would lead to the user's being forced to "step through the loop inside a standard algorithm," as you say. I would intuitively expect that at -Og, the user's `contains_sg15` function would codegen to exactly what it does today:

    - CALL to begin()
    - CALL to end()
    - CALL to std::find_if
    - CALL to end()
    - CALL to operator!=

If the compiler inlined any of those calls, then the assembly code for `contains_sg15` would suddenly contain assembly code corresponding to C++ code that the user didn't write. And the user doesn't want that. So it's good that `-Og` does not inline system code into user code.

Thirdly, I explicitly simplified my description of `find_if` for the sake of argument. But now my simplification gets in the way. If you look at the real `find_if`, you'll see a great many levels of function calls. (I assume they are your nodes.)

That's right. That's why it's important that P1832 says "let's permit the compiler to inline system code into system code."
By optimizing the body of `std::find_if`, we can reduce its code size by about half (from 182 lines to 93 lines in this test case).  Notice that I am not inlining find_if into contains_sg15, and I am not inlining the body of the lambda into find_if. We can get P1832's huge performance boost without ever mixing user and system code together.
 
The first one when you step into `find_if` has no `for` loop. It is a wrapper to the implementation's algorithm. The last one before you step into the lambda also contains no `for` loop but is a function dedicated to invoking lambdas. So your formula for when to inline library functions will only degrade the experience. Instead, you'd need to say that *everything* between `contains_sg15` and the lambda must not be inlined. At this point, you're losing a lot of your performance gains which brings me on to...

I don't understand the sentence "Instead, you'd need to say that *everything* between `contains_sg15` and the lambda must not be inlined." That's 100% exactly the edges that we do want to inline!
 
Finally, because some of these abstractions boil down to very few (sometimes zero!) assembler instructions, any additional assembler you get from relaxing the criteria for inlining can heavily affect run-time performance and binary size. I suspect that on balance, the target audience here won't want to pay that price.

Can you be specific about what code you're talking about? I am sure that you are misunderstanding my suggestion, but I can't figure out what you think the suggestion is.


> Let's step through this program and let's assume that I only use the "step into"
> functionality of the debugger, because that's an easy way to 'visit' an entire
> program. Let's assume `find_if` is a single function.

I think you have to consider the workflow for people who use the "disassemble" functionality of the debugger, as well.

I'd be interested to learn more.

This wasn't intended as a teaser; I meant it to be a summation of the suggestions in my email. Inlining loopy system code into the middle of straight-line user code is a bad idea because it messes with people's ability to understand the disassembly.


I intuit that it should be easy to "step over" calls to library functions; which means that you must ensure that the call to `std::find_if` shows up as a call instruction in the binary. If we inline `std::find_if` into `contains_sg15`, then you can't "step over" it anymore.

So, are you saying that if we inline `find_if` AND we generate debugging symbols, then the debugger's just going to jump us to `find_if` when we step over, right?

No, not at all.  You're thinking of "step into."  "Step into" steps down into function calls. "Step over" steps over function calls. The metaphor here is that a function call is like a pothole or trapdoor down into the next lower level of the code. You can either step "into" the hole, or step "over" it.

HTH,
Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
On Sun, Aug 25, 2019 at 10:26 AM Arthur O'Dwyer <[hidden email]> wrote:
On Sun, Aug 25, 2019 at 9:48 AM John McFarlane <[hidden email]> wrote:
On Sat, 24 Aug 2019 at 14:56, Arthur O'Dwyer <[hidden email]> wrote:

I intuit that it should be easy to "step over" calls to library functions; which means that you must ensure that the call to `std::find_if` shows up as a call instruction in the binary. If we inline `std::find_if` into `contains_sg15`, then you can't "step over" it anymore.

So, are you saying that if we inline `find_if` AND we generate debugging symbols, then the debugger's just going to jump us to `find_if` when we step over, right?

No, not at all.  You're thinking of "step into."  "Step into" steps down into function calls. "Step over" steps over function calls. The metaphor here is that a function call is like a pothole or trapdoor down into the next lower level of the code. You can either step "into" the hole, or step "over" it.

Oops, strike that!
You said if we do inline find_if, will "step over" actually skip over the whole inlined loop or will it act like "step into". I have no intuition either way. I would even bet that different debuggers do different things.
(I had initially misread your sentence as "...if we don't inline `find_if`...".)

–Arthur


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Varying per function optimisation based on include path?

Tom Stellard via cfe-dev
On Sun, 25 Aug 2019 at 18:35, Arthur O'Dwyer <[hidden email]> wrote:
(Did you drop the list intentionally?)

Oops. Thanks. 

On Sun, Aug 25, 2019 at 11:43 AM John McFarlane <[hidden email]> wrote:

It's been mentioned before that the decision to inline here is comparable to a conditional `__forceinline` or `always_inline` directive where the predicate it is what is under question.

My formula:

    __forceinline(included_via_isystem(callee)) auto find_if(....) { .... }

Your formula (perhaps?):

    __forceinline(included_via_isystem(callee)&&included_via_isystem(caller)) auto find_if(....) { .... }

Yes, that's essentially it.
However, I repeat that the primitive operation with inlining is "collapsing an edge of the call graph." We do not inline a function definition; we inline a specific call (from caller to callee).  So I would say that the compiler's algorithm for inlining currently looks like

    bool ShouldInline(Func caller, Func callee) { // and we might add "SourceLocation callsite", which is unused here
        bool heuristic = (callee.is_small() || callee.has_forceinline_attribute());
        return heuristic && (-O2 or greater);
    }

and that we want it to instead look like

    bool ShouldInline(Func caller, Func callee) {
        bool heuristic = (callee.is_small() || callee.has_forceinline_attribute());
        bool is_system_to_system = (callee.is_from_system_header() && caller.is_from_system_header());
        return heuristic && (
            (-O2 or greater) || (-Og && is_system_to_system)
        );
    }

I would stay away from trying to frame the solution as "adding attributes to certain function definitions," because the intended behavior is not a property of function definitions (nodes in the call graph) — it's a property of calls (edges in the call graph). Namely, the property is, do we want to inline that call or not.

The attribute in question is conditional otherwise wouldn't forceinline be all we need? So kind of like `noexcept(predicate)` but dependent on caller/callee -- rather than `T`. So I think the above predicate is a helpful thought experiment.

Again, for all the arguments I gave previously, I think that `find_if` and `operator==` in our example should be inlined. Your `is_system_to_system` test would fail for those. 

John

–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev