Quantcast

Lambdas and the ABI?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Lambdas and the ABI?

Brian Cain via cfe-dev
Hi Richard, Chandler, John, et al.,

"Quick" question: What aspects, if any, of a C++ lambda (e.g. size,
layout, alignment) leak into the ABI or are (potentially) semantically
visible?

Context...

We're investigating late lowering/outlining schemes to improve code
generation for OpenMP and other parallel programming models. One
important advantage of outlining late is that the IR-level optimizer
still has access to the pointer-aliasing and loop information from the
containing function. There are natural benefits to C++ lambdas as well.
Lambdas that are used "in place" (i.e. only have one call site) should
always be inlined, so the issues don't come up there, but for lambdas
that have multiple call sites, or worse, escape (via std::function or
some other type-erasure mechanism), we can get suboptimal optimization
results for the body of the lambda. It would seem sad to fix this for
OpenMP but not for lambdas.

However, optimizing before outlining could mean changes in what
variables need to be captured (not semantically, but at the IR level),
and so I'd like to think through what would constrain the optimizer's
freedom to act in this regard.

John, I'm curious about how this might apply to Objective C blocks as well.

Thanks again,

Hal

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
This question was explored in an other context here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

-- HT

On Tue, Mar 21, 2017 at 9:03 AM, Hal Finkel via cfe-dev <[hidden email]> wrote:
Hi Richard, Chandler, John, et al.,

"Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?

Context...

We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.

However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.

John, I'm curious about how this might apply to Objective C blocks as well.

Thanks again,

Hal

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev


On 03/21/2017 08:45 AM, Hubert Tong wrote:
This question was explored in an other context here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

Very interesting, thanks! So the conclusion was that, " The ABI will need to specify the layout of closure types with weak linkage." In some cases (enumerated here: https://itanium-cxx-abi.github.io/cxx-abi/abi.html#closure-types) the layout must be fixed (because it must be consistent across translation units).

I imagine that size/alignment might also need to be fixed in cases where the source program explicitly depends on them (e.g. doing sizeof(lambda), new auto(lambda)).

 -Hal


-- HT

On Tue, Mar 21, 2017 at 9:03 AM, Hal Finkel via cfe-dev <[hidden email]> wrote:
Hi Richard, Chandler, John, et al.,

"Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?

Context...

We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.

However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.

John, I'm curious about how this might apply to Objective C blocks as well.

Thanks again,

Hal

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
On Tue, Mar 21, 2017 at 6:06 PM Hal Finkel via cfe-dev <[hidden email]> wrote:


On 03/21/2017 08:45 AM, Hubert Tong wrote:
This question was explored in an other context here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

Very interesting, thanks! So the conclusion was that, " The ABI will need to specify the layout of closure types with weak linkage."

Are you basing that on what the document does or the discussion in the thread? Because I don't really see a conclusion in the thread. I see John pointing out three options and suggestion one which is not what you claim is the conclusion here:

I don't see any real disagreement but I also don't see any real discussion past that email and brief follow-ups....

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
On 21 Mar 2017 12:23 pm, "Chandler Carruth" <[hidden email]> wrote:
On Tue, Mar 21, 2017 at 6:06 PM Hal Finkel via cfe-dev <[hidden email]> wrote:


On 03/21/2017 08:45 AM, Hubert Tong wrote:
This question was explored in an other context here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

Very interesting, thanks! So the conclusion was that, " The ABI will need to specify the layout of closure types with weak linkage."

Are you basing that on what the document does or the discussion in the thread? Because I don't really see a conclusion in the thread. I see John pointing out three options and suggestion one which is not what you claim is the conclusion here:

I don't see any real disagreement but I also don't see any real discussion past that email and brief follow-ups....

I expect John's #1 is most likely to be the eventual outcome; #2 and #3 are too restrictive. Plus the language is moving in a direction that makes the capture set much easier to reliably compute, so the cost of #1 is going to become much more like "don't do those optimisations you weren't really doing much anyway" and much less "accurately implement these arcane and formally unobservable rules or you violate the ABI".

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
In reply to this post by Brian Cain via cfe-dev
On 21 Mar 2017 10:06 am, "Hal Finkel" <[hidden email]> wrote:


On 03/21/2017 08:45 AM, Hubert Tong wrote:
This question was explored in an other context here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

Very interesting, thanks! So the conclusion was that, " The ABI will need to specify the layout of closure types with weak linkage." In some cases (enumerated here: https://itanium-cxx-abi.github.io/cxx-abi/abi.html#closure-types) the layout must be fixed (because it must be consistent across translation units).

I imagine that size/alignment might also need to be fixed in cases where the source program explicitly depends on them (e.g. doing sizeof(lambda), new auto(lambda)).

I don't think that's necessary for lambda types that do not escape a single TU. But obviously a portable program should avoid depending on the size or alignment of a closure type.


 -Hal



-- HT

On Tue, Mar 21, 2017 at 9:03 AM, Hal Finkel via cfe-dev <[hidden email]> wrote:
Hi Richard, Chandler, John, et al.,

"Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?

Context...

We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.

However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.

John, I'm curious about how this might apply to Objective C blocks as well.

Thanks again,

Hal

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
In reply to this post by Brian Cain via cfe-dev
> On Mar 21, 2017, at 9:03 AM, Hal Finkel <[hidden email]> wrote:
> Hi Richard, Chandler, John, et al.,
>
> "Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?

C++ lambdas are anonymous local types that are constrained to appear in function bodies (including implicit "function" bodies, e.g. the initializers of global variables).  The main semantic rule for local types like lambdas is that they're supposed to be the "same type" in all translation units that see the same function body, meaning (among many other things) that the ABI properties of the type must be the same.  Now, most function bodies can only appear in only a single translation unit, so this rule is trivially satisfied and the compiler has total flexibility.  But the body of an inline or template function can usually appear in multiple translation units, and so the ABI has to be consistent.

> Context...
>
> We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.
>
> However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.

"Subfunction" representations are very useful for coroutine-esque situations where an inner function's control flow interactions with the outer function are essentially well-understood.  That is, to be useful, you really want the blocks of the inner function to fit into the CFG of the outer function at a unique and appropriate place; otherwise you're just contorting the representation of the outer function in a way that will probably massively hamper optimization.

I can definitely see how there might be many situations in OpenMP that have this kind of CFG knowledge.  Swift has some situations like this as well.  Unfortunately, C++ lambdas and ObjC blocks are not a case where we have this knowledge, because the closure can be passed around and invoked arbitrarily many times and at arbitrary later points, which is not something that can ever fit cleanly into the CFG; pre-emptive "outlining" is just a superior representation for such cases.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
In reply to this post by Brian Cain via cfe-dev


On 03/21/2017 02:47 PM, Richard Smith wrote:
On 21 Mar 2017 12:23 pm, "Chandler Carruth" <[hidden email]> wrote:
On Tue, Mar 21, 2017 at 6:06 PM Hal Finkel via cfe-dev <[hidden email]> wrote:


On 03/21/2017 08:45 AM, Hubert Tong wrote:
This question was explored in an other context here:
http://sourcerytools.com/pipermail/cxx-abi-dev/2013-January/002544.html

Very interesting, thanks! So the conclusion was that, " The ABI will need to specify the layout of closure types with weak linkage."

Are you basing that on what the document does or the discussion in the thread? Because I don't really see a conclusion in the thread. I see John pointing out three options and suggestion one which is not what you claim is the conclusion here:

I don't see any real disagreement but I also don't see any real discussion past that email and brief follow-ups....

I expect John's #1 is most likely to be the eventual outcome; #2 and #3 are too restrictive. Plus the language is moving in a direction that makes the capture set much easier to reliably compute, so the cost of #1 is going to become much more like "don't do those optimisations you weren't really doing much anyway" and much less "accurately implement these arcane and formally unobservable rules or you violate the ABI".

The follow-up message is not linked properly in the web archive: http://sourcerytools.com/pipermail/cxx-abi-dev/2013-September/002618.html - Hubert wrote, " CWG has taken the position that issue CA 2 on this topic in N3771 is NAD. The ABI will need to specify the layout of closure types with weak linkage."

 -Hal
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
In reply to this post by Brian Cain via cfe-dev

On 03/21/2017 03:02 PM, John McCall wrote:

>> On Mar 21, 2017, at 9:03 AM, Hal Finkel <[hidden email]> wrote:
>> Hi Richard, Chandler, John, et al.,
>>
>> "Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?
> C++ lambdas are anonymous local types that are constrained to appear in function bodies (including implicit "function" bodies, e.g. the initializers of global variables).  The main semantic rule for local types like lambdas is that they're supposed to be the "same type" in all translation units that see the same function body, meaning (among many other things) that the ABI properties of the type must be the same.  Now, most function bodies can only appear in only a single translation unit, so this rule is trivially satisfied and the compiler has total flexibility.  But the body of an inline or template function can usually appear in multiple translation units, and so the ABI has to be consistent.
>
>> Context...
>>
>> We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.
>>
>> However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.
> "Subfunction" representations are very useful for coroutine-esque situations where an inner function's control flow interactions with the outer function are essentially well-understood.  That is, to be useful, you really want the blocks of the inner function to fit into the CFG of the outer function at a unique and appropriate place; otherwise you're just contorting the representation of the outer function in a way that will probably massively hamper optimization.
>
> I can definitely see how there might be many situations in OpenMP that have this kind of CFG knowledge.  Swift has some situations like this as well.  Unfortunately, C++ lambdas and ObjC blocks are not a case where we have this knowledge, because the closure can be passed around and invoked arbitrarily many times and at arbitrary later points, which is not something that can ever fit cleanly into the CFG; pre-emptive "outlining" is just a superior representation for such cases.

This is a very good point. We only know that the lambda will execute,
and perhaps execute multiple times, after some given point. The same is
true (to some extent) for some OpenMP features. For example, we can
generate OpenMP tasks in a function, and we know only that they'll run
sometime before the end of the containing parallel region, taskwait
directive, etc. This might be outside the scope of the function.

One interesting question is: To what extent does this affect our ability
use analysis (AA, SCEV, etc.) of the variables at point of capture to
optimize the body of the "subfunction." The aliasing properties of
pointers, for example, and the bounds of captured induction variables,
should be fine. However, if there's anything that might be rendered
invalid by some later change in state (e.g. a write to memory), that
wouldn't be usable. We also need to make sure that pointers
appropriately appear to be captured (just as they would be if we used an
outlined function representation). Luckily, I suppose, we need to do
that anyway.

Thanks again,
Hal

>
> John.

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
> On Mar 21, 2017, at 7:24 PM, Hal Finkel <[hidden email]> wrote:
> On 03/21/2017 03:02 PM, John McCall wrote:
>>> On Mar 21, 2017, at 9:03 AM, Hal Finkel <[hidden email]> wrote:
>>> Hi Richard, Chandler, John, et al.,
>>>
>>> "Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?
>> C++ lambdas are anonymous local types that are constrained to appear in function bodies (including implicit "function" bodies, e.g. the initializers of global variables).  The main semantic rule for local types like lambdas is that they're supposed to be the "same type" in all translation units that see the same function body, meaning (among many other things) that the ABI properties of the type must be the same.  Now, most function bodies can only appear in only a single translation unit, so this rule is trivially satisfied and the compiler has total flexibility.  But the body of an inline or template function can usually appear in multiple translation units, and so the ABI has to be consistent.
>>
>>> Context...
>>>
>>> We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.
>>>
>>> However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.
>> "Subfunction" representations are very useful for coroutine-esque situations where an inner function's control flow interactions with the outer function are essentially well-understood.  That is, to be useful, you really want the blocks of the inner function to fit into the CFG of the outer function at a unique and appropriate place; otherwise you're just contorting the representation of the outer function in a way that will probably massively hamper optimization.
>>
>> I can definitely see how there might be many situations in OpenMP that have this kind of CFG knowledge.  Swift has some situations like this as well.  Unfortunately, C++ lambdas and ObjC blocks are not a case where we have this knowledge, because the closure can be passed around and invoked arbitrarily many times and at arbitrary later points, which is not something that can ever fit cleanly into the CFG; pre-emptive "outlining" is just a superior representation for such cases.
>
> This is a very good point. We only know that the lambda will execute, and perhaps execute multiple times, after some given point. The same is true (to some extent) for some OpenMP features. For example, we can generate OpenMP tasks in a function, and we know only that they'll run sometime before the end of the containing parallel region, taskwait directive, etc. This might be outside the scope of the function.
>
> One interesting question is: To what extent does this affect our ability use analysis (AA, SCEV, etc.) of the variables at point of capture to optimize the body of the "subfunction." The aliasing properties of pointers, for example, and the bounds of captured induction variables, should be fine. However, if there's anything that might be rendered invalid by some later change in state (e.g. a write to memory), that wouldn't be usable. We also need to make sure that pointers appropriately appear to be captured (just as they would be if we used an outlined function representation). Luckily, I suppose, we need to do that anyway.

Right.  I would say that there are three interesting levels of closure-capture optimization:

1. We don't know when the inner function will be called; the outer function's frame may not still be intact.  This is generally true of blocks and lambdas because they can be copied off the stack.  If the block/lambda captures a reference to something in the outer function, U.B. might let us assume that the frame still exists, but I think only if the block/lambda actually uses that captured reference.  Richard, let us know if there's some special rule that would help us here.

It's hard to see what optimizations would be possible here.  Maybe, if we controlled the copying process, we could delay copying captured values into the block/lambda temporary, and instead just access them directly in the inner function?  But the inner function would have to be compiled two ways, one for when the variables have been copied and one when they're still in-place.  This is a lot of extra complexity.

2. We don't know exactly when or how often the inner function will be called, but there's a definite point in the outer function past which the inner function will no longer be used.  This is true of some OpenMP features, "noescape" blocks in ObjC, and noescape blocks/closures in Swift.

We can optimize this by eliminating copies and indirections: instead of aggressively copying values and local addresses into the block/lambda, we can just store the enclosing frame pointer.  If a capture is semantically "by reference", i.e. changes to it in the inner function are reflected in the outer and vice-versa, then we have to temporarily pin it into memory and treat its address as having escaped; we can then just access it relative to the captured frame pointer.  If a capture is semantically "by value", i.e. it copies the value of a variable at a specific point, then we just have to keep that current value live for a while in the outer frame; in many cases, we can prove that neither function will modify it, so this can safely degrade to the by-reference implementation.

3. We know exactly when the inner function will be called and it can be fit into the outer function's CFG.  This is true of some OpenMP features and some very specific Swift features.

We can still do normal data-flow and control-flow analyses between the outer and inner functions; the only constraint is that we (probably) have to treat the beginning and end of the inner function as opaque barriers.  So the unlowered function looks something like:

  %initial_result = call %initial_result_t @begin_subfunction(%initial_args_t %initial_args)
  // subfunction goes here
  %final_result = call %final_result_t @end_subfunction(%final_args_t %final_args)

and eventually that gets lowered into something like:

  %final_result = call %final_result_t @do_with_subfunction(%initial_args, @subfunction)

  define %final_args_t @subfunction(%initial_result_t %initial_result) {
    // subfunction goes here
    ret %final_args_t %final_args
  }

with data flow in and out handled basically like in case #2.  This is one class of thing that people want to be able to do with the coroutine proposal.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
On Tue, Mar 21, 2017 at 9:01 PM, John McCall via cfe-dev <[hidden email]> wrote:
3. We know exactly when the inner function will be called and it can be fit into the outer function's CFG.  This is true of some OpenMP features and some very specific Swift features.

We can still do normal data-flow and control-flow analyses between the outer and inner functions; the only constraint is that we (probably) have to treat the beginning and end of the inner function as opaque barriers.  So the unlowered function looks something like:

I'd also throw in that our implementation of MSVC EH is a good example of this type of inner function outlining in LLVM. The token-producing block leader and terminator instructions create single-entry single-exit regions that we can outline into funclets as part of CodeGen.

By keeping the funclets in the normal CFG of the parent function, we are able to SROA aggregates with destructors and do things like heap-to-stack conversion of std::unique_ptr.

#include <memory>
void may_throw(int);
void foo() {
  std::unique_ptr<int> p(new int(42));
  may_throw(*p);
}

Getting back to Hal's original question, I think that these kinds of optimizations are impossible for C++ lambdas. They frequently appear in headers and it's really easy to leak the class type with decltype and auto, and that class definition better be the same across all TUs.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
On Mar 22, 2017, at 1:31 PM, Reid Kleckner <[hidden email]> wrote:
On Tue, Mar 21, 2017 at 9:01 PM, John McCall via cfe-dev <[hidden email]> wrote:
3. We know exactly when the inner function will be called and it can be fit into the outer function's CFG.  This is true of some OpenMP features and some very specific Swift features.

We can still do normal data-flow and control-flow analyses between the outer and inner functions; the only constraint is that we (probably) have to treat the beginning and end of the inner function as opaque barriers.  So the unlowered function looks something like:

I'd also throw in that our implementation of MSVC EH is a good example of this type of inner function outlining in LLVM. The token-producing block leader and terminator instructions create single-entry single-exit regions that we can outline into funclets as part of CodeGen.

Very good point.

By keeping the funclets in the normal CFG of the parent function, we are able to SROA aggregates with destructors and do things like heap-to-stack conversion of std::unique_ptr.

#include <memory>
void may_throw(int);
void foo() {
  std::unique_ptr<int> p(new int(42));
  may_throw(*p);
}

Getting back to Hal's original question, I think that these kinds of optimizations are impossible for C++ lambdas. They frequently appear in headers and it's really easy to leak the class type with decltype and auto, and that class definition better be the same across all TUs.

It's okay for optimizations to only apply to "special cases", especially when those special cases are likely to be dominant in practice.  Optimizing lambdas would absolutely be worthwhile even if we had to not apply the optimizations to lambdas in inline functions.

John.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
In reply to this post by Brian Cain via cfe-dev
On 21 March 2017 at 21:01, John McCall <[hidden email]> wrote:
> On Mar 21, 2017, at 7:24 PM, Hal Finkel <[hidden email]> wrote:
> On 03/21/2017 03:02 PM, John McCall wrote:
>>> On Mar 21, 2017, at 9:03 AM, Hal Finkel <[hidden email]> wrote:
>>> Hi Richard, Chandler, John, et al.,
>>>
>>> "Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?
>> C++ lambdas are anonymous local types that are constrained to appear in function bodies (including implicit "function" bodies, e.g. the initializers of global variables).  The main semantic rule for local types like lambdas is that they're supposed to be the "same type" in all translation units that see the same function body, meaning (among many other things) that the ABI properties of the type must be the same.  Now, most function bodies can only appear in only a single translation unit, so this rule is trivially satisfied and the compiler has total flexibility.  But the body of an inline or template function can usually appear in multiple translation units, and so the ABI has to be consistent.
>>
>>> Context...
>>>
>>> We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.
>>>
>>> However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.
>> "Subfunction" representations are very useful for coroutine-esque situations where an inner function's control flow interactions with the outer function are essentially well-understood.  That is, to be useful, you really want the blocks of the inner function to fit into the CFG of the outer function at a unique and appropriate place; otherwise you're just contorting the representation of the outer function in a way that will probably massively hamper optimization.
>>
>> I can definitely see how there might be many situations in OpenMP that have this kind of CFG knowledge.  Swift has some situations like this as well.  Unfortunately, C++ lambdas and ObjC blocks are not a case where we have this knowledge, because the closure can be passed around and invoked arbitrarily many times and at arbitrary later points, which is not something that can ever fit cleanly into the CFG; pre-emptive "outlining" is just a superior representation for such cases.
>
> This is a very good point. We only know that the lambda will execute, and perhaps execute multiple times, after some given point. The same is true (to some extent) for some OpenMP features. For example, we can generate OpenMP tasks in a function, and we know only that they'll run sometime before the end of the containing parallel region, taskwait directive, etc. This might be outside the scope of the function.
>
> One interesting question is: To what extent does this affect our ability use analysis (AA, SCEV, etc.) of the variables at point of capture to optimize the body of the "subfunction." The aliasing properties of pointers, for example, and the bounds of captured induction variables, should be fine. However, if there's anything that might be rendered invalid by some later change in state (e.g. a write to memory), that wouldn't be usable. We also need to make sure that pointers appropriately appear to be captured (just as they would be if we used an outlined function representation). Luckily, I suppose, we need to do that anyway.

Right.  I would say that there are three interesting levels of closure-capture optimization:

1. We don't know when the inner function will be called; the outer function's frame may not still be intact.  This is generally true of blocks and lambdas because they can be copied off the stack.  If the block/lambda captures a reference to something in the outer function, U.B. might let us assume that the frame still exists, but I think only if the block/lambda actually uses that captured reference.  Richard, let us know if there's some special rule that would help us here.

Any use of a non-reference captured by reference would require the captured variable to still exist, and so we can in principle use a pointer to the outer function's stack frame for those captures (but not by-copy captures nor reference captures of references). But we can't assume that the outer function's stack frame will be active when such a lambda is called, only when such a capture is used.

It's hard to see what optimizations would be possible here.  Maybe, if we controlled the copying process, we could delay copying captured values into the block/lambda temporary, and instead just access them directly in the inner function?  But the inner function would have to be compiled two ways, one for when the variables have been copied and one when they're still in-place.  This is a lot of extra complexity.

One possibly-interesting optimization would be constant propagation of captures: we statically know at the point where we emit the construction of the lambda and its body what all the uses of each capture are, so this isn't the normal somewhat-unbounded problem of constant propagation through class members.
 
2. We don't know exactly when or how often the inner function will be called, but there's a definite point in the outer function past which the inner function will no longer be used.  This is true of some OpenMP features, "noescape" blocks in ObjC, and noescape blocks/closures in Swift.

We can optimize this by eliminating copies and indirections: instead of aggressively copying values and local addresses into the block/lambda, we can just store the enclosing frame pointer.  If a capture is semantically "by reference", i.e. changes to it in the inner function are reflected in the outer and vice-versa, then we have to temporarily pin it into memory and treat its address as having escaped; we can then just access it relative to the captured frame pointer.  If a capture is semantically "by value", i.e. it copies the value of a variable at a specific point, then we just have to keep that current value live for a while in the outer frame; in many cases, we can prove that neither function will modify it, so this can safely degrade to the by-reference implementation.

3. We know exactly when the inner function will be called and it can be fit into the outer function's CFG.  This is true of some OpenMP features and some very specific Swift features.

We can still do normal data-flow and control-flow analyses between the outer and inner functions; the only constraint is that we (probably) have to treat the beginning and end of the inner function as opaque barriers.  So the unlowered function looks something like:

  %initial_result = call %initial_result_t @begin_subfunction(%initial_args_t %initial_args)
  // subfunction goes here
  %final_result = call %final_result_t @end_subfunction(%final_args_t %final_args)

and eventually that gets lowered into something like:

  %final_result = call %final_result_t @do_with_subfunction(%initial_args, @subfunction)

  define %final_args_t @subfunction(%initial_result_t %initial_result) {
    // subfunction goes here
    ret %final_args_t %final_args
  }

with data flow in and out handled basically like in case #2.  This is one class of thing that people want to be able to do with the coroutine proposal.

John.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lambdas and the ABI?

Brian Cain via cfe-dev
On Mar 22, 2017, at 2:41 PM, Richard Smith <[hidden email]> wrote:
On 21 March 2017 at 21:01, John McCall <[hidden email]> wrote:
> On Mar 21, 2017, at 7:24 PM, Hal Finkel <[hidden email]> wrote:
> On 03/21/2017 03:02 PM, John McCall wrote:
>>> On Mar 21, 2017, at 9:03 AM, Hal Finkel <[hidden email]> wrote:
>>> Hi Richard, Chandler, John, et al.,
>>>
>>> "Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?
>> C++ lambdas are anonymous local types that are constrained to appear in function bodies (including implicit "function" bodies, e.g. the initializers of global variables).  The main semantic rule for local types like lambdas is that they're supposed to be the "same type" in all translation units that see the same function body, meaning (among many other things) that the ABI properties of the type must be the same.  Now, most function bodies can only appear in only a single translation unit, so this rule is trivially satisfied and the compiler has total flexibility.  But the body of an inline or template function can usually appear in multiple translation units, and so the ABI has to be consistent.
>>
>>> Context...
>>>
>>> We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.
>>>
>>> However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.
>> "Subfunction" representations are very useful for coroutine-esque situations where an inner function's control flow interactions with the outer function are essentially well-understood.  That is, to be useful, you really want the blocks of the inner function to fit into the CFG of the outer function at a unique and appropriate place; otherwise you're just contorting the representation of the outer function in a way that will probably massively hamper optimization.
>>
>> I can definitely see how there might be many situations in OpenMP that have this kind of CFG knowledge.  Swift has some situations like this as well.  Unfortunately, C++ lambdas and ObjC blocks are not a case where we have this knowledge, because the closure can be passed around and invoked arbitrarily many times and at arbitrary later points, which is not something that can ever fit cleanly into the CFG; pre-emptive "outlining" is just a superior representation for such cases.
>
> This is a very good point. We only know that the lambda will execute, and perhaps execute multiple times, after some given point. The same is true (to some extent) for some OpenMP features. For example, we can generate OpenMP tasks in a function, and we know only that they'll run sometime before the end of the containing parallel region, taskwait directive, etc. This might be outside the scope of the function.
>
> One interesting question is: To what extent does this affect our ability use analysis (AA, SCEV, etc.) of the variables at point of capture to optimize the body of the "subfunction." The aliasing properties of pointers, for example, and the bounds of captured induction variables, should be fine. However, if there's anything that might be rendered invalid by some later change in state (e.g. a write to memory), that wouldn't be usable. We also need to make sure that pointers appropriately appear to be captured (just as they would be if we used an outlined function representation). Luckily, I suppose, we need to do that anyway.

Right.  I would say that there are three interesting levels of closure-capture optimization:

1. We don't know when the inner function will be called; the outer function's frame may not still be intact.  This is generally true of blocks and lambdas because they can be copied off the stack.  If the block/lambda captures a reference to something in the outer function, U.B. might let us assume that the frame still exists, but I think only if the block/lambda actually uses that captured reference.  Richard, let us know if there's some special rule that would help us here.

Any use of a non-reference captured by reference would require the captured variable to still exist, and so we can in principle use a pointer to the outer function's stack frame for those captures (but not by-copy captures nor reference captures of references). But we can't assume that the outer function's stack frame will be active when such a lambda is called, only when such a capture is used.

Right, that's what I was afraid of.  That does add an extra obstacle for optimizing lambda representations; I wonder how much of a problem it would be in practice.

It's hard to see what optimizations would be possible here.  Maybe, if we controlled the copying process, we could delay copying captured values into the block/lambda temporary, and instead just access them directly in the inner function?  But the inner function would have to be compiled two ways, one for when the variables have been copied and one when they're still in-place.  This is a lot of extra complexity.

One possibly-interesting optimization would be constant propagation of captures: we statically know at the point where we emit the construction of the lambda and its body what all the uses of each capture are, so this isn't the normal somewhat-unbounded problem of constant propagation through class members.

Yeah.  There are also a number of possible minor optimizations where one value is known to be re-derivable from another.

John.
 
2. We don't know exactly when or how often the inner function will be called, but there's a definite point in the outer function past which the inner function will no longer be used.  This is true of some OpenMP features, "noescape" blocks in ObjC, and noescape blocks/closures in Swift.

We can optimize this by eliminating copies and indirections: instead of aggressively copying values and local addresses into the block/lambda, we can just store the enclosing frame pointer.  If a capture is semantically "by reference", i.e. changes to it in the inner function are reflected in the outer and vice-versa, then we have to temporarily pin it into memory and treat its address as having escaped; we can then just access it relative to the captured frame pointer.  If a capture is semantically "by value", i.e. it copies the value of a variable at a specific point, then we just have to keep that current value live for a while in the outer frame; in many cases, we can prove that neither function will modify it, so this can safely degrade to the by-reference implementation.

3. We know exactly when the inner function will be called and it can be fit into the outer function's CFG.  This is true of some OpenMP features and some very specific Swift features.

We can still do normal data-flow and control-flow analyses between the outer and inner functions; the only constraint is that we (probably) have to treat the beginning and end of the inner function as opaque barriers.  So the unlowered function looks something like:

  %initial_result = call %initial_result_t @begin_subfunction(%initial_args_t %initial_args)
  // subfunction goes here
  %final_result = call %final_result_t @end_subfunction(%final_args_t %final_args)

and eventually that gets lowered into something like:

  %final_result = call %final_result_t @do_with_subfunction(%initial_args, @subfunction)

  define %final_args_t @subfunction(%initial_result_t %initial_result) {
    // subfunction goes here
    ret %final_args_t %final_args
  }

with data flow in and out handled basically like in case #2.  This is one class of thing that people want to be able to do with the coroutine proposal.

John.



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Loading...