[RFC] ASM Goto With Output Constraints

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] ASM Goto With Output Constraints

Cameron McInally via cfe-dev
On 6/28/19 9:20 PM, Bill Wendling wrote:
On Fri, Jun 28, 2019 at 5:39 PM Finkel, Hal J. <[hidden email]> wrote:

On 6/28/19 5:35 PM, James Y Knight via llvm-dev wrote:

On Fri, Jun 28, 2019 at 5:53 PM Bill Wendling <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 1:48 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 3:00 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:44 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:29 PM James Y Knight <[hidden email]> wrote:
I think this is fine, except that it stops at the point where things actually start to get interesting and tricky.

How will you actually handle the flow of values from the callbr into the error blocks? A callbr can specify requirements on where its outputs live. So, what if two callbr, in different branches of code, specify _different_ constraints for the same output, and list the same block as a possible error successor? How can the resulting phi be codegened?

This is where I fall back on the statement about how "the programmer knows what they're doing". Perhaps I'm being too cavalier here? My concern, if you want to call it that, is that we don't be too restrictive on the new behavior. For example, the "asm goto" may set a register to an error value (made up on the spot; may not be a common use). But, if there's no real reason to have the value be valid on the abnormal path, then sure we can declare that it's not valid on the abnormal path.

I think I should explain my "programmer knows what they're doing" statement a bit better. I'm specifically referring to inline asm here. The more general "callbr" case may still need to be considered (see Reid's reply).

When a programmer uses inline asm, they're implicitly telling the compiler that they *do* know what they're doing  (I know this is common knowledge, but I wanted to reiterate it.). In particular, either they need to reference an instruction not readily available from the compiler (e.g. "cpuid") or the compiler isn't able to give them the needed performance in a critical section. I'm extending this sentiment to callbr with output constraints. Let's take your example below and write it as "normal" asm statements one on each branch of an if-then-else (please ignore any syntax errors):

if:
  br i1 %cmp, label %true, label %false

true:
  %0 = call { i32, i32 } asm sideeffect "poetry $0, $1", "={r8},={r9}" ()
  br label %end

false:
  %1 = call { i32, i32 } asm sideeffect "poetry2 $0, $1", "={r10},={r11}" ()
  br label %end

end:
  %vals = phi { i32, i32 } [ %0, %true ], [ %1, %false ]

How is this handled in codegen? Is it an error or does the back-end handle it? Whatever's done today for "normal" inline asm is what I *think* should be the behavior for the inline asm callbr variant. If this doesn't seem sensible (and I realize that I may be thinking of an "in a perfect world" scenario), then we'll need to come up with a more sensible solution which may be to disallow the values on the error block until we can think of a better way to handle them.

This example is no problem, because instructions can be emitted between what's emitted by "call asm" and the end of the block (be it a fallthrough, or a jump instruction. What gets emitted there is a move of the output register to another location -- either a register or to the stack. And therefore at the beginning of the "end" block, "%vals" is always in a consistent location, no matter how you got to that block.

But in the callbr case, there is not a location at which those moves can be emitted, after the callbr, before the jump to "error".

I see what you mean. Let's say we create a pseudo-instruction (similar to landingpad, et al) that needs to be lowered by the backend in a reasonable manner. The EH stuff has an external process/library that performs the actual unwinding and which sets the values accordingly. We won't have this.

 
What we could do instead is split the edges and insert the copy-to-<where ever> statements there.

Exactly -- except that doing that is potentially an invalid transform, because the address is being used as a value, not simply a jump target. The label list is just a list of _possible_ jump targets, changing those won't actually affect anything. You'd instead need to change the blockaddress constant, but in the general case you don't know where that address came from -- (and it may therefore be required that you have the same address for two separate callbr instructions).

I guess this kinda touches on some of the same issues as in the other discussion about the handling of the blockaddress in callbr and inlining, etc...

I wonder if we could put some validity restrictions on the IR structure, rather than trying to fix things up after the fact by attempting to split blocks. E.g., we could state that it's invalid to have a phi which uses the value defined by a callbr, if it's conditioned on that same block as predecessor.  That is: it's valid to use _other_ values defined in the block ending in callbr, because they can be moved prior to the callbr. It's also valid to use the value defined by the callbr in a phi conditioned on some other intermediate block as predecessor, because then any required moves can happen in the intermediate block.

I believe such an IR restriction should be sufficient to make it possible to emit valid code from the IR in all cases, but I'm a bit afraid of how badly adding such odd edge-cases might screw up the rest of the compiler and optimizer.
That may be a reasonable restriction to place on the code.

Allow me to wildly speculate a bit. What I would like to have happen is to generate assembly akin to this:

Lasm.goto.dest:           ; The original blockaddress destination.

Lasm.goto.dest.bb1:

  mov %..., %...

  jmp Lasm.goto.dest.body

Lasm.goto.dest.bb2:

  mov %..., %...

  jmp Lasm.goto.dest.body ; This would be elided, of course.

Lasm.goto.dest.body:

  ...


This preserves the blockaddress value. If we create a new instruction, let's say `indirectval' (a horrible name, but used for this example), it could save us from having to deal with edge splitting. It could take values similar to a phi node:


  <val> = indirectval <ty> [%v1, label %bb1], [%v2, label %bb2]


where %v1 and %v2 are from callbr instructions. When we are converting the IR into machine instructions, we can generate something similar to the example above:


asm.goto.dest:

  BR asm.goto.dest.bb1


asm.goto.dest.bb1:

  MOV ...

  BR asm.goto.dest.body


asm.goto.dest.bb2:

  MOV ...

  BR asm.goto.dest.body


asm.goto.dest.body:

  ...


The one issue is the precise instruction to add to access the values. Perhaps they could be inserted directly before the indirectval inst.


I don't understand how you can perform the renaming in general. Can't the block address to which the code jumps be provided as a pointer value (presumably generated using the labels-as-values extension)?



I think that your fear is justified.

In any case, if we're going to support forming this kind of callbr in Clang, then Clang still needs a place to put the stack stores after the inline asm in order to represent the output constraints - which are specified in terms of source-level variables and those are always in stack locations when Clang is generating IR. I think that we can make all of this work if we say that the output constraints, and thus the outputs of the callbr, dominate only uses on the normal "fallthrough" branch. Then the compiler has a single place to put the stores (and, later, a place to put register copies, etc.).

Hal, Are you saying that values should not be used on indirect branches?


Yes.

 -Hal



-bw 
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] ASM Goto With Output Constraints

Cameron McInally via cfe-dev


On Fri, Jun 28, 2019 at 7:27 PM Finkel, Hal J. <[hidden email]> wrote:
On 6/28/19 9:20 PM, Bill Wendling wrote:
On Fri, Jun 28, 2019 at 5:39 PM Finkel, Hal J. <[hidden email]> wrote:

On 6/28/19 5:35 PM, James Y Knight via llvm-dev wrote:

On Fri, Jun 28, 2019 at 5:53 PM Bill Wendling <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 1:48 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 3:00 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:44 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:29 PM James Y Knight <[hidden email]> wrote:
I think this is fine, except that it stops at the point where things actually start to get interesting and tricky.

How will you actually handle the flow of values from the callbr into the error blocks? A callbr can specify requirements on where its outputs live. So, what if two callbr, in different branches of code, specify _different_ constraints for the same output, and list the same block as a possible error successor? How can the resulting phi be codegened?

This is where I fall back on the statement about how "the programmer knows what they're doing". Perhaps I'm being too cavalier here? My concern, if you want to call it that, is that we don't be too restrictive on the new behavior. For example, the "asm goto" may set a register to an error value (made up on the spot; may not be a common use). But, if there's no real reason to have the value be valid on the abnormal path, then sure we can declare that it's not valid on the abnormal path.

I think I should explain my "programmer knows what they're doing" statement a bit better. I'm specifically referring to inline asm here. The more general "callbr" case may still need to be considered (see Reid's reply).

When a programmer uses inline asm, they're implicitly telling the compiler that they *do* know what they're doing  (I know this is common knowledge, but I wanted to reiterate it.). In particular, either they need to reference an instruction not readily available from the compiler (e.g. "cpuid") or the compiler isn't able to give them the needed performance in a critical section. I'm extending this sentiment to callbr with output constraints. Let's take your example below and write it as "normal" asm statements one on each branch of an if-then-else (please ignore any syntax errors):

if:
  br i1 %cmp, label %true, label %false

true:
  %0 = call { i32, i32 } asm sideeffect "poetry $0, $1", "={r8},={r9}" ()
  br label %end

false:
  %1 = call { i32, i32 } asm sideeffect "poetry2 $0, $1", "={r10},={r11}" ()
  br label %end

end:
  %vals = phi { i32, i32 } [ %0, %true ], [ %1, %false ]

How is this handled in codegen? Is it an error or does the back-end handle it? Whatever's done today for "normal" inline asm is what I *think* should be the behavior for the inline asm callbr variant. If this doesn't seem sensible (and I realize that I may be thinking of an "in a perfect world" scenario), then we'll need to come up with a more sensible solution which may be to disallow the values on the error block until we can think of a better way to handle them.

This example is no problem, because instructions can be emitted between what's emitted by "call asm" and the end of the block (be it a fallthrough, or a jump instruction. What gets emitted there is a move of the output register to another location -- either a register or to the stack. And therefore at the beginning of the "end" block, "%vals" is always in a consistent location, no matter how you got to that block.

But in the callbr case, there is not a location at which those moves can be emitted, after the callbr, before the jump to "error".

I see what you mean. Let's say we create a pseudo-instruction (similar to landingpad, et al) that needs to be lowered by the backend in a reasonable manner. The EH stuff has an external process/library that performs the actual unwinding and which sets the values accordingly. We won't have this.

 
What we could do instead is split the edges and insert the copy-to-<where ever> statements there.

Exactly -- except that doing that is potentially an invalid transform, because the address is being used as a value, not simply a jump target. The label list is just a list of _possible_ jump targets, changing those won't actually affect anything. You'd instead need to change the blockaddress constant, but in the general case you don't know where that address came from -- (and it may therefore be required that you have the same address for two separate callbr instructions).

I guess this kinda touches on some of the same issues as in the other discussion about the handling of the blockaddress in callbr and inlining, etc...

I wonder if we could put some validity restrictions on the IR structure, rather than trying to fix things up after the fact by attempting to split blocks. E.g., we could state that it's invalid to have a phi which uses the value defined by a callbr, if it's conditioned on that same block as predecessor.  That is: it's valid to use _other_ values defined in the block ending in callbr, because they can be moved prior to the callbr. It's also valid to use the value defined by the callbr in a phi conditioned on some other intermediate block as predecessor, because then any required moves can happen in the intermediate block.

I believe such an IR restriction should be sufficient to make it possible to emit valid code from the IR in all cases, but I'm a bit afraid of how badly adding such odd edge-cases might screw up the rest of the compiler and optimizer.
That may be a reasonable restriction to place on the code.

Allow me to wildly speculate a bit. What I would like to have happen is to generate assembly akin to this:

Lasm.goto.dest:           ; The original blockaddress destination.

Lasm.goto.dest.bb1:

  mov %..., %...

  jmp Lasm.goto.dest.body

Lasm.goto.dest.bb2:

  mov %..., %...

  jmp Lasm.goto.dest.body ; This would be elided, of course.

Lasm.goto.dest.body:

  ...


This preserves the blockaddress value. If we create a new instruction, let's say `indirectval' (a horrible name, but used for this example), it could save us from having to deal with edge splitting. It could take values similar to a phi node:


  <val> = indirectval <ty> [%v1, label %bb1], [%v2, label %bb2]


where %v1 and %v2 are from callbr instructions. When we are converting the IR into machine instructions, we can generate something similar to the example above:


asm.goto.dest:

  BR asm.goto.dest.bb1


asm.goto.dest.bb1:

  MOV ...

  BR asm.goto.dest.body


asm.goto.dest.bb2:

  MOV ...

  BR asm.goto.dest.body


asm.goto.dest.body:

  ...


The one issue is the precise instruction to add to access the values. Perhaps they could be inserted directly before the indirectval inst.


I don't understand how you can perform the renaming in general. Can't the block address to which the code jumps be provided as a pointer value (presumably generated using the labels-as-values extension)?

I made an error and essentially just restated my other proposal, but introduced an instruction instead of splitting edges (I blame the lack of a sufficient amount of caffeine). The problem still persists though.

How to control the extraction of values in the uncommon branch:

1. Simply don't allow values to be used there. I'm not a fan, because this seems too restrictive.
2. Allow the values to be used, but require that if two callbr instructions point to the same uncommon block, the values used in that block must have compatible (identical?) output constraints. This seems like a nice compromise. However, it does put the onus back on the programmer to ensure correctness.
3. Do something similar to ARM32's EH handling, where a stack slot is allocated and a value is inserted into it before each callbr. Then we could branch according to that value. It's workable, but introduces stack stores/loads which aren't ideal.
4. James's suggestion above about the valid use of callbr return values.

-bw

I think that your fear is justified.

In any case, if we're going to support forming this kind of callbr in Clang, then Clang still needs a place to put the stack stores after the inline asm in order to represent the output constraints - which are specified in terms of source-level variables and those are always in stack locations when Clang is generating IR. I think that we can make all of this work if we say that the output constraints, and thus the outputs of the callbr, dominate only uses on the normal "fallthrough" branch. Then the compiler has a single place to put the stores (and, later, a place to put register copies, etc.).

Hal, Are you saying that values should not be used on indirect branches?


Yes.

 -Hal



-bw 
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] ASM Goto With Output Constraints

Cameron McInally via cfe-dev
In reply to this post by Cameron McInally via cfe-dev
On Fri, Jun 28, 2019 at 3:35 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 5:53 PM Bill Wendling <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 1:48 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 3:00 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:44 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:29 PM James Y Knight <[hidden email]> wrote:
I think this is fine, except that it stops at the point where things actually start to get interesting and tricky.

How will you actually handle the flow of values from the callbr into the error blocks? A callbr can specify requirements on where its outputs live. So, what if two callbr, in different branches of code, specify _different_ constraints for the same output, and list the same block as a possible error successor? How can the resulting phi be codegened?

This is where I fall back on the statement about how "the programmer knows what they're doing". Perhaps I'm being too cavalier here? My concern, if you want to call it that, is that we don't be too restrictive on the new behavior. For example, the "asm goto" may set a register to an error value (made up on the spot; may not be a common use). But, if there's no real reason to have the value be valid on the abnormal path, then sure we can declare that it's not valid on the abnormal path.

I think I should explain my "programmer knows what they're doing" statement a bit better. I'm specifically referring to inline asm here. The more general "callbr" case may still need to be considered (see Reid's reply).

When a programmer uses inline asm, they're implicitly telling the compiler that they *do* know what they're doing  (I know this is common knowledge, but I wanted to reiterate it.). In particular, either they need to reference an instruction not readily available from the compiler (e.g. "cpuid") or the compiler isn't able to give them the needed performance in a critical section. I'm extending this sentiment to callbr with output constraints. Let's take your example below and write it as "normal" asm statements one on each branch of an if-then-else (please ignore any syntax errors):

if:
  br i1 %cmp, label %true, label %false

true:
  %0 = call { i32, i32 } asm sideeffect "poetry $0, $1", "={r8},={r9}" ()
  br label %end

false:
  %1 = call { i32, i32 } asm sideeffect "poetry2 $0, $1", "={r10},={r11}" ()
  br label %end

end:
  %vals = phi { i32, i32 } [ %0, %true ], [ %1, %false ]

How is this handled in codegen? Is it an error or does the back-end handle it? Whatever's done today for "normal" inline asm is what I *think* should be the behavior for the inline asm callbr variant. If this doesn't seem sensible (and I realize that I may be thinking of an "in a perfect world" scenario), then we'll need to come up with a more sensible solution which may be to disallow the values on the error block until we can think of a better way to handle them.

This example is no problem, because instructions can be emitted between what's emitted by "call asm" and the end of the block (be it a fallthrough, or a jump instruction. What gets emitted there is a move of the output register to another location -- either a register or to the stack. And therefore at the beginning of the "end" block, "%vals" is always in a consistent location, no matter how you got to that block.

But in the callbr case, there is not a location at which those moves can be emitted, after the callbr, before the jump to "error".

I see what you mean. Let's say we create a pseudo-instruction (similar to landingpad, et al) that needs to be lowered by the backend in a reasonable manner. The EH stuff has an external process/library that performs the actual unwinding and which sets the values accordingly. We won't have this.

 
What we could do instead is split the edges and insert the copy-to-<where ever> statements there.

Exactly -- except that doing that is potentially an invalid transform, because the address is being used as a value, not simply a jump target. The label list is just a list of _possible_ jump targets, changing those won't actually affect anything. You'd instead need to change the blockaddress constant, but in the general case you don't know where that address came from -- (and it may therefore be required that you have the same address for two separate callbr instructions).

I guess this kinda touches on some of the same issues as in the other discussion about the handling of the blockaddress in callbr and inlining, etc...

I wonder if we could put some validity restrictions on the IR structure, rather than trying to fix things up after the fact by attempting to split blocks. E.g., we could state that it's invalid to have a phi which uses the value defined by a callbr, if it's conditioned on that same block as predecessor.  That is: it's valid to use _other_ values defined in the block ending in callbr, because they can be moved prior to the callbr. It's also valid to use the value defined by the callbr in a phi conditioned on some other intermediate block as predecessor, because then any required moves can happen in the intermediate block.

I believe such an IR restriction should be sufficient to make it possible to emit valid code from the IR in all cases, but I'm a bit afraid of how badly adding such odd edge-cases might screw up the rest of the compiler and optimizer.


I want to revisit this. Here are the situations we're confronted with:

  1. The goto-target can be jumped to by 1 callbr instruction,
  2. The goto-target can be jumped to by N callbr instructions, which don't need a PHI node, and
  3. The goto-target can be jumped to by N callbr instructions, which *do* need a PHI node.

I'm going to plug the instruction I created out of thin air a few emails back, but better explain (I'm using an instruction instead of an intrinsic because we want that instruction to be right after all non-PHI instructions in the goto-target block). I'm _not_ suggesting we need this instruction. It's just for demonstration purposes.


Situations (1) and (2) don't encounter an problem. Any value used in the goto-target can be handled by inserting the code to extract that value in the goto-target block:


bb1:

  ...

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  ...

  %y.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %y, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %x.goto = <extract value from %x.bb1>

  %y.goto = <extract value from %y.bb2>

  ... <uses of %x.goto and %y.goto> ...


This leaves situation (3), which is far more complex as we've seen. To reiterate, the issue here is that we need to extract the values returned by callbr. This would typically be done by using a PHI node, but llvm may want to split critical edges or push the calculation back to the predecessor block, which won't work with the callbr asm, because it could branch out of the asm at any point thus skipping the extraction. So we can't use PHI nodes for these values. There are three classes of solutions to this:

  1. Don't allow the values to be used in goto-targets, or
  2. Allow them, but with significant restrictions, or
  3. Allow them without using PHI nodes.
Each has its benefits and drawbacks. As I've stated before, I think that (1) is too restrictive, but if we can't come up with a good solution, it may be our only option. Solution (2) could be a good compromise. However, I want to propose a potential solution to (3).

The core of my proposal is to replace the PHI node with code that will replicate its behavior without code lowering trying to modify the CFG (at least not in ways that may invalidate the asm). Here is example code:


bb1:

  store i8* blockaddress(@bar, %bb1), i8** %src

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  store i8* blockaddress(@bar, %bb2), i8** %src

  %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %x1 = indirectval i8** %src, i32 [%x.bb1, %bb1], [%x.bb2, %bb2]

  <extract values from %x1>

  ...


This can be lowered to this:

bb1:

  store i8* blockaddress(@bar, %bb1), i8** %src

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %error))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  store i8* blockaddress(@bar, %bb2), i8** %src

  %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %error))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %src1 = load i8**, i8*** @src

  %src.bb = load i8*, i8** %src1

  switch i64 %src.bb, label %goto.target.body [ ; or if-then-else blocks

      i64 ptrtoint i8* blockaddress(@bar, %bb1) to i64, label %goto.target.bb1

      i64 ptrtoint i8* blockaddress(@bar, %bb2) to i64, label %goto.target.bb2   

  ]


goto.target.bb1:

  %x1 = <extract value from %x.bb1>

  br label %goto.target.body


goto.target.bb2:

  %x2 = <extract value from %x.bb2>

  br label %goto.target.body


goto.target.body:

  %x.merge = phi i64 [%x1, label %goto.target.bb1], [%x1, label %goto.target.bb2]

  ...


With this, we don't change any values used by the callbr instructions, and the return values are extracted correctly. This has the unsavory issue of using stores and loads, but this may be the price we need to pay.


Thoughts?


-bw

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] ASM Goto With Output Constraints

Cameron McInally via cfe-dev


On 7/1/19 1:38 PM, Bill Wendling via llvm-dev wrote:
On Fri, Jun 28, 2019 at 3:35 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 5:53 PM Bill Wendling <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 1:48 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 3:00 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:44 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:29 PM James Y Knight <[hidden email]> wrote:
I think this is fine, except that it stops at the point where things actually start to get interesting and tricky.

How will you actually handle the flow of values from the callbr into the error blocks? A callbr can specify requirements on where its outputs live. So, what if two callbr, in different branches of code, specify _different_ constraints for the same output, and list the same block as a possible error successor? How can the resulting phi be codegened?

This is where I fall back on the statement about how "the programmer knows what they're doing". Perhaps I'm being too cavalier here? My concern, if you want to call it that, is that we don't be too restrictive on the new behavior. For example, the "asm goto" may set a register to an error value (made up on the spot; may not be a common use). But, if there's no real reason to have the value be valid on the abnormal path, then sure we can declare that it's not valid on the abnormal path.

I think I should explain my "programmer knows what they're doing" statement a bit better. I'm specifically referring to inline asm here. The more general "callbr" case may still need to be considered (see Reid's reply).

When a programmer uses inline asm, they're implicitly telling the compiler that they *do* know what they're doing  (I know this is common knowledge, but I wanted to reiterate it.). In particular, either they need to reference an instruction not readily available from the compiler (e.g. "cpuid") or the compiler isn't able to give them the needed performance in a critical section. I'm extending this sentiment to callbr with output constraints. Let's take your example below and write it as "normal" asm statements one on each branch of an if-then-else (please ignore any syntax errors):

if:
  br i1 %cmp, label %true, label %false

true:
  %0 = call { i32, i32 } asm sideeffect "poetry $0, $1", "={r8},={r9}" ()
  br label %end

false:
  %1 = call { i32, i32 } asm sideeffect "poetry2 $0, $1", "={r10},={r11}" ()
  br label %end

end:
  %vals = phi { i32, i32 } [ %0, %true ], [ %1, %false ]

How is this handled in codegen? Is it an error or does the back-end handle it? Whatever's done today for "normal" inline asm is what I *think* should be the behavior for the inline asm callbr variant. If this doesn't seem sensible (and I realize that I may be thinking of an "in a perfect world" scenario), then we'll need to come up with a more sensible solution which may be to disallow the values on the error block until we can think of a better way to handle them.

This example is no problem, because instructions can be emitted between what's emitted by "call asm" and the end of the block (be it a fallthrough, or a jump instruction. What gets emitted there is a move of the output register to another location -- either a register or to the stack. And therefore at the beginning of the "end" block, "%vals" is always in a consistent location, no matter how you got to that block.

But in the callbr case, there is not a location at which those moves can be emitted, after the callbr, before the jump to "error".

I see what you mean. Let's say we create a pseudo-instruction (similar to landingpad, et al) that needs to be lowered by the backend in a reasonable manner. The EH stuff has an external process/library that performs the actual unwinding and which sets the values accordingly. We won't have this.

 
What we could do instead is split the edges and insert the copy-to-<where ever> statements there.

Exactly -- except that doing that is potentially an invalid transform, because the address is being used as a value, not simply a jump target. The label list is just a list of _possible_ jump targets, changing those won't actually affect anything. You'd instead need to change the blockaddress constant, but in the general case you don't know where that address came from -- (and it may therefore be required that you have the same address for two separate callbr instructions).

I guess this kinda touches on some of the same issues as in the other discussion about the handling of the blockaddress in callbr and inlining, etc...

I wonder if we could put some validity restrictions on the IR structure, rather than trying to fix things up after the fact by attempting to split blocks. E.g., we could state that it's invalid to have a phi which uses the value defined by a callbr, if it's conditioned on that same block as predecessor.  That is: it's valid to use _other_ values defined in the block ending in callbr, because they can be moved prior to the callbr. It's also valid to use the value defined by the callbr in a phi conditioned on some other intermediate block as predecessor, because then any required moves can happen in the intermediate block.

I believe such an IR restriction should be sufficient to make it possible to emit valid code from the IR in all cases, but I'm a bit afraid of how badly adding such odd edge-cases might screw up the rest of the compiler and optimizer.


I want to revisit this. Here are the situations we're confronted with:

  1. The goto-target can be jumped to by 1 callbr instruction,
  2. The goto-target can be jumped to by N callbr instructions, which don't need a PHI node, and
  3. The goto-target can be jumped to by N callbr instructions, which *do* need a PHI node.

I'm going to plug the instruction I created out of thin air a few emails back, but better explain (I'm using an instruction instead of an intrinsic because we want that instruction to be right after all non-PHI instructions in the goto-target block). I'm _not_ suggesting we need this instruction. It's just for demonstration purposes.


Situations (1) and (2) don't encounter an problem. Any value used in the goto-target can be handled by inserting the code to extract that value in the goto-target block:


bb1:

  ...

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  ...

  %y.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %y, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %x.goto = <extract value from %x.bb1>

  %y.goto = <extract value from %y.bb2>

  ... <uses of %x.goto and %y.goto> ...


This leaves situation (3), which is far more complex as we've seen. To reiterate, the issue here is that we need to extract the values returned by callbr. This would typically be done by using a PHI node, but llvm may want to split critical edges or push the calculation back to the predecessor block, which won't work with the callbr asm, because it could branch out of the asm at any point thus skipping the extraction. So we can't use PHI nodes for these values. There are three classes of solutions to this:

  1. Don't allow the values to be used in goto-targets, or
  2. Allow them, but with significant restrictions, or
  3. Allow them without using PHI nodes.
Each has its benefits and drawbacks. As I've stated before, I think that (1) is too restrictive, but if we can't come up with a good solution, it may be our only option. Solution (2) could be a good compromise. However, I want to propose a potential solution to (3).

The core of my proposal is to replace the PHI node with code that will replicate its behavior without code lowering trying to modify the CFG (at least not in ways that may invalidate the asm). Here is example code:


bb1:

  store i8* blockaddress(@bar, %bb1), i8** %src

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  store i8* blockaddress(@bar, %bb2), i8** %src

  %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %x1 = indirectval i8** %src, i32 [%x.bb1, %bb1], [%x.bb2, %bb2]

  <extract values from %x1>

  ...


This can be lowered to this:

bb1:

  store i8* blockaddress(@bar, %bb1), i8** %src

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %error))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  store i8* blockaddress(@bar, %bb2), i8** %src

  %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %error))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %src1 = load i8**, i8*** @src

  %src.bb = load i8*, i8** %src1

  switch i64 %src.bb, label %goto.target.body [ ; or if-then-else blocks

      i64 ptrtoint i8* blockaddress(@bar, %bb1) to i64, label %goto.target.bb1

      i64 ptrtoint i8* blockaddress(@bar, %bb2) to i64, label %goto.target.bb2   

  ]


goto.target.bb1:

  %x1 = <extract value from %x.bb1>

  br label %goto.target.body


goto.target.bb2:

  %x2 = <extract value from %x.bb2>

  br label %goto.target.body


goto.target.body:

  %x.merge = phi i64 [%x1, label %goto.target.bb1], [%x1, label %goto.target.bb2]

  ...


With this, we don't change any values used by the callbr instructions, and the return values are extracted correctly. This has the unsavory issue of using stores and loads, but this may be the price we need to pay.


Thoughts?


The non-fallthrough blocks can have other predecessors, right? If so, I imagine that you need to also do the following:

 1. Store zero (or -1 or some other distinguishable value) into the %src alloca in the entry block.

 2. Store this same distinguishable value into the %src alloca after the "value extraction" is performed.

 3. Include this distinguishable value in the switch statement.

While Clang does not normally insert phi nodes, in this case perhaps the problem is self-contained enough for this to be reasonable. However, I'm not sure that this is worthwhile. This is a performance feature generally, and if the user really wants to use these outputs, are they going to want the extra expense of the branches and jump tables and all of the rest of it? Maybe in the common case the extraction blocks will be trivial and get merged, but the default case will still be problematic?

 -Hal


-bw

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] ASM Goto With Output Constraints

Cameron McInally via cfe-dev
On Mon, Jul 1, 2019 at 6:25 PM Finkel, Hal J. <[hidden email]> wrote:

On 7/1/19 1:38 PM, Bill Wendling via llvm-dev wrote:

On Fri, Jun 28, 2019 at 3:35 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 5:53 PM Bill Wendling <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 1:48 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 3:00 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:44 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:29 PM James Y Knight <[hidden email]> wrote:
I think this is fine, except that it stops at the point where things actually start to get interesting and tricky.

How will you actually handle the flow of values from the callbr into the error blocks? A callbr can specify requirements on where its outputs live. So, what if two callbr, in different branches of code, specify _different_ constraints for the same output, and list the same block as a possible error successor? How can the resulting phi be codegened?

This is where I fall back on the statement about how "the programmer knows what they're doing". Perhaps I'm being too cavalier here? My concern, if you want to call it that, is that we don't be too restrictive on the new behavior. For example, the "asm goto" may set a register to an error value (made up on the spot; may not be a common use). But, if there's no real reason to have the value be valid on the abnormal path, then sure we can declare that it's not valid on the abnormal path.

I think I should explain my "programmer knows what they're doing" statement a bit better. I'm specifically referring to inline asm here. The more general "callbr" case may still need to be considered (see Reid's reply).

When a programmer uses inline asm, they're implicitly telling the compiler that they *do* know what they're doing  (I know this is common knowledge, but I wanted to reiterate it.). In particular, either they need to reference an instruction not readily available from the compiler (e.g. "cpuid") or the compiler isn't able to give them the needed performance in a critical section. I'm extending this sentiment to callbr with output constraints. Let's take your example below and write it as "normal" asm statements one on each branch of an if-then-else (please ignore any syntax errors):

if:
  br i1 %cmp, label %true, label %false

true:
  %0 = call { i32, i32 } asm sideeffect "poetry $0, $1", "={r8},={r9}" ()
  br label %end

false:
  %1 = call { i32, i32 } asm sideeffect "poetry2 $0, $1", "={r10},={r11}" ()
  br label %end

end:
  %vals = phi { i32, i32 } [ %0, %true ], [ %1, %false ]

How is this handled in codegen? Is it an error or does the back-end handle it? Whatever's done today for "normal" inline asm is what I *think* should be the behavior for the inline asm callbr variant. If this doesn't seem sensible (and I realize that I may be thinking of an "in a perfect world" scenario), then we'll need to come up with a more sensible solution which may be to disallow the values on the error block until we can think of a better way to handle them.

This example is no problem, because instructions can be emitted between what's emitted by "call asm" and the end of the block (be it a fallthrough, or a jump instruction. What gets emitted there is a move of the output register to another location -- either a register or to the stack. And therefore at the beginning of the "end" block, "%vals" is always in a consistent location, no matter how you got to that block.

But in the callbr case, there is not a location at which those moves can be emitted, after the callbr, before the jump to "error".

I see what you mean. Let's say we create a pseudo-instruction (similar to landingpad, et al) that needs to be lowered by the backend in a reasonable manner. The EH stuff has an external process/library that performs the actual unwinding and which sets the values accordingly. We won't have this.

 
What we could do instead is split the edges and insert the copy-to-<where ever> statements there.

Exactly -- except that doing that is potentially an invalid transform, because the address is being used as a value, not simply a jump target. The label list is just a list of _possible_ jump targets, changing those won't actually affect anything. You'd instead need to change the blockaddress constant, but in the general case you don't know where that address came from -- (and it may therefore be required that you have the same address for two separate callbr instructions).

I guess this kinda touches on some of the same issues as in the other discussion about the handling of the blockaddress in callbr and inlining, etc...

I wonder if we could put some validity restrictions on the IR structure, rather than trying to fix things up after the fact by attempting to split blocks. E.g., we could state that it's invalid to have a phi which uses the value defined by a callbr, if it's conditioned on that same block as predecessor.  That is: it's valid to use _other_ values defined in the block ending in callbr, because they can be moved prior to the callbr. It's also valid to use the value defined by the callbr in a phi conditioned on some other intermediate block as predecessor, because then any required moves can happen in the intermediate block.

I believe such an IR restriction should be sufficient to make it possible to emit valid code from the IR in all cases, but I'm a bit afraid of how badly adding such odd edge-cases might screw up the rest of the compiler and optimizer.


I want to revisit this. Here are the situations we're confronted with:

  1. The goto-target can be jumped to by 1 callbr instruction,
  2. The goto-target can be jumped to by N callbr instructions, which don't need a PHI node, and
  3. The goto-target can be jumped to by N callbr instructions, which *do* need a PHI node.

I'm going to plug the instruction I created out of thin air a few emails back, but better explain (I'm using an instruction instead of an intrinsic because we want that instruction to be right after all non-PHI instructions in the goto-target block). I'm _not_ suggesting we need this instruction. It's just for demonstration purposes.


Situations (1) and (2) don't encounter an problem. Any value used in the goto-target can be handled by inserting the code to extract that value in the goto-target block:


bb1:

  ...

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  ...

  %y.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %y, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %x.goto = <extract value from %x.bb1>

  %y.goto = <extract value from %y.bb2>

  ... <uses of %x.goto and %y.goto> ...


This leaves situation (3), which is far more complex as we've seen. To reiterate, the issue here is that we need to extract the values returned by callbr. This would typically be done by using a PHI node, but llvm may want to split critical edges or push the calculation back to the predecessor block, which won't work with the callbr asm, because it could branch out of the asm at any point thus skipping the extraction. So we can't use PHI nodes for these values. There are three classes of solutions to this:

  1. Don't allow the values to be used in goto-targets, or
  2. Allow them, but with significant restrictions, or
  3. Allow them without using PHI nodes.
Each has its benefits and drawbacks. As I've stated before, I think that (1) is too restrictive, but if we can't come up with a good solution, it may be our only option. Solution (2) could be a good compromise. However, I want to propose a potential solution to (3).

The core of my proposal is to replace the PHI node with code that will replicate its behavior without code lowering trying to modify the CFG (at least not in ways that may invalidate the asm). Here is example code:


bb1:

  store i8* blockaddress(@bar, %bb1), i8** %src

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  store i8* blockaddress(@bar, %bb2), i8** %src

  %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %x1 = indirectval i8** %src, i32 [%x.bb1, %bb1], [%x.bb2, %bb2]

  <extract values from %x1>

  ...


This can be lowered to this:

bb1:

  store i8* blockaddress(@bar, %bb1), i8** %src

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %error))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  store i8* blockaddress(@bar, %bb2), i8** %src

  %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %error))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %src1 = load i8**, i8*** @src

  %src.bb = load i8*, i8** %src1

  switch i64 %src.bb, label %goto.target.body [ ; or if-then-else blocks

      i64 ptrtoint i8* blockaddress(@bar, %bb1) to i64, label %goto.target.bb1

      i64 ptrtoint i8* blockaddress(@bar, %bb2) to i64, label %goto.target.bb2   

  ]


goto.target.bb1:

  %x1 = <extract value from %x.bb1>

  br label %goto.target.body


goto.target.bb2:

  %x2 = <extract value from %x.bb2>

  br label %goto.target.body


goto.target.body:

  %x.merge = phi i64 [%x1, label %goto.target.bb1], [%x1, label %goto.target.bb2]

  ...


With this, we don't change any values used by the callbr instructions, and the return values are extracted correctly. This has the unsavory issue of using stores and loads, but this may be the price we need to pay.


Thoughts?


The non-fallthrough blocks can have other predecessors, right? If so, I imagine that you need to also do the following:

Good point! 

 1. Store zero (or -1 or some other distinguishable value) into the %src alloca in the entry block.

It should be a null value, as that's not a valid block address. Then again, if we use the "switch" instruction the default branch should suffice. We will probably want to reset the value after the callbr values are extracted.

 2. Store this same distinguishable value into the %src alloca after the "value extraction" is performed.

 3. Include this distinguishable value in the switch statement.

While Clang does not normally insert phi nodes, in this case perhaps the problem is self-contained enough for this to be reasonable. However, I'm not sure that this is worthwhile. This is a performance feature generally, and if the user really wants to use these outputs, are they going to want the extra expense of the branches and jump tables and all of the rest of it? Maybe in the common case the extraction blocks will be trivial and get merged, but the default case will still be problematic?

There are ways to avoid the branches, et al, mostly by writing the code in the form of situations (1) and (2) by using lead-in blocks:

true_branch:
  goto body;

false_branch:
  goto body;

body:
  <use of common values here>

-bw 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] ASM Goto With Output Constraints

Cameron McInally via cfe-dev
On 7/2/19 12:42 AM, Bill Wendling wrote:
On Mon, Jul 1, 2019 at 6:25 PM Finkel, Hal J. <[hidden email]> wrote:

On 7/1/19 1:38 PM, Bill Wendling via llvm-dev wrote:

On Fri, Jun 28, 2019 at 3:35 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 5:53 PM Bill Wendling <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 1:48 PM James Y Knight <[hidden email]> wrote:
On Fri, Jun 28, 2019 at 3:00 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:44 PM Bill Wendling <[hidden email]> wrote:
On Thu, Jun 27, 2019 at 1:29 PM James Y Knight <[hidden email]> wrote:
I think this is fine, except that it stops at the point where things actually start to get interesting and tricky.

How will you actually handle the flow of values from the callbr into the error blocks? A callbr can specify requirements on where its outputs live. So, what if two callbr, in different branches of code, specify _different_ constraints for the same output, and list the same block as a possible error successor? How can the resulting phi be codegened?

This is where I fall back on the statement about how "the programmer knows what they're doing". Perhaps I'm being too cavalier here? My concern, if you want to call it that, is that we don't be too restrictive on the new behavior. For example, the "asm goto" may set a register to an error value (made up on the spot; may not be a common use). But, if there's no real reason to have the value be valid on the abnormal path, then sure we can declare that it's not valid on the abnormal path.

I think I should explain my "programmer knows what they're doing" statement a bit better. I'm specifically referring to inline asm here. The more general "callbr" case may still need to be considered (see Reid's reply).

When a programmer uses inline asm, they're implicitly telling the compiler that they *do* know what they're doing  (I know this is common knowledge, but I wanted to reiterate it.). In particular, either they need to reference an instruction not readily available from the compiler (e.g. "cpuid") or the compiler isn't able to give them the needed performance in a critical section. I'm extending this sentiment to callbr with output constraints. Let's take your example below and write it as "normal" asm statements one on each branch of an if-then-else (please ignore any syntax errors):

if:
  br i1 %cmp, label %true, label %false

true:
  %0 = call { i32, i32 } asm sideeffect "poetry $0, $1", "={r8},={r9}" ()
  br label %end

false:
  %1 = call { i32, i32 } asm sideeffect "poetry2 $0, $1", "={r10},={r11}" ()
  br label %end

end:
  %vals = phi { i32, i32 } [ %0, %true ], [ %1, %false ]

How is this handled in codegen? Is it an error or does the back-end handle it? Whatever's done today for "normal" inline asm is what I *think* should be the behavior for the inline asm callbr variant. If this doesn't seem sensible (and I realize that I may be thinking of an "in a perfect world" scenario), then we'll need to come up with a more sensible solution which may be to disallow the values on the error block until we can think of a better way to handle them.

This example is no problem, because instructions can be emitted between what's emitted by "call asm" and the end of the block (be it a fallthrough, or a jump instruction. What gets emitted there is a move of the output register to another location -- either a register or to the stack. And therefore at the beginning of the "end" block, "%vals" is always in a consistent location, no matter how you got to that block.

But in the callbr case, there is not a location at which those moves can be emitted, after the callbr, before the jump to "error".

I see what you mean. Let's say we create a pseudo-instruction (similar to landingpad, et al) that needs to be lowered by the backend in a reasonable manner. The EH stuff has an external process/library that performs the actual unwinding and which sets the values accordingly. We won't have this.

 
What we could do instead is split the edges and insert the copy-to-<where ever> statements there.

Exactly -- except that doing that is potentially an invalid transform, because the address is being used as a value, not simply a jump target. The label list is just a list of _possible_ jump targets, changing those won't actually affect anything. You'd instead need to change the blockaddress constant, but in the general case you don't know where that address came from -- (and it may therefore be required that you have the same address for two separate callbr instructions).

I guess this kinda touches on some of the same issues as in the other discussion about the handling of the blockaddress in callbr and inlining, etc...

I wonder if we could put some validity restrictions on the IR structure, rather than trying to fix things up after the fact by attempting to split blocks. E.g., we could state that it's invalid to have a phi which uses the value defined by a callbr, if it's conditioned on that same block as predecessor.  That is: it's valid to use _other_ values defined in the block ending in callbr, because they can be moved prior to the callbr. It's also valid to use the value defined by the callbr in a phi conditioned on some other intermediate block as predecessor, because then any required moves can happen in the intermediate block.

I believe such an IR restriction should be sufficient to make it possible to emit valid code from the IR in all cases, but I'm a bit afraid of how badly adding such odd edge-cases might screw up the rest of the compiler and optimizer.


I want to revisit this. Here are the situations we're confronted with:

  1. The goto-target can be jumped to by 1 callbr instruction,
  2. The goto-target can be jumped to by N callbr instructions, which don't need a PHI node, and
  3. The goto-target can be jumped to by N callbr instructions, which *do* need a PHI node.

I'm going to plug the instruction I created out of thin air a few emails back, but better explain (I'm using an instruction instead of an intrinsic because we want that instruction to be right after all non-PHI instructions in the goto-target block). I'm _not_ suggesting we need this instruction. It's just for demonstration purposes.


Situations (1) and (2) don't encounter an problem. Any value used in the goto-target can be handled by inserting the code to extract that value in the goto-target block:


bb1:

  ...

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  ...

  %y.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %y, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %x.goto = <extract value from %x.bb1>

  %y.goto = <extract value from %y.bb2>

  ... <uses of %x.goto and %y.goto> ...


This leaves situation (3), which is far more complex as we've seen. To reiterate, the issue here is that we need to extract the values returned by callbr. This would typically be done by using a PHI node, but llvm may want to split critical edges or push the calculation back to the predecessor block, which won't work with the callbr asm, because it could branch out of the asm at any point thus skipping the extraction. So we can't use PHI nodes for these values. There are three classes of solutions to this:

  1. Don't allow the values to be used in goto-targets, or
  2. Allow them, but with significant restrictions, or
  3. Allow them without using PHI nodes.
Each has its benefits and drawbacks. As I've stated before, I think that (1) is too restrictive, but if we can't come up with a good solution, it may be our only option. Solution (2) could be a good compromise. However, I want to propose a potential solution to (3).

The core of my proposal is to replace the PHI node with code that will replicate its behavior without code lowering trying to modify the CFG (at least not in ways that may invalidate the asm). Here is example code:


bb1:

  store i8* blockaddress(@bar, %bb1), i8** %src

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  store i8* blockaddress(@bar, %bb2), i8** %src

  %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %goto.target))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %x1 = indirectval i8** %src, i32 [%x.bb1, %bb1], [%x.bb2, %bb2]

  <extract values from %x1>

  ...


This can be lowered to this:

bb1:

  store i8* blockaddress(@bar, %bb1), i8** %src

  %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %error))

          to label %fallthrough1 [label %goto.target]


fallthrough1:

  ...


bb2:

  store i8* blockaddress(@bar, %bb2), i8** %src

  %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8* blockaddress(@bar, %error))

          to label %fallthrough2 [label %goto.target]


fallthrough2:

  ...


goto.target:

  %src1 = load i8**, i8*** @src

  %src.bb = load i8*, i8** %src1

  switch i64 %src.bb, label %goto.target.body [ ; or if-then-else blocks

      i64 ptrtoint i8* blockaddress(@bar, %bb1) to i64, label %goto.target.bb1

      i64 ptrtoint i8* blockaddress(@bar, %bb2) to i64, label %goto.target.bb2   

  ]


goto.target.bb1:

  %x1 = <extract value from %x.bb1>

  br label %goto.target.body


goto.target.bb2:

  %x2 = <extract value from %x.bb2>

  br label %goto.target.body


goto.target.body:

  %x.merge = phi i64 [%x1, label %goto.target.bb1], [%x1, label %goto.target.bb2]

  ...


With this, we don't change any values used by the callbr instructions, and the return values are extracted correctly. This has the unsavory issue of using stores and loads, but this may be the price we need to pay.


Thoughts?


The non-fallthrough blocks can have other predecessors, right? If so, I imagine that you need to also do the following:

Good point! 

 1. Store zero (or -1 or some other distinguishable value) into the %src alloca in the entry block.

It should be a null value, as that's not a valid block address. Then again, if we use the "switch" instruction the default branch should suffice. We will probably want to reset the value after the callbr values are extracted.


Actually, I'm not sure this works because the code in "<extract value from %x.bb1>" must be dominated by %x.bb1, and in this case it's not.


 2. Store this same distinguishable value into the %src alloca after the "value extraction" is performed.

 3. Include this distinguishable value in the switch statement.

While Clang does not normally insert phi nodes, in this case perhaps the problem is self-contained enough for this to be reasonable. However, I'm not sure that this is worthwhile. This is a performance feature generally, and if the user really wants to use these outputs, are they going to want the extra expense of the branches and jump tables and all of the rest of it? Maybe in the common case the extraction blocks will be trivial and get merged, but the default case will still be problematic?

There are ways to avoid the branches, et al, mostly by writing the code in the form of situations (1) and (2) by using lead-in blocks:

true_branch:
  goto body;

false_branch:
  goto body;

body:
  <use of common values here>


One specific concern is that if it turns out that the non-fallthrough-block is not reachable except via callbrs, then you'd want the optimizer to be able to eliminate the checks for the "null value" case.

 -Hal



-bw 
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12