[RFC] Late (OpenMP) GPU code "SPMD-zation"

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189


--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.


 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,
  James


From: Alexey Bataev <[hidden email]>
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.


 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev

The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev
22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,
  James


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
We could still do that in clang, couldn't we?


From: Alexey Bataev <[hidden email]>
Sent: Tuesday, January 22, 2019 12:52:42 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev
22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,
  James


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
But we need to know the execution mode, SPMD or "guarded"

-------------
Best regards,
Alexey Bataev
22.01.2019 13:54, Doerfert, Johannes Rudolf пишет:
We could still do that in clang, couldn't we?


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:52:42 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev
22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,
  James


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
We would still know that. We can do exactly the same reasoning as we do now.

I think the important question is, how different is the code generated for either mode and can we hide (most of) the differences in the runtime.


If I understand you correctly, you say the data sharing code looks very different and the differences cannot be hidden, correct?

It would be helpful for me to understand your point if you could give me a piece of OpenMP for which the data sharing in SPMD mode and "guarded"

mode are as different as possible. I can compile it in both modes myself so high-level OpenMP is fine (I will disable SPMD mode manually in the source if necessary).


Thanks,

  Johannes





From: llvm-dev <[hidden email]> on behalf of Alexey Bataev via llvm-dev <[hidden email]>
Sent: Tuesday, January 22, 2019 13:10
To: Doerfert, Johannes Rudolf
Cc: Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; [hidden email]; [hidden email]
Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
 
But we need to know the execution mode, SPMD or "guarded"

-------------
Best regards,
Alexey Bataev
22.01.2019 13:54, Doerfert, Johannes Rudolf пишет:
We could still do that in clang, couldn't we?


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:52:42 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev
22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,
  James


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev

No, we don't. We need to perform the different kind of the analysis for SPMD mode constructs and Non-SPMD.

For SPMD mode we need to globalize only reduction/lastprivate variables. For Non-SPMD mode, we need to globalize all the private/local variables, that may escape their declaration context in the construct.

-------------
Best regards,
Alexey Bataev
22.01.2019 14:29, Doerfert, Johannes Rudolf пишет:
We would still know that. We can do exactly the same reasoning as we do now.

I think the important question is, how different is the code generated for either mode and can we hide (most of) the differences in the runtime.


If I understand you correctly, you say the data sharing code looks very different and the differences cannot be hidden, correct?

It would be helpful for me to understand your point if you could give me a piece of OpenMP for which the data sharing in SPMD mode and "guarded"

mode are as different as possible. I can compile it in both modes myself so high-level OpenMP is fine (I will disable SPMD mode manually in the source if necessary).


Thanks,

  Johannes





From: llvm-dev [hidden email] on behalf of Alexey Bataev via llvm-dev [hidden email]
Sent: Tuesday, January 22, 2019 13:10
To: Doerfert, Johannes Rudolf
Cc: Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; [hidden email]; [hidden email]
Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
 
But we need to know the execution mode, SPMD or "guarded"

-------------
Best regards,
Alexey Bataev
22.01.2019 13:54, Doerfert, Johannes Rudolf пишет:
We could still do that in clang, couldn't we?


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:52:42 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev
22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,
  James


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev

What do you refer to with: "No, we don't". 


Again, I do not propose to remove the SPMD "detection" in Clang. We will still identify SPMD mode based on the syntactic criteria we have now.

The Clang analysis is also not affected.  Thus, we will globalize/localize the same variables as we do now. I don't see why this should be any different.




From: llvm-dev <[hidden email]> on behalf of Alexey Bataev via llvm-dev <[hidden email]>
Sent: Tuesday, January 22, 2019 1:46:39 PM
To: Doerfert, Johannes Rudolf
Cc: llvm-dev; [hidden email]; [hidden email]
Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

No, we don't. We need to perform the different kind of the analysis for SPMD mode constructs and Non-SPMD.

For SPMD mode we need to globalize only reduction/lastprivate variables. For Non-SPMD mode, we need to globalize all the private/local variables, that may escape their declaration context in the construct.

-------------
Best regards,
Alexey Bataev
22.01.2019 14:29, Doerfert, Johannes Rudolf пишет:
We would still know that. We can do exactly the same reasoning as we do now.

I think the important question is, how different is the code generated for either mode and can we hide (most of) the differences in the runtime.


If I understand you correctly, you say the data sharing code looks very different and the differences cannot be hidden, correct?

It would be helpful for me to understand your point if you could give me a piece of OpenMP for which the data sharing in SPMD mode and "guarded"

mode are as different as possible. I can compile it in both modes myself so high-level OpenMP is fine (I will disable SPMD mode manually in the source if necessary).


Thanks,

  Johannes





From: llvm-dev [hidden email] on behalf of Alexey Bataev via llvm-dev [hidden email]
Sent: Tuesday, January 22, 2019 13:10
To: Doerfert, Johannes Rudolf
Cc: Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; [hidden email]; [hidden email]
Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
 
But we need to know the execution mode, SPMD or "guarded"

-------------
Best regards,
Alexey Bataev
22.01.2019 13:54, Doerfert, Johannes Rudolf пишет:
We could still do that in clang, couldn't we?


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:52:42 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev
22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,
  James


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev

After an IRC discussion, I think Alexey and I are pretty much in agreement (on the general feasibility at least).


I try to sketch the proposed idea again below, as the initial RFC was simply not descriptive enough.

After that, I shortly summarize how I see these changes being developed and committed so that we

 - never have any regressions,

 - can make an educated decision before removing any existing code.



What we want to do:

The intermediate goal is that the code generated by clang for the SPMD and non-SPMD (earlier denoted as "guarded") case

is conceptually/structurally very similar. The current non-SPMD code is, however, a state machine generated into the user code

module. This state machine is very hard to analyze and optimize. If the code would look as the SPMD code but *behave the same

way it does now*, we could "easily" switch from non-SPMD to SPMD version after a (late) analysis determined legality. To make

the code look the same but behave differently, we propose to hide the semantic difference in the runtime library calls. That is,

the runtime calls emitted in the two modes are (slightly) different, or there is a flag which indicates the (initial) mode. If that mode

is SPMD, the runtime behavior does not change compared to the way it is now. If that mode is non-SPMD, the runtime would

separate the master and worker threads, as we do it now in the user code module, and keep the workers in an internal state machine

waiting for the master to provide them with work. Only the master would return from the runtime call and the mechanism

to distribute work to the worker threads would (for now) stay the same.





Preliminary implementation (and integration) steps:

1) Design and implement the necessary runtime extensions and determine feasibility.

2) Allow to Clang codegen to use the new runtime extensions if explicitly chosen by the user.

2b) Performance comparison unoptimized new code path vs. original code path on test cases and real use cases.

3) Implement the middle-end pass to analyze and optimize the code using the runtime extensions.

3b) Performance comparison optimized new code path vs. original code path on real use cases.

4) If no conceptual problem was found and 2b)/3b) determined that the new code path is superior, switch to the

    new code path by default.

5) If no regressions/complaints are reported after a grace period, remove the old code path from the clang front-end.




Again, this is an early design RFC for which I welcome any feedback!


Thanks,

  Johannes



From: Doerfert, Johannes Rudolf
Sent: Tuesday, January 22, 2019 1:50:51 PM
To: Alexey Bataev
Cc: [hidden email]; [hidden email]
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

What do you refer to with: "No, we don't". 


Again, I do not propose to remove the SPMD "detection" in Clang. We will still identify SPMD mode based on the syntactic criteria we have now.

The Clang analysis is also not affected.  Thus, we will globalize/localize the same variables as we do now. I don't see why this should be any different.




From: llvm-dev <[hidden email]> on behalf of Alexey Bataev via llvm-dev <[hidden email]>
Sent: Tuesday, January 22, 2019 1:46:39 PM
To: Doerfert, Johannes Rudolf
Cc: llvm-dev; [hidden email]; [hidden email]
Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

No, we don't. We need to perform the different kind of the analysis for SPMD mode constructs and Non-SPMD.

For SPMD mode we need to globalize only reduction/lastprivate variables. For Non-SPMD mode, we need to globalize all the private/local variables, that may escape their declaration context in the construct.

-------------
Best regards,
Alexey Bataev
22.01.2019 14:29, Doerfert, Johannes Rudolf пишет:
We would still know that. We can do exactly the same reasoning as we do now.

I think the important question is, how different is the code generated for either mode and can we hide (most of) the differences in the runtime.


If I understand you correctly, you say the data sharing code looks very different and the differences cannot be hidden, correct?

It would be helpful for me to understand your point if you could give me a piece of OpenMP for which the data sharing in SPMD mode and "guarded"

mode are as different as possible. I can compile it in both modes myself so high-level OpenMP is fine (I will disable SPMD mode manually in the source if necessary).


Thanks,

  Johannes





From: llvm-dev [hidden email] on behalf of Alexey Bataev via llvm-dev [hidden email]
Sent: Tuesday, January 22, 2019 13:10
To: Doerfert, Johannes Rudolf
Cc: Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; [hidden email]; [hidden email]
Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
 
But we need to know the execution mode, SPMD or "guarded"

-------------
Best regards,
Alexey Bataev
22.01.2019 13:54, Doerfert, Johannes Rudolf пишет:
We could still do that in clang, couldn't we?


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:52:42 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 

The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime.


-------------
Best regards,
Alexey Bataev
22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,
  James


From: Alexey Bataev [hidden email]
Sent: Tuesday, January 22, 2019 12:34:01 PM
To: Doerfert, Johannes Rudolf; [hidden email]
Cc: [hidden email]; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
 


-------------
Best regards,
Alexey Bataev
22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.


What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.


How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor.

We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.


None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!


Cheers,
  Johannes


P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev
Hi Doru,

[+ llvm-dev and cfe-dev]

On 01/30, Gheorghe-Teod Bercea wrote:
> Hi Johannes,
>
> First of all thanks for looking into the matter of improving non-SPMD mode!
>
> I have a question regarding the state machine that you said you'd like to
> replace/improve. There are cases (such as target regions that span multiple
> compilation units) where the switch statement is required. Is this something
> that your changes will touch in any way?

There will not be a difference. Let me explain in some details as there
seems to be a lot of confusion on this state machine topic:

Now:

Build a state machine in the user code (module) with all the parallel
regions as explicit targets of the switch statement and a fallback
default that does a indirect call to the requested parallel region.


Proposed, after Clang:

Use the runtime state machine implementation [0] which reduces the
switch to the default case, thus an indirect call to the requested
parallel region. This will always work, regardless of the translation
unit that contained the parallel region (pointer).

Proposed, after OpenMP-Opt pass in LLVM (assuming SPMD wasn't achieved):

All reachable parallel regions in a kernel are collected and used to
create the switch statement in the user code (module) [1, line 111] with
a fallback if there are potentially [1, line 212] hidden parallel
regions.


Does that make sense?


[0] https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz
[1] https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B


> My next question is, for the workloads which are in the same compilation unit
> there is a trick that code gen performs (or could perform I'm not sure if this
> has been upstreamed) where it can check for the specific name of an outlined
> function and then just call it directly thus making that function inline-able
> (thus erasing most if not all the overhead of having the state machine in the
> first place). In other words the "worst" part of the switch statement will only
> apply to outlined functions from other compilation units. With this in mind
> what would the impact of your changes be in the end? If this part isn't clear I
> can do some digging to find out how this actually works in more details it's
> been too long since I've had to look at this part.
See the answer above.


> Can you share some performance numbers given an example you have been looking
> at? I see you have one that uses "#pragma omp atomic". I would avoid using
> something like that since it may have other overheads not related to your
> changes. I would put together an example with this directive structure:
>
> #pragma omp target teams distribute
> for(...){
>   <code1>
>   #pragma omp parallel for
>   for(...) {
>     <code2>
>   }
>   <code3>
> }
>
> which forces the use of the master-worker scheme (non-SPMD mode) without any
> other distractions.
The atomic stuff I used to determine correctness. I haven't yet looked
at performance. I will do so now and inform you on my results.


> It would then be interesting to understand how you plan to change the LLVM code
> generated for this,

The examples show how the LLVM-IR is supposed to look like, right?

> what the overheads that you're targeting are (register usage,
> synchronization cost etc), and then what the performance gain is
> compared to the current scheme.

I can also compare register usage in addition to performance but there
is no difference in synchronization. The number and (relative) order of
original runtime library calls stays the same. The number of user code
-> runtime library calls is even decreased.


Please let me know if this helps and what questions remain.

Thanks,
  Johannes


 

> From:        "Doerfert, Johannes" <[hidden email]>
> To:        Alexey Bataev <[hidden email]>
> Cc:        Guray Ozen <[hidden email]>, Gheorghe-Teod Bercea
> <[hidden email]>, "[hidden email]"
> <[hidden email]>, "Finkel, Hal J." <[hidden email]>,
> "[hidden email]" <[hidden email]>, "[hidden email]"
> <[hidden email]>
> Date:        01/30/2019 04:14 PM
> Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
>
>
> I don't really see "many ifs and maybes", actually none.
>
> Anyway, I will now work on a patch set that adds the new functionality under a
> cmd flag
> in order to showcase correctness and performance on real code.
>
> If you, or somebody else, have interesting examples, please feel free to point
> me at them.
>
> Thanks,
>   Johannes
>
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 2:18:19 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Currently, there are too many "if"s and "maybe"s. If you can provide solution
> that does not break anything and does not affect the performance, does not
> require changes in the backend - then go ahead with the patches.
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:49, Doerfert, Johannes      :
> No, SPMD mode will not be affected at all.
>
> The "worse" part is the following:
>   If we inline runtime library calls before the openmp-opt pass had a chance to
> look at the code,
>   we will not have a customized state machine for the __non-SPMD__ case. That
> is, the if-cascade
>   checking the work function pointer is not there.
>
> Avoiding this potential performance decline is actually very easy. While we do
> not have the "inline_late" capability,
> run the openmp-opt pass __before__ the inliner and we will not get "worse"
> code. We might however miss out on
> _new_ non-SPMD -> SPMD transformations.
>
>
> Does that make sense?
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 1:44:10 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Any "worse" is not a good idea. We need to avoid it. It would be good that the
> new code did not affect the performance, especially for SPMD mode (I think,
> this "worse" will affect exactly SPMD mode, no?)
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:38, Doerfert, Johannes      :
> The LLVM optimization (openmp-opt), which does non-SPMD -> SPMD and custom
> state machine generation, will not fire if
> the __kernel_general_... calls are "missing". Thus if we inline "to early", we
> are "stuck" with the non-SPMD choice (not worse than
> what we have now!) and the default library state machine ("worse" than what we
> have right now). Does that make sense?
>
> The second option described what I want to see us do "later" in order to avoid
> the above scenario and always get both,
> openmp-opt and inlining of the runtime and work functions.
>
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 1:25:42 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Sorry, did not understand your answer correctly. But you wrote:
> for now, not doing the optimization is just fine.
> What do you mean?
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:23, Doerfert, Johannes      :
> Alexey,
>
> I'm not sure how to interpret "Bad idea!". but I think there is again a
> misunderstanding.
> To help me understand, could you try to elaborate a bit?
>
> To make my last email clear:
> I __do__ want inlining. Both answers to your earlier inlining questions do
> actually assume the runtime library calls __are eventually inlined__,
> that is why I mentioned LTO and the runtime as bitcode.
> .
> Cheers,
>   Johannes
>
>
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 1:14:56 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Bad idea!
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:11, Doerfert, Johannes      :
> Sure I do. Why do you think I don't?
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 1:00:59 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> You don't want to do the inlining?
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 13:59, Doerfert, Johannes      :
> - for now, not doing the optimization is just fine. The whole idea is that code
> is always valid.
>
>
--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
Hi Johannes,

Thank you for the explanation.

I think we need to clarify some details about code generation in Clang today:

1. non-SPMD mode, or generic mode, uses the master-worker code gen scheme where the master thread and the worker threads are disjoint sets of threads (when one set runs the other set is blocked and doesn't participate in the execution):

workers  |  master
====================
BLOCKED  | RUNNING
------- sync -------
RUNNING  | BLOCKED
------- sync -------
BLOCKED  | RUNNING


2. the worker threads, in their RUNNING state above, contain a state machine which chooses the parallel region to be executed. Today this choice happens in one of two ways: explicit targets (where you know what outlined region you are calling and you just call it) and indirect targets (via function pointer set by master thread in one of its RUNNING regions):

workers  |  master
====================
BLOCKED  | RUNNING
------- sync -------
RUNNING  |
 state   | BLOCKED
machine  |
------- sync -------
BLOCKED  | RUNNING


Your intended changes (only target the RUNNING state machine of the WORKERS):
- remove explicit targets from current code gen. (by itself this is a major step back!!)
- introduce a pass in LLVM which will add back the explicit targets.

Can you point out any major improvements this will bring compared to the current state?
From your answer below you mention a lower number of function calls. Since today we inline everything anyway how does that help?
If you haven't considered performance so far how come you're proposing all these changes? What led you to propose all these changes?


In SPMD mode all threads execute the same code. Using the notation in the schemes above you can depict this as:

    all threads
====================
      RUNNING

No state machine being used, no disjoints sets of threads. This is as if you're executing CUDA code.

Could you explain what your proposed changes are in this context?
Could you also explain what you mean by "assuming SPMD wasn't achieved"?
Do you expect to write another LLVM pass which will transform the master-worker scheme + state machine into an SPMD scheme?

Thanks,

--Doru







From:        "Doerfert, Johannes" <[hidden email]>
To:        Gheorghe-Teod Bercea <[hidden email]>
Cc:        Alexey Bataev <[hidden email]>, Guray Ozen <[hidden email]>, "[hidden email]" <[hidden email]>, "Finkel, Hal J." <[hidden email]>, "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>, LLVM-Dev <[hidden email]>, "[hidden email]" <[hidden email]>
Date:        01/30/2019 07:56 PM
Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"




Hi Doru,

[+ llvm-dev and cfe-dev]

On 01/30, Gheorghe-Teod Bercea wrote:
> Hi Johannes,
>
> First of all thanks for looking into the matter of improving non-SPMD mode!
>
> I have a question regarding the state machine that you said you'd like to
> replace/improve. There are cases (such as target regions that span multiple
> compilation units) where the switch statement is required. Is this something
> that your changes will touch in any way?

There will not be a difference. Let me explain in some details as there
seems to be a lot of confusion on this state machine topic:

Now:

Build a state machine in the user code (module) with all the parallel
regions as explicit targets of the switch statement and a fallback
default that does a indirect call to the requested parallel region.


Proposed, after Clang:

Use the runtime state machine implementation [0] which reduces the
switch to the default case, thus an indirect call to the requested
parallel region. This will always work, regardless of the translation
unit that contained the parallel region (pointer).

Proposed, after OpenMP-Opt pass in LLVM (assuming SPMD wasn't achieved):

All reachable parallel regions in a kernel are collected and used to
create the switch statement in the user code (module) [1, line 111] with
a fallback if there are potentially [1, line 212] hidden parallel
regions.


Does that make sense?


[0]
https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz
[1]
https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B


> My next question is, for the workloads which are in the same compilation unit
> there is a trick that code gen performs (or could perform I'm not sure if this
> has been upstreamed) where it can check for the specific name of an outlined
> function and then just call it directly thus making that function inline-able
> (thus erasing most if not all the overhead of having the state machine in the
> first place). In other words the "worst" part of the switch statement will only
> apply to outlined functions from other compilation units. With this in mind
> what would the impact of your changes be in the end? If this part isn't clear I
> can do some digging to find out how this actually works in more details it's
> been too long since I've had to look at this part.

See the answer above.


> Can you share some performance numbers given an example you have been looking
> at? I see you have one that uses "#pragma omp atomic". I would avoid using
> something like that since it may have other overheads not related to your
> changes. I would put together an example with this directive structure:
>
> #pragma omp target teams distribute
> for(...){
>   <code1>
>   #pragma omp parallel for
>   for(...) {
>     <code2>
>   }
>   <code3>
> }
>
> which forces the use of the master-worker scheme (non-SPMD mode) without any
> other distractions.

The atomic stuff I used to determine correctness. I haven't yet looked
at performance. I will do so now and inform you on my results.


> It would then be interesting to understand how you plan to change the LLVM code
> generated for this,

The examples show how the LLVM-IR is supposed to look like, right?

> what the overheads that you're targeting are (register usage,
> synchronization cost etc), and then what the performance gain is
> compared to the current scheme.

I can also compare register usage in addition to performance but there
is no difference in synchronization. The number and (relative) order of
original runtime library calls stays the same. The number of user code
-> runtime library calls is even decreased.


Please let me know if this helps and what questions remain.

Thanks,
 Johannes



> From:        "Doerfert, Johannes" <[hidden email]>
> To:        Alexey Bataev <[hidden email]>
> Cc:        Guray Ozen <[hidden email]>, Gheorghe-Teod Bercea
> <[hidden email]>, "[hidden email]"
> <[hidden email]>, "Finkel, Hal J." <[hidden email]>,
> "[hidden email]" <[hidden email]>, "[hidden email]"
> <[hidden email]>
> Date:        01/30/2019 04:14 PM
> Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
>
>
> I don't really see "many ifs and maybes", actually none.
>
> Anyway, I will now work on a patch set that adds the new functionality under a
> cmd flag
> in order to showcase correctness and performance on real code.
>
> If you, or somebody else, have interesting examples, please feel free to point
> me at them.
>
> Thanks,
>   Johannes
>
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 2:18:19 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Currently, there are too many "if"s and "maybe"s. If you can provide solution
> that does not break anything and does not affect the performance, does not
> require changes in the backend - then go ahead with the patches.
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:49, Doerfert, Johannes      :
> No, SPMD mode will not be affected at all.
>
> The "worse" part is the following:
>   If we inline runtime library calls before the openmp-opt pass had a chance to
> look at the code,
>   we will not have a customized state machine for the __non-SPMD__ case. That
> is, the if-cascade
>   checking the work function pointer is not there.
>
> Avoiding this potential performance decline is actually very easy. While we do
> not have the "inline_late" capability,
> run the openmp-opt pass __before__ the inliner and we will not get "worse"
> code. We might however miss out on
> _new_ non-SPMD -> SPMD transformations.
>
>
> Does that make sense?
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 1:44:10 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Any "worse" is not a good idea. We need to avoid it. It would be good that the
> new code did not affect the performance, especially for SPMD mode (I think,
> this "worse" will affect exactly SPMD mode, no?)
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:38, Doerfert, Johannes      :
> The LLVM optimization (openmp-opt), which does non-SPMD -> SPMD and custom
> state machine generation, will not fire if
> the __kernel_general_... calls are "missing". Thus if we inline "to early", we
> are "stuck" with the non-SPMD choice (not worse than
> what we have now!) and the default library state machine ("worse" than what we
> have right now). Does that make sense?
>
> The second option described what I want to see us do "later" in order to avoid
> the above scenario and always get both,
> openmp-opt and inlining of the runtime and work functions.
>
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 1:25:42 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Sorry, did not understand your answer correctly. But you wrote:
> for now, not doing the optimization is just fine.
> What do you mean?
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:23, Doerfert, Johannes      :
> Alexey,
>
> I'm not sure how to interpret "Bad idea!". but I think there is again a
> misunderstanding.
> To help me understand, could you try to elaborate a bit?
>
> To make my last email clear:
> I __do__ want inlining. Both answers to your earlier inlining questions do
> actually assume the runtime library calls __are eventually inlined__,
> that is why I mentioned LTO and the runtime as bitcode.
> .
> Cheers,
>   Johannes
>
>
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 1:14:56 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> Bad idea!
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 14:11, Doerfert, Johannes      :
> Sure I do. Why do you think I don't?
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
> From: Alexey Bataev <[hidden email]>
> Sent: Wednesday, January 30, 2019 1:00:59 PM
> To: Doerfert, Johannes
> Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> J.; [hidden email]; [hidden email]
> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>  
> You don't want to do the inlining?
>
> -------------
> Best regards,
> Alexey Bataev
> 30.01.2019 13:59, Doerfert, Johannes      :
> - for now, not doing the optimization is just fine. The whole idea is that code
> is always valid.
>
>

--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]
[attachment "signature.asc" deleted by Gheorghe-Teod Bercea/US/IBM]




_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
Hi Doru,

maybe I should clarify something I mentioned in an earlier email already
but it seems there are things getting lost in this thread:

  While the prototype replaces code generation parts in Clang, the
  actual patches will add alternative code generation paths, guarded
  under a cmd flag. Once, and obviously only if, everything is in place
  and has been shown to improve the current situation, the default path
  would be switched.


On 01/31, Gheorghe-Teod Bercea wrote:
> Hi Johannes,
>
> Thank you for the explanation.
>
> I think we need to clarify some details about code generation in Clang today:

I'm not really sure why you feel the need to do that but OK.


> 1. non-SPMD mode, or generic mode, uses the master-worker code gen scheme where
> the master thread and the worker threads are disjoint sets of threads (when one
> set runs the other set is blocked and doesn't participate in the execution):
>
> workers  |  master
> ====================
> BLOCKED  | RUNNING
> ------- sync -------
> RUNNING  | BLOCKED
> ------- sync -------
> BLOCKED  | RUNNING
I agree, and for the record, this is not changed by my prototype, see
[1, line 295].

[1] https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz


> 2. the worker threads, in their RUNNING state above, contain a state machine
> which chooses the parallel region to be executed. Today this choice happens in
> one of two ways: explicit targets (where you know what outlined region you are
> calling and you just call it) and indirect targets (via function pointer set by
> master thread in one of its RUNNING regions):
>
> workers  |  master
> ====================
> BLOCKED  | RUNNING
> ------- sync -------
> RUNNING  |
>  state   | BLOCKED
> machine  |
> ------- sync -------
> BLOCKED  | RUNNING
Partially agreed. Afaik, it will always be decided through a function
pointer set by the master thread and communicated to the workers through
the runtime. The workers use a switch, or in fact an if-cascade, to
check if the function pointer points to a known parallel region. If so
it will be called directly, otherwise there is the fallback indirect
call of the function pointer.

> Your intended changes (only target the RUNNING state machine of the WORKERS):
> - remove explicit targets from current code gen. (by itself this is a major
> step back!!)
> - introduce a pass in LLVM which will add back the explicit targets.

Simplified but correct. From my perspective this is not a problem
because in production I will always run the LLVM passes after Clang.
Even if you do not run the LLVM passes, the below reasoning might be
enough to convince people to run a similar pass in their respective
pipeline. If that is not enough, we can also keep the Clang state
machine generation around (see the top comment).


> Can you point out any major improvements this will bring compared to the
> current state?

Sure, I'll give you three for now:

[FIRST]
Here is the original motivation from the first RFC mail (in case you
have missed it):

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, the canonicalizations, dead code elimination,
      code movement [2, Section 7 on page 8], we have a clearer picture
      of the code that is actually executed in the target region and all
      the side effects it contains. Thus, we can make an educated
      decision on the required amount of guards that prevent unwanted
      side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.

[2] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf


Let me give you the canonical example that shows the need for this:

  #pragma omp target teams
  {
    foo(i + 0)
    foo(i + 1)
    foo(i + 2)
  }

  void foo(int i) {
  #pragma omp parallel
  ...
  }

The target region can be executed in SPMD mode but we cannot decide that
syntactically when the region is encountered. Agreed?



[SECOND]
Now there are other benefits with regards to the above mentioned state
machine. In the LLVM pass we can analyze the kernel code
interprocedurally and detect all potentially executed parallel regions,
together with a relation between them, and the need for the fallback
case. That means we can build a state machine that __takes control
dependences into account__, __after inlining and dead code elimination__
canonicalized the kernel.

If inlining and code canonicalization resulted in the following
structure, the state machine we can build late can know that after
section0 the workers will execute section1, potentially multiple times,
before they move on to section3. In today's scheme, this is sth. we
cannot simply do, causing us to traverse the if-cascade from top to
bottom all the time (which grows linear with the number of parallel
regions).

  if (...) {
    #pragma omp parallel
    section0(...)
    do {
      #pragma omp parallel
      section1(...)
    } while (...)
  }
  #pragma omp parallel
  section3(...)



[THIRD]
Depending on the hardware, we need to make sure, or at least try rally
hard, that there is no fallback case in the state machine, which is an
indirect function call. This can be done best at link time which
requires us to analyze the kernel late and modify the state machine at
that point anyway.



> From your answer below you mention a lower number of function calls. Since
> today we inline everything anyway how does that help?

If we inline, it doesn't for performance purposes. If we do not inline,
it does. In either case, it helps to simplify middle-end analyses and
transformations that work on kernels. Finally, it prevents us from
wasting compile time looking at the (unoptimizable) state machine of
every target region.

Maybe it is worth asking the opposite question:
  What are the reasons against these general runtime calls that hide the
  complexity we currently emit into the user code module?
[Note that I discuss the only drawback I came up with, a non-customized
state machine, already above.]


> If you haven't considered performance so far how come you're proposing all
> these changes? What led you to propose all these changes?

See above.


> In SPMD mode all threads execute the same code. Using the notation in the
> schemes above you can depict this as:
>
>     all threads
> ====================
>       RUNNING
>
> No state machine being used, no disjoints sets of threads. This is as
> if you're executing CUDA code.

Agreed.


> Could you explain what your proposed changes are in this context?

None, at least after inlining the runtime library calls there is
literally the same code executed before and after the changes.


> Could you also explain what you mean by "assuming SPMD wasn't achieved"?

That is one of the two motivations for the whole change. I explained
that in the initial RFC and again above. The next comment points you to
the code that tries to achieve SPMD mode for inputs that were generated
in the non-SPMD mode (master-worker + state machine) by Clang.


> Do you expect to write another LLVM pass which will transform the
> master-worker scheme + state machine into an SPMD scheme?

I did already, as that was the main motivation for the whole thing.
It is part of the prototype, see [3, line 321].

[3] https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B 


Cheers,
  Johannes


> From:        "Doerfert, Johannes" <[hidden email]>
> To:        Gheorghe-Teod Bercea <[hidden email]>
> Cc:        Alexey Bataev <[hidden email]>, Guray Ozen <[hidden email]>,
> "[hidden email]" <[hidden email]>, "Finkel, Hal J."
> <[hidden email]>, "[hidden email]" <[hidden email]>,
> "[hidden email]" <[hidden email]>, LLVM-Dev
> <[hidden email]>, "[hidden email]" <[hidden email]>
> Date:        01/30/2019 07:56 PM
> Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
>
>
> Hi Doru,
>
> [+ llvm-dev and cfe-dev]
>
> On 01/30, Gheorghe-Teod Bercea wrote:
> > Hi Johannes,
> >
> > First of all thanks for looking into the matter of improving non-SPMD mode!
> >
> > I have a question regarding the state machine that you said you'd like to
> > replace/improve. There are cases (such as target regions that span multiple
> > compilation units) where the switch statement is required. Is this something
> > that your changes will touch in any way?
>
> There will not be a difference. Let me explain in some details as there
> seems to be a lot of confusion on this state machine topic:
>
> Now:
>
> Build a state machine in the user code (module) with all the parallel
> regions as explicit targets of the switch statement and a fallback
> default that does a indirect call to the requested parallel region.
>
>
> Proposed, after Clang:
>
> Use the runtime state machine implementation [0] which reduces the
> switch to the default case, thus an indirect call to the requested
> parallel region. This will always work, regardless of the translation
> unit that contained the parallel region (pointer).
>
> Proposed, after OpenMP-Opt pass in LLVM (assuming SPMD wasn't achieved):
>
> All reachable parallel regions in a kernel are collected and used to
> create the switch statement in the user code (module) [1, line 111] with
> a fallback if there are potentially [1, line 212] hidden parallel
> regions.
>
>
> Does that make sense?
>
>
> [0] https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz
> [1] https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B
>
>
> > My next question is, for the workloads which are in the same compilation unit
> > there is a trick that code gen performs (or could perform I'm not sure if
> this
> > has been upstreamed) where it can check for the specific name of an outlined
> > function and then just call it directly thus making that function inline-able
> > (thus erasing most if not all the overhead of having the state machine in the
> > first place). In other words the "worst" part of the switch statement will
> only
> > apply to outlined functions from other compilation units. With this in mind
> > what would the impact of your changes be in the end? If this part isn't clear
> I
> > can do some digging to find out how this actually works in more details it's
> > been too long since I've had to look at this part.
>
> See the answer above.
>
>
> > Can you share some performance numbers given an example you have been looking
> > at? I see you have one that uses "#pragma omp atomic". I would avoid using
> > something like that since it may have other overheads not related to your
> > changes. I would put together an example with this directive structure:
> >
> > #pragma omp target teams distribute
> > for(...){
> >   <code1>
> >   #pragma omp parallel for
> >   for(...) {
> >     <code2>
> >   }
> >   <code3>
> > }
> >
> > which forces the use of the master-worker scheme (non-SPMD mode) without any
> > other distractions.
>
> The atomic stuff I used to determine correctness. I haven't yet looked
> at performance. I will do so now and inform you on my results.
>
>
> > It would then be interesting to understand how you plan to change the LLVM
> code
> > generated for this,
>
> The examples show how the LLVM-IR is supposed to look like, right?
>
> > what the overheads that you're targeting are (register usage,
> > synchronization cost etc), and then what the performance gain is
> > compared to the current scheme.
>
> I can also compare register usage in addition to performance but there
> is no difference in synchronization. The number and (relative) order of
> original runtime library calls stays the same. The number of user code
> -> runtime library calls is even decreased.
>
>
> Please let me know if this helps and what questions remain.
>
> Thanks,
>  Johannes
>
>
>
> > From:        "Doerfert, Johannes" <[hidden email]>
> > To:        Alexey Bataev <[hidden email]>
> > Cc:        Guray Ozen <[hidden email]>, Gheorghe-Teod Bercea
> > <[hidden email]>, "[hidden email]"
> > <[hidden email]>, "Finkel, Hal J." <[hidden email]>,
> > "[hidden email]" <[hidden email]>, "[hidden email]"
> > <[hidden email]>
> > Date:        01/30/2019 04:14 PM
> > Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> >
> >
> > I don't really see "many ifs and maybes", actually none.
> >
> > Anyway, I will now work on a patch set that adds the new functionality under
> a
> > cmd flag
> > in order to showcase correctness and performance on real code.
> >
> > If you, or somebody else, have interesting examples, please feel free to
> point
> > me at them.
> >
> > Thanks,
> >   Johannes
> >
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 2:18:19 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Currently, there are too many "if"s and "maybe"s. If you can provide solution
> > that does not break anything and does not affect the performance, does not
> > require changes in the backend - then go ahead with the patches.
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:49, Doerfert, Johannes      :
> > No, SPMD mode will not be affected at all.
> >
> > The "worse" part is the following:
> >   If we inline runtime library calls before the openmp-opt pass had a chance
> to
> > look at the code,
> >   we will not have a customized state machine for the __non-SPMD__ case. That
> > is, the if-cascade
> >   checking the work function pointer is not there.
> >
> > Avoiding this potential performance decline is actually very easy. While we
> do
> > not have the "inline_late" capability,
> > run the openmp-opt pass __before__ the inliner and we will not get "worse"
> > code. We might however miss out on
> > _new_ non-SPMD -> SPMD transformations.
> >
> >
> > Does that make sense?
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 1:44:10 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Any "worse" is not a good idea. We need to avoid it. It would be good that
> the
> > new code did not affect the performance, especially for SPMD mode (I think,
> > this "worse" will affect exactly SPMD mode, no?)
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:38, Doerfert, Johannes      :
> > The LLVM optimization (openmp-opt), which does non-SPMD -> SPMD and custom
> > state machine generation, will not fire if
> > the __kernel_general_... calls are "missing". Thus if we inline "to early",
> we
> > are "stuck" with the non-SPMD choice (not worse than
> > what we have now!) and the default library state machine ("worse" than what
> we
> > have right now). Does that make sense?
> >
> > The second option described what I want to see us do "later" in order to
> avoid
> > the above scenario and always get both,
> > openmp-opt and inlining of the runtime and work functions.
> >
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 1:25:42 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Sorry, did not understand your answer correctly. But you wrote:
> > for now, not doing the optimization is just fine.
> > What do you mean?
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:23, Doerfert, Johannes      :
> > Alexey,
> >
> > I'm not sure how to interpret "Bad idea!". but I think there is again a
> > misunderstanding.
> > To help me understand, could you try to elaborate a bit?
> >
> > To make my last email clear:
> > I __do__ want inlining. Both answers to your earlier inlining questions do
> > actually assume the runtime library calls __are eventually inlined__,
> > that is why I mentioned LTO and the runtime as bitcode.
> > .
> > Cheers,
> >   Johannes
> >
> >
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 1:14:56 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Bad idea!
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:11, Doerfert, Johannes      :
> > Sure I do. Why do you think I don't?
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 1:00:59 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > You don't want to do the inlining?
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 13:59, Doerfert, Johannes      :
> > - for now, not doing the optimization is just fine. The whole idea is that
> code
> > is always valid.
> >
> >
>
> --
>
> Johannes Doerfert
> Researcher
>
> Argonne National Laboratory
> Lemont, IL 60439, USA
>
> [hidden email]
> [attachment "signature.asc" deleted by Gheorghe-Teod Bercea/US/IBM]
>
>
--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

David Blaikie via cfe-dev
Hi Johannes,

Your clarifications helped a lot, having all details gathered in one place helped me understand better what you are proposing.

Thanks a lot for taking the time to explain.

Thanks,

--Doru




From:        "Doerfert, Johannes" <[hidden email]>
To:        Gheorghe-Teod Bercea <[hidden email]>
Cc:        Alexey Bataev <[hidden email]>, "[hidden email]" <[hidden email]>, Guray Ozen <[hidden email]>, "[hidden email]" <[hidden email]>, "Finkel, Hal J." <[hidden email]>, "[hidden email]" <[hidden email]>, LLVM-Dev <[hidden email]>, "[hidden email]" <[hidden email]>
Date:        01/31/2019 12:34 PM
Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"




Hi Doru,

maybe I should clarify something I mentioned in an earlier email already
but it seems there are things getting lost in this thread:

 While the prototype replaces code generation parts in Clang, the
 actual patches will add alternative code generation paths, guarded
 under a cmd flag. Once, and obviously only if, everything is in place
 and has been shown to improve the current situation, the default path
 would be switched.


On 01/31, Gheorghe-Teod Bercea wrote:
> Hi Johannes,
>
> Thank you for the explanation.
>
> I think we need to clarify some details about code generation in Clang today:

I'm not really sure why you feel the need to do that but OK.


> 1. non-SPMD mode, or generic mode, uses the master-worker code gen scheme where
> the master thread and the worker threads are disjoint sets of threads (when one
> set runs the other set is blocked and doesn't participate in the execution):
>
> workers  |  master
> ====================
> BLOCKED  | RUNNING
> ------- sync -------
> RUNNING  | BLOCKED
> ------- sync -------
> BLOCKED  | RUNNING

I agree, and for the record, this is not changed by my prototype, see
[1, line 295].

[1]
https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz


> 2. the worker threads, in their RUNNING state above, contain a state machine
> which chooses the parallel region to be executed. Today this choice happens in
> one of two ways: explicit targets (where you know what outlined region you are
> calling and you just call it) and indirect targets (via function pointer set by
> master thread in one of its RUNNING regions):
>
> workers  |  master
> ====================
> BLOCKED  | RUNNING
> ------- sync -------
> RUNNING  |
>  state   | BLOCKED
> machine  |
> ------- sync -------
> BLOCKED  | RUNNING

Partially agreed. Afaik, it will always be decided through a function
pointer set by the master thread and communicated to the workers through
the runtime. The workers use a switch, or in fact an if-cascade, to
check if the function pointer points to a known parallel region. If so
it will be called directly, otherwise there is the fallback indirect
call of the function pointer.

> Your intended changes (only target the RUNNING state machine of the WORKERS):
> - remove explicit targets from current code gen. (by itself this is a major
> step back!!)
> - introduce a pass in LLVM which will add back the explicit targets.

Simplified but correct. From my perspective this is not a problem
because in production I will always run the LLVM passes after Clang.
Even if you do not run the LLVM passes, the below reasoning might be
enough to convince people to run a similar pass in their respective
pipeline. If that is not enough, we can also keep the Clang state
machine generation around (see the top comment).


> Can you point out any major improvements this will bring compared to the
> current state?

Sure, I'll give you three for now:

[FIRST]
Here is the original motivation from the first RFC mail (in case you
have missed it):

2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
   e.g., through the runtime library calls used, and that tries to
   convert it into the SPMD mode potentially by introducing lightweight
   guards in the process.

   Why:
   - After the inliner, the canonicalizations, dead code elimination,
     code movement [2, Section 7 on page 8], we have a clearer picture
     of the code that is actually executed in the target region and all
     the side effects it contains. Thus, we can make an educated
     decision on the required amount of guards that prevent unwanted
     side effects from happening after a move to SPMD mode.
   - At this point we can more easily introduce different schemes to
     avoid side effects by threads that were not supposed to run. We
     can decide if a state machine is needed, conditionals should be
     employed, masked instructions are appropriate, or "dummy" local
     storage can be used to hide the side effect from the outside
     world.

[2]
http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf


Let me give you the canonical example that shows the need for this:

 #pragma omp target teams
 {
   foo(i + 0)
   foo(i + 1)
   foo(i + 2)
 }

 void foo(int i) {
 #pragma omp parallel
 ...
 }

The target region can be executed in SPMD mode but we cannot decide that
syntactically when the region is encountered. Agreed?



[SECOND]
Now there are other benefits with regards to the above mentioned state
machine. In the LLVM pass we can analyze the kernel code
interprocedurally and detect all potentially executed parallel regions,
together with a relation between them, and the need for the fallback
case. That means we can build a state machine that __takes control
dependences into account__, __after inlining and dead code elimination__
canonicalized the kernel.

If inlining and code canonicalization resulted in the following
structure, the state machine we can build late can know that after
section0 the workers will execute section1, potentially multiple times,
before they move on to section3. In today's scheme, this is sth. we
cannot simply do, causing us to traverse the if-cascade from top to
bottom all the time (which grows linear with the number of parallel
regions).

 if (...) {
   #pragma omp parallel
   section0(...)
   do {
     #pragma omp parallel
     section1(...)
   } while (...)
 }
 #pragma omp parallel
 section3(...)



[THIRD]
Depending on the hardware, we need to make sure, or at least try rally
hard, that there is no fallback case in the state machine, which is an
indirect function call. This can be done best at link time which
requires us to analyze the kernel late and modify the state machine at
that point anyway.



> From your answer below you mention a lower number of function calls. Since
> today we inline everything anyway how does that help?

If we inline, it doesn't for performance purposes. If we do not inline,
it does. In either case, it helps to simplify middle-end analyses and
transformations that work on kernels. Finally, it prevents us from
wasting compile time looking at the (unoptimizable) state machine of
every target region.

Maybe it is worth asking the opposite question:
 What are the reasons against these general runtime calls that hide the
 complexity we currently emit into the user code module?
[Note that I discuss the only drawback I came up with, a non-customized
state machine, already above.]


> If you haven't considered performance so far how come you're proposing all
> these changes? What led you to propose all these changes?

See above.


> In SPMD mode all threads execute the same code. Using the notation in the
> schemes above you can depict this as:
>
>     all threads
> ====================
>       RUNNING
>
> No state machine being used, no disjoints sets of threads. This is as
> if you're executing CUDA code.

Agreed.


> Could you explain what your proposed changes are in this context?

None, at least after inlining the runtime library calls there is
literally the same code executed before and after the changes.


> Could you also explain what you mean by "assuming SPMD wasn't achieved"?

That is one of the two motivations for the whole change. I explained
that in the initial RFC and again above. The next comment points you to
the code that tries to achieve SPMD mode for inputs that were generated
in the non-SPMD mode (master-worker + state machine) by Clang.


> Do you expect to write another LLVM pass which will transform the
> master-worker scheme + state machine into an SPMD scheme?

I did already, as that was the main motivation for the whole thing.
It is part of the prototype, see [3, line 321].

[3]
https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B


Cheers,
 Johannes


> From:        "Doerfert, Johannes" <[hidden email]>
> To:        Gheorghe-Teod Bercea <[hidden email]>
> Cc:        Alexey Bataev <[hidden email]>, Guray Ozen <[hidden email]>,
> "[hidden email]" <[hidden email]>, "Finkel, Hal J."
> <[hidden email]>, "[hidden email]" <[hidden email]>,
> "[hidden email]" <[hidden email]>, LLVM-Dev
> <[hidden email]>, "[hidden email]" <[hidden email]>
> Date:        01/30/2019 07:56 PM
> Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
>
>
> Hi Doru,
>
> [+ llvm-dev and cfe-dev]
>
> On 01/30, Gheorghe-Teod Bercea wrote:
> > Hi Johannes,
> >
> > First of all thanks for looking into the matter of improving non-SPMD mode!
> >
> > I have a question regarding the state machine that you said you'd like to
> > replace/improve. There are cases (such as target regions that span multiple
> > compilation units) where the switch statement is required. Is this something
> > that your changes will touch in any way?
>
> There will not be a difference. Let me explain in some details as there
> seems to be a lot of confusion on this state machine topic:
>
> Now:
>
> Build a state machine in the user code (module) with all the parallel
> regions as explicit targets of the switch statement and a fallback
> default that does a indirect call to the requested parallel region.
>
>
> Proposed, after Clang:
>
> Use the runtime state machine implementation [0] which reduces the
> switch to the default case, thus an indirect call to the requested
> parallel region. This will always work, regardless of the translation
> unit that contained the parallel region (pointer).
>
> Proposed, after OpenMP-Opt pass in LLVM (assuming SPMD wasn't achieved):
>
> All reachable parallel regions in a kernel are collected and used to
> create the switch statement in the user code (module) [1, line 111] with
> a fallback if there are potentially [1, line 212] hidden parallel
> regions.
>
>
> Does that make sense?
>
>
> [0]
https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz
> [1]
https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B
>
>
> > My next question is, for the workloads which are in the same compilation unit
> > there is a trick that code gen performs (or could perform I'm not sure if
> this
> > has been upstreamed) where it can check for the specific name of an outlined
> > function and then just call it directly thus making that function inline-able
> > (thus erasing most if not all the overhead of having the state machine in the
> > first place). In other words the "worst" part of the switch statement will
> only
> > apply to outlined functions from other compilation units. With this in mind
> > what would the impact of your changes be in the end? If this part isn't clear
> I
> > can do some digging to find out how this actually works in more details it's
> > been too long since I've had to look at this part.
>
> See the answer above.
>
>
> > Can you share some performance numbers given an example you have been looking
> > at? I see you have one that uses "#pragma omp atomic". I would avoid using
> > something like that since it may have other overheads not related to your
> > changes. I would put together an example with this directive structure:
> >
> > #pragma omp target teams distribute
> > for(...){
> >   <code1>
> >   #pragma omp parallel for
> >   for(...) {
> >     <code2>
> >   }
> >   <code3>
> > }
> >
> > which forces the use of the master-worker scheme (non-SPMD mode) without any
> > other distractions.
>
> The atomic stuff I used to determine correctness. I haven't yet looked
> at performance. I will do so now and inform you on my results.
>
>
> > It would then be interesting to understand how you plan to change the LLVM
> code
> > generated for this,
>
> The examples show how the LLVM-IR is supposed to look like, right?
>
> > what the overheads that you're targeting are (register usage,
> > synchronization cost etc), and then what the performance gain is
> > compared to the current scheme.
>
> I can also compare register usage in addition to performance but there
> is no difference in synchronization. The number and (relative) order of
> original runtime library calls stays the same. The number of user code
> -> runtime library calls is even decreased.
>
>
> Please let me know if this helps and what questions remain.
>
> Thanks,
>  Johannes
>
>
>
> > From:        "Doerfert, Johannes" <[hidden email]>
> > To:        Alexey Bataev <[hidden email]>
> > Cc:        Guray Ozen <[hidden email]>, Gheorghe-Teod Bercea
> > <[hidden email]>, "[hidden email]"
> > <[hidden email]>, "Finkel, Hal J." <[hidden email]>,
> > "[hidden email]" <[hidden email]>, "[hidden email]"
> > <[hidden email]>
> > Date:        01/30/2019 04:14 PM
> > Subject:        Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> >
> >
> > I don't really see "many ifs and maybes", actually none.
> >
> > Anyway, I will now work on a patch set that adds the new functionality under
> a
> > cmd flag
> > in order to showcase correctness and performance on real code.
> >
> > If you, or somebody else, have interesting examples, please feel free to
> point
> > me at them.
> >
> > Thanks,
> >   Johannes
> >
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 2:18:19 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Currently, there are too many "if"s and "maybe"s. If you can provide solution
> > that does not break anything and does not affect the performance, does not
> > require changes in the backend - then go ahead with the patches.
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:49, Doerfert, Johannes      :
> > No, SPMD mode will not be affected at all.
> >
> > The "worse" part is the following:
> >   If we inline runtime library calls before the openmp-opt pass had a chance
> to
> > look at the code,
> >   we will not have a customized state machine for the __non-SPMD__ case. That
> > is, the if-cascade
> >   checking the work function pointer is not there.
> >
> > Avoiding this potential performance decline is actually very easy. While we
> do
> > not have the "inline_late" capability,
> > run the openmp-opt pass __before__ the inliner and we will not get "worse"
> > code. We might however miss out on
> > _new_ non-SPMD -> SPMD transformations.
> >
> >
> > Does that make sense?
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 1:44:10 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Any "worse" is not a good idea. We need to avoid it. It would be good that
> the
> > new code did not affect the performance, especially for SPMD mode (I think,
> > this "worse" will affect exactly SPMD mode, no?)
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:38, Doerfert, Johannes      :
> > The LLVM optimization (openmp-opt), which does non-SPMD -> SPMD and custom
> > state machine generation, will not fire if
> > the __kernel_general_... calls are "missing". Thus if we inline "to early",
> we
> > are "stuck" with the non-SPMD choice (not worse than
> > what we have now!) and the default library state machine ("worse" than what
> we
> > have right now). Does that make sense?
> >
> > The second option described what I want to see us do "later" in order to
> avoid
> > the above scenario and always get both,
> > openmp-opt and inlining of the runtime and work functions.
> >
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 1:25:42 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Sorry, did not understand your answer correctly. But you wrote:
> > for now, not doing the optimization is just fine.
> > What do you mean?
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:23, Doerfert, Johannes      :
> > Alexey,
> >
> > I'm not sure how to interpret "Bad idea!". but I think there is again a
> > misunderstanding.
> > To help me understand, could you try to elaborate a bit?
> >
> > To make my last email clear:
> > I __do__ want inlining. Both answers to your earlier inlining questions do
> > actually assume the runtime library calls __are eventually inlined__,
> > that is why I mentioned LTO and the runtime as bitcode.
> > .
> > Cheers,
> >   Johannes
> >
> >
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 1:14:56 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > Bad idea!
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 14:11, Doerfert, Johannes      :
> > Sure I do. Why do you think I don't?
> >
> >  $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,
> (,(,(, (B
> >
> > From: Alexey Bataev <[hidden email]>
> > Sent: Wednesday, January 30, 2019 1:00:59 PM
> > To: Doerfert, Johannes
> > Cc: Guray Ozen; Gheorghe-Teod Bercea; [hidden email]; Finkel, Hal
> > J.; [hidden email]; [hidden email]
> > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
> >  
> > You don't want to do the inlining?
> >
> > -------------
> > Best regards,
> > Alexey Bataev
> > 30.01.2019 13:59, Doerfert, Johannes      :
> > - for now, not doing the optimization is just fine. The whole idea is that
> code
> > is always valid.
> >
> >
>
> --
>
> Johannes Doerfert
> Researcher
>
> Argonne National Laboratory
> Lemont, IL 60439, USA
>
> [hidden email]
> [attachment "signature.asc" deleted by Gheorghe-Teod Bercea/US/IBM]
>
>

--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]
[attachment "signature.asc" deleted by Gheorghe-Teod Bercea/US/IBM]




_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev