Cuda Dynamic Parallelism

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Cuda Dynamic Parallelism

Eric Fiselier via cfe-dev
Hi everyone,

I have been planning to implement CUDA dynamic parallelism in clang for
a while now. I found some time to dig into clang's code recently in
order to grasp how clang handles calls to kernels (thanks to Justin
Lebar and Artem Belevich for pointing me to the most important
locations!). I was wondering if there would be any expert on the CUDA
code on the dev list that was willing to help me get a better
understanding and clear out any misconceptions that I might have.


Allow me to summarize how it works right now. It appears that the host
side kernel calls work something like this:

During parsing of function calls, when the <<< >>> syntax is
encountered, a node is build that calls cudaConfigureCall using the
execution configuration parameters. Then, a CUDAKernelCallExpr is build
using the remaining function parameters, and the CallExpr for
cudaConfigureCall is passed into that node.

During CodeGen, the CUDAKernelCallExpr is translated into something like
this:

   if ( cudaConfigureCall( .... ) )
   {
       <call to sub for the specific kernel referenced by the call expr>
   }

The stub for the kernel looks something like this:

   __host__ <something> kernel( arg1, arg2, ... argn ) {

   if( cudaSetupArgument( <stuff for arg0> ) ) {
      :
   if( cudaSetupArgument( <stuff for argn> ) ) {
       cudaLaunch( <kernelName> );
   } ... }

   }

Did I miss anything important?

On a side note, it would also seem that each kernel call is parsed twice
(once with CUDAIsDevice set to true and false, respectively).



For a device side kernel call, I would have to construct something like
this instead (see e.g. CUDA dynamic parallel programming guide [1]), the
first four parameters being the execution configuration:

   __device__ <something> kernel( gridDim, blockDim, sharedMem, stream,
arg0, arg1, ..., argn )
   {
     <compute overall size and alignment for storing args to a buffer>

     void * buf = cudaGetParameterBuffer( <alignment>, <size> );

     <copy args into buffer at proper offset and alignment>

     cudaLaunchDevice( <kernel name>, buf, gridDim, blockDim, sharedMem,
stream );
   }


It would appear that I cannot setup the execution configuration before
the call to the stub, so I need to pass it into the stub instead. Thus,
I would have to store the expressions for evaluating these in the
CUDAKernelCallExpr instead of the call to cudaConfigureCall. The latter
has to be moved to the host stub. The "__device__" would also prevent us
from calling that overload of the stub from __host__ __device__
functions (the host version would be changed to also take the exec
config in as params). I do not know how to mark the stubs as __host__ or
__device__ though.

Would that be a reasonable approach? Would the current implementation
then be able to correctly infer which stub to call based on __host__ and
__device__ attributes? Is it possible to have the device side stub be
inlined? Function calls tend to cost a lot of registers in my
experience, so actually calling a function like that might be quite
expensive. Also, I would have to check whether we are compiling with at
least sm_35 before emitting the device stub at all.

I would appreciate any feedback!

Regards,

  Andre Reichenbach


[1]
https://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Cuda Dynamic Parallelism

Eric Fiselier via cfe-dev
On a side note, it would also seem that each kernel call is parsed twice (once with CUDAIsDevice set to true and false, respectively).

Yes, in two separate invocations of clang -cc1 -- one for the host, one for the device.

The "__device__" would also prevent us from calling that overload of the stub from __host__ __device__ functions (the host version would be changed to also take the exec config in as params).

Specifically, it's OK to call __device__ functions from __host__ __device__ functions, but only if the __host__ __device__ function is never codegen'ed for the host.  Infrastructure for doing this already exists, so you shouldn't have to do anything.

> I do not know how to mark the stubs as __host__ or __device__ though.

It's __attribute__((device)), aka CUDADeviceAttr.  See e.g. SemaDecl.cpp.

> Would the current implementation then be able to correctly infer which stub to call based on __host__ and __device__ attributes?

Yes.  These attributes are part of the function signature, and clang chooses which function to call based on (somewhat convoluted) overloading rules.  It should DTRT.

> Is it possible to have the device side stub be inlined?

Yes, LLVM will do that for you automatically if it's profitable.  If it turns out it's not inlining the function when it should, I'd recommend we look into fixing the heuristics rather than slapping an always_inline on the function.

> Also, I would have to check whether we are compiling with at least sm_35 before emitting the device stub at all.

Right.

Would that be a reasonable approach?

SGTM!

On Thu, Oct 19, 2017 at 11:53 AM Andre Reichenbach <[hidden email]> wrote:
Hi everyone,

I have been planning to implement CUDA dynamic parallelism in clang for
a while now. I found some time to dig into clang's code recently in
order to grasp how clang handles calls to kernels (thanks to Justin
Lebar and Artem Belevich for pointing me to the most important
locations!). I was wondering if there would be any expert on the CUDA
code on the dev list that was willing to help me get a better
understanding and clear out any misconceptions that I might have.


Allow me to summarize how it works right now. It appears that the host
side kernel calls work something like this:

During parsing of function calls, when the <<< >>> syntax is
encountered, a node is build that calls cudaConfigureCall using the
execution configuration parameters. Then, a CUDAKernelCallExpr is build
using the remaining function parameters, and the CallExpr for
cudaConfigureCall is passed into that node.

During CodeGen, the CUDAKernelCallExpr is translated into something like
this:

   if ( cudaConfigureCall( .... ) )
   {
       <call to sub for the specific kernel referenced by the call expr>
   }

The stub for the kernel looks something like this:

   __host__ <something> kernel( arg1, arg2, ... argn ) {

   if( cudaSetupArgument( <stuff for arg0> ) ) {
      :
   if( cudaSetupArgument( <stuff for argn> ) ) {
       cudaLaunch( <kernelName> );
   } ... }

   }

Did I miss anything important?

On a side note, it would also seem that each kernel call is parsed twice
(once with CUDAIsDevice set to true and false, respectively).



For a device side kernel call, I would have to construct something like
this instead (see e.g. CUDA dynamic parallel programming guide [1]), the
first four parameters being the execution configuration:

   __device__ <something> kernel( gridDim, blockDim, sharedMem, stream,
arg0, arg1, ..., argn )
   {
     <compute overall size and alignment for storing args to a buffer>

     void * buf = cudaGetParameterBuffer( <alignment>, <size> );

     <copy args into buffer at proper offset and alignment>

     cudaLaunchDevice( <kernel name>, buf, gridDim, blockDim, sharedMem,
stream );
   }


It would appear that I cannot setup the execution configuration before
the call to the stub, so I need to pass it into the stub instead. Thus,
I would have to store the expressions for evaluating these in the
CUDAKernelCallExpr instead of the call to cudaConfigureCall. The latter
has to be moved to the host stub. The "__device__" would also prevent us
from calling that overload of the stub from __host__ __device__
functions (the host version would be changed to also take the exec
config in as params). I do not know how to mark the stubs as __host__ or
__device__ though.

Would that be a reasonable approach? Would the current implementation
then be able to correctly infer which stub to call based on __host__ and
__device__ attributes? Is it possible to have the device side stub be
inlined? Function calls tend to cost a lot of registers in my
experience, so actually calling a function like that might be quite
expensive. Also, I would have to check whether we are compiling with at
least sm_35 before emitting the device stub at all.

I would appreciate any feedback!

Regards,

  Andre Reichenbach


[1]
https://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev