openmp 4.5 and cuda streams

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev
Dear All,

I'm using clang 9.0.0 to compile a code which offloads sections of a
code on a GPU using the openmp target construct.
I also use the nowait clause to overlap the execution of certain kernels
and/or host<->device memory transfers.
However, using the nvidia profiler I've noticed that when I compile the
code with clang only one cuda stream is active,
and therefore the execution gets serialized. On the other hand, when
compiling with XLC I see that kernels are executed
on different streams. I could not understand if this is the expected
behavior (e.g. the nowait clause is currently not supported),
or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
compiling with the following options:

-target x86_64-pc-linux-gnu -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda
-march=sm_60

best wishes

Alessandro

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev
[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

  -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:

> Dear All,
>
> I'm using clang 9.0.0 to compile a code which offloads sections of a
> code on a GPU using the openmp target construct.
> I also use the nowait clause to overlap the execution of certain
> kernels and/or host<->device memory transfers.
> However, using the nvidia profiler I've noticed that when I compile
> the code with clang only one cuda stream is active,
> and therefore the execution gets serialized. On the other hand, when
> compiling with XLC I see that kernels are executed
> on different streams. I could not understand if this is the expected
> behavior (e.g. the nowait clause is currently not supported),
> or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
> compiling with the following options:
>
> -target x86_64-pc-linux-gnu -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
> -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>
> best wishes
>
> Alessandro
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev

I don't think it will be very easy. It requires some additional work in libomptarget + some fixes in the clang itself. Otherwise there might be some race conditions.

-------------
Best regards,
Alexey Bataev
30.10.2019 2:40 PM, Finkel, Hal J. via cfe-dev пишет:
[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a 
work-around and a patch, correct?

  -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
Dear All,

I'm using clang 9.0.0 to compile a code which offloads sections of a 
code on a GPU using the openmp target construct.
I also use the nowait clause to overlap the execution of certain 
kernels and/or host<->device memory transfers.
However, using the nvidia profiler I've noticed that when I compile 
the code with clang only one cuda stream is active,
and therefore the execution gets serialized. On the other hand, when 
compiling with XLC I see that kernels are executed
on different streams. I could not understand if this is the expected 
behavior (e.g. the nowait clause is currently not supported),
or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and 
compiling with the following options:

-target x86_64-pc-linux-gnu -fopenmp 
-fopenmp-targets=nvptx64-nvidia-cuda 
-Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60

best wishes

Alessandro

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

    

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev
On 10/30/19 1:48 PM, GMail wrote:
>
> I don't think it will be very easy. It requires some additional work
> in libomptarget + some fixes in the clang itself. Otherwise there
> might be some race conditions.
>

Can you be more specific? I thought that the mapping table, etc. were
already appropriately protected.

As a general thought, we should probably have a mode in which the
runtime is compiled with ThreadSanitizer to check for these kinds of things.

Thanks again,

Hal


> -------------
> Best regards,
> Alexey Bataev
> 30.10.2019 2:40 PM, Finkel, Hal J. via cfe-dev пишет:
>> [+Ye, Johannes]
>>
>> I recall that we've also observed this behavior. Ye, Johannes, we had a
>> work-around and a patch, correct?
>>
>>    -Hal
>>
>> On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
>>> Dear All,
>>>
>>> I'm using clang 9.0.0 to compile a code which offloads sections of a
>>> code on a GPU using the openmp target construct.
>>> I also use the nowait clause to overlap the execution of certain
>>> kernels and/or host<->device memory transfers.
>>> However, using the nvidia profiler I've noticed that when I compile
>>> the code with clang only one cuda stream is active,
>>> and therefore the execution gets serialized. On the other hand, when
>>> compiling with XLC I see that kernels are executed
>>> on different streams. I could not understand if this is the expected
>>> behavior (e.g. the nowait clause is currently not supported),
>>> or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
>>> compiling with the following options:
>>>
>>> -target x86_64-pc-linux-gnu -fopenmp
>>> -fopenmp-targets=nvptx64-nvidia-cuda
>>> -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>>>
>>> best wishes
>>>
>>> Alessandro
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev

Hal, seems to me, not everything is protected. Some buffers are reused for different kernels, I assume. Better to ask Alex Eichenberger, he knows more about it, I did not not investigate this problem.

As to clang, we try to reduce the size of the buffers in the global memory for the reduction/lastprivate/etc. vars, which may escape their declaration context. These buffers cannot be combined in streams mode, need to allocate unique buffer for each particular kernel. It is not very hard to do, it is just not implemented yet.

-------------
Best regards,
Alexey Bataev
30.10.2019 3:22 PM, Finkel, Hal J. пишет:
On 10/30/19 1:48 PM, GMail wrote:
I don't think it will be very easy. It requires some additional work 
in libomptarget + some fixes in the clang itself. Otherwise there 
might be some race conditions.

Can you be more specific? I thought that the mapping table, etc. were 
already appropriately protected.

As a general thought, we should probably have a mode in which the 
runtime is compiled with ThreadSanitizer to check for these kinds of things.

Thanks again,

Hal


-------------
Best regards,
Alexey Bataev
30.10.2019 2:40 PM, Finkel, Hal J. via cfe-dev пишет:
[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

   -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
Dear All,

I'm using clang 9.0.0 to compile a code which offloads sections of a
code on a GPU using the openmp target construct.
I also use the nowait clause to overlap the execution of certain
kernels and/or host<->device memory transfers.
However, using the nvidia profiler I've noticed that when I compile
the code with clang only one cuda stream is active,
and therefore the execution gets serialized. On the other hand, when
compiling with XLC I see that kernels are executed
on different streams. I could not understand if this is the expected
behavior (e.g. the nowait clause is currently not supported),
or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
compiling with the following options:

-target x86_64-pc-linux-gnu -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda
-Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60

best wishes

Alessandro

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

    

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
Hi Hal,
My experience of llvm/clang so far shows:
1. all the target offload is blocking synchronous using the default stream. nowait is not supported.
2. all the memory transfer calls invoke cudaMemcpy. There are no async calls.
3. I had an experiment in the past turning on CUDA_API_PER_THREAD_DEFAULT_STREAM in libomptarget.
Then I use multiple host threads to do individual blocking synchronous offload. I got it sort of running and saw multple streams but the code crashes due to memory corruption probably due to some data race in libomptarget.
Best,
Ye


From: Finkel, Hal J. <[hidden email]>
Sent: Wednesday, October 30, 2019 1:40 PM
To: Alessandro Gabbana <[hidden email]>; [hidden email] <[hidden email]>; Luo, Ye <[hidden email]>; Doerfert, Johannes <[hidden email]>
Subject: Re: [cfe-dev] openmp 4.5 and cuda streams
 
[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

  -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
> Dear All,
>
> I'm using clang 9.0.0 to compile a code which offloads sections of a
> code on a GPU using the openmp target construct.
> I also use the nowait clause to overlap the execution of certain
> kernels and/or host<->device memory transfers.
> However, using the nvidia profiler I've noticed that when I compile
> the code with clang only one cuda stream is active,
> and therefore the execution gets serialized. On the other hand, when
> compiling with XLC I see that kernels are executed
> on different streams. I could not understand if this is the expected
> behavior (e.g. the nowait clause is currently not supported),
> or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
> compiling with the following options:
>
> -target x86_64-pc-linux-gnu -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
> -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>
> best wishes
>
> Alessandro
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev


On 10/31/19 10:54 AM, Luo, Ye wrote:
Hi Hal,
My experience of llvm/clang so far shows:
1. all the target offload is blocking synchronous using the default stream. nowait is not supported.
2. all the memory transfer calls invoke cudaMemcpy. There are no async calls.
3. I had an experiment in the past turning on CUDA_API_PER_THREAD_DEFAULT_STREAM in libomptarget.
Then I use multiple host threads to do individual blocking synchronous offload. I got it sort of running and saw multple streams but the code crashes due to memory corruption probably due to some data race in libomptarget.


Thanks, Ye. That's consistent with Alexey's comments.


Is there already a bug open on this? If not, we should open one.


Alexey, the buffer-reuse optimizations in Clang that you mentioned, how much memory/overhead do they save? Is it worth keeping them in some mode?


 -Hal


Best,
Ye


From: Finkel, Hal J. [hidden email]
Sent: Wednesday, October 30, 2019 1:40 PM
To: Alessandro Gabbana [hidden email]; [hidden email] [hidden email]; Luo, Ye [hidden email]; Doerfert, Johannes [hidden email]
Subject: Re: [cfe-dev] openmp 4.5 and cuda streams
 
[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

  -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
> Dear All,
>
> I'm using clang 9.0.0 to compile a code which offloads sections of a
> code on a GPU using the openmp target construct.
> I also use the nowait clause to overlap the execution of certain
> kernels and/or host<->device memory transfers.
> However, using the nvidia profiler I've noticed that when I compile
> the code with clang only one cuda stream is active,
> and therefore the execution gets serialized. On the other hand, when
> compiling with XLC I see that kernels are executed
> on different streams. I could not understand if this is the expected
> behavior (e.g. the nowait clause is currently not supported),
> or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
> compiling with the following options:
>
> -target x86_64-pc-linux-gnu -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
> -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>
> best wishes
>
> Alessandro
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev

Hope to send this message from the main dev e-mail this time :)


Well, about the memory. It depends on the number of kernels you have. All the memory in the kernels that must be globalized is squashed into a union. With streams we need to use the separate structure for each particular kernel. Plus, we cannot use shared memory for this buffer anymore again because of possible conflict.


We can add a new compiler option to compile only some files with streams support and use unique memory buffer for the globalized variables. Plus, some work in the libomptarget is required, of course.


-------------
Best regards,
Alexey Bataev
31.10.2019 3:58 PM, Finkel, Hal J. пишет:


On 10/31/19 10:54 AM, Luo, Ye wrote:
Hi Hal,
My experience of llvm/clang so far shows:
1. all the target offload is blocking synchronous using the default stream. nowait is not supported.
2. all the memory transfer calls invoke cudaMemcpy. There are no async calls.
3. I had an experiment in the past turning on CUDA_API_PER_THREAD_DEFAULT_STREAM in libomptarget.
Then I use multiple host threads to do individual blocking synchronous offload. I got it sort of running and saw multple streams but the code crashes due to memory corruption probably due to some data race in libomptarget.


Thanks, Ye. That's consistent with Alexey's comments.


Is there already a bug open on this? If not, we should open one.


Alexey, the buffer-reuse optimizations in Clang that you mentioned, how much memory/overhead do they save? Is it worth keeping them in some mode?


 -Hal


Best,
Ye


From: Finkel, Hal J. [hidden email]
Sent: Wednesday, October 30, 2019 1:40 PM
To: Alessandro Gabbana [hidden email]; [hidden email] [hidden email]; Luo, Ye [hidden email]; Doerfert, Johannes [hidden email]
Subject: Re: [cfe-dev] openmp 4.5 and cuda streams
 
[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

  -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
> Dear All,
>
> I'm using clang 9.0.0 to compile a code which offloads sections of a
> code on a GPU using the openmp target construct.
> I also use the nowait clause to overlap the execution of certain
> kernels and/or host<->device memory transfers.
> However, using the nvidia profiler I've noticed that when I compile
> the code with clang only one cuda stream is active,
> and therefore the execution gets serialized. On the other hand, when
> compiling with XLC I see that kernels are executed
> on different streams. I could not understand if this is the expected
> behavior (e.g. the nowait clause is currently not supported),
> or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
> compiling with the following options:
>
> -target x86_64-pc-linux-gnu -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
> -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>
> best wishes
>
> Alessandro
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev


On 10/31/19 3:06 PM, Alexey Bataev wrote:

Hope to send this message from the main dev e-mail this time :)


Well, about the memory. It depends on the number of kernels you have. All the memory in the kernels that must be globalized is squashed into a union. With streams we need to use the separate structure for each particular kernel. Plus, we cannot use shared memory for this buffer anymore again because of possible conflict.


We can add a new compiler option to compile only some files with streams support and use unique memory buffer for the globalized variables. Plus, some work in the libomptarget is required, of course.


Do we also need some kind of libomptarget API change in order to communicate the fact that it's allowed to run multiple target regions concurrently?


Thanks again,

Hal



-------------
Best regards,
Alexey Bataev
31.10.2019 3:58 PM, Finkel, Hal J. пишет:


On 10/31/19 10:54 AM, Luo, Ye wrote:
Hi Hal,
My experience of llvm/clang so far shows:
1. all the target offload is blocking synchronous using the default stream. nowait is not supported.
2. all the memory transfer calls invoke cudaMemcpy. There are no async calls.
3. I had an experiment in the past turning on CUDA_API_PER_THREAD_DEFAULT_STREAM in libomptarget.
Then I use multiple host threads to do individual blocking synchronous offload. I got it sort of running and saw multple streams but the code crashes due to memory corruption probably due to some data race in libomptarget.


Thanks, Ye. That's consistent with Alexey's comments.


Is there already a bug open on this? If not, we should open one.


Alexey, the buffer-reuse optimizations in Clang that you mentioned, how much memory/overhead do they save? Is it worth keeping them in some mode?


 -Hal


Best,
Ye


From: Finkel, Hal J. [hidden email]
Sent: Wednesday, October 30, 2019 1:40 PM
To: Alessandro Gabbana [hidden email]; [hidden email] [hidden email]; Luo, Ye [hidden email]; Doerfert, Johannes [hidden email]
Subject: Re: [cfe-dev] openmp 4.5 and cuda streams
 
[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

  -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
> Dear All,
>
> I'm using clang 9.0.0 to compile a code which offloads sections of a
> code on a GPU using the openmp target construct.
> I also use the nowait clause to overlap the execution of certain
> kernels and/or host<->device memory transfers.
> However, using the nvidia profiler I've noticed that when I compile
> the code with clang only one cuda stream is active,
> and therefore the execution gets serialized. On the other hand, when
> compiling with XLC I see that kernels are executed
> on different streams. I could not understand if this is the expected
> behavior (e.g. the nowait clause is currently not supported),
> or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
> compiling with the following options:
>
> -target x86_64-pc-linux-gnu -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
> -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>
> best wishes
>
> Alessandro
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: openmp 4.5 and cuda streams

Kristof Beyls via cfe-dev

Not sure about the API, most probably just some internal work is required. Better to ask Alex Eichenberger, he knows more about this.

-------------
Best regards,
Alexey Bataev
31.10.2019 4:36 PM, Finkel, Hal J. пишет:


On 10/31/19 3:06 PM, Alexey Bataev wrote:

Hope to send this message from the main dev e-mail this time :)


Well, about the memory. It depends on the number of kernels you have. All the memory in the kernels that must be globalized is squashed into a union. With streams we need to use the separate structure for each particular kernel. Plus, we cannot use shared memory for this buffer anymore again because of possible conflict.


We can add a new compiler option to compile only some files with streams support and use unique memory buffer for the globalized variables. Plus, some work in the libomptarget is required, of course.


Do we also need some kind of libomptarget API change in order to communicate the fact that it's allowed to run multiple target regions concurrently?


Thanks again,

Hal



-------------
Best regards,
Alexey Bataev
31.10.2019 3:58 PM, Finkel, Hal J. пишет:


On 10/31/19 10:54 AM, Luo, Ye wrote:
Hi Hal,
My experience of llvm/clang so far shows:
1. all the target offload is blocking synchronous using the default stream. nowait is not supported.
2. all the memory transfer calls invoke cudaMemcpy. There are no async calls.
3. I had an experiment in the past turning on CUDA_API_PER_THREAD_DEFAULT_STREAM in libomptarget.
Then I use multiple host threads to do individual blocking synchronous offload. I got it sort of running and saw multple streams but the code crashes due to memory corruption probably due to some data race in libomptarget.


Thanks, Ye. That's consistent with Alexey's comments.


Is there already a bug open on this? If not, we should open one.


Alexey, the buffer-reuse optimizations in Clang that you mentioned, how much memory/overhead do they save? Is it worth keeping them in some mode?


 -Hal


Best,
Ye


From: Finkel, Hal J. [hidden email]
Sent: Wednesday, October 30, 2019 1:40 PM
To: Alessandro Gabbana [hidden email]; [hidden email] [hidden email]; Luo, Ye [hidden email]; Doerfert, Johannes [hidden email]
Subject: Re: [cfe-dev] openmp 4.5 and cuda streams
 
[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

  -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
> Dear All,
>
> I'm using clang 9.0.0 to compile a code which offloads sections of a
> code on a GPU using the openmp target construct.
> I also use the nowait clause to overlap the execution of certain
> kernels and/or host<->device memory transfers.
> However, using the nvidia profiler I've noticed that when I compile
> the code with clang only one cuda stream is active,
> and therefore the execution gets serialized. On the other hand, when
> compiling with XLC I see that kernels are executed
> on different streams. I could not understand if this is the expected
> behavior (e.g. the nowait clause is currently not supported),
> or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
> compiling with the following options:
>
> -target x86_64-pc-linux-gnu -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
> -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>
> best wishes
>
> Alessandro
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

signature.asc (849 bytes) Download Attachment