[RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev
Hi,

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Any comments?

Thanks.

Sam

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

winmail.dat (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev


> OpenCL spec requires that a pointer should be aligned to at least the pointee type.


So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia

From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions
 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 


Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions
 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev

My comments are below.

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev

Hi Sam,


There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.


Cheer,

Anastasia




From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions
 

My comments are below.

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev

Hi Brian,


Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?


Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions
 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev
I'm still a little bit confused about the background of this. And I understand that the actual usecase here may not be something that can be shared, but perhaps at least some part of the underlying problem can be shared to help with the understanding of the issue is here...

The approach I've taken is to allocate every local argument with the "largest alignment requirement" (in other words 128 bytes - this may of course vary depending on the HW available in the GPU).

As I see it, this wouldn't lead to THAT much overhead in the allocations, as local storage is per work-group, and the number of llocal arguments is, hopefully, not a very large number.

Whilst I'm all for saving memory when possible, I'm not sure adding a set of alignment values to the argument list of enqueue_kernel, for calls that have local arguments, and the extra complexity, even if it's not large, is worth the saving of local memory allocations. I'd really like to understand why a single large alignment doesn't work in this case.

I'm completely aware that this may be my lack of understanding of something - hopefully I will learn something new, if that's the case...

--
Mats

On 11 January 2018 at 12:02, Anastasia Stulova via cfe-dev <[hidden email]> wrote:

Hi Brian,


Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?


Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd

Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions
 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev
In reply to this post by Richard Smith via cfe-dev

Hi Anastasia,

 

There is a 2.0 extension, cl_khr_device_enqueue_local_arg_types, that we requested when we first encountered this problem.  Clang should implement this if it hasn’t already, and ideally this would eventually become the default in the spec.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Thursday, January 11, 2018 4:02 AM
To: Sumner, Brian; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Brian,

 

Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?

 

Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev
In reply to this post by Richard Smith via cfe-dev

The workgroup size is usually 64 or 128. The number of workgroups can be quite large if the global size is large. If for each local array we waste 124 bytes, the total waste could be quite large, considering local memory is precious resource for GPU.

 

On the other hand, passing the alignment info and using it is pretty straight forward.

 

Sam

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of mats petersson
Sent: Thursday, January 11, 2018 7:47 AM
To: Anastasia Stulova <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

I'm still a little bit confused about the background of this. And I understand that the actual usecase here may not be something that can be shared, but perhaps at least some part of the underlying problem can be shared to help with the understanding of the issue is here...

The approach I've taken is to allocate every local argument with the "largest alignment requirement" (in other words 128 bytes - this may of course vary depending on the HW available in the GPU).

As I see it, this wouldn't lead to THAT much overhead in the allocations, as local storage is per work-group, and the number of llocal arguments is, hopefully, not a very large number.

Whilst I'm all for saving memory when possible, I'm not sure adding a set of alignment values to the argument list of enqueue_kernel, for calls that have local arguments, and the extra complexity, even if it's not large, is worth the saving of local memory allocations. I'd really like to understand why a single large alignment doesn't work in this case.

I'm completely aware that this may be my lack of understanding of something - hopefully I will learn something new, if that's the case...

--

Mats

 

On 11 January 2018 at 12:02, Anastasia Stulova via cfe-dev <[hidden email]> wrote:

Hi Brian,

 

Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?

 

Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd


Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev
In reply to this post by Richard Smith via cfe-dev

We don't have this extension unfortunately implemented upstream. But if AMD is willing to add, it would allow to do the optimization that Sam was suggesting earlier.


Anastasia

From: Sumner, Brian <[hidden email]>
Sent: 11 January 2018 14:00
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions
 

Hi Anastasia,

 

There is a 2.0 extension, cl_khr_device_enqueue_local_arg_types, that we requested when we first encountered this problem.  Clang should implement this if it hasn’t already, and ideally this would eventually become the default in the spec.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Thursday, January 11, 2018 4:02 AM
To: Sumner, Brian; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Brian,

 

Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?

 

Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev

We will implement our proposal under cl_khr_device_enqueue_local_arg_types extension. Thanks.

 

Sam

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Thursday, January 11, 2018 12:44 PM
To: Sumner, Brian <[hidden email]>; Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

We don't have this extension unfortunately implemented upstream. But if AMD is willing to add, it would allow to do the optimization that Sam was suggesting earlier.


Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 11 January 2018 14:00
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Anastasia,

 

There is a 2.0 extension, cl_khr_device_enqueue_local_arg_types, that we requested when we first encountered this problem.  Clang should implement this if it hasn’t already, and ideally this would eventually become the default in the spec.

 

Thanks,

Brian

 

From: Anastasia Stulova [[hidden email]]
Sent: Thursday, January 11, 2018 4:02 AM
To: Sumner, Brian; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Brian,

 

Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?

 

Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev
In reply to this post by Richard Smith via cfe-dev


On 11 January 2018 at 16:39, Liu, Yaxun (Sam) <[hidden email]> wrote:

The workgroup size is usually 64 or 128. The number of workgroups can be quite large if the global size is large. If for each local array we waste 124 bytes, the total waste could be quite large, considering local memory is precious resource for GPU.


It is, if you have a local argument that is just 4 bytes. But is that really typical practical use-case? Doing a local memory allocation in the first place to store 4 bytes seems a bit excessive.

Also, a simplification would be to do something like this:

alignment = min(round_to_neareast_power_of_2(size), max_alignment),

so you either align to the size of the argument [because there is no CL type where the alignment is greater than the size of the type itself], or the maximum alignment. This doesn't require any further arguments to be passed, but
gives a reasonable alignment. Sure, it's going to align an array of 6 int values to 32 bytes, but it's not the same loss as rounding everything to 128 bytes, and can be done without changing anything.

--
Mats

 

On the other hand, passing the alignment info and using it is pretty straight forward.

 

Sam

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of mats petersson
Sent: Thursday, January 11, 2018 7:47 AM
To: Anastasia Stulova <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

I'm still a little bit confused about the background of this. And I understand that the actual usecase here may not be something that can be shared, but perhaps at least some part of the underlying problem can be shared to help with the understanding of the issue is here...

The approach I've taken is to allocate every local argument with the "largest alignment requirement" (in other words 128 bytes - this may of course vary depending on the HW available in the GPU).

As I see it, this wouldn't lead to THAT much overhead in the allocations, as local storage is per work-group, and the number of llocal arguments is, hopefully, not a very large number.

Whilst I'm all for saving memory when possible, I'm not sure adding a set of alignment values to the argument list of enqueue_kernel, for calls that have local arguments, and the extra complexity, even if it's not large, is worth the saving of local memory allocations. I'd really like to understand why a single large alignment doesn't work in this case.

I'm completely aware that this may be my lack of understanding of something - hopefully I will learn something new, if that's the case...

--

Mats

 

On 11 January 2018 at 12:02, Anastasia Stulova via cfe-dev <[hidden email]> wrote:

Hi Brian,

 

Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?

 

Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd


Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev

My comments are below.

 

Sam

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of mats petersson
Sent: Friday, January 12, 2018 8:32 AM
To: Liu, Yaxun (Sam) <[hidden email]>
Cc: Anastasia Stulova <[hidden email]>; Sumner, Brian <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

 

On 11 January 2018 at 16:39, Liu, Yaxun (Sam) <[hidden email]> wrote:

The workgroup size is usually 64 or 128. The number of workgroups can be quite large if the global size is large. If for each local array we waste 124 bytes, the total waste could be quite large, considering local memory is precious resource for GPU.

 

It is, if you have a local argument that is just 4 bytes. But is that really typical practical use-case? Doing a local memory allocation in the first place to store 4 bytes seems a bit excessive.

[Sam] The waste of memory could happen to an integer array of any size, e.g. int a[10], which only needs to align at 4 bytes. Aligning it to 128 bytes waste 124 bytes.

Also, a simplification would be to do something like this:

alignment = min(round_to_neareast_power_of_2(size), max_alignment),

[Sam] We cannot expect how user would use local memory. In certain cases the above approach still waste considerable local memory. I think it is better to allow user be able to fully utilize their local memory, considering the implementation effort is moderate.


so you either align to the size of the argument [because there is no CL type where the alignment is greater than the size of the type itself], or the maximum alignment. This doesn't require any further arguments to be passed, but

gives a reasonable alignment. Sure, it's going to align an array of 6 int values to 32 bytes, but it's not the same loss as rounding everything to 128 bytes, and can be done without changing anything.

 

--

Mats

 

On the other hand, passing the alignment info and using it is pretty straight forward.

 

Sam

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of mats petersson
Sent: Thursday, January 11, 2018 7:47 AM
To: Anastasia Stulova <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

I'm still a little bit confused about the background of this. And I understand that the actual usecase here may not be something that can be shared, but perhaps at least some part of the underlying problem can be shared to help with the understanding of the issue is here...

The approach I've taken is to allocate every local argument with the "largest alignment requirement" (in other words 128 bytes - this may of course vary depending on the HW available in the GPU).

As I see it, this wouldn't lead to THAT much overhead in the allocations, as local storage is per work-group, and the number of llocal arguments is, hopefully, not a very large number.

Whilst I'm all for saving memory when possible, I'm not sure adding a set of alignment values to the argument list of enqueue_kernel, for calls that have local arguments, and the extra complexity, even if it's not large, is worth the saving of local memory allocations. I'd really like to understand why a single large alignment doesn't work in this case.

I'm completely aware that this may be my lack of understanding of something - hopefully I will learn something new, if that's the case...

--

Mats

 

On 11 January 2018 at 12:02, Anastasia Stulova via cfe-dev <[hidden email]> wrote:

Hi Brian,

 

Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?

 

Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd


Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

 


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

Richard Smith via cfe-dev


On 12 January 2018 at 19:32, Liu, Yaxun (Sam) <[hidden email]> wrote:

My comments are below.

 

Sam

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of mats petersson
Sent: Friday, January 12, 2018 8:32 AM
To: Liu, Yaxun (Sam) <[hidden email]>
Cc: Anastasia Stulova <[hidden email]>; Sumner, Brian <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

 

On 11 January 2018 at 16:39, Liu, Yaxun (Sam) <[hidden email]> wrote:

The workgroup size is usually 64 or 128. The number of workgroups can be quite large if the global size is large. If for each local array we waste 124 bytes, the total waste could be quite large, considering local memory is precious resource for GPU.

 

It is, if you have a local argument that is just 4 bytes. But is that really typical practical use-case? Doing a local memory allocation in the first place to store 4 bytes seems a bit excessive.

[Sam] The waste of memory could happen to an integer array of any size, e.g. int a[10], which only needs to align at 4 bytes. Aligning it to 128 bytes waste 124 bytes.

Clearly not, in this case, the local argument is 40 bytes, and thus the wastage is at most 68 bytes (128-40). And I'm not arguing that this is not wasted, I'm trying to understand what the use-case is where the user uses local memory in such a small amount per workgroup.

Also, a simplification would be to do something like this:

alignment = min(round_to_neareast_power_of_2(size), max_alignment),

[Sam] We cannot expect how user would use local memory. In certain cases the above approach still waste considerable local memory. I think it is better to allow user be able to fully utilize their local memory, considering the implementation effort is moderate.


With the above suggestion, the implementation cost is nearly zero, and you CAN assume that the user will not access outside the range of the actual allocated space [that would be UB]. For an int [10], the  "loss" would be 24 bytes, because the rounding up would be to 64 bytes, and the worst possible case for small buffers is for int [17], which would waste 60 bytes. For large buffers, the worst case can of course still be 124 bytes.

You could potentially do something like (I have not validated this - and it still needs clamping to 128 or something of course)

     rounded_size = round_to_nearest_smaller_power_2(size);
     if (size != rounded_size)
     {
         alignment = size % rounded_size;
     }
     else
     {
          alignment = rounded_siize;
     }

This will give you an alignment of 8 for int [10], and 4 for int [17].

This does of course assume that someone doesn't try to load 16 of the int [17] in a vector-instruction that requires alignment of 64, and then load one element on its own. That wouldn't work well, but that would only work if the user-call supplied the alignment, which I don't think is the proposed solution.

Of course, if you have a bunch of different local arguments, of varying sizes, this will still potentially lead to wasted space, but less so. For example int [1], int [10], int[1], int [32], int [12] would lead to several gaps of varying sizes. If this is what is expected - and I don't really know what use cases there are out there that use local memory combined with device-side enqueue - then I would say, it may be worth doing this.

Have you investigated some work-loads with regard to how much space you gain from "the tightest possible packing", compared to my above solution, the one-line solution, and "round everything to 128"?
Without revealing what the work-loads are, perhaps you could show something like:
Kernel A: 12, 36, 18, 128 bytes
Kernel B: 116, 236, 240, 256 bytes
Kernel C: ...
[I just made those numbers up, and I don't really expect the numbers to make any sense compared to real applications and numbers]

Not quite a single-line, but still trivial compared to passing and handling an array of extra arguments, which requires modification of several different files, adding new test-cases, etc. [Although you may want to add some test-cases for this implementation, of course].

--
Mats


so you either align to the size of the argument [because there is no CL type where the alignment is greater than the size of the type itself], or the maximum alignment. This doesn't require any further arguments to be passed, but

gives a reasonable alignment. Sure, it's going to align an array of 6 int values to 32 bytes, but it's not the same loss as rounding everything to 128 bytes, and can be done without changing anything.

 

--

Mats

 

On the other hand, passing the alignment info and using it is pretty straight forward.

 

Sam

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of mats petersson
Sent: Thursday, January 11, 2018 7:47 AM
To: Anastasia Stulova <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

I'm still a little bit confused about the background of this. And I understand that the actual usecase here may not be something that can be shared, but perhaps at least some part of the underlying problem can be shared to help with the understanding of the issue is here...

The approach I've taken is to allocate every local argument with the "largest alignment requirement" (in other words 128 bytes - this may of course vary depending on the HW available in the GPU).

As I see it, this wouldn't lead to THAT much overhead in the allocations, as local storage is per work-group, and the number of llocal arguments is, hopefully, not a very large number.

Whilst I'm all for saving memory when possible, I'm not sure adding a set of alignment values to the argument list of enqueue_kernel, for calls that have local arguments, and the extra complexity, even if it's not large, is worth the saving of local memory allocations. I'd really like to understand why a single large alignment doesn't work in this case.

I'm completely aware that this may be my lack of understanding of something - hopefully I will learn something new, if that's the case...

--

Mats

 

On 11 January 2018 at 12:02, Anastasia Stulova via cfe-dev <[hidden email]> wrote:

Hi Brian,

 

Considering the current implementation there is no reason we couldn't generate code with arbitrary pointer types instead of void. This is anyways implemented as a custom check. I don't know though if there might be limitations if using different compilation toolchains or so. Although I can imagine this will require custom implementation anywhere. Should we clarify this in spec?

 

Cheers,

Anastasia


From: Sumner, Brian <[hidden email]>
Sent: 10 January 2018 18:26:44
To: Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: nd


Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

From my perspective, this restriction is nonsense.  OpenCL kernel local* arguments are not required to point to void.  Why must block local* arguments point to void?  They have to be cast to actually be useful; this is an unnecessary extra step.  And unless the actual type is available, the kernel enqueue mechanism has no choice to align the storage to 128 bytes since any local void * could actually be a local ulong16 *.

 

Thanks,

Brian

 

From: Anastasia Stulova [mailto:[hidden email]]
Sent: Wednesday, January 10, 2018 9:55 AM
To: Liu, Yaxun (Sam); cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi Sam,

 

There is a restriction in OpenCL spec I have referenced in my previous email - s6.13.17.2, which is implemented by Clang. If you look in the file test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see that block_B is rejected to be passed into enqueue_kernel because it has a parameter which isn't "local void*". If you think this is wrong perhaps it would make sense to revisit this bit and understand whether the current spec should be changed to allow more optimal implementations to exist. But as for the current state, I don't think we can implement what you are suggesting because we can only have one block argument type for a block in enqueue.

 

Cheer,

Anastasia

 


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 08 January 2018 22:25
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

My comments are below.

 

From: Anastasia Stulova [[hidden email]]
Sent: Tuesday, December 19, 2017 10:21 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

> For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. 

 

Perhaps I am missing something but I still don't see anything in the spec that requires pointers themselves to take alignment from the pointee type. In your example int4* should be aligned to the pointer size (either 4 or 8 bites) while int4 should be 16 byte aligned. Clang will set the alignment of load and store operations correctly according to their data types specified in the source code (which is mainly inherited from C implementation apart from some special data types like vectors). The arguments passed to kernels are allocated elsewhere and OpenCL compiler has no control over this.

  Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately aligned as required by the data type”, which means the pointee of the kernel argument of int4* type should be aligned at 16 bytes.


Regarding enqueued kernels as far as I understand you suggest to add block argument alignment info into builtin? Even though it shouldn't be strictly necessary I believe some implementation can indeed be done more efficiently using this. So I don't see any problem adding this. However, spec (s6.13.17.2) mandates that the enqueued block function only has void* types as parameters: "Each argument must be declared to be a void pointer to local memory."  So could you elaborate please where exactly do you plan to get the optimal alignment from?

  Sam: The block function is passed to the builtin. The argument of the block function has the proper data type instead of void* type. Clang can deduce the alignment of the pointee of the kernel argument from the block function type.


Thanks,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 15 December 2017 19:08
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian; nd
Subject: RE: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Spec reference:

 

OpenCL v2.0 s6.1.5

The OpenCL compiler is responsible for aligning data items to the appropriate alignment as required by the data type. For arguments to a __kernel function declared to be a pointer to a data type, the OpenCL compiler can assume that the pointee is always appropriately aligned as required by the data type. The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.13.7.

 

s6.2.5

Casting a pointer to a new type represents an unchecked assertion that the address is correctly aligned.

 

The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states

 

A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

 

For example, if a block kernel has argument local int4*. Its alignment should be 16 bytes. Passing a pointer aligned to 1 byte may result in undefined behavior. Most hardware can still load from the unaligned memory but will a performance hit. If runtime wants to avoid the performance hit, it has to allocate the buffer at maximum possible alignment e.g. 32 bytes, which will result in wasted memory.

 

Sam

 

From: Anastasia Stulova [[hidden email]]
Sent: Friday, December 15, 2017 10:40 AM
To: Liu, Yaxun (Sam) <[hidden email]>; cfe-dev ([hidden email]) <[hidden email]>; Bader, Alexey ([hidden email]) <[hidden email]>
Cc: Sumner, Brian <[hidden email]>; nd <[hidden email]>
Subject: Re: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

 

> OpenCL spec requires that a pointer should be aligned to at least the pointee type.

 

So a pointer to int16 would be 64 byte aligned? Seems strange though. Can you give me the spec reference?

> Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

Can you explain in more details here, please.

Cheer,
Anastasia


From: Liu, Yaxun (Sam) <[hidden email]>
Sent: 01 December 2017 19:45
To: Anastasia Stulova; cfe-dev ([hidden email]); Bader, Alexey ([hidden email])
Cc: Sumner, Brian
Subject: [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

 

Hi,

 

OpenCL spec requires that a pointer should be aligned to at least the pointee type. Therefore, if a device-side enqueued kernel has a local int* argument, it should be aligned to 4 bytes.

 

Since these buffers in local addr space are allocated by __enqueue_kernel, it needs to know the alignment of these buffers, not just their sizes.

 

Although such information is not passed to the original OpenCL builtin function enqueue_kernel, it can be obtained by checking the prototype of the block invoke function at compile time.

 

I would like to create a patch to pass this information to  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate unaligned local buffer, which degrades performance, or allocates local buffer with extra alignment therefore wasted memory space.

 

Any comments?

 

Thanks.

 

Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

 



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev