[RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
Hello all,

At Intel, we have developed an implementation of C++17 execution policies
for algorithms (often referred to as Parallel STL). We hope to contribute it
to libc++/LLVM, so would like to ask the community for comments on this.

The code is already published at GitHub (https://github.com/intel/parallelstl).
It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
about half of the C++17 standard algorithms that must support execution policies
are implemented; a few more will be ready soon, and the work continues.
The tests that we use are also available at GitHub; needless to say we will
contribute those as well.

The implementation is not specific to Intel’s hardware. For thread-level parallelism
it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
an internal API which can be implemented on top of other threading/parallel solutions –
so it is for the community to decide which ones to use. For SIMD parallelism
(unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
does not require any OpenMP runtime support.

The current implementation meets the spirit but not always the letter of
the standard, because it has to be separate from but also coexist with
implementations of standard C++ libraries. While preparing the contribution,
we will address inconsistencies, adjust the code to meet community standards,
and better integrate it into the standard library code.

We are also proposing that our implementation is included into libstdc++/GCC.
Compatibility between the implementations seems useful as it can potentially
reduce the amount of work for everyone. We hope to keep the code mostly identical,
and would like to know if you think it’s too optimistic to expect.

Obviously we plan to use appropriate open source licenses to meet the different
projects’ requirements.

We expect to keep developing the code and will take the responsibility for
maintaining it (with community contributions, of course). If there are other
community efforts to implement parallel algorithms, we are willing to collaborate.

We look forward to your feedback, both for the overall idea and – if supported –
for the next steps we should take.

Regards,
- Alexey Kukanov

* Note that TBB itself is highly portable (and ported by community to Power and ARM
architectures) and permissively licensed, so could be the base for the threading
infrastructure. But the Parallel STL implementation itself does not require TBB.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey <[hidden email]> wrote:
Hello all,

At Intel, we have developed an implementation of C++17 execution policies
for algorithms (often referred to as Parallel STL). We hope to contribute it
to libc++/LLVM, so would like to ask the community for comments on this.


Thank you very much for the offer!

-- Marshall
 


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.

While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,

Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:

>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
Hello,
 
_Pragma("omp simd") is sematically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as exepected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD atchitectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:

>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev


On 12/03/2017 10:09 PM, Serge Preis via cfe-dev wrote:
Hello,
 
_Pragma("omp simd") is sematically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as exepected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops,

A similar flag is being worked on for Clang (see https://reviews.llvm.org/D31417).

Maybe what we really need for this is some kind of ' #pragma GCC push_options' thing so that we can force OpenMP SIMD support on in particular regions of code?

 -Hal

but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD atchitectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" [hidden email]:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev

A similar flag is being worked on for Clang (see https://reviews.llvm.org/D31417).
 
Thank you for the reference: I saw this once but was unable to locate today.
 

Maybe what we really need for this is some kind of ' #pragma GCC push_options' thing so that we can force OpenMP SIMD support on in particular regions of code?
 
This sounds like a good idea to me.
 
 
-- 
Serge Preis
 
 
04.12.2017, 11:23, "Hal Finkel" <[hidden email]>:

 

On 12/03/2017 10:09 PM, Serge Preis via cfe-dev wrote:
Hello,
 
_Pragma("omp simd") is sematically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as exepected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops,

A similar flag is being worked on for Clang (see https://reviews.llvm.org/D31417).

Maybe what we really need for this is some kind of ' #pragma GCC push_options' thing so that we can force OpenMP SIMD support on in particular regions of code?

 -Hal
 
but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD atchitectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" [hidden email]:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:

>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
 
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.

Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.

In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.

Best,

Jeff

On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:

>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.
 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" <[hidden email]>:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:

>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev


On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard

std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.

I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.

In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal

 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev


On 11/30/2017 11:48 AM, Marshall Clow wrote:
On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey <[hidden email]> wrote:
Hello all,

At Intel, we have developed an implementation of C++17 execution policies
for algorithms (often referred to as Parallel STL). We hope to contribute it
to libc++/LLVM, so would like to ask the community for comments on this.


Thank you very much for the offer!

Alexey, I suggest breaking this into separately-testable pieces for review (and then posting the relevant patches on reviews.llvm.org).

Do you only have a TBB-backed implementation currently, or do you also have an OpenMP-based implementation as well?

Marshall, preferences on how we proceed in general?

Thanks again,
Hal


-- Marshall
 


-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev

On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <[hidden email]> wrote:


On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard

std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.


Do you expect par/par_unseq to nest? Nesting omp-parallel is generally regarded as a Bad Idea.

Jeff


I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.

In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal


 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev


On 12/06/2017 10:23 PM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <[hidden email]> wrote:


On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard

std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.


Do you expect par/par_unseq to nest?

Yes.

Nesting omp-parallel is generally regarded as a Bad Idea.

Agreed. I suspect we'll want the mapping to be more like '#pragma omp taskloop simd'.

 -Hal


Jeff


I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.

In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal


 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev
 
 
 
07.12.2017, 11:24, "Jeff Hammond" <[hidden email]>:
 
On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <[hidden email]> wrote:

 

On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard
std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.
 
 
Do you expect par/par_unseq to nest? Nesting omp-parallel is generally regarded as a Bad Idea.
 
Sorry for answering question asked not to me.
 
I don't see any OpenMP parallel implementation in the sources published by Intel on github, so par_unseq in current published implementation is mix of TBB and #pragma omp simd as far as I understand (though some other may exist inside Intel).
 
In my personal opinion execution policies in their current form were not desingned with nesting in mind. There is no concept of any execution resource managenent, so nesting will either lead to oversubscription or it will rely on some global resource management hidden in the implementation and basically precluding any other form of parallelism in the program. To ease this issue in hypothetical OpenMP implementation I think OpenMP tasks are better fit as internal machinery than plain 'omp parallel for'.
 
Inability to pass thread pool in any form as part of std::execution::par policy is major roadblock to adoption parallel policies in the company I work for.
 
Nesting something parallel into par_unseq (and I believe Intel's unseq) is explicitly prohibited.
 
 
 
Jeff
 
 
I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.
In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal
 
 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:

>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--
 
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev

On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <[hidden email]> wrote:


On 12/06/2017 10:23 PM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <[hidden email]> wrote:


On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard

std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.


Do you expect par/par_unseq to nest?

Yes.


Nesting omp-parallel is generally regarded as a Bad Idea.

Agreed. I suspect we'll want the mapping to be more like '#pragma omp taskloop simd'.


That won’t run in parallel unless in an omp-parallel-master region. That means OpenMP-based PSTL won’t be parallel unless the user knows to add back-end specific code about the PSTL.

What I’m trying to say is that OpenMP is a poor target for PSTL in its current form. Nested parallel regions is the only thing that works and it is likely to work poorly.

Jeff


 -Hal



Jeff


I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.

In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal


 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev


On 12/07/2017 11:35 AM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <[hidden email]> wrote:


On 12/06/2017 10:23 PM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <[hidden email]> wrote:


On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard

std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.


Do you expect par/par_unseq to nest?

Yes.


Nesting omp-parallel is generally regarded as a Bad Idea.

Agreed. I suspect we'll want the mapping to be more like '#pragma omp taskloop simd'.


That won’t run in parallel unless in an omp-parallel-master region.

Yes.

That means OpenMP-based PSTL won’t be parallel unless the user knows to add back-end specific code about the PSTL.

That obviously wouldn't be acceptable.


What I’m trying to say is that OpenMP is a poor target for PSTL in its current form. Nested parallel regions is the only thing that works and it is likely to work poorly.

I'm not sure that's true, but the technique may not be trivial. I believe that it is possible, however. For example, the mapping might be to something like:

if (omp_in_parallel()) {
#pragma omp taskloop simd
  for (size_t i = 0; i < N; ++i)
    F(X[i]);
} else {
#pragma omp parallel
  {
#pragma omp taskloop simd
     for (size_t i = 0; i < N; ++i)
       F(X[i]);
  }
}

The fact that we'd need to use this kind of pattern is a bit unfortunate, but it can be easily abstracted into a template function, so it just becomes some implementation detail of the library.

Thanks again,
Hal


Jeff


 -Hal



Jeff


I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.

In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal


 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev


On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <[hidden email]> wrote:


On 12/07/2017 11:35 AM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <[hidden email]> wrote:


On 12/06/2017 10:23 PM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <[hidden email]> wrote:


On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard

std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.


Do you expect par/par_unseq to nest?

Yes.


Nesting omp-parallel is generally regarded as a Bad Idea.

Agreed. I suspect we'll want the mapping to be more like '#pragma omp taskloop simd'.


That won’t run in parallel unless in an omp-parallel-master region.

Yes.

That means OpenMP-based PSTL won’t be parallel unless the user knows to add back-end specific code about the PSTL.

That obviously wouldn't be acceptable.


What I’m trying to say is that OpenMP is a poor target for PSTL in its current form. Nested parallel regions is the only thing that works and it is likely to work poorly.

I'm not sure that's true, but the technique may not be trivial. I believe that it is possible, however. For example, the mapping might be to something like:

if (omp_in_parallel()) {
#pragma omp taskloop simd
  for (size_t i = 0; i < N; ++i)
    F(X[i]);
} else {
#pragma omp parallel
  {
#pragma omp taskloop simd
     for (size_t i = 0; i < N; ++i)
       F(X[i]);
  }
}

The fact that we'd need to use this kind of pattern is a bit unfortunate, but it can be easily abstracted into a template function, so it just becomes some implementation detail of the library.


You are right and that is probably the best way to do it with OpenMP.  I am concerned about the absolute performance, based upon my observations of omp-taskloop vs omp-for and tbb::parallel_for in the PRK project, but at least it is sane from a semantic perspective.  Having motivating use cases like PSTL should lead to improvements in OpenMP runtime performance w.r.t. taskloop.

https://i.stack.imgur.com/MVd5j.png is a snapshot of the performance of PRK stencil (https://github.com/ParRes/Kernels/tree/master/Cxx11), which shows taskloop loses to TBB-based PSTL, OpenMP for, and tbb::parallel_for (pure TBB beats TBB-based PSTL because I use tbb::blocked_range2d, which improves cache utilization).  I think those results tuned taskloop grainsize as well, so they may be an optimistic representation of taskloop in a general usage.

I'll see if I can prototype this in RAJA or Intel PSTL.  It's not hard to get results directly from the PRK tests, if the former attempts fail.

Best,

Jeff
 
Thanks again,
Hal



Jeff


 -Hal



Jeff


I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.

In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal


 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory



--

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev


On 12/08/2017 03:55 PM, Jeff Hammond wrote:


On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <[hidden email]> wrote:


On 12/07/2017 11:35 AM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <[hidden email]> wrote:


On 12/06/2017 10:23 PM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <[hidden email]> wrote:


On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard

std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.


Do you expect par/par_unseq to nest?

Yes.


Nesting omp-parallel is generally regarded as a Bad Idea.

Agreed. I suspect we'll want the mapping to be more like '#pragma omp taskloop simd'.


That won’t run in parallel unless in an omp-parallel-master region.

Yes.

That means OpenMP-based PSTL won’t be parallel unless the user knows to add back-end specific code about the PSTL.

That obviously wouldn't be acceptable.


What I’m trying to say is that OpenMP is a poor target for PSTL in its current form. Nested parallel regions is the only thing that works and it is likely to work poorly.

I'm not sure that's true, but the technique may not be trivial. I believe that it is possible, however. For example, the mapping might be to something like:

if (omp_in_parallel()) {
#pragma omp taskloop simd
  for (size_t i = 0; i < N; ++i)
    F(X[i]);
} else {
#pragma omp parallel
  {
#pragma omp taskloop simd
     for (size_t i = 0; i < N; ++i)
       F(X[i]);
  }
}

The fact that we'd need to use this kind of pattern is a bit unfortunate, but it can be easily abstracted into a template function, so it just becomes some implementation detail of the library.


You are right and that is probably the best way to do it with OpenMP.  I am concerned about the absolute performance, based upon my observations of omp-taskloop vs omp-for and tbb::parallel_for

Have you tried this recently? There was a recursive task-stealing strategy added to our OpenMP library in July of this year (r308338) which should have made the performance of taskloop better.

in the PRK project, but at least it is sane from a semantic perspective.  Having motivating use cases like PSTL should lead to improvements in OpenMP runtime performance w.r.t. taskloop.

Indeed :-)


https://i.stack.imgur.com/MVd5j.png is a snapshot of the performance of PRK stencil (https://github.com/ParRes/Kernels/tree/master/Cxx11), which shows taskloop loses to TBB-based PSTL, OpenMP for, and tbb::parallel_for (pure TBB beats TBB-based PSTL because I use tbb::blocked_range2d, which improves cache utilization).  I think those results tuned taskloop grainsize as well, so they may be an optimistic representation of taskloop in a general usage.

Interesting.


I'll see if I can prototype this in RAJA or Intel PSTL.  It's not hard to get results directly from the PRK tests, if the former attempts fail.

Thanks!

 -Hal


Best,

Jeff
 
Thanks again,
Hal



Jeff


 -Hal



Jeff


I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.

In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal


 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Proposal to contribute Intel’s implementation of C++17 parallel algorithms

Robinson, Paul via cfe-dev


On Fri, Dec 8, 2017 at 2:34 PM, Hal Finkel <[hidden email]> wrote:


On 12/08/2017 03:55 PM, Jeff Hammond wrote:


On Fri, Dec 8, 2017 at 1:13 PM, Hal Finkel <[hidden email]> wrote:


On 12/07/2017 11:35 AM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 8:57 PM Hal Finkel <[hidden email]> wrote:


On 12/06/2017 10:23 PM, Jeff Hammond wrote:

On Wed, Dec 6, 2017 at 4:23 PM Hal Finkel <[hidden email]> wrote:


On 12/04/2017 10:48 PM, Serge Preis via cfe-dev wrote:
I agree that guarantees provided by ICC may be stronger than with other compilers, so yes, under OpenMP terms vectorization is permitted and cannot be assumed. However OpenMP clearly defines semantics of variables used within OpenMP region some being shared(scalar), some private(vector) and some being inductions. This goes far beyond typical compiler specific pragmas about dependencies and cost modelling and makes vectorization much simpler task with more predictable and robust results if properly implemented (admittedly, even ICC implementation is far from perfect). I hope Intel's efforts to standardize someting like this in core C++ will evntually come to fruition. Until then I as a regular application developer would appreciate OpenMP-simd based execution policy (hoping for good support for OpenMP SIMD in clang), but it shouldn't necessary be part of libc++. Since 'unordered' execution policy is currently not part of C++ standard

std::execution::par_unseq is part of C++17, and that essentially maps to '#pragma omp parallel for simd'.


Do you expect par/par_unseq to nest?

Yes.


Nesting omp-parallel is generally regarded as a Bad Idea.

Agreed. I suspect we'll want the mapping to be more like '#pragma omp taskloop simd'.


That won’t run in parallel unless in an omp-parallel-master region.

Yes.

That means OpenMP-based PSTL won’t be parallel unless the user knows to add back-end specific code about the PSTL.

That obviously wouldn't be acceptable.


What I’m trying to say is that OpenMP is a poor target for PSTL in its current form. Nested parallel regions is the only thing that works and it is likely to work poorly.

I'm not sure that's true, but the technique may not be trivial. I believe that it is possible, however. For example, the mapping might be to something like:

if (omp_in_parallel()) {
#pragma omp taskloop simd
  for (size_t i = 0; i < N; ++i)
    F(X[i]);
} else {
#pragma omp parallel
  {
#pragma omp taskloop simd
     for (size_t i = 0; i < N; ++i)
       F(X[i]);
  }
}

The fact that we'd need to use this kind of pattern is a bit unfortunate, but it can be easily abstracted into a template function, so it just becomes some implementation detail of the library.


You are right and that is probably the best way to do it with OpenMP.  I am concerned about the absolute performance, based upon my observations of omp-taskloop vs omp-for and tbb::parallel_for

Have you tried this recently? There was a recursive task-stealing strategy added to our OpenMP library in July of this year (r308338) which should have made the performance of taskloop better.


I ran those benchmarks this summer with Intel 18 beta.  Tom from LLNL mentioned that a stealing-based implementation of OpenMP taskloop was feasible but I didn't investigate whether it was used.  Obviously, I know some people who can help me answer questions about the LLVM OpenMP runtime ;-)
 
in the PRK project, but at least it is sane from a semantic perspective.  Having motivating use cases like PSTL should lead to improvements in OpenMP runtime performance w.r.t. taskloop.

Indeed :-)


https://i.stack.imgur.com/MVd5j.png is a snapshot of the performance of PRK stencil (https://github.com/ParRes/Kernels/tree/master/Cxx11), which shows taskloop loses to TBB-based PSTL, OpenMP for, and tbb::parallel_for (pure TBB beats TBB-based PSTL because I use tbb::blocked_range2d, which improves cache utilization).  I think those results tuned taskloop grainsize as well, so they may be an optimistic representation of taskloop in a general usage.

Interesting.
 
I should try to figure out how to recreate what TBB does with PSTL since it's clearly beneficial, at least on KNL.  Obviously, I can block loops manually as I do with raw OpenMP code, but I'm sure there's a nicer way.

I'll see if I can prototype this in RAJA or Intel PSTL.  It's not hard to get results directly from the PRK tests, if the former attempts fail.
Correct: I'll see if I can prototype for_each.  The rest will be left as an exercise for the reader :-D

Jeff
 
Thanks!

 -Hal



Best,

Jeff
 
Thanks again,
Hal



Jeff


 -Hal



Jeff


I don't care much on how it will be implemneted in libc++ if it is. I just would like to ask Intel guys and community here to make implementation extensible in a sense that custom OpenMP-SIMD-based execution policy along with algorithms implementations (as specializations for the policy) can be used with the libc++ library. And I additionally would like to ask Intel guys to provide complete and compatible extension on github for developers like me to use.

In the end, I think we want the following:

 1. A design for libc++ that allows the thread-level parallelism to be implemented in terms of different underlying providers (i.e., OpenMP, GCD, Work Queues on Windows, whatever else).
 2. To follow the same philosophy with respect to standards as we do everywhere else: Use standards where possible with compiler/system-specific extensions as necessary.

 -Hal


 
Regards,
Serge.
 
 
 
04.12.2017, 12:07, "Jeff Hammond" [hidden email]:
ICC implements a very aggressive interpretation of the OpenMP standard, and this interpretation is not shared by everyone in the OpenMP community.  ICC is correct but other implementations may be far less aggressive, so _Pragma("omp simd") doesn't guarentee vectorization unless the compiler documentation says that is how it is implemented.  All the standard says that it means is that vectorization is _permitted_.
 
Given that the practical meaning of _Pragma("omp simd") isn't guaranteed to be consistent across different implementations, I don't really know how to compare it to compiler-specific pragmas unless we define everything explicitly.
 
In any case, my fundamental point remains: do not use OpenMP pragmas here, but instead use whatever the appropriate compiler-specific pragma is, or create a new one that meets the need.
 
Best,
 
Jeff
 
 
On Sun, Dec 3, 2017 at 8:09 PM, Serge Preis <[hidden email]> wrote:
Hello,
 
_Pragma("omp simd") is semantically quite different from _Pragma("clang loop vectorize(assume_safety)"), _Pragma("GCC ivdep") and _Pragma("vector always"), so I am not sure all latter will work as expected in all cases. They definitely won't provide any vectorization guarantees which slightly defeat the purpose of using corresponding execution policy.
 
I support the idea of having OpenMP orthogonal and definitely having -fopenmp enabled by default is not an option. Intel compiler has separate -qopenmp-simd option which doesn't affect performance outside explicitly marked loops, but even this is not enabled by default. I would say that there might exist multiple implementations of unordered policy, originally OpenMP SIMD based implementation may be more powerful and one based on other pragmas being default, but hinting about existence of faster option. Later on one may be brave enough to add some SIMD template library and implement default unordered policy using it (such implementation is possible even now using vector types, but it will be extremely complex if attempt to target all base data types, vector widths and target SIMD architectures clang supports. Even with the library this may be quite tedious).
 
Without any standard way of expressing SIMD perallelism in pure C++ any implementer of SIMD execution policy is to rely on means avaialble for plaform/compiler and so it is not totaly unnatural to ask user to enable OpenMP SIMD for efficient support of corresponding execution policy.
 
Reagrds,
Serge Preis
 
(Who once was part of Intel Compiler Vectorizer team and driven OpenMP SIMD efforts within icc and beyond, if anyone is keeping track of conflicts-of-interest)
 
 
04.12.2017, 08:46, "Jeff Hammond via cfe-dev" <[hidden email]>:
It would be nice to keep PSTL and OpenMP orthogonal, even if _Pragma("omp simd") does not require runtime support.  It should be trivial to use _Pragma("clang loop vectorize(assume_safety)") instead, by wrapping all of the different compiler vectorization pragmas in preprocessor logic.  I similarly recommend _Pragma("GCC ivdep") for GCC and _Pragma("vector always") for ICC.  While this requires O(n_compilers) effort instead of O(1), but orthogonality is worth it.
 
While OpenMP is vendor/compiler-agnostic, users should not be required to use -fopenmp or similar to enable vectorization from PSTL, nor should the compiler enable any OpenMP pragma by default.  I know of cases where merely using the -fopenmp flag alters code generation in a performance-visible manner, and enabling the OpenMP "simd" pragma by default may surprise some users, particularly if no other OpenMP pragmas are enabled by default.

Best,
 
Jeff
(who works for Intel but not on any software products and has been a heavy user of Intel PSTL since it was released, if anyone is keeping track of conflicts-of-interest)

On Wed, Nov 29, 2017 at 4:21 AM, Kukanov, Alexey via cfe-dev <[hidden email]> wrote:
>
> Hello all,
>
> At Intel, we have developed an implementation of C++17 execution policies
> for algorithms (often referred to as Parallel STL). We hope to contribute it
> to libc++/LLVM, so would like to ask the community for comments on this.
>
> The code is already published at GitHub (https://github.com/intel/parallelstl).
> It supports the C++17 standard execution policies (seq, par, par_unseq) as well as
> the experimental unsequenced policy (unseq) for SIMD execution. At the moment,
> about half of the C++17 standard algorithms that must support execution policies
> are implemented; a few more will be ready soon, and the work continues.
> The tests that we use are also available at GitHub; needless to say we will
> contribute those as well.
>
> The implementation is not specific to Intel’s hardware. For thread-level parallelism
> it uses TBB* (https://www.threadingbuildingblocks.org/) but abstracts it with
> an internal API which can be implemented on top of other threading/parallel solutions –
> so it is for the community to decide which ones to use. For SIMD parallelism
> (unseq, par_unseq) we use #pragma omp simd directives; it is vendor-neutral and
> does not require any OpenMP runtime support.
>
> The current implementation meets the spirit but not always the letter of
> the standard, because it has to be separate from but also coexist with
> implementations of standard C++ libraries. While preparing the contribution,
> we will address inconsistencies, adjust the code to meet community standards,
> and better integrate it into the standard library code.
>
> We are also proposing that our implementation is included into libstdc++/GCC.
> Compatibility between the implementations seems useful as it can potentially
> reduce the amount of work for everyone. We hope to keep the code mostly identical,
> and would like to know if you think it’s too optimistic to expect.
>
> Obviously we plan to use appropriate open source licenses to meet the different
> projects’ requirements.
>
> We expect to keep developing the code and will take the responsibility for
> maintaining it (with community contributions, of course). If there are other
> community efforts to implement parallel algorithms, we are willing to collaborate.
>
> We look forward to your feedback, both for the overall idea and – if supported –
> for the next steps we should take.
>
> Regards,
> - Alexey Kukanov
>
> * Note that TBB itself is highly portable (and ported by community to Power and ARM
> architectures) and permissively licensed, so could be the base for the threading
> infrastructure. But the Parallel STL implementation itself does not require TBB.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Jeff Hammond
[hidden email]
http://jeffhammond.github.io/
 
,

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 
 
--


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
--
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory



--

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev