[libc++] Working on the parallel STL algorithms

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev
Hello all,

I’m an intern at Apple this summer, and my project is to work on the C++17 parallel algorithms library. In a recent thread, Hal mentioned [0] that he had spoken with Marshall about the design of the library. I’m currently working on a prototype that is pretty similar to what was committed into lib/Support in r302748. I’m wondering what you folks talked about, and if any consensus wrt to design has been reached. Even if not, I’d love to hear about what thoughts were exchanged, or any reference material that could bring me up to speed!

It might also be worth having a conversation about how to divide up the work to be done here. Were any of you intending to start working on the library?


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev

Hi, Erik,

That's great!

Gor, Marshall, and I discussed this after some past committee meeting. We wanted to architect the implementation so that we could provide different underlying concurrency mechanisms; including:

   a. A self-contained thread-pool-based implementation using a work-stealing scheme.

   b. An implementation that wraps Grand Central Dispatch (for Mac and any other platforms providing libdispatch).

   c. An implementation that uses OpenMP.

The OpenMP-based provider is essential for interoperability in applications using OpenMP, GCD seemed most natural on Mac, and the self-contained implementation is available for anyone else. CUDA+UVM should be possible in the future as well. All of these mechanisms can export an interface for scheduling tasks, and we specifically discussed the interface having a "bulk enqueue" capability. This is essential for efficiently mapping onto several potential underlying technologies (e.g. OpenMP). You don't want to generate individual tasks for each item, and for underlying technologies that can do intelligent things, you want the underlying technology to make the coarsening decisions.

par_unseq should map, via pragmas, either Clang-specific and/or OpenMP, to enabling vectorizaton. As a longer-term exercise, we might need additional compiler extensions to make this work well for some algorithms (such as sorting), dealing with exceptions, etc.

 -Hal

On 05/15/2017 11:54 AM, Erik Pilkington wrote:
Hello all,

I’m an intern at Apple this summer, and my project is to work on the C++17 parallel algorithms library. In a recent thread, Hal mentioned [0] that he had spoken with Marshall about the design of the library. I’m currently working on a prototype that is pretty similar to what was committed into lib/Support in r302748. I’m wondering what you folks talked about, and if any consensus wrt to design has been reached. Even if not, I’d love to hear about what thoughts were exchanged, or any reference material that could bring me up to speed!

It might also be worth having a conversation about how to divide up the work to be done here. Were any of you intending to start working on the library?


-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev


On Tue, May 16, 2017 at 2:50 PM, Hal Finkel via cfe-dev <[hidden email]> wrote:

Hi, Erik,

That's great!

Gor, Marshall, and I discussed this after some past committee meeting. We wanted to architect the implementation so that we could provide different underlying concurrency mechanisms; including:

   a. A self-contained thread-pool-based implementation using a work-stealing scheme.

   b. An implementation that wraps Grand Central Dispatch (for Mac and any other platforms providing libdispatch).

   c. An implementation that uses OpenMP.


Sorry to butt in, but I'm kinda curious how these will be substantially different under the hood

"OpenMP" is a pretty vague term and I'm curious what that means in terms of actual directives used. All non-accelerator OpenMP implementations lower down to threading currently. (Even if you use tasks it still ends up being a thread)

GCD (libdispatch) is essentially a task based execution model, but again on non-OSX platforms lowers to threads. (I have a doubt that GCD offers any performance benefit over native threads or Intel OMP runtime on OSX.)

How would the above offer any benefit over a native thread pool? Would you be just duplicating code which is already working?
--------------
I'm no OMP advocate, but I'd find it significantly more sane to target the Intel OMP runtime API directly.
* Production ready
* Portable across CPU (Intel, ARM, Power8)
* Likely provides the interface needed for parallelism
* Single approach
* Already part of the llvm infrastructure without external dependencies.

I don't know how well the API will map to accelerators, but for something quick and easy it's likely to the easiest.

Bryce I think even mentioned he had used it before with positive results?

In contrast the other approaches will loosely couple things to external dependencies and be more difficult to debug and support long term. It will introduce additional build dependencies which will likely add barriers to others contributing.

I'm not writing the code and just trying to offer another pragmatic point of view..


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev
On 05/16/2017 02:54 AM, C Bergström wrote:


On Tue, May 16, 2017 at 2:50 PM, Hal Finkel via cfe-dev <[hidden email]> wrote:

Hi, Erik,

That's great!

Gor, Marshall, and I discussed this after some past committee meeting. We wanted to architect the implementation so that we could provide different underlying concurrency mechanisms; including:

   a. A self-contained thread-pool-based implementation using a work-stealing scheme.

   b. An implementation that wraps Grand Central Dispatch (for Mac and any other platforms providing libdispatch).

   c. An implementation that uses OpenMP.


Sorry to butt in, but I'm kinda curious how these will be substantially different under the hood

No need to be sorry; this is a good question. I think that there are a few high-level goals here:

 1. Provide a solution that works for everybody

 2. Take advantage of compiler technology as appropriate

 3. Provide useful interoperability. In practice: don't oversubscribe the system.

The motivation for providing an implementation based on a libc++ thread pool is to satisfy (1). Your suggestion of using our OpenMP runtime's low-level API directly is a good one. Personally, I really like this idea. It does imply, however, that organizations that distribute libc++ will also end up distributing libomp. If libomp has matured (in the open-source sense) to the point where this is a suitable solution, then we should do this. As I recall, however, we still have at least several organizations that ship Clang/LLVM/libc++-based toolchains that don't ship libomp, and I don't know how generally comfortable people will be with this dependency.

That having been said, to point (2), using the OpenMP compiler directives is superior to calling the low-level API directly. OpenMP directives to translate into API calls, as you point out, but they also provide optimization hints to the compiler (e.g. about lack of loop-carried dependencies). Over the next couple of years, I expect to see a lot more in the compiler optimization capabilities around OpenMP (and perhaps other parallelism) directives (parallel-region fusion, etc.). OpenMP also provides a standard way to access many of the relevant vectorization hints, and taking advantage of this is useful for compiling with Clang and also other compilers.

Regarding why you'd use GDC on Mac, and similarly why it is important for many users to use OpenMP underneath, it is important, to the extent possible, to use the same underlying thread pool as other things in the application. This is to avoid over-subscription and other issues associated with conflicting threading runtimes. If parts of the application are already using GCD, then we probably want to do this to (or at least not compete with it). Otherwise, OpenMP's runtime is probably better ;)



"OpenMP" is a pretty vague term and I'm curious what that means in terms of actual directives used. All non-accelerator OpenMP implementations lower down to threading currently. (Even if you use tasks it still ends up being a thread)

I had in mind basic host-level OpenMP directives (i.e. OpenMP 3 style plus simd directives for vectorization, although using taskloop is a good thing to consider as well). I don't think we can transparently use OpenMP accelerator directives in their current state because we can't identify the memory dependencies. When OpenMP grows some way to deal with accelerators in a global address space (e.g. the new NVIDIA UVM technology), then we should be able to use that too. CUDA+UVM will be an option in the shorter term here as well, however. Given that Clang can function as a CUDA compiler, this is definitely worth exploring.

Thanks again,
Hal


GCD (libdispatch) is essentially a task based execution model, but again on non-OSX platforms lowers to threads. (I have a doubt that GCD offers any performance benefit over native threads or Intel OMP runtime on OSX.)

How would the above offer any benefit over a native thread pool? Would you be just duplicating code which is already working?
--------------
I'm no OMP advocate, but I'd find it significantly more sane to target the Intel OMP runtime API directly.
* Production ready
* Portable across CPU (Intel, ARM, Power8)
* Likely provides the interface needed for parallelism
* Single approach
* Already part of the llvm infrastructure without external dependencies.

I don't know how well the API will map to accelerators, but for something quick and easy it's likely to the easiest.

Bryce I think even mentioned he had used it before with positive results?

In contrast the other approaches will loosely couple things to external dependencies and be more difficult to debug and support long term. It will introduce additional build dependencies which will likely add barriers to others contributing.

I'm not writing the code and just trying to offer another pragmatic point of view..


-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev


On Wed, May 17, 2017 at 12:20 AM, Hal Finkel <[hidden email]> wrote:
On 05/16/2017 02:54 AM, C Bergström wrote:


On Tue, May 16, 2017 at 2:50 PM, Hal Finkel via cfe-dev <[hidden email]> wrote:

Hi, Erik,

That's great!

Gor, Marshall, and I discussed this after some past committee meeting. We wanted to architect the implementation so that we could provide different underlying concurrency mechanisms; including:

   a. A self-contained thread-pool-based implementation using a work-stealing scheme.

   b. An implementation that wraps Grand Central Dispatch (for Mac and any other platforms providing libdispatch).

   c. An implementation that uses OpenMP.


Sorry to butt in, but I'm kinda curious how these will be substantially different under the hood

No need to be sorry; this is a good question. I think that there are a few high-level goals here:

 1. Provide a solution that works for everybody

 2. Take advantage of compiler technology as appropriate

 3. Provide useful interoperability. In practice: don't oversubscribe the system.

The motivation for providing an implementation based on a libc++ thread pool is to satisfy (1). Your suggestion of using our OpenMP runtime's low-level API directly is a good one. Personally, I really like this idea. It does imply, however, that organizations that distribute libc++ will also end up distributing libomp. If libomp has matured (in the open-source sense) to the point where this is a suitable solution, then we should do this. As I recall, however, we still have at least several organizations that ship Clang/LLVM/libc++-based toolchains that don't ship libomp, and I don't know how generally comfortable people will be with this dependency.

If "people" aren't comfortable with llvm-openmp then kick it out as a project. I use it and I know other projects that use it just fine. I can maybe claim the title of OpenMP hater and yet I don't know any legitimate reason against having this as a dependency. It's a portable parallel runtime that exposes an API and works.. I hope someone does speak up about specific concerns if they exist.
 

That having been said, to point (2), using the OpenMP compiler directives is superior to calling the low-level API directly. OpenMP directives to translate into API calls, as you point out, but they also provide optimization hints to the compiler (e.g. about lack of loop-carried dependencies). Over the next couple of years, I expect to see a lot more in the compiler optimization capabilities around OpenMP (and perhaps other parallelism) directives (parallel-region fusion, etc.). OpenMP also provides a standard way to access many of the relevant vectorization hints, and taking advantage of this is useful for compiling with Clang and also other compilers.

If projects can't even ship llvm-openmp runtime then I have a very strong concern with bootstrap dependencies which may start relying on external tools.

Further, I'm not sure I understand your point here. The directives wouldn't be in the end user code, but would be in the STL implementation side. Wouldn't that implementation stuff be fixed and an abstract layer exposed to the end user? It almost sounds like you're expressing the benefits of OMP here and not the parallel STL side. (Hmm.. in the distance I hear.. "premature optimization is the root of all evil")

Once llvm OpenMP can do things like handle nested parallelism and a few more advanced things properly all this might be fun (We can go down a big list if anyone wants to digress)
 

Regarding why you'd use GDC on Mac, and similarly why it is important for many users to use OpenMP underneath, it is important, to the extent possible, to use the same underlying thread pool as other things in the application. This is to avoid over-subscription and other issues associated with conflicting threading runtimes. If parts of the application are already using GCD, then we probably want to do this to (or at least not compete with it). Otherwise, OpenMP's runtime is probably better ;)

Again this detail isn't visible to the end user? We pick an implementation that makes sense. If other applications use GCD and we use OpenMP, if multiple thread heavy applications are running, over-subscription would be a kernel issue and not userland. I don't see how you can always avoid that situation and creating two implementations to try kinda seems funny. btw GCD is a marketing term and libdispatch is really what I'm talking about here. It's been quite a while since I hands on worked with it, but I wonder how much the API overlaps with similar interfaces to llvm-openmp. If the interfaces are similar and the "cost" in terms of complexity is low, who cares, but I don't remember that being the case. (side note: I worked on an older version of libdispatch and ported it Solaris. I also played around and benchmarked OMP tasks lowering directly down to libdispatch calls across multiple platforms. At the time our runtime always beat it in performance. Maybe newer versions of libdispatch are better)

I'm not trying to be combative, but your points just don't make sense....... (I take the blame and must be missing something)
-----------------
All this aside - I'm happy to help if needed - GPU (NVIDIA or AMD) and or llvm-openmp direct runtime api implementation. I've been involved with sorta similar projects (C++AMP) and based on that experience may be able to help avoid some gotchas.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev


On 05/16/2017 11:57 AM, C Bergström wrote:


On Wed, May 17, 2017 at 12:20 AM, Hal Finkel <[hidden email]> wrote:
On 05/16/2017 02:54 AM, C Bergström wrote:


On Tue, May 16, 2017 at 2:50 PM, Hal Finkel via cfe-dev <[hidden email]> wrote:

Hi, Erik,

That's great!

Gor, Marshall, and I discussed this after some past committee meeting. We wanted to architect the implementation so that we could provide different underlying concurrency mechanisms; including:

   a. A self-contained thread-pool-based implementation using a work-stealing scheme.

   b. An implementation that wraps Grand Central Dispatch (for Mac and any other platforms providing libdispatch).

   c. An implementation that uses OpenMP.


Sorry to butt in, but I'm kinda curious how these will be substantially different under the hood

No need to be sorry; this is a good question. I think that there are a few high-level goals here:

 1. Provide a solution that works for everybody

 2. Take advantage of compiler technology as appropriate

 3. Provide useful interoperability. In practice: don't oversubscribe the system.

The motivation for providing an implementation based on a libc++ thread pool is to satisfy (1). Your suggestion of using our OpenMP runtime's low-level API directly is a good one. Personally, I really like this idea. It does imply, however, that organizations that distribute libc++ will also end up distributing libomp. If libomp has matured (in the open-source sense) to the point where this is a suitable solution, then we should do this. As I recall, however, we still have at least several organizations that ship Clang/LLVM/libc++-based toolchains that don't ship libomp, and I don't know how generally comfortable people will be with this dependency.

If "people" aren't comfortable with llvm-openmp then kick it out as a project. I use it and I know other projects that use it just fine. I can maybe claim the title of OpenMP hater and yet I don't know any legitimate reason against having this as a dependency. It's a portable parallel runtime that exposes an API and works.. I hope someone does speak up about specific concerns if they exist.
 

That having been said, to point (2), using the OpenMP compiler directives is superior to calling the low-level API directly. OpenMP directives to translate into API calls, as you point out, but they also provide optimization hints to the compiler (e.g. about lack of loop-carried dependencies). Over the next couple of years, I expect to see a lot more in the compiler optimization capabilities around OpenMP (and perhaps other parallelism) directives (parallel-region fusion, etc.). OpenMP also provides a standard way to access many of the relevant vectorization hints, and taking advantage of this is useful for compiling with Clang and also other compilers.

If projects can't even ship llvm-openmp runtime then I have a very strong concern with bootstrap dependencies which may start relying on external tools.

Further, I'm not sure I understand your point here. The directives wouldn't be in the end user code, but would be in the STL implementation side. Wouldn't that implementation stuff be fixed and an abstract layer exposed to the end user? It almost sounds like you're expressing the benefits of OMP here and not the parallel STL side. (Hmm.. in the distance I hear.. "premature optimization is the root of all evil")

That's correct. The OpenMP pragmas would be an implementation detail. However, we'd design this so that the lambda that gets passed into the algorithm can be inlined into the code that has the compiler directives, thus reaping the benefit of OpenMP's compiler integration.


Once llvm OpenMP can do things like handle nested parallelism and a few more advanced things properly all this might be fun (We can go down a big list if anyone wants to digress)

This is why I said we might consider using taskloop ;) -- There are other ways of handling nesting as well (colleagues of mine work on one: http://www.bolt-omp.org/), but we should probably have a separate thread on OpenMP and nesting to discuss this aspect of things.

 

Regarding why you'd use GDC on Mac, and similarly why it is important for many users to use OpenMP underneath, it is important, to the extent possible, to use the same underlying thread pool as other things in the application. This is to avoid over-subscription and other issues associated with conflicting threading runtimes. If parts of the application are already using GCD, then we probably want to do this to (or at least not compete with it). Otherwise, OpenMP's runtime is probably better ;)

Again this detail isn't visible to the end user? We pick an implementation that makes sense. If other applications use GCD and we use OpenMP, if multiple thread heavy applications are running, over-subscription would be a kernel issue and not userland. I don't see how you can always avoid that situation and creating two implementations to try kinda seems funny. btw GCD is a marketing term and libdispatch is really what I'm talking about here. It's been quite a while since I hands on worked with it, but I wonder how much the API overlaps with similar interfaces to llvm-openmp. If the interfaces are similar and the "cost" in terms of complexity is low, who cares, but I don't remember that being the case. (side note: I worked on an older version of libdispatch and ported it Solaris. I also played around and benchmarked OMP tasks lowering directly down to libdispatch calls across multiple platforms. At the time our runtime always beat it in performance. Maybe newer versions of libdispatch are better)

The detail is invisible to the user at the source-code level. Obviously they might notice if we're oversubscribing the system. Yes, on many systems the kernel can manage oversubscription, but that does not mean it will perform well. As I'm sure you understand, because of cache locality and many other effects, just running a bunch of threads and letting the kernel switch them is often much slower than running a smaller number of threads and having them pull from a task queue. There are exceptions worth mentioning, however, such as when the threads are mostly themselves blocked on I/O.


I'm not trying to be combative, but your points just don't make sense....... (I take the blame and must be missing something)
-----------------
All this aside - I'm happy to help if needed - GPU (NVIDIA or AMD) and or llvm-openmp direct runtime api implementation. I've been involved with sorta similar projects (C++AMP) and based on that experience may be able to help avoid some gotchas.

Sounds great.

 -Hal

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev
On Mon, May 15, 2017 at 9:54 AM, Erik Pilkington
<[hidden email]> wrote:

> Hello all,
>
> I’m an intern at Apple this summer, and my project is to work on the C++17
> parallel algorithms library. In a recent thread, Hal mentioned [0] that he
> had spoken with Marshall about the design of the library. I’m currently
> working on a prototype that is pretty similar to what was committed into
> lib/Support in r302748. I’m wondering what you folks talked about, and if
> any consensus wrt to design has been reached. Even if not, I’d love to hear
> about what thoughts were exchanged, or any reference material that could
> bring me up to speed!

Awesome!

You may want to read my (long) post in the other thread if you haven't
already - I described some of my thoughts about the design of C++17
parallel algorithm runtimes.

http://llvm.1065342.n5.nabble.com/llvm-dev-PSA-Parallel-STL-algorithms-available-in-LLVM-td108822.html#a108873


> It might also be worth having a conversation about how to divide up the work
> to be done here. Were any of you intending to start working on the library?

Yes, I am hoping to contribute to this effort starting sometime in
June. I need to speak with Marshall to figure out how I can be most
effective here.

I'd be happy to collaborate. Some of the other HPX developers may be
interested in contributing as well.

--
Bryce Adelstein Lelbach aka wash
Lawrence Berkeley National Laboratory
ISO C++ Committee Member
CppCon and C++Now Program Chair

Compiler ICE Hunter
--
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev
Some thoughts:

* I pretty much agree with Hal - his vision for what we should do here
closely matches mine.
* I agree with Chris that we may want to target the Intel OMP Runtime
interface - but Hal is right that this may be sub-optimal.
Additionally, it may be quite tricky to do correctly.
* I am a little concerned that it will be hard to de-couple the
"parallel engine" and "algorithms" components of the library if we're
using OpenMP pragmas. I am also a little concerned about using pragmas
because they don't always interact well with C++ abstractions.
* I want to make sure we end up with one generic "algorithms" core and
O(3) backends that libc++ supports/maintains, and an interface that
3rd parties can implement to plug their "parallel engines" into
libc++'s algorithms. I'm a little worried that the pragma-based OpenMP
one might be tightly coupled (although 3rd parties could plug in by
implement the OpenMP runtime API, that'd be sub-optimal).


--
Bryce Adelstein Lelbach aka wash
Lawrence Berkeley National Laboratory
ISO C++ Committee Member
CppCon and C++Now Program Chair

Compiler ICE Hunter
--
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev

On 05/17/2017 09:50 AM, Bryce Lelbach wrote:

> Some thoughts:
>
> * I pretty much agree with Hal - his vision for what we should do here
> closely matches mine.
> * I agree with Chris that we may want to target the Intel OMP Runtime
> interface - but Hal is right that this may be sub-optimal.
> Additionally, it may be quite tricky to do correctly.
> * I am a little concerned that it will be hard to de-couple the
> "parallel engine" and "algorithms" components of the library if we're
> using OpenMP pragmas. I am also a little concerned about using pragmas
> because they don't always interact well with C++ abstractions.
> * I want to make sure we end up with one generic "algorithms" core and
> O(3) backends that libc++ supports/maintains, and an interface that
> 3rd parties can implement to plug their "parallel engines" into
> libc++'s algorithms. I'm a little worried that the pragma-based OpenMP
> one might be tightly coupled (although 3rd parties could plug in by
> implement the OpenMP runtime API, that'd be sub-optimal).

I don't see why you have that concern. We have plenty of existing
parallel abstraction libraries (e.g. Kokkos) that have multiple
backends, including OpenMP, and don't have this problem. You can
certainly have a set of templates that implement bulk task dispatch,
etc. in terms of OpenMP and a set of templates that implement that
interface in terms of other things. The key thing here is to reuse the
thread pool and get the added compiler optimizations that the pragmas
can imply (when OpenMP is available). To be clear, I'm not proposing
that we put OpenMP pragmas on the algorithm implementations themselves,
only use it for an underlying parallelism-execution provider.

What is true, AFAIK, is that the abstractions don't play well with
OpenMP pragmas that take variable names (reductions, for example) in a
general way. Thus, for example, it is not straightforward/possible to
have a variadic template that takes a list of variables on which to
perform a reduction and transform that into a loop with a pragma with
the right list of variables in a reduction clause (in general). You can
handle specific cases (e.g. one reduction variable for specific
operations). For the general case, we need to do our own thing using
underlying primitives (which is what we'd need to do for the non-OpenMP
implementations as well). I don't even know that we run into this
problem with the current set of standardized algorithms.

  -Hal

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev
On Tue, May 16, 2017 at 12:50 AM, Hal Finkel via cfe-dev <[hidden email]> wrote:

Hi, Erik,

That's great!

Gor, Marshall, and I discussed this after some past committee meeting. We wanted to architect the implementation so that we could provide different underlying concurrency mechanisms; including:

   a. A self-contained thread-pool-based implementation using a work-stealing scheme.

   b. An implementation that wraps Grand Central Dispatch (for Mac and any other platforms providing libdispatch).

   c. An implementation that uses OpenMP.

The OpenMP-based provider is essential for interoperability in applications using OpenMP, GCD seemed most natural on Mac, and the self-contained implementation is available for anyone else. CUDA+UVM should be possible in the future as well. All of these mechanisms can export an interface for scheduling tasks, and we specifically discussed the interface having a "bulk enqueue" capability. This is essential for efficiently mapping onto several potential underlying technologies (e.g. OpenMP). You don't want to generate individual tasks for each item, and for underlying technologies that can do intelligent things, you want the underlying technology to make the coarsening decisions.

par_unseq should map, via pragmas, either Clang-specific and/or OpenMP, to enabling vectorizaton. As a longer-term exercise, we might need additional compiler extensions to make this work well for some algorithms (such as sorting), dealing with exceptions, etc.

 -Hal


Don't forget about ConcRT on Windows. 

- Michael Spencer
 

On 05/15/2017 11:54 AM, Erik Pilkington wrote:
Hello all,

I’m an intern at Apple this summer, and my project is to work on the C++17 parallel algorithms library. In a recent thread, Hal mentioned [0] that he had spoken with Marshall about the design of the library. I’m currently working on a prototype that is pretty similar to what was committed into lib/Support in r302748. I’m wondering what you folks talked about, and if any consensus wrt to design has been reached. Even if not, I’d love to hear about what thoughts were exchanged, or any reference material that could bring me up to speed!

It might also be worth having a conversation about how to divide up the work to be done here. Were any of you intending to start working on the library?


-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [libc++] Working on the parallel STL algorithms

Robinson, Paul via cfe-dev
In reply to this post by Robinson, Paul via cfe-dev
On Wed, May 17, 2017 at 9:13 AM, Hal Finkel <[hidden email]> wrote:
> I don't see why you have that concern. We have plenty of existing parallel
> abstraction libraries (e.g. Kokkos) that have multiple backends, including
> OpenMP, and don't have this problem. You can certainly have a set of
> templates that implement bulk task dispatch, etc. in terms of OpenMP and a
> set of templates that implement that interface in terms of other things. The
> key thing here is to reuse the thread pool and get the added compiler
> optimizations that the pragmas can imply (when OpenMP is available).

Upon reflection, I think you're correct. I probably let some of my
prior experience with OMP (pragmas sprinkled all over the place in
applications, historical avoidance of nested OMP and composition, and
some of the rough edges that you get when using OMP on iterator loops
with certain compilers) bias my thinking a bit. With the right
approach, like the design you suggested, makes sense to me, and as you
mentioned there's plenty of prior art.

You make an excellent point about leveraging existing OpenMP compiler
optimizations (this is actually one of the reasons that we wrote an
OpenMP compatibility layer for HPX!).

> To be clear, I'm not proposing that we put OpenMP pragmas on the algorithm
> implementations themselves, only use it for an underlying
> parallelism-execution provider.

+1 - I think my concern was mostly to make sure we do this.

> What is true, AFAIK, is that the abstractions don't play well with OpenMP
> pragmas that take variable names (reductions, for example) in a general way.
> Thus, for example, it is not straightforward/possible to have a variadic
> template that takes a list of variables on which to perform a reduction and
> transform that into a loop with a pragma with the right list of variables in
> a reduction clause (in general). You can handle specific cases (e.g. one
> reduction variable for specific operations).

I think this is the only case we have in the current algorithms.

> I don't even know that we run into this problem with the current set of standardized algorithms.

Yep, I believe this is correct.



--
Bryce Adelstein Lelbach aka wash
Lawrence Berkeley National Laboratory
ISO C++ Committee Member
CppCon and C++Now Program Chair

Compiler ICE Hunter
--
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Loading...