[OT?] real-world interest of the polly optimiser

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[OT?] real-world interest of the polly optimiser

Roman Popov via cfe-dev
Hi,

Apologies if this isn't the best place. I've been looking for some information (understandable by the average user) about the real-world benefits of the polly optimiser, but have found only either very broad and vague claims or specialist research papers.

What I'd like to get an idea of is what benefits Polly brings, under what conditions, for what cost and how (= any special compiler options needed?).

Also, given I'm installing clang via MacPorts: does clang pick up Polly's presence automatically after I add the libpolly binary (i.e. port:llvm with the +polly install variant) or do I need to rebuild clang too?

Thanks,
René
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OT?] real-world interest of the polly optimiser

Roman Popov via cfe-dev
Hi René,

2017-05-19 10:38 GMT+02:00 René J.V. Bertin via cfe-dev
<[hidden email]>:
> Hi,
>
> Apologies if this isn't the best place.

Polly has its own mailing list here:
https://groups.google.com/forum/#!forum/polly-dev
[hidden email]


> I've been looking for some information (understandable by the average user) about the real-world benefits of the polly optimiser, but have found only either very broad and vague claims or specialist research papers.

As a researcher, I can tell about the research we are doing. We
currently have a paper under review about optimizing gemm where we get
85\% of vendor-provided BLAS implementation, which is 20x the speed of
the program compiled by clang without Polly.

We know Samsung, Qualcomm and Xilinx are using Polly on a regular basis.

Polly can automatically generate OpenMP and CUDA code. The benefits
depend a lot on what you are using it for, for instance whether your
code consists of for-loops and dense arrays. In other cases you only
get increased compilation time.


> What I'd like to get an idea of is what benefits Polly brings, under what conditions, for what cost and how (= any special compiler options needed?).
>
> Also, given I'm installing clang via MacPorts: does clang pick up Polly's presence automatically after I add the libpolly binary (i.e. port:llvm with the +polly install variant) or do I need to rebuild clang too?

I don't have a Mac, so I don't know how it works there. So I can only
explain how to do it from source:

Check out the Polly source into LLVM's tools directory then recompile
opt and clang. Add "-mllvm -polly" to the clang command line to enable
Polly.

As currently being a research project, I'd not expect a sudden
improvement of execution time. Performance-critical "real-world" code
is often already optimized manually simply because general purpose
compilers do not automatically optimize code aggressive enough. Many
such manual optimizations are incompatible with Polly, e.g. parts
written in assembler.


Michael
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OT?] real-world interest of the polly optimiser

Roman Popov via cfe-dev


On Mon, May 22, 2017 at 10:08 PM, Michael Kruse via cfe-dev <[hidden email]> wrote:
Hi René,

2017-05-19 10:38 GMT+02:00 René J.V. Bertin via cfe-dev
<[hidden email]>:
> Hi,
>
> Apologies if this isn't the best place.

Polly has its own mailing list here:
https://groups.google.com/forum/#!forum/polly-dev
[hidden email]


> I've been looking for some information (understandable by the average user) about the real-world benefits of the polly optimiser, but have found only either very broad and vague claims or specialist research papers.

As a researcher, I can tell about the research we are doing. We
currently have a paper under review about optimizing gemm where we get
85\% of vendor-provided BLAS implementation, which is 20x the speed of
the program compiled by clang without Polly.

Sorry, but please don't
1) Provide numbers when comparing against a weak baseline

Please do
2) If you do have a valid performance comparison or claim - please do provide enough information so that a complete picture is presented.

You statement just came across as something like either llvm's loop optimizer sucks so bad that polly is required and or somehow it's hitting a corner case which is a sweetspot for polly.
------------
Also I'd kindly ask that if you do have such specific performance examples of clang doing a rather poor job, please file a bug report and include as much detail as you have time. It's unlikely that polly is doing anything that a traditional loop optimizer can't do and or at least attempted.



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OT?] real-world interest of the polly optimiser

Roman Popov via cfe-dev
In reply to this post by Roman Popov via cfe-dev
To my knowledge, polly is not in use in any production setting.  It is
used for research purposes, but I don't believe it has been
productionized at this time.

Philip

On 05/19/2017 01:38 AM, René J.V. Bertin via cfe-dev wrote:

> Hi,
>
> Apologies if this isn't the best place. I've been looking for some information (understandable by the average user) about the real-world benefits of the polly optimiser, but have found only either very broad and vague claims or specialist research papers.
>
> What I'd like to get an idea of is what benefits Polly brings, under what conditions, for what cost and how (= any special compiler options needed?).
>
> Also, given I'm installing clang via MacPorts: does clang pick up Polly's presence automatically after I add the libpolly binary (i.e. port:llvm with the +polly install variant) or do I need to rebuild clang too?
>
> Thanks,
> René
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OT?] real-world interest of the polly optimiser

Roman Popov via cfe-dev
In reply to this post by Roman Popov via cfe-dev
On 5/22/2017 9:17 AM, C Bergström via cfe-dev wrote:
> You statement just came across as something like either llvm's loop
> optimizer sucks so bad that polly is required

In applications like linear algebra a lot of performance comes from
optimizing loop nests for cache locality. Doing things like loop
interchange, loop nest distribution, unroll and jam, etc. helps a lot
with it, and to the best of my knowledge LLVM does none of that. There
is some basic support for loop fusion and distribution, but I don't
think it works on the nest level. Given how important that is in
high-performance computing, the 20x difference sounds believable.

I don't know what the general plan is: if there is any interest in loop
nest optimizations in the LLVM itself, or whether this task is delegated
to Polly to handle at some point. In any case, without those
optimizations there is only so much that can be done.

-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OT?] real-world interest of the polly optimiser

Roman Popov via cfe-dev
2017-06-01 21:52 GMT+02:00 Krzysztof Parzyszek via cfe-dev
<[hidden email]>:
> In applications like linear algebra a lot of performance comes from
> optimizing loop nests for cache locality. Doing things like loop
> interchange, loop nest distribution, unroll and jam, etc. helps a lot with
> it, and to the best of my knowledge LLVM does none of that. There is some
> basic support for loop fusion and distribution, but I don't think it works
> on the nest level. Given how important that is in high-performance
> computing, the 20x difference sounds believable.

We implemented it recently, but only for gemm-like kernels, basically
the techniques from
http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

We sent a paper for review to ACM TACO. As it is under review, and I
am not the main author, I think cannot just share it publicly (yet).

Michael
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev