RFC: User-directed code transformations with #pragma clang transform

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: User-directed code transformations with #pragma clang transform

suyash singh via cfe-dev
Dear Clang community,

I am working on adding more powerful compiler optimization directives
that allows the programmer to steer optimization decisions in mid-end.
Most recently presented this work at the LLVM Dev Meeting [2] and I am
looking for reviewers https://reviews.llvm.org/D69088.

------------------------------------------------------------------------------------------------------------
The motivation

is the same as for "#pragma clang loop" and "#pragma omp simd":

 * Separate semantics and optimizations

 * Better maintainability by lowering the code complexity

 * Only switch the pragmas (e.g. using the preprocessor or #pragma omp
metadirective) to optimize the same code base for different targets

 * Easily try out different optimizations without needing to rewrite
the source base

 * Or use an autotuner which inserts #pragma clang transform
directives to find the transformation sequence that has the best
performance.

Such directives already exist in the form of "#pragma clang loop" (aka
loop hints) and "#pragma omp simd", but have several problems:

 * OpenMP is currently limited to Vectorization

 * Loop hints and OpenMP are incompatible to each other

 * Multiple loops hints can be applied to the same loop, but the
transformation order cannot be modified

 * There are no front-end diagnostics when a transformation does not
make sense, e.g. unroll-and-jam without an inner loop

 * Hints are emitted in form of metadata (MDNode) that can be dropped
by mid-end optimizers

 * Loop hints have been added ad-hoc and have inconsistent syntax and
semantics (e.g. whether and how vectorize_width/unroll_count
implicitly enables vectorization/unrolling)

I am also currently working in the OpenMP language committee to add
more loop transformations with defined transformation order that are
compatible with the other OpenMP constructs. So far, tiling has been
added to the TR8 [0] and I hope to also add unrolling to OpenMP 5.1,
with more transformations to follow in OpenMP 6.0. For more details on
the motivation, also see references [2-5].

------------------------------------------------------------------------------------------------------------
The suggested syntax is the following:

    #pragma clang transform unroll [full/partial(n)]
    #pragma clang transform unrollandjam [full/partial(n)]
    #pragma clang transform distribute
    #pragma clang transform vectorize [width(n)]
    #pragma clang transform interleave [factor(n)]

The selection is currently limited to the passes LLVM currently
supports. I am working on more transformations that currently are only
picked-up by Polly. The largest difference to loop hints is that it
allows to specify in which order the transformations are applied,
which is ignored by clang's current LoopHint attribute. That is, the
following for reverses the loop, then unrolls it.

    #pragma clang transform unroll partial(2)
    #pragma clang transform reverse
    for (int i = 0; i < 128; i+=1)
      body(i);

Whereas the following first unrolls the loop, then reverses it:

    #pragma clang transform reverse
    #pragma clang transform unroll partial(2)
    for (int i = 0; i < 128; i+=1)
      body(i);

Observe that the results are different: The former will execute the
loop body in the order

    body(127); body(126); body(125); body(124); ...

whereas the latter should execute

    body(126); body(127); body(124); body(125); ...

If during optimizing performance the programmer reaches the point for
force the compiler to use specific transformations, they will not want
to leave the choice in which order to apply them to the compiler as
their performance characteristics can be quite different.

Furthermore, I intend to implement assigning identifiers to loop to
reference them in followup transformations (e.g. tile a loop,
parallelize the generated outer loop and vectorize the inner), make
then combinable with OpenMP directives, and add options to control
whether to only override the compiler's profitability heuristic or
also the legality check (as "#pragma clang loop
vectorize(assume_safety)" does). Of course, I would like to implement
more transformations  (tiling, fusion, interchange, array packing,
reversal, wavefronting, peeling, splitting, space-filling curves,
unswitching, collapsing, strip/strip-mining, blocking, ...) and
options.

There will always be differences between the OpenMP directives and
"#pragma clang transform": First, we cannot implement clang-specific
non-OpenMP transformations using the OpenMP syntax. Second, OpenMP
follows an "the user is always right" approach without checking
whether the program semantics are preserved. By contrast, loop hints
historically and by default do not change the code's semantics but
take the choice of which transformation out of the legal ones to use.
This, for instance, is important for autotuning to avoid that the
machine learning algorithm chooses an optimization that yields wrong
results.

------------------------------------------------------------------------------------------------------------
This is a followup to a previous RFC [1] Unfortunately, I did not get
any response directly on the mailing list (However, I got responses
off the mailing list). Since then, I uploaded diffs for reviews (see
below). Some reviewers requested an updated RFC [Lex] on the topic
which this is intended to be.

So far, I uploaded the following patches:

* [Lex] https://reviews.llvm.org/D69088
  Also adds an -fexperimental-transform-pragma switch to enable
processing of #pragma clang transform

* [Parser] https://reviews.llvm.org/D69089
  Also adds classes to represent a transformation such as LoopUnrollTransform

 * [Sema] https://reviews.llvm.org/D69091
   Adds the AST nodes kinds: the Stmt representing the #pragma
(TransformExecutableDirective) and the clauses (TransformClause).
Moreover, the AnalysisTransform component constructs a loop nest tree
to which transformations are applied to such that Sema can warn about
inconsistencies, e.g. there is no inner or ambiguous loops for
unrollandjam.

 * [CodeGen] https://reviews.llvm.org/D69092
   The codegen part uses the same AnalysisTransform to determine which
loop metadata nodes to emit.

 * [Serialization] https://reviews.llvm.org/D70572

 * [CIndex] https://reviews.llvm.org/D71447:

 * [Docs] https://reviews.llvm.org/D70032:

The main change since the previous RFC is the syntax: In [1] I
suggested, as did some reviewers in [Lex], to re-use the "#pragma
clang loop" syntax. However, I am now convinced that is is not a good
idea because syntax as well as semantics of #pragma clang loop are
different. Ensuring compatibility with existing code requires quirky
exceptions. Also, I eventually would also like to add transformations
that do apply on code that are not on loops (such as code motion,
task-parallelism, SLP vectorization, etc).

Previously, I used an implementation based on attributes [7] (also
used by #pragma clang loop) but I don't think these are powerful
enough to meet all requirements. For instance, transformations that
apply to loop identifiers for generated loops do not have a
syntactical loop that they could be attached to.

------------------------------------------------------------------------------------------------------------
Applying the transformations

My patches for the front-end part do not apply the transformations,
but add metadata to inform LLVM passes to apply the transformations.
[8] and [9] are about encoding of the transformation order as loop
metadata. Patch [9] has been applied some time ago.

However, the applicability of transformations in LLVM is limited by
the passes that have been implemented and the order in which they are
processed by the pass manager. The new pass manager has an updater
that allows re-processing of loops if a pass applied a transformation,
but the pipeline has to be restructured to allow this. I am using an
extension to Polly [12] to apply transformations in any order since
Polly makes it a lot easier to implement a transformation compared to
implement on from scratch on the IR level.

To not require Polly to implement OpenMP, an alternative to emitting
metadata is to use the OpenMPIRBuilder [13] to directly generate the
transformed IR. However, we will not want to do safety analysis on
this level such that this implementation is limited to OpenMP and
assume_safety transformations.

In the long run, I would like to introduce a framework for more
general transformations [10-11] for which we are currently preparing
another RFC.


With kind regards,
Michael Kruse




[0] OpenMP TR8
https://www.openmp.org/wp-content/uploads/openmp-TR8.pdf

[1] RFC: Extending #pragma clang loop
https://lists.llvm.org/pipermail/cfe-dev/2018-May/058141.html

[2] Loop-transformation #pragmas in the front-end
http://llvm.org/devmtg/2019-10/talk-abstracts.html#tech20

[3] A Proposal for Loop-Transformation Pragmas (IWOMP'19)
https://arxiv.org/abs/1805.03374

[4] User-Directed Loop-Transformations in Clang (LLVM-HPC'18)
https://arxiv.org/abs/1811.00624

[5] Design and Use of Loop-Transformation Pragmas (IWOMP'19)
https://arxiv.org/abs/1910.02375

[6] Prototype implementation using #pragma clang loop and attributes
https://github.com/SOLLVE/llvm-project/tree/pragma-clang-loop

[7] Implementation uploaded for review
https://github.com/SOLLVE/llvm-project/tree/pragma-clang-transform

[8] RFC: Extending loop metadata
https://lists.llvm.org/pipermail/llvm-dev/2018-May/123690.html

[9] D57978: Metadata for follow-up transformations
https://reviews.llvm.org/D57978

[10] Loop Optimization Framework (LCPC'18)
https://arxiv.org/abs/1811.00632

[11] Loop Transformations in LLVM (LLVM DevMtg'18)
https://youtu.be/QpvZt9w-Jik?t=813

[12] Implementation for applying transformations using Polly
https://github.com/SOLLVE/polly/tree/pragma

[13] OpenMPIRBuilder
https://reviews.llvm.org/D69785
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: User-directed code transformations with #pragma clang transform

suyash singh via cfe-dev
Hi Michael, thanks for this RFC!  I just have a few questions and
comments to start off.

Are these pragmas meant to be advisory or prescriptive (when legal)?
From your description and motivation I assume the latter but I wanted to
double-check.

>  * Hints are emitted in form of metadata (MDNode) that can be dropped
> by mid-end optimizers

Below you state that metadata will be used to encode the transformations
and order.  Doesn't this suffer from the same problem?

> The selection is currently limited to the passes LLVM currently
> supports. I am working on more transformations that currently are only
> picked-up by Polly. The largest difference to loop hints is that it
> allows to specify in which order the transformations are applied,
> which is ignored by clang's current LoopHint attribute. That is, the
> following for reverses the loop, then unrolls it.
>
>     #pragma clang transform unroll partial(2)
>     #pragma clang transform reverse
>     for (int i = 0; i < 128; i+=1)
>       body(i);

This seems unintuitive to me.  I would expect this to unroll first and
then reverse.  I get the "inner-to-outer" ordering you present here, but
I wonder if it will be too easy for users to get something unexpected.

Will it be possible to list multiple transformations with one directive?

     #pragma clang transform reverse, unroll partial(2)

If so, then the "inner-to-outer" ordering seems even more problematic:

     #pragma clang transform reverse, unroll partial(2)
     #pragma clang transform distribute, vectorize

In what order are the transformations applied?

> Furthermore, I intend to implement assigning identifiers to loop to
> reference them in followup transformations (e.g. tile a loop,
> parallelize the generated outer loop and vectorize the inner),

Do you have an example of this idea?

                   -David
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: User-directed code transformations with #pragma clang transform

suyash singh via cfe-dev
Am Mi., 18. Dez. 2019 um 09:37 Uhr schrieb David Greene <[hidden email]>:
>
> Hi Michael, thanks for this RFC!  I just have a few questions and
> comments to start off.
>
> Are these pragmas meant to be advisory or prescriptive (when legal)?
> From your description and motivation I assume the latter but I wanted to
> double-check.

For loop hint style directives, they are advisory. However, I would
expect the compiler to able to emit a diagnostic (not an error) if
applying it fails, as #pragma clang loop vectorize(enable) already
does.
Determining the safety of a transformations depends on compiler
capability (which may change between versions of the compiler), hence
cannot be prescriptive.

For OpenMP style directives (and possibly assume_safey), they should
be prescriptive.


> >  * Hints are emitted in form of metadata (MDNode) that can be dropped
> > by mid-end optimizers
>
> Below you state that metadata will be used to encode the transformations
> and order.  Doesn't this suffer from the same problem?

I did not claim to be able to solve every problem ;-)
But we should work towards removing sources where the metadata can be
lost (e.g. https://reviews.llvm.org/D53876 and
https://reviews.llvm.org/D66892), but it will always be best effort by
the definition of MDNode.

However, emitting already transformed IR using the OpenMPIRBuilder
would eliminate this source of transformations being forgotten.


> > The selection is currently limited to the passes LLVM currently
> > supports. I am working on more transformations that currently are only
> > picked-up by Polly. The largest difference to loop hints is that it
> > allows to specify in which order the transformations are applied,
> > which is ignored by clang's current LoopHint attribute. That is, the
> > following for reverses the loop, then unrolls it.
> >
> >     #pragma clang transform unroll partial(2)
> >     #pragma clang transform reverse
> >     for (int i = 0; i < 128; i+=1)
> >       body(i);
>
> This seems unintuitive to me.  I would expect this to unroll first and
> then reverse.  I get the "inner-to-outer" ordering you present here, but
> I wonder if it will be too easy for users to get something unexpected.

I find this order more intuitive and matches the OpenMP semantics. For instance,

    #pragma omp parallel for
    for (int i = 0; i < 128; i+=1)

has the same semantics as

    #pragma omp parallel
    #pragma omp for
   for (int i = 0; i < 128; i+=1)

which is the same as

    #pragma omp parallel
    {
       #pragma omp for
       for (int i = 0; i < 128; i+=1)
       ..
    }

I think the difference in interpretation comes from either seeing the pragmas as
1. a collection of attributes to the next statement (like
AttributedStmt/LoopHint)

or

2. as an statement taking another statement as argument (like
OMPExecutableDirective)

In trying unify both implementations, I will have to use the latter.
It also avoid the problems you mentioned below. Moreover, for
transformations that do not apply on loops, this interpretation makes
it clear that it consumes a statement with its own scope:

   #pragma clang transform offload // Compared to "#pragma omp
target", the compiler has to to a legality analysis
   {
     do_something();
     do_something_else();
   }


>
> Will it be possible to list multiple transformations with one directive?

No, and one of the reason I decided against reusing #pragma clang loop syntax.


> > Furthermore, I intend to implement assigning identifiers to loop to
> > reference them in followup transformations (e.g. tile a loop,
> > parallelize the generated outer loop and vectorize the inner),
>
> Do you have an example of this idea?

The same example as code:

    #pragma clang transform vectorize on(innername) width(4)
    #pragma clang transform parallelize_thread on(outername)

    #pragma clang transform tile sizes(32) floor(id(outername))
tile(id(innername))
    for (int i = 0; i < 128; i+=1)

Without ids, by writing more transformation in the clauses, it could
also be written as

    #pragma clang transform parallelize_thread // applies to the
outermost loop of the previous transformation
    #pragma clang transform tile sizes(32) tile(vectorize width(4))
    for (int i = 0; i < 128; i+=1)

Ids are more required when a follow-up transformation applies to
multiple loops, or handy for writing transformations for a specific
target together.

    #ifdef OPTIMIZE_FOR_COARSE_GRAIN
      #pragma clang transform interchange on(j,thefloor) permutation(thefloor,j)
    #endif

    #pragma clang transform id(j)
    for (int j = 0; j < n; j+=1) {
      #pragma clang transform tile size(32) floor(id(thefloor))
      for (int i = 0; i < 128; i+=1)


Michael
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: User-directed code transformations with #pragma clang transform

suyash singh via cfe-dev
Hi Michael,

Thanks for the RFC!  While I see that the identifier is needed to be able to apply transform directives to new loop (which is an important capability), I'm wondering if the identifier is meant to generically separate the source location from the optimization directive.  Will I be able to add identifiers to my loops and have the optimization directives live elsewhere?  If that is the case, then the ordering of the transformations could be difficult to understand.

Thanks again,
Brian

On Wed, Dec 18, 2019 at 11:49 AM Michael Kruse via cfe-dev <[hidden email]> wrote:
Am Mi., 18. Dez. 2019 um 09:37 Uhr schrieb David Greene <[hidden email]>:
>
> Hi Michael, thanks for this RFC!  I just have a few questions and
> comments to start off.
>
> Are these pragmas meant to be advisory or prescriptive (when legal)?
> From your description and motivation I assume the latter but I wanted to
> double-check.

For loop hint style directives, they are advisory. However, I would
expect the compiler to able to emit a diagnostic (not an error) if
applying it fails, as #pragma clang loop vectorize(enable) already
does.
Determining the safety of a transformations depends on compiler
capability (which may change between versions of the compiler), hence
cannot be prescriptive.

For OpenMP style directives (and possibly assume_safey), they should
be prescriptive.


> >  * Hints are emitted in form of metadata (MDNode) that can be dropped
> > by mid-end optimizers
>
> Below you state that metadata will be used to encode the transformations
> and order.  Doesn't this suffer from the same problem?

I did not claim to be able to solve every problem ;-)
But we should work towards removing sources where the metadata can be
lost (e.g. https://reviews.llvm.org/D53876 and
https://reviews.llvm.org/D66892), but it will always be best effort by
the definition of MDNode.

However, emitting already transformed IR using the OpenMPIRBuilder
would eliminate this source of transformations being forgotten.


> > The selection is currently limited to the passes LLVM currently
> > supports. I am working on more transformations that currently are only
> > picked-up by Polly. The largest difference to loop hints is that it
> > allows to specify in which order the transformations are applied,
> > which is ignored by clang's current LoopHint attribute. That is, the
> > following for reverses the loop, then unrolls it.
> >
> >     #pragma clang transform unroll partial(2)
> >     #pragma clang transform reverse
> >     for (int i = 0; i < 128; i+=1)
> >       body(i);
>
> This seems unintuitive to me.  I would expect this to unroll first and
> then reverse.  I get the "inner-to-outer" ordering you present here, but
> I wonder if it will be too easy for users to get something unexpected.

I find this order more intuitive and matches the OpenMP semantics. For instance,

    #pragma omp parallel for
    for (int i = 0; i < 128; i+=1)

has the same semantics as

    #pragma omp parallel
    #pragma omp for
   for (int i = 0; i < 128; i+=1)

which is the same as

    #pragma omp parallel
    {
       #pragma omp for
       for (int i = 0; i < 128; i+=1)
       ..
    }

I think the difference in interpretation comes from either seeing the pragmas as
1. a collection of attributes to the next statement (like
AttributedStmt/LoopHint)

or

2. as an statement taking another statement as argument (like
OMPExecutableDirective)

In trying unify both implementations, I will have to use the latter.
It also avoid the problems you mentioned below. Moreover, for
transformations that do not apply on loops, this interpretation makes
it clear that it consumes a statement with its own scope:

   #pragma clang transform offload // Compared to "#pragma omp
target", the compiler has to to a legality analysis
   {
     do_something();
     do_something_else();
   }


>
> Will it be possible to list multiple transformations with one directive?

No, and one of the reason I decided against reusing #pragma clang loop syntax.


> > Furthermore, I intend to implement assigning identifiers to loop to
> > reference them in followup transformations (e.g. tile a loop,
> > parallelize the generated outer loop and vectorize the inner),
>
> Do you have an example of this idea?

The same example as code:

    #pragma clang transform vectorize on(innername) width(4)
    #pragma clang transform parallelize_thread on(outername)

    #pragma clang transform tile sizes(32) floor(id(outername))
tile(id(innername))
    for (int i = 0; i < 128; i+=1)

Without ids, by writing more transformation in the clauses, it could
also be written as

    #pragma clang transform parallelize_thread // applies to the
outermost loop of the previous transformation
    #pragma clang transform tile sizes(32) tile(vectorize width(4))
    for (int i = 0; i < 128; i+=1)

Ids are more required when a follow-up transformation applies to
multiple loops, or handy for writing transformations for a specific
target together.

    #ifdef OPTIMIZE_FOR_COARSE_GRAIN
      #pragma clang transform interchange on(j,thefloor) permutation(thefloor,j)
    #endif

    #pragma clang transform id(j)
    for (int j = 0; j < n; j+=1) {
      #pragma clang transform tile size(32) floor(id(thefloor))
      for (int i = 0; i < 128; i+=1)


Michael
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: User-directed code transformations with #pragma clang transform

suyash singh via cfe-dev
Am Mi., 19. Feb. 2020 um 13:45 Uhr schrieb Brian Homerding <[hidden email]>:
> Thanks for the RFC!  While I see that the identifier is needed to be able to apply transform directives to new loop (which is an important capability), I'm wondering if the identifier is meant to generically separate the source location from the optimization directive.  Will I be able to add identifiers to my loops and have the optimization directives live elsewhere?  If that is the case, then the ordering of the transformations could be difficult to understand.

The identifiers also disambiguate the transformation order because a
loop can only be 'consumed' by a transformation. Any transformation
that should apply to after the the first transformation must use a
different loop name. That is,

#pragma clang transform unroll on(myloop)
#pragma clang transform vectorize on(myloop)
...
#pragma clang transform id(myloop)
for (int i = 0; i < n; ++i)
  Body(i);

is a compiler error since that transformation order is ambiguous.
Instead, only one of the transformations applies on myloop, while the
output of that transformation must be given a new name:

#pragma clang transform unroll on(mysimdloop)
#pragma clang transform vectorize on(myloop) apply(vectorized:id(mysimdloop))

In which order the pragmas appear in the source code is up to the
programmer. I hope they try to do it in a way that improved
understandably but I think misinterpreting the transformation order is
a lesser concern.


Michael
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev