[OpenMP] Redundant store inside reduction loop body

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[OpenMP] Redundant store inside reduction loop body

shirley breuer via cfe-dev
Hey,

I've encountered a peculiar code generation issue in a parallel-for-reduction.
Inside the per-thread reduction loop, I see a store to a stack slot that happens
*every iteration*, clobbering the value written by the previous one. The address
of that stack slot is later taken to pass to the inter-thread
reduction, but the store
is *not* hoisted outside the loop, while I'd expect it to happen just
once outside it.

Attached below are the code example that triggers this, and an annotated x86-64
assembly snippet from the outlined OMP function. The code was compiled using
clang++-12 Ubuntu focal unstable branch, using the command-line:
clang++-12 -fopenmp -O3 main.cpp -o main

I'm wondering whether this is some sort of bug, and in which component.

Regards,
~Itay

// main.cpp
#include <cstdio>
#include <memory>

double compute_dot_product(size_t n, double *xv, double *yv)
{
  double local = 0.0;
  #pragma omp parallel for reduction (+:local)
  for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
  return local;
}

int main(int argc, char **argv)
{
  constexpr size_t n = 0x1000;
  auto xv = std::make_unique<double[]>(n);
  auto yv = std::make_unique<double[]>(n);

  double result = compute_dot_product(n, xv.get(), yv.get());
  printf("result = %e\n", result);
  return 0;
}

// Disassembly excerpt from objdump -d --no-show-raw-insn main,
function <...>omp_outlined<...>
4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head
4012f5: mulsd (%rsi,%rcx,8),%xmm1
4012fa: addsd %xmm1,%xmm0
4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration
401303: add $0x1,%rcx
401307: add $0xffffffffffffffff,%rax
40130b: jne 4012f0 <.omp_outlined.+0xc0>  ; <--- Non-unrolled loop latch
40130d: cmp $0x3,%rbp
401311: jb 40138b <.omp_outlined.+0x15b> ; <--- x4-unrolled loop guard
401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
401316: lea (%rsi,%rcx,8),%rsi
40131a: add $0x18,%rsi
40131e: lea (%rdx,%rcx,8),%rcx
401322: add $0x18,%rcx
401326: mov $0xffffffffffffffff,%rdx
40132d: nopl (%rax)
401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head
401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
40133c: addsd %xmm0,%xmm1
401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
401345: movsd -0x8(%rcx,%rdx,8),%xmm0
40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
401351: addsd %xmm1,%xmm0
401355: movsd %xmm0,(%rsp)  ; <--- Weird store #2
40135a: movsd (%rcx,%rdx,8),%xmm1
40135f: mulsd (%rsi,%rdx,8),%xmm1
401364: addsd %xmm0,%xmm1
401368: movsd %xmm1,(%rsp)  ; <--- Weird store #3
40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
401379: addsd %xmm1,%xmm0
40137d: movsd %xmm0,(%rsp)  ; <--- Weird store #4
401382: add $0x4,%rdx
401386: cmp %rdx,%rdi
401389: jne 401330 <.omp_outlined.+0x100> ; <--- x4-unrolled loop latch
40138b: mov $0x402028,%edi
401390: mov %r14d,%esi
401393: callq 401040 <__kmpc_for_static_fini@plt>
401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to
reduction logic
40139b: mov %rax,0x20(%rsp)
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OpenMP] Redundant store inside reduction loop body

shirley breuer via cfe-dev
Hi Itay,

the problem is the "escape" of `local` later in the function (into the
OpenMP reduction runtime call).

There are various solutions to this, on the compiler side, which
I'll try to describe briefly below. If you need a shortstop solution
for this issue, you can introduce a secondary privatization variable
yourself https://godbolt.org/z/3fxxTT (untested). Each thread will use
the user provided variable for accumulation and the OpenMP reduction
facility is used for the final reduction across threads. Obviously, this
is what the compiler should generate, so here we go:

1) Use `PointerMayBeCapturedBefore` instead of `PointerMayBeCaptured` in
    BasicAliasAnalysis.cpp (the call is done via
`isNonEscapingLocalObject`).
    The reason we don't do this, I assume, is compile time. Maybe someone
    will try it out and measure the compile and runtime implications but for
    not I assume this to be a solution that might not be enacted (though
simple
    and generic).

2) Modify the OpenMP reduction output to introduce the second privatization
    location. This should be relatively easy but it will only address
the problem
    at hand. Similar problems pop up more often and I would prefer 1) or
3). That
    said, we can for now enact 2) if someone writes the code ;)

3) Introduce an attribute/annotation that indicates this is actually not
an escaping
    use. Such uses happen all the time (at least in the OpenMP runtime)
and it would
    be ideal if we could indicate the memory store is not causing the
pointer to escape.
    I'm thinking about this a bit more now, any input is welcome :)

I hope this helps, feel free to reach out if you have more questions or
comments.

~ Johannes


On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote:

> Hey,
>
> I've encountered a peculiar code generation issue in a parallel-for-reduction.
> Inside the per-thread reduction loop, I see a store to a stack slot that happens
> *every iteration*, clobbering the value written by the previous one. The address
> of that stack slot is later taken to pass to the inter-thread
> reduction, but the store
> is *not* hoisted outside the loop, while I'd expect it to happen just
> once outside it.
>
> Attached below are the code example that triggers this, and an annotated x86-64
> assembly snippet from the outlined OMP function. The code was compiled using
> clang++-12 Ubuntu focal unstable branch, using the command-line:
> clang++-12 -fopenmp -O3 main.cpp -o main
>
> I'm wondering whether this is some sort of bug, and in which component.
>
> Regards,
> ~Itay
>
> // main.cpp
> #include <cstdio>
> #include <memory>
>
> double compute_dot_product(size_t n, double *xv, double *yv)
> {
>    double local = 0.0;
>    #pragma omp parallel for reduction (+:local)
>    for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
>    return local;
> }
>
> int main(int argc, char **argv)
> {
>    constexpr size_t n = 0x1000;
>    auto xv = std::make_unique<double[]>(n);
>    auto yv = std::make_unique<double[]>(n);
>
>    double result = compute_dot_product(n, xv.get(), yv.get());
>    printf("result = %e\n", result);
>    return 0;
> }
>
> // Disassembly excerpt from objdump -d --no-show-raw-insn main,
> function <...>omp_outlined<...>
> 4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head
> 4012f5: mulsd (%rsi,%rcx,8),%xmm1
> 4012fa: addsd %xmm1,%xmm0
> 4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration
> 401303: add $0x1,%rcx
> 401307: add $0xffffffffffffffff,%rax
> 40130b: jne 4012f0 <.omp_outlined.+0xc0>  ; <--- Non-unrolled loop latch
> 40130d: cmp $0x3,%rbp
> 401311: jb 40138b <.omp_outlined.+0x15b> ; <--- x4-unrolled loop guard
> 401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
> 401316: lea (%rsi,%rcx,8),%rsi
> 40131a: add $0x18,%rsi
> 40131e: lea (%rdx,%rcx,8),%rcx
> 401322: add $0x18,%rcx
> 401326: mov $0xffffffffffffffff,%rdx
> 40132d: nopl (%rax)
> 401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head
> 401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
> 40133c: addsd %xmm0,%xmm1
> 401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
> 401345: movsd -0x8(%rcx,%rdx,8),%xmm0
> 40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
> 401351: addsd %xmm1,%xmm0
> 401355: movsd %xmm0,(%rsp)  ; <--- Weird store #2
> 40135a: movsd (%rcx,%rdx,8),%xmm1
> 40135f: mulsd (%rsi,%rdx,8),%xmm1
> 401364: addsd %xmm0,%xmm1
> 401368: movsd %xmm1,(%rsp)  ; <--- Weird store #3
> 40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
> 401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
> 401379: addsd %xmm1,%xmm0
> 40137d: movsd %xmm0,(%rsp)  ; <--- Weird store #4
> 401382: add $0x4,%rdx
> 401386: cmp %rdx,%rdi
> 401389: jne 401330 <.omp_outlined.+0x100> ; <--- x4-unrolled loop latch
> 40138b: mov $0x402028,%edi
> 401390: mov %r14d,%esi
> 401393: callq 401040 <__kmpc_for_static_fini@plt>
> 401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to
> reduction logic
> 40139b: mov %rax,0x20(%rsp)
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OpenMP] Redundant store inside reduction loop body

shirley breuer via cfe-dev

I think 3 is the best option in general, no?

12/8/2020 7:23 PM, Johannes Doerfert via cfe-dev пишет:
Hi Itay,

the problem is the "escape" of `local` later in the function (into the
OpenMP reduction runtime call).

There are various solutions to this, on the compiler side, which
I'll try to describe briefly below. If you need a shortstop solution
for this issue, you can introduce a secondary privatization variable
yourself https://godbolt.org/z/3fxxTT (untested). Each thread will use
the user provided variable for accumulation and the OpenMP reduction
facility is used for the final reduction across threads. Obviously, this
is what the compiler should generate, so here we go:

1) Use `PointerMayBeCapturedBefore` instead of `PointerMayBeCaptured` in
   BasicAliasAnalysis.cpp (the call is done via `isNonEscapingLocalObject`).
   The reason we don't do this, I assume, is compile time. Maybe someone
   will try it out and measure the compile and runtime implications but for
   not I assume this to be a solution that might not be enacted (though simple
   and generic).

2) Modify the OpenMP reduction output to introduce the second privatization
   location. This should be relatively easy but it will only address the problem
   at hand. Similar problems pop up more often and I would prefer 1) or 3). That
   said, we can for now enact 2) if someone writes the code ;)

3) Introduce an attribute/annotation that indicates this is actually not an escaping
   use. Such uses happen all the time (at least in the OpenMP runtime) and it would
   be ideal if we could indicate the memory store is not causing the pointer to escape.
   I'm thinking about this a bit more now, any input is welcome :)

I hope this helps, feel free to reach out if you have more questions or comments.

~ Johannes


On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote:
Hey,

I've encountered a peculiar code generation issue in a parallel-for-reduction.
Inside the per-thread reduction loop, I see a store to a stack slot that happens
*every iteration*, clobbering the value written by the previous one. The address
of that stack slot is later taken to pass to the inter-thread
reduction, but the store
is *not* hoisted outside the loop, while I'd expect it to happen just
once outside it.

Attached below are the code example that triggers this, and an annotated x86-64
assembly snippet from the outlined OMP function. The code was compiled using
clang++-12 Ubuntu focal unstable branch, using the command-line:
clang++-12 -fopenmp -O3 main.cpp -o main

I'm wondering whether this is some sort of bug, and in which component.

Regards,
~Itay

// main.cpp
#include <cstdio>
#include <memory>

double compute_dot_product(size_t n, double *xv, double *yv)
{
   double local = 0.0;
   #pragma omp parallel for reduction (+:local)
   for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
   return local;
}

int main(int argc, char **argv)
{
   constexpr size_t n = 0x1000;
   auto xv = std::make_unique<double[]>(n);
   auto yv = std::make_unique<double[]>(n);

   double result = compute_dot_product(n, xv.get(), yv.get());
   printf("result = %e\n", result);
   return 0;
}

// Disassembly excerpt from objdump -d --no-show-raw-insn main,
function <...>omp_outlined<...>
4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head
4012f5: mulsd (%rsi,%rcx,8),%xmm1
4012fa: addsd %xmm1,%xmm0
4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration
401303: add $0x1,%rcx
401307: add $0xffffffffffffffff,%rax
40130b: jne 4012f0 <.omp_outlined.+0xc0>  ; <--- Non-unrolled loop latch
40130d: cmp $0x3,%rbp
401311: jb 40138b <.omp_outlined.+0x15b> ; <--- x4-unrolled loop guard
401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
401316: lea (%rsi,%rcx,8),%rsi
40131a: add $0x18,%rsi
40131e: lea (%rdx,%rcx,8),%rcx
401322: add $0x18,%rcx
401326: mov $0xffffffffffffffff,%rdx
40132d: nopl (%rax)
401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head
401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
40133c: addsd %xmm0,%xmm1
401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
401345: movsd -0x8(%rcx,%rdx,8),%xmm0
40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
401351: addsd %xmm1,%xmm0
401355: movsd %xmm0,(%rsp)  ; <--- Weird store #2
40135a: movsd (%rcx,%rdx,8),%xmm1
40135f: mulsd (%rsi,%rdx,8),%xmm1
401364: addsd %xmm0,%xmm1
401368: movsd %xmm1,(%rsp)  ; <--- Weird store #3
40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
401379: addsd %xmm1,%xmm0
40137d: movsd %xmm0,(%rsp)  ; <--- Weird store #4
401382: add $0x4,%rdx
401386: cmp %rdx,%rdi
401389: jne 401330 <.omp_outlined.+0x100> ; <--- x4-unrolled loop latch
40138b: mov $0x402028,%edi
401390: mov %r14d,%esi
401393: callq 401040 <__kmpc_for_static_fini@plt>
401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to
reduction logic
40139b: mov %rax,0x20(%rsp)
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OpenMP] Redundant store inside reduction loop body

shirley breuer via cfe-dev
In reply to this post by shirley breuer via cfe-dev

I think 3 is the best option in general.

-------------
Best regards,
Alexey Bataev
12/8/2020 7:23 PM, Johannes Doerfert via cfe-dev пишет:
Hi Itay,

the problem is the "escape" of `local` later in the function (into the
OpenMP reduction runtime call).

There are various solutions to this, on the compiler side, which
I'll try to describe briefly below. If you need a shortstop solution
for this issue, you can introduce a secondary privatization variable
yourself https://godbolt.org/z/3fxxTT (untested). Each thread will use
the user provided variable for accumulation and the OpenMP reduction
facility is used for the final reduction across threads. Obviously, this
is what the compiler should generate, so here we go:

1) Use `PointerMayBeCapturedBefore` instead of `PointerMayBeCaptured` in
   BasicAliasAnalysis.cpp (the call is done via `isNonEscapingLocalObject`).
   The reason we don't do this, I assume, is compile time. Maybe someone
   will try it out and measure the compile and runtime implications but for
   not I assume this to be a solution that might not be enacted (though simple
   and generic).

2) Modify the OpenMP reduction output to introduce the second privatization
   location. This should be relatively easy but it will only address the problem
   at hand. Similar problems pop up more often and I would prefer 1) or 3). That
   said, we can for now enact 2) if someone writes the code ;)

3) Introduce an attribute/annotation that indicates this is actually not an escaping
   use. Such uses happen all the time (at least in the OpenMP runtime) and it would
   be ideal if we could indicate the memory store is not causing the pointer to escape.
   I'm thinking about this a bit more now, any input is welcome :)

I hope this helps, feel free to reach out if you have more questions or comments.

~ Johannes


On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote:
Hey,

I've encountered a peculiar code generation issue in a parallel-for-reduction.
Inside the per-thread reduction loop, I see a store to a stack slot that happens
*every iteration*, clobbering the value written by the previous one. The address
of that stack slot is later taken to pass to the inter-thread
reduction, but the store
is *not* hoisted outside the loop, while I'd expect it to happen just
once outside it.

Attached below are the code example that triggers this, and an annotated x86-64
assembly snippet from the outlined OMP function. The code was compiled using
clang++-12 Ubuntu focal unstable branch, using the command-line:
clang++-12 -fopenmp -O3 main.cpp -o main

I'm wondering whether this is some sort of bug, and in which component.

Regards,
~Itay

// main.cpp
#include <cstdio>
#include <memory>

double compute_dot_product(size_t n, double *xv, double *yv)
{
   double local = 0.0;
   #pragma omp parallel for reduction (+:local)
   for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
   return local;
}

int main(int argc, char **argv)
{
   constexpr size_t n = 0x1000;
   auto xv = std::make_unique<double[]>(n);
   auto yv = std::make_unique<double[]>(n);

   double result = compute_dot_product(n, xv.get(), yv.get());
   printf("result = %e\n", result);
   return 0;
}

// Disassembly excerpt from objdump -d --no-show-raw-insn main,
function <...>omp_outlined<...>
4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head
4012f5: mulsd (%rsi,%rcx,8),%xmm1
4012fa: addsd %xmm1,%xmm0
4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration
401303: add $0x1,%rcx
401307: add $0xffffffffffffffff,%rax
40130b: jne 4012f0 <.omp_outlined.+0xc0>  ; <--- Non-unrolled loop latch
40130d: cmp $0x3,%rbp
401311: jb 40138b <.omp_outlined.+0x15b> ; <--- x4-unrolled loop guard
401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
401316: lea (%rsi,%rcx,8),%rsi
40131a: add $0x18,%rsi
40131e: lea (%rdx,%rcx,8),%rcx
401322: add $0x18,%rcx
401326: mov $0xffffffffffffffff,%rdx
40132d: nopl (%rax)
401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head
401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
40133c: addsd %xmm0,%xmm1
401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
401345: movsd -0x8(%rcx,%rdx,8),%xmm0
40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
401351: addsd %xmm1,%xmm0
401355: movsd %xmm0,(%rsp)  ; <--- Weird store #2
40135a: movsd (%rcx,%rdx,8),%xmm1
40135f: mulsd (%rsi,%rdx,8),%xmm1
401364: addsd %xmm0,%xmm1
401368: movsd %xmm1,(%rsp)  ; <--- Weird store #3
40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
401379: addsd %xmm1,%xmm0
40137d: movsd %xmm0,(%rsp)  ; <--- Weird store #4
401382: add $0x4,%rdx
401386: cmp %rdx,%rdi
401389: jne 401330 <.omp_outlined.+0x100> ; <--- x4-unrolled loop latch
40138b: mov $0x402028,%edi
401390: mov %r14d,%esi
401393: callq 401040 <__kmpc_for_static_fini@plt>
401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to
reduction logic
40139b: mov %rax,0x20(%rsp)
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [OpenMP] Redundant store inside reduction loop body

shirley breuer via cfe-dev
I agree, 3 is preferable, but also the hardest :D

I'll continue to think about this. As I said, there is a user
code short stop and a clang short stop solution if this is time
critical.

I'll send an RFC for 3) as soon as I have a coherent model;
suggestions welcome.

~ Johannes


On 12/9/20 8:10 AM, Alexey.Bataev wrote:

> I think 3 is the best option in general.
>
> -------------
> Best regards,
> Alexey Bataev
>
> 12/8/2020 7:23 PM, Johannes Doerfert via cfe-dev пишет:
>> Hi Itay,
>>
>> the problem is the "escape" of `local` later in the function (into the
>> OpenMP reduction runtime call).
>>
>> There are various solutions to this, on the compiler side, which
>> I'll try to describe briefly below. If you need a shortstop solution
>> for this issue, you can introduce a secondary privatization variable
>> yourself https://godbolt.org/z/3fxxTT (untested). Each thread will use
>> the user provided variable for accumulation and the OpenMP reduction
>> facility is used for the final reduction across threads. Obviously, this
>> is what the compiler should generate, so here we go:
>>
>> 1) Use `PointerMayBeCapturedBefore` instead of `PointerMayBeCaptured` in
>>     BasicAliasAnalysis.cpp (the call is done via
>> `isNonEscapingLocalObject`).
>>     The reason we don't do this, I assume, is compile time. Maybe someone
>>     will try it out and measure the compile and runtime implications
>> but for
>>     not I assume this to be a solution that might not be enacted
>> (though simple
>>     and generic).
>>
>> 2) Modify the OpenMP reduction output to introduce the second
>> privatization
>>     location. This should be relatively easy but it will only address
>> the problem
>>     at hand. Similar problems pop up more often and I would prefer 1)
>> or 3). That
>>     said, we can for now enact 2) if someone writes the code ;)
>>
>> 3) Introduce an attribute/annotation that indicates this is actually
>> not an escaping
>>     use. Such uses happen all the time (at least in the OpenMP runtime)
>> and it would
>>     be ideal if we could indicate the memory store is not causing the
>> pointer to escape.
>>     I'm thinking about this a bit more now, any input is welcome :)
>>
>> I hope this helps, feel free to reach out if you have more questions
>> or comments.
>>
>> ~ Johannes
>>
>>
>> On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote:
>>> Hey,
>>>
>>> I've encountered a peculiar code generation issue in a
>>> parallel-for-reduction.
>>> Inside the per-thread reduction loop, I see a store to a stack slot
>>> that happens
>>> *every iteration*, clobbering the value written by the previous one.
>>> The address
>>> of that stack slot is later taken to pass to the inter-thread
>>> reduction, but the store
>>> is *not* hoisted outside the loop, while I'd expect it to happen just
>>> once outside it.
>>>
>>> Attached below are the code example that triggers this, and an
>>> annotated x86-64
>>> assembly snippet from the outlined OMP function. The code was
>>> compiled using
>>> clang++-12 Ubuntu focal unstable branch, using the command-line:
>>> clang++-12 -fopenmp -O3 main.cpp -o main
>>>
>>> I'm wondering whether this is some sort of bug, and in which component.
>>>
>>> Regards,
>>> ~Itay
>>>
>>> // main.cpp
>>> #include <cstdio>
>>> #include <memory>
>>>
>>> double compute_dot_product(size_t n, double *xv, double *yv)
>>> {
>>>     double local = 0.0;
>>>     #pragma omp parallel for reduction (+:local)
>>>     for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
>>>     return local;
>>> }
>>>
>>> int main(int argc, char **argv)
>>> {
>>>     constexpr size_t n = 0x1000;
>>>     auto xv = std::make_unique<double[]>(n);
>>>     auto yv = std::make_unique<double[]>(n);
>>>
>>>     double result = compute_dot_product(n, xv.get(), yv.get());
>>>     printf("result = %e\n", result);
>>>     return 0;
>>> }
>>>
>>> // Disassembly excerpt from objdump -d --no-show-raw-insn main,
>>> function <...>omp_outlined<...>
>>> 4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head
>>> 4012f5: mulsd (%rsi,%rcx,8),%xmm1
>>> 4012fa: addsd %xmm1,%xmm0
>>> 4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration
>>> 401303: add $0x1,%rcx
>>> 401307: add $0xffffffffffffffff,%rax
>>> 40130b: jne 4012f0 <.omp_outlined.+0xc0>  ; <--- Non-unrolled loop latch
>>> 40130d: cmp $0x3,%rbp
>>> 401311: jb 40138b <.omp_outlined.+0x15b> ; <--- x4-unrolled loop guard
>>> 401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
>>> 401316: lea (%rsi,%rcx,8),%rsi
>>> 40131a: add $0x18,%rsi
>>> 40131e: lea (%rdx,%rcx,8),%rcx
>>> 401322: add $0x18,%rcx
>>> 401326: mov $0xffffffffffffffff,%rdx
>>> 40132d: nopl (%rax)
>>> 401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head
>>> 401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
>>> 40133c: addsd %xmm0,%xmm1
>>> 401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
>>> 401345: movsd -0x8(%rcx,%rdx,8),%xmm0
>>> 40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
>>> 401351: addsd %xmm1,%xmm0
>>> 401355: movsd %xmm0,(%rsp)  ; <--- Weird store #2
>>> 40135a: movsd (%rcx,%rdx,8),%xmm1
>>> 40135f: mulsd (%rsi,%rdx,8),%xmm1
>>> 401364: addsd %xmm0,%xmm1
>>> 401368: movsd %xmm1,(%rsp)  ; <--- Weird store #3
>>> 40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
>>> 401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
>>> 401379: addsd %xmm1,%xmm0
>>> 40137d: movsd %xmm0,(%rsp)  ; <--- Weird store #4
>>> 401382: add $0x4,%rdx
>>> 401386: cmp %rdx,%rdi
>>> 401389: jne 401330 <.omp_outlined.+0x100> ; <--- x4-unrolled loop latch
>>> 40138b: mov $0x402028,%edi
>>> 401390: mov %r14d,%esi
>>> 401393: callq 401040 <__kmpc_for_static_fini@plt>
>>> 401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to
>>> reduction logic
>>> 40139b: mov %rax,0x20(%rsp)
>>> _______________________________________________
>>> cfe-dev mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev