Re: [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Eric Fiselier via cfe-dev
I think this is caused by a front-end change (cc'ing clang-dev) because the IR with "-Xclang -disable-llvm-optzns" shows the difference.

But independently of that, there's a missing IR canonicalization - instcombine doesn't currently do anything with either version.

And the version where we trunc later survives through the backend and produces worse code even for x86 with AVX2:
before:
    vmovd    %edi, %xmm1
    vpmovzxwq    %xmm1, %xmm1 
    vpsraw    %xmm1, %xmm0, %xmm0
    retq

after:
    vmovd    %edi, %xmm1
    vpbroadcastd    %xmm1, %ymm1
    vmovdqa    LCPI1_0(%rip), %ymm2  
    vpshufb    %ymm2, %ymm1, %ymm1
    vpermq    $232, %ymm1, %ymm1    
    vpmovzxwd    %xmm1, %ymm1  
    vpmovsxwd    %xmm0, %ymm0
    vpsravd    %ymm1, %ymm0, %ymm0
    vpshufb    %ymm2, %ymm0, %ymm0
    vpermq    $232, %ymm0, %ymm0  
    vzeroupper


So this example may have won the bug lottery by exposing all of front-, middle-, back-end bugs. :)



On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <[hidden email]> wrote:
Correction in the C snippet:

typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));

v8i16_t foo (v8i16_t a, int n)
{
   return a >> n;
}

Best regards
Saurabh



On 17 February 2017 at 16:21, Saurabh Verma <[hidden email]> wrote:
Hello,

We are investigating a difference in code generation for vector splat instructions between llvm-3.9 and llvm-4.0, which could lead to a performance regression for our target. Here is the C snippet

typedef signed v8i16_t __attribute__((ext_vector_type(8)))

v8i16_t foo (v8i16 a, int n)
{
   return result = a >> n;
}

With llvm-3.9, the generated sequence does a trunc followed by splat, but with llvm-4.0 it is reversed to a splat to a bigger vector followed by a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is definitely better for our target, but are there known scenarios where the new sequence would lead to better code?

Here are the instruction sequences generated in the two cases:

With llvm 3.9:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = trunc i32 %1 to i16
  %4 = insertelement <8 x i16> undef, i16 %3, i32 0
  %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> zeroinitializer
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}


With llvm 4.0:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = insertelement <8 x i32> undef, i32 %1, i32 0
  %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> zeroinitializer
  %5 = trunc <8 x i32> %4 to <8 x i16>
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}

Best regards
Saurabh Verma


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Eric Fiselier via cfe-dev
Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a difference in the way the shift was handled. Does the initial IR generated for you show this difference when the option is passed?

Best regards
Saurabh


On 17 February 2017 at 19:03, Sanjay Patel <[hidden email]> wrote:
I think this is caused by a front-end change (cc'ing clang-dev) because the IR with "-Xclang -disable-llvm-optzns" shows the difference.

But independently of that, there's a missing IR canonicalization - instcombine doesn't currently do anything with either version.

And the version where we trunc later survives through the backend and produces worse code even for x86 with AVX2:
before:
    vmovd    %edi, %xmm1
    vpmovzxwq    %xmm1, %xmm1 
    vpsraw    %xmm1, %xmm0, %xmm0
    retq

after:
    vmovd    %edi, %xmm1
    vpbroadcastd    %xmm1, %ymm1
    vmovdqa    LCPI1_0(%rip), %ymm2  
    vpshufb    %ymm2, %ymm1, %ymm1
    vpermq    $232, %ymm1, %ymm1    
    vpmovzxwd    %xmm1, %ymm1  
    vpmovsxwd    %xmm0, %ymm0
    vpsravd    %ymm1, %ymm0, %ymm0
    vpshufb    %ymm2, %ymm0, %ymm0
    vpermq    $232, %ymm0, %ymm0  
    vzeroupper


So this example may have won the bug lottery by exposing all of front-, middle-, back-end bugs. :)



On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <[hidden email]> wrote:
Correction in the C snippet:

typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));

v8i16_t foo (v8i16_t a, int n)
{
   return a >> n;
}

Best regards
Saurabh



On 17 February 2017 at 16:21, Saurabh Verma <[hidden email]> wrote:
Hello,

We are investigating a difference in code generation for vector splat instructions between llvm-3.9 and llvm-4.0, which could lead to a performance regression for our target. Here is the C snippet

typedef signed v8i16_t __attribute__((ext_vector_type(8)))

v8i16_t foo (v8i16 a, int n)
{
   return result = a >> n;
}

With llvm-3.9, the generated sequence does a trunc followed by splat, but with llvm-4.0 it is reversed to a splat to a bigger vector followed by a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is definitely better for our target, but are there known scenarios where the new sequence would lead to better code?

Here are the instruction sequences generated in the two cases:

With llvm 3.9:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = trunc i32 %1 to i16
  %4 = insertelement <8 x i16> undef, i16 %3, i32 0
  %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> zeroinitializer
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}


With llvm 4.0:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = insertelement <8 x i32> undef, i32 %1, i32 0
  %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> zeroinitializer
  %5 = trunc <8 x i32> %4 to <8 x i16>
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}

Best regards
Saurabh Verma


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev




_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Eric Fiselier via cfe-dev
Yes, there is an IR difference between clang 3.9.1 and clang trunk before any IR transforms are done:
https://godbolt.org/g/FuBqIb

We can't solve this problem (moving a trunc ahead of other vector ops) in general in IR because we take a conservative approach to vector transforms in IR. That means the burden for solving the general problem falls on the front-end or the back-end. If you can bisect to find the clang commit where this changed, that would be very helpful.

However, I think we can handle a very specific case (a too fat splat) in IR in instcombine, and it will resolve this exact example. This will take a couple of patches to restore your example. Here's a proposal for the first one:
https://reviews.llvm.org/D30123


On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <[hidden email]> wrote:
Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a difference in the way the shift was handled. Does the initial IR generated for you show this difference when the option is passed?

Best regards
Saurabh


On 17 February 2017 at 19:03, Sanjay Patel <[hidden email]> wrote:
I think this is caused by a front-end change (cc'ing clang-dev) because the IR with "-Xclang -disable-llvm-optzns" shows the difference.

But independently of that, there's a missing IR canonicalization - instcombine doesn't currently do anything with either version.

And the version where we trunc later survives through the backend and produces worse code even for x86 with AVX2:
before:
    vmovd    %edi, %xmm1
    vpmovzxwq    %xmm1, %xmm1 
    vpsraw    %xmm1, %xmm0, %xmm0
    retq

after:
    vmovd    %edi, %xmm1
    vpbroadcastd    %xmm1, %ymm1
    vmovdqa    LCPI1_0(%rip), %ymm2  
    vpshufb    %ymm2, %ymm1, %ymm1
    vpermq    $232, %ymm1, %ymm1    
    vpmovzxwd    %xmm1, %ymm1  
    vpmovsxwd    %xmm0, %ymm0
    vpsravd    %ymm1, %ymm0, %ymm0
    vpshufb    %ymm2, %ymm0, %ymm0
    vpermq    $232, %ymm0, %ymm0  
    vzeroupper


So this example may have won the bug lottery by exposing all of front-, middle-, back-end bugs. :)



On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <[hidden email]> wrote:
Correction in the C snippet:

typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));

v8i16_t foo (v8i16_t a, int n)
{
   return a >> n;
}

Best regards
Saurabh



On 17 February 2017 at 16:21, Saurabh Verma <[hidden email]> wrote:
Hello,

We are investigating a difference in code generation for vector splat instructions between llvm-3.9 and llvm-4.0, which could lead to a performance regression for our target. Here is the C snippet

typedef signed v8i16_t __attribute__((ext_vector_type(8)))

v8i16_t foo (v8i16 a, int n)
{
   return result = a >> n;
}

With llvm-3.9, the generated sequence does a trunc followed by splat, but with llvm-4.0 it is reversed to a splat to a bigger vector followed by a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is definitely better for our target, but are there known scenarios where the new sequence would lead to better code?

Here are the instruction sequences generated in the two cases:

With llvm 3.9:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = trunc i32 %1 to i16
  %4 = insertelement <8 x i16> undef, i16 %3, i32 0
  %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> zeroinitializer
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}


With llvm 4.0:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = insertelement <8 x i32> undef, i32 %1, i32 0
  %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> zeroinitializer
  %5 = trunc <8 x i32> %4 to <8 x i16>
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}

Best regards
Saurabh Verma


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev





_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Eric Fiselier via cfe-dev
It would still be good to understand if the clang change was intentional or if that was a side effect that can be limited.

On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <[hidden email]> wrote:
Yes, there is an IR difference between clang 3.9.1 and clang trunk before any IR transforms are done:
https://godbolt.org/g/FuBqIb

We can't solve this problem (moving a trunc ahead of other vector ops) in general in IR because we take a conservative approach to vector transforms in IR. That means the burden for solving the general problem falls on the front-end or the back-end. If you can bisect to find the clang commit where this changed, that would be very helpful.

However, I think we can handle a very specific case (a too fat splat) in IR in instcombine, and it will resolve this exact example. This will take a couple of patches to restore your example. Here's a proposal for the first one:
https://reviews.llvm.org/D30123


On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <[hidden email]> wrote:
Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a difference in the way the shift was handled. Does the initial IR generated for you show this difference when the option is passed?

Best regards
Saurabh


On 17 February 2017 at 19:03, Sanjay Patel <[hidden email]> wrote:
I think this is caused by a front-end change (cc'ing clang-dev) because the IR with "-Xclang -disable-llvm-optzns" shows the difference.

But independently of that, there's a missing IR canonicalization - instcombine doesn't currently do anything with either version.

And the version where we trunc later survives through the backend and produces worse code even for x86 with AVX2:
before:
    vmovd    %edi, %xmm1
    vpmovzxwq    %xmm1, %xmm1 
    vpsraw    %xmm1, %xmm0, %xmm0
    retq

after:
    vmovd    %edi, %xmm1
    vpbroadcastd    %xmm1, %ymm1
    vmovdqa    LCPI1_0(%rip), %ymm2  
    vpshufb    %ymm2, %ymm1, %ymm1
    vpermq    $232, %ymm1, %ymm1    
    vpmovzxwd    %xmm1, %ymm1  
    vpmovsxwd    %xmm0, %ymm0
    vpsravd    %ymm1, %ymm0, %ymm0
    vpshufb    %ymm2, %ymm0, %ymm0
    vpermq    $232, %ymm0, %ymm0  
    vzeroupper


So this example may have won the bug lottery by exposing all of front-, middle-, back-end bugs. :)



On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <[hidden email]> wrote:
Correction in the C snippet:

typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));

v8i16_t foo (v8i16_t a, int n)
{
   return a >> n;
}

Best regards
Saurabh



On 17 February 2017 at 16:21, Saurabh Verma <[hidden email]> wrote:
Hello,

We are investigating a difference in code generation for vector splat instructions between llvm-3.9 and llvm-4.0, which could lead to a performance regression for our target. Here is the C snippet

typedef signed v8i16_t __attribute__((ext_vector_type(8)))

v8i16_t foo (v8i16 a, int n)
{
   return result = a >> n;
}

With llvm-3.9, the generated sequence does a trunc followed by splat, but with llvm-4.0 it is reversed to a splat to a bigger vector followed by a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is definitely better for our target, but are there known scenarios where the new sequence would lead to better code?

Here are the instruction sequences generated in the two cases:

With llvm 3.9:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = trunc i32 %1 to i16
  %4 = insertelement <8 x i16> undef, i16 %3, i32 0
  %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> zeroinitializer
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}


With llvm 4.0:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = insertelement <8 x i32> undef, i32 %1, i32 0
  %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> zeroinitializer
  %5 = trunc <8 x i32> %4 to <8 x i16>
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}

Best regards
Saurabh Verma


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev






_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Eric Fiselier via cfe-dev
There were several patches (r278501 was the first) that fixed vector shift bugs. I don’t think the IR changes were intentional.

I’m not sure if it’s the right solution, but inserting an integral cast before the CK_VectorSplat cast in checkVectorShift makes IRGen emit the trunc before the splat.

On Mar 8, 2017, at 7:21 AM, Sanjay Patel via cfe-dev <[hidden email]> wrote:

It would still be good to understand if the clang change was intentional or if that was a side effect that can be limited.

On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <[hidden email]> wrote:
Yes, there is an IR difference between clang 3.9.1 and clang trunk before any IR transforms are done:
https://godbolt.org/g/FuBqIb

We can't solve this problem (moving a trunc ahead of other vector ops) in general in IR because we take a conservative approach to vector transforms in IR. That means the burden for solving the general problem falls on the front-end or the back-end. If you can bisect to find the clang commit where this changed, that would be very helpful.

However, I think we can handle a very specific case (a too fat splat) in IR in instcombine, and it will resolve this exact example. This will take a couple of patches to restore your example. Here's a proposal for the first one:
https://reviews.llvm.org/D30123


On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <[hidden email]> wrote:
Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a difference in the way the shift was handled. Does the initial IR generated for you show this difference when the option is passed?

Best regards
Saurabh


On 17 February 2017 at 19:03, Sanjay Patel <[hidden email]> wrote:
I think this is caused by a front-end change (cc'ing clang-dev) because the IR with "-Xclang -disable-llvm-optzns" shows the difference.

But independently of that, there's a missing IR canonicalization - instcombine doesn't currently do anything with either version.

And the version where we trunc later survives through the backend and produces worse code even for x86 with AVX2:
before:
    vmovd    %edi, %xmm1
    vpmovzxwq    %xmm1, %xmm1 
    vpsraw    %xmm1, %xmm0, %xmm0
    retq

after:
    vmovd    %edi, %xmm1
    vpbroadcastd    %xmm1, %ymm1
    vmovdqa    LCPI1_0(%rip), %ymm2  
    vpshufb    %ymm2, %ymm1, %ymm1
    vpermq    $232, %ymm1, %ymm1    
    vpmovzxwd    %xmm1, %ymm1  
    vpmovsxwd    %xmm0, %ymm0
    vpsravd    %ymm1, %ymm0, %ymm0
    vpshufb    %ymm2, %ymm0, %ymm0
    vpermq    $232, %ymm0, %ymm0  
    vzeroupper


So this example may have won the bug lottery by exposing all of front-, middle-, back-end bugs. :)



On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <[hidden email]> wrote:
Correction in the C snippet:

typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));

v8i16_t foo (v8i16_t a, int n)
{
   return a >> n;
}

Best regards
Saurabh



On 17 February 2017 at 16:21, Saurabh Verma <[hidden email]> wrote:
Hello,

We are investigating a difference in code generation for vector splat instructions between llvm-3.9 and llvm-4.0, which could lead to a performance regression for our target. Here is the C snippet

typedef signed v8i16_t __attribute__((ext_vector_type(8)))

v8i16_t foo (v8i16 a, int n)
{
   return result = a >> n;
}

With llvm-3.9, the generated sequence does a trunc followed by splat, but with llvm-4.0 it is reversed to a splat to a bigger vector followed by a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is definitely better for our target, but are there known scenarios where the new sequence would lead to better code?

Here are the instruction sequences generated in the two cases:

With llvm 3.9:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = trunc i32 %1 to i16
  %4 = insertelement <8 x i16> undef, i16 %3, i32 0
  %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> zeroinitializer
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}


With llvm 4.0:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = insertelement <8 x i32> undef, i32 %1, i32 0
  %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> zeroinitializer
  %5 = trunc <8 x i32> %4 to <8 x i16>
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}

Best regards
Saurabh Verma


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev





_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Eric Fiselier via cfe-dev
Thanks, Akira.

I don't know enough about vectors in the front-end to be much use here. cc'ing authors/reviewers of some of the patches that might be related:
https://reviews.llvm.org/rL284579
https://reviews.llvm.org/rL281669
https://reviews.llvm.org/rL278501

On Wed, Mar 8, 2017 at 8:28 PM, Akira Hatanaka <[hidden email]> wrote:
There were several patches (r278501 was the first) that fixed vector shift bugs. I don’t think the IR changes were intentional.

I’m not sure if it’s the right solution, but inserting an integral cast before the CK_VectorSplat cast in checkVectorShift makes IRGen emit the trunc before the splat.

On Mar 8, 2017, at 7:21 AM, Sanjay Patel via cfe-dev <[hidden email]> wrote:

It would still be good to understand if the clang change was intentional or if that was a side effect that can be limited.

On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <[hidden email]> wrote:
Yes, there is an IR difference between clang 3.9.1 and clang trunk before any IR transforms are done:
https://godbolt.org/g/FuBqIb

We can't solve this problem (moving a trunc ahead of other vector ops) in general in IR because we take a conservative approach to vector transforms in IR. That means the burden for solving the general problem falls on the front-end or the back-end. If you can bisect to find the clang commit where this changed, that would be very helpful.

However, I think we can handle a very specific case (a too fat splat) in IR in instcombine, and it will resolve this exact example. This will take a couple of patches to restore your example. Here's a proposal for the first one:
https://reviews.llvm.org/D30123


On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <[hidden email]> wrote:
Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a difference in the way the shift was handled. Does the initial IR generated for you show this difference when the option is passed?

Best regards
Saurabh


On 17 February 2017 at 19:03, Sanjay Patel <[hidden email]> wrote:
I think this is caused by a front-end change (cc'ing clang-dev) because the IR with "-Xclang -disable-llvm-optzns" shows the difference.

But independently of that, there's a missing IR canonicalization - instcombine doesn't currently do anything with either version.

And the version where we trunc later survives through the backend and produces worse code even for x86 with AVX2:
before:
    vmovd    %edi, %xmm1
    vpmovzxwq    %xmm1, %xmm1 
    vpsraw    %xmm1, %xmm0, %xmm0
    retq

after:
    vmovd    %edi, %xmm1
    vpbroadcastd    %xmm1, %ymm1
    vmovdqa    LCPI1_0(%rip), %ymm2  
    vpshufb    %ymm2, %ymm1, %ymm1
    vpermq    $232, %ymm1, %ymm1    
    vpmovzxwd    %xmm1, %ymm1  
    vpmovsxwd    %xmm0, %ymm0
    vpsravd    %ymm1, %ymm0, %ymm0
    vpshufb    %ymm2, %ymm0, %ymm0
    vpermq    $232, %ymm0, %ymm0  
    vzeroupper


So this example may have won the bug lottery by exposing all of front-, middle-, back-end bugs. :)



On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <[hidden email]> wrote:
Correction in the C snippet:

typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));

v8i16_t foo (v8i16_t a, int n)
{
   return a >> n;
}

Best regards
Saurabh



On 17 February 2017 at 16:21, Saurabh Verma <[hidden email]> wrote:
Hello,

We are investigating a difference in code generation for vector splat instructions between llvm-3.9 and llvm-4.0, which could lead to a performance regression for our target. Here is the C snippet

typedef signed v8i16_t __attribute__((ext_vector_type(8)))

v8i16_t foo (v8i16 a, int n)
{
   return result = a >> n;
}

With llvm-3.9, the generated sequence does a trunc followed by splat, but with llvm-4.0 it is reversed to a splat to a bigger vector followed by a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is definitely better for our target, but are there known scenarios where the new sequence would lead to better code?

Here are the instruction sequences generated in the two cases:

With llvm 3.9:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = trunc i32 %1 to i16
  %4 = insertelement <8 x i16> undef, i16 %3, i32 0
  %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> zeroinitializer
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}


With llvm 4.0:

define <8 x i16> @foo(<8 x i16>, i32) #0 {
  %3 = insertelement <8 x i32> undef, i32 %1, i32 0
  %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> zeroinitializer
  %5 = trunc <8 x i32> %4 to <8 x i16>
  %6 = ashr <8 x i16> %0, %5
  ret <8 x i16> %6
}

Best regards
Saurabh Verma


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev





_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev