food for optimizer developers

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

food for optimizer developers

Ralf W. Grosse-Kunstleve
I wrote a Fortran to C++ conversion program that I used to convert selected
LAPACK sources. Comparing runtimes with different compilers I get:

                          absolute  relative
ifort 11.1.072             1.790s     1.00
gfortran 4.4.4             2.470s     1.38
g++ 4.4.4                  2.922s     1.63
clang++ 2.8 (trunk 108205) 6.487s     3.62

This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHz

All files to easily reproduce the results are here:

  http://cci.lbl.gov/lapack_fem/

See the README file or the example commands below.

Questions:

- Why is the code generated by clang++ so much slower than the g++ code?

- Is there anything I could do in the C++ code generation or in the "fem"
  Fortran EMulation library to help runtime performance?

Ralf


wget http://cci.lbl.gov/lapack_fem/lapack_fem_001.tgz
tar zxf lapack_fem_001.tgz
cd lapack_fem_001
clang++ -o dsyev_test_clang++ -I. -O3 -ffast-math dsyev_test.cpp
time dsyev_test_clang++
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Alexei Svitkine
Have you tried profiling the resulting program?

-Alexei

On Mon, Aug 9, 2010 at 9:52 PM, Ralf W. Grosse-Kunstleve <[hidden email]> wrote:

> I wrote a Fortran to C++ conversion program that I used to convert selected
> LAPACK sources. Comparing runtimes with different compilers I get:
>
>                          absolute  relative
> ifort 11.1.072             1.790s     1.00
> gfortran 4.4.4             2.470s     1.38
> g++ 4.4.4                  2.922s     1.63
> clang++ 2.8 (trunk 108205) 6.487s     3.62
>
> This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHz
>
> All files to easily reproduce the results are here:
>
>  http://cci.lbl.gov/lapack_fem/
>
> See the README file or the example commands below.
>
> Questions:
>
> - Why is the code generated by clang++ so much slower than the g++ code?
>
> - Is there anything I could do in the C++ code generation or in the "fem"
>  Fortran EMulation library to help runtime performance?
>
> Ralf
>
>
> wget http://cci.lbl.gov/lapack_fem/lapack_fem_001.tgz
> tar zxf lapack_fem_001.tgz
> cd lapack_fem_001
> clang++ -o dsyev_test_clang++ -I. -O3 -ffast-math dsyev_test.cpp
> time dsyev_test_clang++
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Ralf W. Grosse-Kunstleve
> Have you tried profiling the resulting program?

>
> -Alexei

ifort and g++, just enough to convince myself there isn't something silly
due to the conversion to C++.
50% of the time is spent in two lines of code.
I haven't profiled clang++, mainly because I think I couldn't do much about
the 2x speed difference compared to g++ anyway.
Ralf


On Mon, Aug 9, 2010 at 9:52 PM, Ralf W. Grosse-Kunstleve <[hidden email]> wrote:

> I wrote a Fortran to C++ conversion program that I used to convert selected
> LAPACK sources. Comparing runtimes with different compilers I get:
>
>                          absolute  relative
> ifort 11.1.072             1.790s     1.00
> gfortran 4.4.4             2.470s     1.38
> g++ 4.4.4                  2.922s     1.63
> clang++ 2.8 (trunk 108205) 6.487s     3.62
>
> This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHz
>
> All files to easily reproduce the results are here:
>
>  http://cci.lbl.gov/lapack_fem/
>
> See the README file or the example commands below.
>
> Questions:
>
> - Why is the code generated by clang++ so much slower than the g++ code?
>
> - Is there anything I could do in the C++ code generation or in the "fem"
>  Fortran EMulation library to help runtime performance?
>
> Ralf
>
>
> wget http://cci.lbl.gov/lapack_fem/lapack_fem_001.tgz
> tar zxf lapack_fem_001.tgz
> cd lapack_fem_001
> clang++ -o dsyev_test_clang++ -I. -O3 -ffast-math dsyev_test.cpp
> time dsyev_test_clang++
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Robert Purves
In reply to this post by Ralf W. Grosse-Kunstleve

> I wrote a Fortran to C++ conversion program that I used to convert selected
> LAPACK sources. Comparing runtimes with different compilers I get:
>
>                          absolute  relative
> ifort 11.1.072             1.790s     1.00
> gfortran 4.4.4             2.470s     1.38
> g++ 4.4.4                  2.922s     1.63
> clang++ 2.8 (trunk 108205) 6.487s     3.62

> - Why is the code generated by clang++ so much slower than the g++ code?

A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()

  FEM_DO(i, 1, m) {
    temp = a(i, j + 1);
    a(i, j + 1) = ctemp * temp - stemp * a(i, j);
    a(i, j) = stemp * temp + ctemp * a(i, j);
  }

For the loop body, g++ (4.2) emits unsurprising code.
loop:
movsd    (%rcx), %xmm2
movapd   %xmm3, %xmm0
mulsd    %xmm2, %xmm0
movapd   %xmm4, %xmm1
mulsd    (%rax), %xmm1
subsd    %xmm1, %xmm0
movsd    %xmm0, (%rcx)
movapd   %xmm3, %xmm0
mulsd    (%rax), %xmm0
mulsd    %xmm4, %xmm2
addsd    %xmm2, %xmm0
movsd    %xmm0, (%rax)
incl     %esi
addq     $8, %rcx
addq     $8, %rax
cmpl     %esi, +0(%r13)
jge      loop

clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
loop:
movq     %rax, %rdi
subq     %rdx, %rdi
imulq    %r14, %rdi
subq     %rcx, %rdi
addq     %rsi, %rdi
movq     +0(%r13), %r8
movsd    (%r8, %rdi, 8), %xmm3
mulsd    %xmm1, %xmm3
movq     %rbx, %rdi
subq     %rdx, %rdi
imulq    %r14, %rdi
subq     %rcx, %rdi
addq     %rsi, %rdi
movsd    (%r8, %rdi, 8), %xmm4
movapd   %xmm2, %xmm5
mulsd    %xmm4, %xmm5
subsd    %xmm3, %xmm5
movsd    %xmm5, (%r8, %rdi, 8)
movq     +32(%r13), %rdx
movq     %rax, %rdi
subq     %rdx, %rdi
movq     +0(%r13), %r8
movq     +8(%r13), %r14
imulq    %r14, %rdi
movq     +24(%r13), %rcx
subq     %rcx, %rdi
addq     %rsi, %rdi
movsd    (%r8, %rdi, 8), %xmm3
mulsd    %xmm2, %xmm3
mulsd    %xmm1, %xmm4
addsd    %xmm3, %xmm4
movsd    %xmm4, (%r8, %rdi, 8)
incq     %rsi
cmpl     (%r15), %esi
jle      loop

Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
arr_ref<double, 2> a

I think clang has no trouble optimizing properly for arrays like this:
double  a[800][800];

Robert P.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Douglas Gregor

On Aug 10, 2010, at 3:59 AM, Robert Purves wrote:

>
>> I wrote a Fortran to C++ conversion program that I used to convert selected
>> LAPACK sources. Comparing runtimes with different compilers I get:
>>
>>                         absolute  relative
>> ifort 11.1.072             1.790s     1.00
>> gfortran 4.4.4             2.470s     1.38
>> g++ 4.4.4                  2.922s     1.63
>> clang++ 2.8 (trunk 108205) 6.487s     3.62
>
>> - Why is the code generated by clang++ so much slower than the g++ code?
>
> A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()
>
>  FEM_DO(i, 1, m) {
>    temp = a(i, j + 1);
>    a(i, j + 1) = ctemp * temp - stemp * a(i, j);
>    a(i, j) = stemp * temp + ctemp * a(i, j);
>  }
>
> For the loop body, g++ (4.2) emits unsurprising code.
> loop:
> movsd    (%rcx), %xmm2
> movapd   %xmm3, %xmm0
> mulsd    %xmm2, %xmm0
> movapd   %xmm4, %xmm1
> mulsd    (%rax), %xmm1
> subsd    %xmm1, %xmm0
> movsd    %xmm0, (%rcx)
> movapd   %xmm3, %xmm0
> mulsd    (%rax), %xmm0
> mulsd    %xmm4, %xmm2
> addsd    %xmm2, %xmm0
> movsd    %xmm0, (%rax)
> incl     %esi
> addq     $8, %rcx
> addq     $8, %rax
> cmpl     %esi, +0(%r13)
> jge      loop
>
> clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
> loop:
> movq     %rax, %rdi
> subq     %rdx, %rdi
> imulq    %r14, %rdi
> subq     %rcx, %rdi
> addq     %rsi, %rdi
> movq     +0(%r13), %r8
> movsd    (%r8, %rdi, 8), %xmm3
> mulsd    %xmm1, %xmm3
> movq     %rbx, %rdi
> subq     %rdx, %rdi
> imulq    %r14, %rdi
> subq     %rcx, %rdi
> addq     %rsi, %rdi
> movsd    (%r8, %rdi, 8), %xmm4
> movapd   %xmm2, %xmm5
> mulsd    %xmm4, %xmm5
> subsd    %xmm3, %xmm5
> movsd    %xmm5, (%r8, %rdi, 8)
> movq     +32(%r13), %rdx
> movq     %rax, %rdi
> subq     %rdx, %rdi
> movq     +0(%r13), %r8
> movq     +8(%r13), %r14
> imulq    %r14, %rdi
> movq     +24(%r13), %rcx
> subq     %rcx, %rdi
> addq     %rsi, %rdi
> movsd    (%r8, %rdi, 8), %xmm3
> mulsd    %xmm2, %xmm3
> mulsd    %xmm1, %xmm4
> addsd    %xmm3, %xmm4
> movsd    %xmm4, (%r8, %rdi, 8)
> incq     %rsi
> cmpl     (%r15), %esi
> jle      loop
>
> Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
> arr_ref<double, 2> a


This would make a *wonderful* bug report against the LLVM optimizer... http://llvm.org/bugs/ :)

        - Doug
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Chris Lattner
In reply to this post by Robert Purves

On Aug 10, 2010, at 3:59 AM, Robert Purves wrote:

>
>> I wrote a Fortran to C++ conversion program that I used to convert selected
>> LAPACK sources. Comparing runtimes with different compilers I get:
>>
>>                         absolute  relative
>> ifort 11.1.072             1.790s     1.00
>> gfortran 4.4.4             2.470s     1.38
>> g++ 4.4.4                  2.922s     1.63
>> clang++ 2.8 (trunk 108205) 6.487s     3.62
>
>> - Why is the code generated by clang++ so much slower than the g++ code?
>
> A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()
>
>  FEM_DO(i, 1, m) {
>    temp = a(i, j + 1);
>    a(i, j + 1) = ctemp * temp - stemp * a(i, j);
>    a(i, j) = stemp * temp + ctemp * a(i, j);
>  }

Please file a bug with the reduced .cpp testcase.  My wild guess is that this is a failure because we don't have TBAA yet, which isn't being worked on.  What flags are you passing to the compiler?  Anything like -ffast-math?  Note that ifort defaults to "fast and loose" numerics iirc.

-Chris

>
> For the loop body, g++ (4.2) emits unsurprising code.
> loop:
> movsd    (%rcx), %xmm2
> movapd   %xmm3, %xmm0
> mulsd    %xmm2, %xmm0
> movapd   %xmm4, %xmm1
> mulsd    (%rax), %xmm1
> subsd    %xmm1, %xmm0
> movsd    %xmm0, (%rcx)
> movapd   %xmm3, %xmm0
> mulsd    (%rax), %xmm0
> mulsd    %xmm4, %xmm2
> addsd    %xmm2, %xmm0
> movsd    %xmm0, (%rax)
> incl     %esi
> addq     $8, %rcx
> addq     $8, %rax
> cmpl     %esi, +0(%r13)
> jge      loop
>
> clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
> loop:
> movq     %rax, %rdi
> subq     %rdx, %rdi
> imulq    %r14, %rdi
> subq     %rcx, %rdi
> addq     %rsi, %rdi
> movq     +0(%r13), %r8
> movsd    (%r8, %rdi, 8), %xmm3
> mulsd    %xmm1, %xmm3
> movq     %rbx, %rdi
> subq     %rdx, %rdi
> imulq    %r14, %rdi
> subq     %rcx, %rdi
> addq     %rsi, %rdi
> movsd    (%r8, %rdi, 8), %xmm4
> movapd   %xmm2, %xmm5
> mulsd    %xmm4, %xmm5
> subsd    %xmm3, %xmm5
> movsd    %xmm5, (%r8, %rdi, 8)
> movq     +32(%r13), %rdx
> movq     %rax, %rdi
> subq     %rdx, %rdi
> movq     +0(%r13), %r8
> movq     +8(%r13), %r14
> imulq    %r14, %rdi
> movq     +24(%r13), %rcx
> subq     %rcx, %rdi
> addq     %rsi, %rdi
> movsd    (%r8, %rdi, 8), %xmm3
> mulsd    %xmm2, %xmm3
> mulsd    %xmm1, %xmm4
> addsd    %xmm3, %xmm4
> movsd    %xmm4, (%r8, %rdi, 8)
> incq     %rsi
> cmpl     (%r15), %esi
> jle      loop
>
> Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
> arr_ref<double, 2> a
>
> I think clang has no trouble optimizing properly for arrays like this:
> double  a[800][800];
>
> Robert P.
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Chris Lattner

On Aug 10, 2010, at 8:42 AM, Chris Lattner wrote:

> Please file a bug with the reduced .cpp testcase.  My wild guess is that this is a failure because we don't have TBAA yet, which isn't being worked on.  What flags are you passing to the compiler?  Anything like -ffast-math?  Note that ifort defaults to "fast and loose" numerics iirc.

Rather, "which *is* being worked on".  You can quickly verify this assumption by seeing if gcc generates similar code to llvm when you pass -fno-strict-aliasing to gcc.

-Chris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Ralf W. Grosse-Kunstleve
In reply to this post by Chris Lattner
Chris Lattner wrote:


> Please file a bug with the reduced .cpp testcase.

http://llvm.org/bugs/show_bug.cgi?id=7868

> What flags are you passing to the compiler?

-O3 -ffast-math

> Note that ifort defaults to "fast and loose" numerics iirc.

Which is exactly what I'm hoping to get from C++, too, one day,
if I ask for it via certain options.

I think speed will be the major argument against using the C++ code
generated by the fable converter. If the generated C++ code could somehow
be made to run nearly as fast as the original Fortran (compiled with ifort)
there wouldn't be any good reason anymore to still develop in Fortran,
or to bother with the complexities of mixing languages.

Ralf
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Robert Purves
In reply to this post by Douglas Gregor

Douglas Gregor wrote:

>>> I wrote a Fortran to C++ conversion program that I used to convert selected
>>> LAPACK sources. Comparing runtimes with different compilers I get:
>>>
>>>                        absolute  relative
>>> ifort 11.1.072             1.790s     1.00
>>> gfortran 4.4.4             2.470s     1.38
>>> g++ 4.4.4                  2.922s     1.63
>>> clang++ 2.8 (trunk 108205) 6.487s     3.62
>>
>>> - Why is the code generated by clang++ so much slower than the g++ code?
>>
>> A "hot spot" in your benchmark dsyev_test.cpp is this loop in dlasr()
>>
>> FEM_DO(i, 1, m) {
>>   temp = a(i, j + 1);
>>   a(i, j + 1) = ctemp * temp - stemp * a(i, j);
>>   a(i, j) = stemp * temp + ctemp * a(i, j);
>> }
>>
>> For the loop body, g++ (4.2) emits unsurprising code.
>>

>> clang++ (2.8) misses major optimizations accessing the 'a' array, and makes no less than 3 laborious address calculations.
>>

>> Presumably clang++, in its present state of development, is not smart enough to notice the underlying simple sequential access pattern, when the array is declared
>> arr_ref<double, 2> a
>
> This would make a *wonderful* bug report against the LLVM optimizer... http://llvm.org/bugs/ :)

I believe that would require the cooperation of the OP, because it is his Fortran -> C++ converter. Are you interested, Ralf?
I've started the ball rolling with a much reduced test case.


cat test.cpp
/*
 Background:
 <http://lists.cs.uiuc.edu/pipermail/cfe-dev/2010-August/010258.html>
 
 Relevant files, including benchmark dsyev_test.cpp:
 <http://cci.lbl.gov/lapack_fem/>

 
 This file (test.cpp) is a reduced case of dsyev_test.cpp.
 It sheds light on the performance issue with clang++.

 
 $ clang++ -c -I. -O3 test.cpp -save-temps
 
 Examine test.s, in which the two inner loops of interest
 are easily identified by their 'subsd' instruction.
 Contrary to expectation, assembly code for loops A and B
 is different. Loop B contains laborious and redundant
 address calculations.
 
 clang --version
 clang version 2.8 (trunk 110653)

 
 By contrast, g++ (4.2) emits identical assembler for loops A and B.
 */

#include <fem/major_types.hpp>

namespace lapack_dsyev_fem {
 
  using namespace fem::major_types;
 
  void
  test(
     int x,
     int const& m,
     int const& n,
     arr_cref<double> c,
     arr_cref<double> s,
     arr_ref<double, 2> a,
     int const& lda)
  {
    c(dimension(star));
    s(dimension(star));
    a(dimension(lda, star));
   
    int i, j;
    double ctemp, stemp, temp;
   
    if ( x ) {
      for ( j = m - 1; j >= 1; j-- ) {
        ctemp = c(j);
        stemp = s(j);
      // loop A, identical with loop B below
        for ( i = 1; i <= n; i++ ) {
          temp = a(j + 1, i);
          a(j + 1, i) = ctemp * temp - stemp * a(j, i);
          a(j, i) = stemp * temp + ctemp * a(j, i);
        }
      }
    }
    else  {
      for ( j = m - 1; j >= 1; j-- ) {
        ctemp = c(j);
        stemp = s(j);
        // loop B, identical with loop A above
        for ( i = 1; i <= n; i++ ) {
          temp = a(j + 1, i);
          a(j + 1, i) = ctemp * temp - stemp * a(j, i);
          a(j, i) = stemp * temp + ctemp * a(j, i);
        }
      }
    }  
  }
 
} // namespace lapack_dsyev_fem


Robert P.



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Robert Purves
In reply to this post by Chris Lattner

Chris Lattner wrote:

>> My wild guess is that this is a failure because we don't have TBAA yet, which isn't being worked on.  What flags are you passing to the compiler?  Anything like -ffast-math?  Note that ifort defaults to "fast and loose" numerics iirc.
>
> Rather, "which *is* being worked on".  You can quickly verify this assumption by seeing if gcc generates similar code to llvm when you pass -fno-strict-aliasing to gcc.

Passing -fno-strict-aliasing makes no difference to the code generated by g++. It is still twice the speed of code from clang.

Robert P.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: food for optimizer developers

Ralf W. Grosse-Kunstleve
In reply to this post by Robert Purves
Hi Robert,

> I believe that would require the cooperation of the OP, because it is his
> Fortran -> C++ converter. Are you interested, Ralf?


Definitely. Let me know how I could help by changing the C++ code generator.

Ralf
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev