Clang ignoring --fast-math for complex division, serious performance hit

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.

Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
> On Nov 5, 2017, at 12:43 PM, Richard Campbell via cfe-dev <[hidden email]> wrote:
> Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.
>
> Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?

These don't sound like the same bug; they're just the same overall problem observed in different patterns of source code.  The best way to prevent regressions in cases like this is to add some tests that the source patterns generate the desired output.

I can't imagine how the general case of complex division in general could possibly be implemented in "one or two divss instructions", so I suspect you're talking about the special case of dividing a complex number by a scalar (either real or imaginary).  I wouldn't be that surprised if attempts got broken if we didn't have a good, exhaustive set of tests for it.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
__attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
    const float one_over_denominator = 1.0f / (br * br + bi * bi);
    return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
}

This contains a single real divide, and is general. I proposed it two years ago for -ffast-math/-freciprocal-math but didn’t get any takers. Clang and gcc both normally (until this regression) implement a complex divide using two divide instructions.

> On Nov 5, 2017, at 9:28 PM, John McCall <[hidden email]> wrote:
>
>> On Nov 5, 2017, at 12:43 PM, Richard Campbell via cfe-dev <[hidden email]> wrote:
>> Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.
>>
>> Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?
>
> These don't sound like the same bug; they're just the same overall problem observed in different patterns of source code.  The best way to prevent regressions in cases like this is to add some tests that the source patterns generate the desired output.
>
> I can't imagine how the general case of complex division in general could possibly be implemented in "one or two divss instructions", so I suspect you're talking about the special case of dividing a complex number by a scalar (either real or imaginary).  I wouldn't be that surprised if attempts got broken if we didn't have a good, exhaustive set of tests for it.
>
> John.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev

> On Nov 6, 2017, at 12:43 AM, Richard Campbell <[hidden email]> wrote:
> On Nov 5, 2017, at 9:28 PM, John McCall <[hidden email]> wrote:
>>
>>> On Nov 5, 2017, at 12:43 PM, Richard Campbell via cfe-dev <[hidden email]> wrote:
>>> Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.
>>>
>>> Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?
>>
>> These don't sound like the same bug; they're just the same overall problem observed in different patterns of source code.  The best way to prevent regressions in cases like this is to add some tests that the source patterns generate the desired output.
>>
>> I can't imagine how the general case of complex division in general could possibly be implemented in "one or two divss instructions", so I suspect you're talking about the special case of dividing a complex number by a scalar (either real or imaginary).  I wouldn't be that surprised if attempts got broken if we didn't have a good, exhaustive set of tests for it.
>
> __attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
>    const float one_over_denominator = 1.0f / (br * br + bi * bi);
>    return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
> }
>
> This contains a single real divide, and is general. I proposed it two years ago for -ffast-math/-freciprocal-math but didn’t get any takers. Clang and gcc both normally (until this regression) implement a complex divide using two divide instructions.

I see.  You're saying that, under -ffast-math, we can implement a full complex division with one reciprocal plus a bunch of other operations, not that we can implement it with just one division.  I believe you're correct that that's acceptable under the usual assumptions of -ffast-math, as well as that it's likely to generally be faster to do that, but I'm not an expert; I've CC'ed a few people who are.  One of those people, Chandler, happens to be the person who last changed this (with r219557 in October 2014).  If they sign off on the idea, I'd be happy to accept a patch making Clang emit an inlined version under -ffast-math using a single reciprocal.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
On Nov 5, 2017, at 22:48, John McCall <[hidden email]> wrote:

>
>
>> On Nov 6, 2017, at 12:43 AM, Richard Campbell <[hidden email]> wrote:
>>> On Nov 5, 2017, at 9:28 PM, John McCall <[hidden email]> wrote:
>>>
>>>> On Nov 5, 2017, at 12:43 PM, Richard Campbell via cfe-dev <[hidden email]> wrote:
>>>> Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.
>>>>
>>>> Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?
>>>
>>> These don't sound like the same bug; they're just the same overall problem observed in different patterns of source code.  The best way to prevent regressions in cases like this is to add some tests that the source patterns generate the desired output.
>>>
>>> I can't imagine how the general case of complex division in general could possibly be implemented in "one or two divss instructions", so I suspect you're talking about the special case of dividing a complex number by a scalar (either real or imaginary).  I wouldn't be that surprised if attempts got broken if we didn't have a good, exhaustive set of tests for it.
>>
>> __attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
>>   const float one_over_denominator = 1.0f / (br * br + bi * bi);
>>   return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
>> }
>>
>> This contains a single real divide, and is general. I proposed it two years ago for -ffast-math/-freciprocal-math but didn’t get any takers. Clang and gcc both normally (until this regression) implement a complex divide using two divide instructions.
>
> I see.  You're saying that, under -ffast-math, we can implement a full complex division with one reciprocal plus a bunch of other operations, not that we can implement it with just one division.  I believe you're correct that that's acceptable under the usual assumptions of -ffast-math, as well as that it's likely to generally be faster to do that, but I'm not an expert; I've CC'ed a few people who are.  One of those people, Chandler, happens to be the person who last changed this (with r219557 in October 2014).  If they sign off on the idea, I'd be happy to accept a patch making Clang emit an inlined version under -ffast-math using a single reciprocal.

Yes, this is fine for fast-math.
- Steve
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.

> On Nov 5, 2017, at 23:06, Steve Canon <[hidden email]> wrote:
>
>> On Nov 5, 2017, at 22:48, John McCall <[hidden email]> wrote:
>>
>>
>>>> On Nov 6, 2017, at 12:43 AM, Richard Campbell <[hidden email]> wrote:
>>>>> On Nov 5, 2017, at 9:28 PM, John McCall <[hidden email]> wrote:
>>>>>
>>>>> On Nov 5, 2017, at 12:43 PM, Richard Campbell via cfe-dev <[hidden email]> wrote:
>>>>> Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.
>>>>>
>>>>> Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?
>>>>
>>>> These don't sound like the same bug; they're just the same overall problem observed in different patterns of source code.  The best way to prevent regressions in cases like this is to add some tests that the source patterns generate the desired output.
>>>>
>>>> I can't imagine how the general case of complex division in general could possibly be implemented in "one or two divss instructions", so I suspect you're talking about the special case of dividing a complex number by a scalar (either real or imaginary).  I wouldn't be that surprised if attempts got broken if we didn't have a good, exhaustive set of tests for it.
>>>
>>> __attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
>>>  const float one_over_denominator = 1.0f / (br * br + bi * bi);
>>>  return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
>>> }
>>>
>>> This contains a single real divide, and is general. I proposed it two years ago for -ffast-math/-freciprocal-math but didn’t get any takers. Clang and gcc both normally (until this regression) implement a complex divide using two divide instructions.
>>
>> I see.  You're saying that, under -ffast-math, we can implement a full complex division with one reciprocal plus a bunch of other operations, not that we can implement it with just one division.  I believe you're correct that that's acceptable under the usual assumptions of -ffast-math, as well as that it's likely to generally be faster to do that, but I'm not an expert; I've CC'ed a few people who are.  One of those people, Chandler, happens to be the person who last changed this (with r219557 in October 2014).  If they sign off on the idea, I'd be happy to accept a patch making Clang emit an inlined version under -ffast-math using a single reciprocal.
>
> Yes, this is fine for fast-math.
> - Steve
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
> On Nov 6, 2017, at 2:47 AM, Richard Campbell <[hidden email]> wrote:
>> On Nov 5, 2017, at 23:06, Steve Canon <[hidden email]> wrote:
>>> On Nov 5, 2017, at 22:48, John McCall <[hidden email]> wrote:
>>>>> On Nov 6, 2017, at 12:43 AM, Richard Campbell <[hidden email]> wrote:
>>>>>> On Nov 5, 2017, at 9:28 PM, John McCall <[hidden email]> wrote:
>>>>>>
>>>>>> On Nov 5, 2017, at 12:43 PM, Richard Campbell via cfe-dev <[hidden email]> wrote:
>>>>>> Similarly to a problem that occurred two years ago with multiplication (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html), production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now issuing __divsc3 function calls anywhere complex division occurs, irrespective of the -ffast-math setting. This can cause a single division (which should be one or at most two divss instructions, with minimal performance impact) to slow down a performance-critical section of code (single-precision complex tridiagonal matrix solving) significantly. I’m not sure how long this has been the case - for some time now, in my critical loop, I instead call an inline function which explicitly does the single divss required. When I take out this hack, the speed of the entire application is cut in half.
>>>>>>
>>>>>> Can we fix this (again) and maybe add a code comment wherever it needs to be such that it doesn’t keep getting broken after a couple years?
>>>>>
>>>>> These don't sound like the same bug; they're just the same overall problem observed in different patterns of source code.  The best way to prevent regressions in cases like this is to add some tests that the source patterns generate the desired output.
>>>>>
>>>>> I can't imagine how the general case of complex division in general could possibly be implemented in "one or two divss instructions", so I suspect you're talking about the special case of dividing a complex number by a scalar (either real or imaginary).  I wouldn't be that surprised if attempts got broken if we didn't have a good, exhaustive set of tests for it.
>>>>
>>>> __attribute__((always_inline)) static inline float _Complex __divsc3(const float ar, const float ai, const float br, const float bi) {
>>>> const float one_over_denominator = 1.0f / (br * br + bi * bi);
>>>> return (float _Complex){ (ar * br + ai * bi) * one_over_denominator, (ai * br - ar * bi) * one_over_denominator };
>>>> }
>>>>
>>>> This contains a single real divide, and is general. I proposed it two years ago for -ffast-math/-freciprocal-math but didn’t get any takers. Clang and gcc both normally (until this regression) implement a complex divide using two divide instructions.
>>>
>>> I see.  You're saying that, under -ffast-math, we can implement a full complex division with one reciprocal plus a bunch of other operations, not that we can implement it with just one division.  I believe you're correct that that's acceptable under the usual assumptions of -ffast-math, as well as that it's likely to generally be faster to do that, but I'm not an expert; I've CC'ed a few people who are.  One of those people, Chandler, happens to be the person who last changed this (with r219557 in October 2014).  If they sign off on the idea, I'd be happy to accept a patch making Clang emit an inlined version under -ffast-math using a single reciprocal.
>>
>> Yes, this is fine for fast-math.

> The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.

By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself?  Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.

My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors.  If this is something that interests you, we'd be very open to contributions.


John.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev
On Sun, Nov 05, 2017 at 09:43:56AM -0800, Richard Campbell via cfe-dev wrote:
> Similarly to a problem that occurred two years ago with multiplication
> (http://lists.llvm.org/pipermail/cfe-dev/2015-July/043792.html),
> production Clang (Apple LLVM version 9.0.0 (clang-900.0.38)) is now
> issuing __divsc3 function calls anywhere complex division occurs,
> irrespective of the -ffast-math setting.

Are multiplication and division really the same thing for -ffast-math?
There are a lot of cases were division will trivially result in horrible
loss of precision. I think the situation is far worse than for
multiplication.

Joerg
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev

> On Nov 6, 2017, at 12:20 AM, John McCall <[hidden email]> wrote:
>
>> On Nov 6, 2017, at 2:47 AM, Richard Campbell <[hidden email]> wrote:
>> The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.
>
> By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself?  Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.
>
> My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors.  If this is something that interests you, we'd be very open to contributions.
>
>
> John.

I suppose I mean both of those optimisations, although I don’t know the actual breakdown of the performance hit of one vs the other vs just the fact of the function call. When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.

While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
> On Nov 6, 2017, at 12:21 PM, Richard Campbell <[hidden email]> wrote:
>> On Nov 6, 2017, at 12:20 AM, John McCall <[hidden email]> wrote:
>>
>>> On Nov 6, 2017, at 2:47 AM, Richard Campbell <[hidden email]> wrote:
>>> The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.
>>
>> By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself?  Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.
>>
>> My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors.  If this is something that interests you, we'd be very open to contributions.
>>
>>
>> John.
>
> I suppose I mean both of those optimisations, although I don’t know the actual breakdown of the performance hit of one vs the other vs just the fact of the function call. When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.

Complex divide is a large, complicated operation when full precision and infinity-correctness is required.  We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.

> While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.

Richard, let me be clear about your options here.  If you're interested in working on this, that would be great, and I'd be happy to review your patches.  If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev

> On Nov 6, 2017, at 9:59 AM, John McCall <[hidden email]> wrote:
>
>> On Nov 6, 2017, at 12:21 PM, Richard Campbell <[hidden email]> wrote:
>>  When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.
>
> Complex divide is a large, complicated operation when full precision and infinity-correctness is required.  We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.

Perhaps not, except that the compiler was previously doing the right thing without the associated very serious performance hit.

>
>> While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.
>
> Richard, let me be clear about your options here.  If you're interested in working on this, that would be great, and I'd be happy to review your patches.  If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

I’d be happy to submit a patch, but don’t know where to begin with that process. I expect the solution will look nearly identical to the solution to the same problem when it occurred with __mulsc3 in July 2015. That fix seems to have happened with no explanation on the mailing list, or I’d have a head start on fixing this myself.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev
[+Alex]

On 11/06/2017 11:59 AM, John McCall wrote:

>> On Nov 6, 2017, at 12:21 PM, Richard Campbell <[hidden email]> wrote:
>>> On Nov 6, 2017, at 12:20 AM, John McCall <[hidden email]> wrote:
>>>
>>>> On Nov 6, 2017, at 2:47 AM, Richard Campbell <[hidden email]> wrote:
>>>> The much bigger issue is not on division or two, but rather zero function calls or one. The function call overhead, and the resulting inability to make any other refactoring optimisations, far outweighs the choice of instructions used.
>>> By "refactoring optimizations", do you mean reordering and potentially CSE'ing the component arithmetic with operations outside of the division, or do you mean the compiler-barrier costs of emitting an opaque function call in the frontend instead of something that can be CSE'ed / reordered itself?  Because the latter is a problem that can be fixed for non-fast-math arithmetic as well.
>>>
>>> My general impression is that there is a lot of low-hanging fruit in optimizing complex math in LLVM for one simple reason: it's not widely used, so it's an accordingly low priority for most of our current contributors.  If this is something that interests you, we'd be very open to contributions.
>>>
>>>
>>> John.
>> I suppose I mean both of those optimisations, although I don’t know the actual breakdown of the performance hit of one vs the other vs just the fact of the function call. When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.
> Complex divide is a large, complicated operation when full precision and infinity-correctness is required.  We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.
>
>> While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.
> Richard, let me be clear about your options here.  If you're interested in working on this, that would be great, and I'd be happy to review your patches.  If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.

I'd like to add that Alex L. looked at this in some detail in 2013. For
some relevant notes, see PR17248 (and how divide is handled in
https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp).
There are indeed more- and less-numerically-stable ways to implement
complex division. For an extended discussion, I recommend looking at
https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions
that are reasonable to inline, especially in the fast-math context, and
I support doing so. Alex found that we had to use Smith's algorithm in
order to pass LAPACK's regression tests.

  -Hal

>
> John.

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev

> On Nov 6, 2017, at 1:11 PM, Richard Campbell <[hidden email]> wrote:
>
>
>> On Nov 6, 2017, at 9:59 AM, John McCall <[hidden email]> wrote:
>>
>>> On Nov 6, 2017, at 12:21 PM, Richard Campbell <[hidden email]> wrote:
>>> When one writes a critical inner loop that doesn’t contain any function calls, one should reasonably expect the compiler not to add them.
>>
>> Complex divide is a large, complicated operation when full precision and infinity-correctness is required.  We appreciate that you have performance constraints, but implementing it with an outlined function is not an unreasonable choice.
>
> Perhaps not, except that the compiler was previously doing the right thing without the associated very serious performance hit.

It was doing an acceptable thing for -ffast-math.  Chandler's patch was largely a correctness fix.

>>> While there may be more low hanging fruit, I don’t want it to get in the way of fixing this. My main concern is that there not be noticeable regressions. This particular regression has the potential to result in certain calculations taking HOURS longer than expected, if I hadn’t been hacking my way around it already. I would greatly prefer to write simple maintainable code and let the compiler do the right thing on the hardware of today and tomorrow.
>>
>> Richard, let me be clear about your options here.  If you're interested in working on this, that would be great, and I'd be happy to review your patches.  If you're not interested in working on this, then you should file a bug and hope that someone else has the motivation to pick it up.
>
> I’d be happy to submit a patch, but don’t know where to begin with that process. I expect the solution will look nearly identical to the solution to the same problem when it occurred with __mulsc3 in July 2015. That fix seems to have happened with no explanation on the mailing list, or I’d have a head start on fixing this myself.

I would look into the commit history for Clang's lib/CodeGen/CGExprComplex.cpp.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev
(actually cc'ing Alex this time)

On 11/06/2017 12:18 PM, Hal Finkel via cfe-dev wrote:

> [+Alex]
>
> On 11/06/2017 11:59 AM, John McCall wrote:
>>> On Nov 6, 2017, at 12:21 PM, Richard Campbell <[hidden email]>
>>> wrote:
>>>> On Nov 6, 2017, at 12:20 AM, John McCall <[hidden email]> wrote:
>>>>
>>>>> On Nov 6, 2017, at 2:47 AM, Richard Campbell
>>>>> <[hidden email]> wrote:
>>>>> The much bigger issue is not on division or two, but rather zero
>>>>> function calls or one. The function call overhead, and the
>>>>> resulting inability to make any other refactoring optimisations,
>>>>> far outweighs the choice of instructions used.
>>>> By "refactoring optimizations", do you mean reordering and
>>>> potentially CSE'ing the component arithmetic with operations
>>>> outside of the division, or do you mean the compiler-barrier costs
>>>> of emitting an opaque function call in the frontend instead of
>>>> something that can be CSE'ed / reordered itself? Because the latter
>>>> is a problem that can be fixed for non-fast-math arithmetic as well.
>>>>
>>>> My general impression is that there is a lot of low-hanging fruit
>>>> in optimizing complex math in LLVM for one simple reason: it's not
>>>> widely used, so it's an accordingly low priority for most of our
>>>> current contributors.  If this is something that interests you,
>>>> we'd be very open to contributions.
>>>>
>>>>
>>>> John.
>>> I suppose I mean both of those optimisations, although I don’t know
>>> the actual breakdown of the performance hit of one vs the other vs
>>> just the fact of the function call. When one writes a critical inner
>>> loop that doesn’t contain any function calls, one should reasonably
>>> expect the compiler not to add them.
>> Complex divide is a large, complicated operation when full precision
>> and infinity-correctness is required.  We appreciate that you have
>> performance constraints, but implementing it with an outlined
>> function is not an unreasonable choice.
>>
>>> While there may be more low hanging fruit, I don’t want it to get in
>>> the way of fixing this. My main concern is that there not be
>>> noticeable regressions. This particular regression has the potential
>>> to result in certain calculations taking HOURS longer than expected,
>>> if I hadn’t been hacking my way around it already. I would greatly
>>> prefer to write simple maintainable code and let the compiler do the
>>> right thing on the hardware of today and tomorrow.
>> Richard, let me be clear about your options here.  If you're
>> interested in working on this, that would be great, and I'd be happy
>> to review your patches.  If you're not interested in working on this,
>> then you should file a bug and hope that someone else has the
>> motivation to pick it up.
>
> I'd like to add that Alex L. looked at this in some detail in 2013.
> For some relevant notes, see PR17248 (and how divide is handled in
> https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp).
> There are indeed more- and less-numerically-stable ways to implement
> complex division. For an extended discussion, I recommend looking at
> https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions
> that are reasonable to inline, especially in the fast-math context,
> and I support doing so. Alex found that we had to use Smith's
> algorithm in order to pass LAPACK's regression tests.
>
>  -Hal
>
>>
>> John.
>

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev

On 11/06/2017 12:29 PM, Hal Finkel via cfe-dev wrote:

> (actually cc'ing Alex this time)
>
> On 11/06/2017 12:18 PM, Hal Finkel via cfe-dev wrote:
>> [+Alex]
>>
>> On 11/06/2017 11:59 AM, John McCall wrote:
>>>> On Nov 6, 2017, at 12:21 PM, Richard Campbell
>>>> <[hidden email]> wrote:
>>>>> On Nov 6, 2017, at 12:20 AM, John McCall <[hidden email]> wrote:
>>>>>
>>>>>> On Nov 6, 2017, at 2:47 AM, Richard Campbell
>>>>>> <[hidden email]> wrote:
>>>>>> The much bigger issue is not on division or two, but rather zero
>>>>>> function calls or one. The function call overhead, and the
>>>>>> resulting inability to make any other refactoring optimisations,
>>>>>> far outweighs the choice of instructions used.
>>>>> By "refactoring optimizations", do you mean reordering and
>>>>> potentially CSE'ing the component arithmetic with operations
>>>>> outside of the division, or do you mean the compiler-barrier costs
>>>>> of emitting an opaque function call in the frontend instead of
>>>>> something that can be CSE'ed / reordered itself? Because the
>>>>> latter is a problem that can be fixed for non-fast-math arithmetic
>>>>> as well.
>>>>>
>>>>> My general impression is that there is a lot of low-hanging fruit
>>>>> in optimizing complex math in LLVM for one simple reason: it's not
>>>>> widely used, so it's an accordingly low priority for most of our
>>>>> current contributors.  If this is something that interests you,
>>>>> we'd be very open to contributions.
>>>>>
>>>>>
>>>>> John.
>>>> I suppose I mean both of those optimisations, although I don’t know
>>>> the actual breakdown of the performance hit of one vs the other vs
>>>> just the fact of the function call. When one writes a critical
>>>> inner loop that doesn’t contain any function calls, one should
>>>> reasonably expect the compiler not to add them.
>>> Complex divide is a large, complicated operation when full precision
>>> and infinity-correctness is required.  We appreciate that you have
>>> performance constraints, but implementing it with an outlined
>>> function is not an unreasonable choice.
>>>
>>>> While there may be more low hanging fruit, I don’t want it to get
>>>> in the way of fixing this. My main concern is that there not be
>>>> noticeable regressions. This particular regression has the
>>>> potential to result in certain calculations taking HOURS longer
>>>> than expected, if I hadn’t been hacking my way around it already. I
>>>> would greatly prefer to write simple maintainable code and let the
>>>> compiler do the right thing on the hardware of today and tomorrow.
>>> Richard, let me be clear about your options here.  If you're
>>> interested in working on this, that would be great, and I'd be happy
>>> to review your patches.  If you're not interested in working on
>>> this, then you should file a bug and hope that someone else has the
>>> motivation to pick it up.
>>
>> I'd like to add that Alex L. looked at this in some detail in 2013.
>> For some relevant notes, see PR17248 (and how divide is handled in
>> https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp).
>> There are indeed more- and less-numerically-stable ways to implement
>> complex division. For an extended discussion, I recommend looking at
>> https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions
>> that are reasonable to inline, especially in the fast-math context,
>> and I support doing so. Alex found that we had to use Smith's
>> algorithm in order to pass LAPACK's regression tests.

One more thing, we can use the cheaper (but less numerically-stable
formula) when we have #pragma STDC CX_LIMITED_RANGE ON.

  -Hal

>>
>>  -Hal
>>
>>>
>>> John.
>>
>

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev


> On Nov 9, 2017, at 4:05 PM, Hal Finkel <[hidden email]> wrote:
>
>
> On 11/06/2017 12:29 PM, Hal Finkel via cfe-dev wrote:
>> (actually cc'ing Alex this time)
>>
>> On 11/06/2017 12:18 PM, Hal Finkel via cfe-dev wrote:
>>> [+Alex]
>>>
>>> On 11/06/2017 11:59 AM, John McCall wrote:
>>> I'd like to add that Alex L. looked at this in some detail in 2013. For some relevant notes, see PR17248 (and how divide is handled in https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). There are indeed more- and less-numerically-stable ways to implement complex division. For an extended discussion, I recommend looking at https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions that are reasonable to inline, especially in the fast-math context, and I support doing so. Alex found that we had to use Smith's algorithm in order to pass LAPACK's regression tests.
>
> One more thing, we can use the cheaper (but less numerically-stable formula) when we have #pragma STDC CX_LIMITED_RANGE ON.
>
> -Hal

Again, by far the biggest performance problem right now is that it’s making a function call, rather than the specifics of the arithmetic. One would hope that simply inlining the existing __divsc3() and allowing the compiler to eliminate the inf and nan branches (which it should do automatically with -ffinite-math-only or CX_LIMITED_RANGE) would result in performance not noticeably slower than the previous baseline.

Unfortunately, no one seems to be able to tell why this problem went away for __mulsc3 in July 2015 (a change in CGExprComplex.cpp does NOT seem to have been involved) after Chandler’s earlier mods in r219557 originally caused __mulsc3 to start issuing function calls. If we knew why __mulsc3 started being inlined again, we’d be able to apply the same fix to __divsc3.

Richard
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
Sorry about the delay, I saw the thread only today.

There's an existing bug report about this already btw: https://bugs.llvm.org/show_bug.cgi?id=31872. When I looked into it last time ICC and GCC had different fast division implementations. With -fp-model fast=1 ICC promotes floats to doubles and doubles to 80-bit long doubles to avoid loss of precision. With -fp-model fast=2 ICC uses the original type, but does one division for floats to get a reciprocal instead of doing two divisions, and two (in one instruction) divisions for complex double. GCC just uses two divisions every time without any type promotion. 

On 9 November 2017 at 16:19, Richard Campbell via cfe-dev <[hidden email]> wrote:


> On Nov 9, 2017, at 4:05 PM, Hal Finkel <[hidden email]> wrote:
>
>
> On 11/06/2017 12:29 PM, Hal Finkel via cfe-dev wrote:
>> (actually cc'ing Alex this time)
>>
>> On 11/06/2017 12:18 PM, Hal Finkel via cfe-dev wrote:
>>> [+Alex]
>>>
>>> On 11/06/2017 11:59 AM, John McCall wrote:
>>> I'd like to add that Alex L. looked at this in some detail in 2013. For some relevant notes, see PR17248 (and how divide is handled in https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). There are indeed more- and less-numerically-stable ways to implement complex division. For an extended discussion, I recommend looking at https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions that are reasonable to inline, especially in the fast-math context, and I support doing so. Alex found that we had to use Smith's algorithm in order to pass LAPACK's regression tests.

Smith's might be too slow for -ffast-math though, especially if __divsc3 is doing a regular division.
 
>
> One more thing, we can use the cheaper (but less numerically-stable formula) when we have #pragma STDC CX_LIMITED_RANGE ON.
>
> -Hal

Again, by far the biggest performance problem right now is that it’s making a function call, rather than the specifics of the arithmetic. One would hope that simply inlining the existing __divsc3() and allowing the compiler to eliminate the inf and nan branches (which it should do automatically with -ffinite-math-only or CX_LIMITED_RANGE) would result in performance not noticeably slower than the previous baseline.

Unfortunately, no one seems to be able to tell why this problem went away for __mulsc3 in July 2015 (a change in CGExprComplex.cpp does NOT seem to have been involved) after Chandler’s earlier mods in r219557 originally caused __mulsc3 to start issuing function calls. If we knew why __mulsc3 started being inlined again, we’d be able to apply the same fix to __divsc3.

Richard
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clang ignoring --fast-math for complex division, serious performance hit

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev

On 11/09/2017 06:19 PM, Richard Campbell wrote:

>
>> On Nov 9, 2017, at 4:05 PM, Hal Finkel <[hidden email]> wrote:
>>
>>
>> On 11/06/2017 12:29 PM, Hal Finkel via cfe-dev wrote:
>>> (actually cc'ing Alex this time)
>>>
>>> On 11/06/2017 12:18 PM, Hal Finkel via cfe-dev wrote:
>>>> [+Alex]
>>>>
>>>> On 11/06/2017 11:59 AM, John McCall wrote:
>>>> I'd like to add that Alex L. looked at this in some detail in 2013. For some relevant notes, see PR17248 (and how divide is handled in https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp). There are indeed more- and less-numerically-stable ways to implement complex division. For an extended discussion, I recommend looking at https://arxiv.org/pdf/1210.4539v2.pdf -- There are certainly versions that are reasonable to inline, especially in the fast-math context, and I support doing so. Alex found that we had to use Smith's algorithm in order to pass LAPACK's regression tests.
>> One more thing, we can use the cheaper (but less numerically-stable formula) when we have #pragma STDC CX_LIMITED_RANGE ON.
>>
>> -Hal
> Again, by far the biggest performance problem right now is that it’s making a function call, rather than the specifics of the arithmetic.

I think that we're all well aware of that.

>   One would hope that simply inlining the existing __divsc3() and allowing the compiler to eliminate the inf and nan branches (which it should do automatically with -ffinite-math-only or CX_LIMITED_RANGE) would result in performance not noticeably slower than the previous baseline.

To be clear. CX_LIMITED_RANGE and -ffinite-math-only are different in
this regard. C says, "The usual mathematical formulas for complex
multiply, divide, and absolute value are problematic because of their
treatment of infinities and because of undue overflow and underflow."
It's the numerical stability, the "undue overflow and underflow" part,
that makes them different.

>
> Unfortunately, no one seems to be able to tell why this problem went away for __mulsc3 in July 2015 (a change in CGExprComplex.cpp does NOT seem to have been involved) after Chandler’s earlier mods in r219557 originally caused __mulsc3 to start issuing function calls. If we knew why __mulsc3 started being inlined again, we’d be able to apply the same fix to __divsc3.

I vaguely recall when this happened, and I'm pretty sure that there was
indeed a change to CodeGen somewhere. I do plan to look at this.

  -Hal

>
> Richard

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev