Handling of FP denormal values

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Handling of FP denormal values

Kristof Beyls via cfe-dev

Hi all,

 

While reviewing a recent clang documentation change, I became aware of an issue with the way that clang is handling FP denormals. There is currently some support for variations in the way denormals are handled, but it isn't consistent across architectures and generally feels kind of half-baked. I'd like to discuss possible solutions to this problem.

 

First, there is a clang command line option:

 

    -fdenormal-fp-math=<arg>

 

    Select which denormal numbers the code is permitted to require.

 

    Valid values are: ieee, preserve-sign, and positive-zero, which

    correspond to IEEE 754 denormal numbers, the sign of a flushed-to-zero

    number is preserved in the sign of 0, denormals are flushed to positive

    zero, respectively.

 

A quick survey of the code leads me to believe this has no effect for targets other than ARM. For X86 targets we may want different options. I'll say more about that below. The wording of the documentation is sufficiently ambiguous that I’m not entirely certain whether it is intended to control the target hardware or just the optimizer.

 

In addition, when either -Ofast or -ffast-math is used, we attempt to link 'crtfastmath.o' if it can be found. For X86 targets, this object file adds a static constructor that sets the DAZ and FTZ bits of the MXCSR register. I expect that it has analogous behavior for other architectures when it is available. This object file is typically available on Linux systems, possibly also with things like MinGW. If it isn't found, the denomral control flags will be left in their default state.

 

There is also a CUDA-specific option, -f[no-]cuda-flush-denormals-to-zero. I don't know how this is implemented, but the documentation says it is specific to CUDA device mode.

 

Finally, there is an OpenCL-specific option, -cl-denorms-are-zero. Again, I don't know how it is implemented.

 

So.... I'd like to talk about how we can corral all of this into some interface that is consistent (or at least consistently sensible) across architectures.

 

The problems I see are:

 

1. -fdenormal-fp-math needs to handle all scenarios needed by all architectures (or needs to be limited to a common subset).

2. -fdenormal-fp-math needs to be reconciled with -ffast-math and its variants.

3. -fdenormal-fp-math needs to be consistent about whether or not it imposes hardware changes when applicable.

I can only really speak to X86, so I'll say a few words about that to start the discussion.

 

The current choices for -fdenormal-fp-math are: ieee, preserve-sign, and positive-zero. With X86, you get ieee behavior if neither DAZ or FTZ are set. If FTZ is set you get 'preserve sign' behavior -- i.e. denormal results are flushed to zero and the sign of the result is kept. There is no way to get 'positive zero' behavior with X86. At the hardware level, modern X86 processors have separate controls for ftz (results are flushed to zero) and daz (inputs are flushed to zero before calculations), but I doubt that they are used independently often enough to distinguish them at the command line option level.

 

Also, any X87 instructions that happen to be generated (such as if the code contains 'long double' data on Linux) will ignore the ftz and daz settings. There are some early Pentium 4 processors that don't support 'daz' but I hope we can safely ignore that fact.

 

Linking in crtfastmath.o when -Ofast or -ffast-math are used is consistent with GCC's behavior. However, it implicitly ignores -fdenormal-fp-math, which GCC doesn't have. In most cases if a user sets a fast math option they probably also want DAZ and FTZ, but there might be some reason why an advanced user would want to treat them separately. This can be done with intrinsics, of course, but if we have an option to control it, we should respect that option. Also, it is possible to construct fast math behavior cafeteria-style (i.e. setting some fast math flags and not others) so we should probably have a way to add ftz behaviors a la carte.

 

FWIW, ICC sets the FTZ and DAZ flags from a function call that is inserted into main depending on the options used to compile the file containing main.

 

Trying to go back to the general case, I'd like to solicit information about whether other targets have/need different denormal options than are described above. Futher, I'd suggest that for any architecture that supports FTZ behavior, a well-document default be automatically set when fast math is enabled via

-Ofast, -ffast-math, or -funsafe-math-optimizations unless that option is turned off by a subsequent -fno-fast-math/-fno-unsafe-math-optimizations option or overridden by a subsequent -fdenormal-fp-math option, and if -fdenormal-fp-math is used, some code will be emitted to set the relevant hardware controls.

 

I don't have a strong opinion on whether it is better to emit a static constructor or to inject a call into main. The latter seems more predictable. I’d like to avoid a dependency on crtfastmath.o either way.

 

Do we need an ftz fast-math flag?

 

Are there any other facets to this problem that I've overlooked?

 

Thanks,

Andy

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Handling of FP denormal values

Kristof Beyls via cfe-dev
On Mon, Sep 16, 2019 at 7:58 PM Kaylor, Andrew via cfe-dev <[hidden email]> wrote:

Hi all,

 

While reviewing a recent clang documentation change, I became aware of an issue with the way that clang is handling FP denormals. There is currently some support for variations in the way denormals are handled, but it isn't consistent across architectures and generally feels kind of half-baked. I'd like to discuss possible solutions to this problem.

 

First, there is a clang command line option:

 

    -fdenormal-fp-math=<arg>

 

    Select which denormal numbers the code is permitted to require.

 

    Valid values are: ieee, preserve-sign, and positive-zero, which

    correspond to IEEE 754 denormal numbers, the sign of a flushed-to-zero

    number is preserved in the sign of 0, denormals are flushed to positive

    zero, respectively.

 

A quick survey of the code leads me to believe this has no effect for targets other than ARM. For X86 targets we may want different options. I'll say more about that below. The wording of the documentation is sufficiently ambiguous that I’m not entirely certain whether it is intended to control the target hardware or just the optimizer.

 

In addition, when either -Ofast or -ffast-math is used, we attempt to link 'crtfastmath.o' if it can be found. For X86 targets, this object file adds a static constructor that sets the DAZ and FTZ bits of the MXCSR register. I expect that it has analogous behavior for other architectures when it is available. This object file is typically available on Linux systems, possibly also with things like MinGW. If it isn't found, the denomral control flags will be left in their default state.

 

There is also a CUDA-specific option, -f[no-]cuda-flush-denormals-to-zero. I don't know how this is implemented, but the documentation says it is specific to CUDA device mode.

 

Finally, there is an OpenCL-specific option, -cl-denorms-are-zero. Again, I don't know how it is implemented.

 

So.... I'd like to talk about how we can corral all of this into some interface that is consistent (or at least consistently sensible) across architectures.

 

The problems I see are:

 

1. -fdenormal-fp-math needs to handle all scenarios needed by all architectures (or needs to be limited to a common subset).

2. -fdenormal-fp-math needs to be reconciled with -ffast-math and its variants.

3. -fdenormal-fp-math needs to be consistent about whether or not it imposes hardware changes when applicable.

I can only really speak to X86, so I'll say a few words about that to start the discussion.

 

The current choices for -fdenormal-fp-math are: ieee, preserve-sign, and positive-zero. With X86, you get ieee behavior if neither DAZ or FTZ are set. If FTZ is set you get 'preserve sign' behavior -- i.e. denormal results are flushed to zero and the sign of the result is kept. There is no way to get 'positive zero' behavior with X86. At the hardware level, modern X86 processors have separate controls for ftz (results are flushed to zero) and daz (inputs are flushed to zero before calculations), but I doubt that they are used independently often enough to distinguish them at the command line option level.

 

Also, any X87 instructions that happen to be generated (such as if the code contains 'long double' data on Linux) will ignore the ftz and daz settings. There are some early Pentium 4 processors that don't support 'daz' but I hope we can safely ignore that fact.

 

Linking in crtfastmath.o when -Ofast or -ffast-math are used is consistent with GCC's behavior. However, it implicitly ignores -fdenormal-fp-math, which GCC doesn't have. In most cases if a user sets a fast math option they probably also want DAZ and FTZ, but there might be some reason why an advanced user would want to treat them separately. This can be done with intrinsics, of course, but if we have an option to control it, we should respect that option. Also, it is possible to construct fast math behavior cafeteria-style (i.e. setting some fast math flags and not others) so we should probably have a way to add ftz behaviors a la carte.

 

FWIW, ICC sets the FTZ and DAZ flags from a function call that is inserted into main depending on the options used to compile the file containing main.

 

Trying to go back to the general case, I'd like to solicit information about whether other targets have/need different denormal options than are described above. Futher, I'd suggest that for any architecture that supports FTZ behavior, a well-document default be automatically set when fast math is enabled via

-Ofast, -ffast-math, or -funsafe-math-optimizations unless that option is turned off by a subsequent -fno-fast-math/-fno-unsafe-math-optimizations option or overridden by a subsequent -fdenormal-fp-math option, and if -fdenormal-fp-math is used, some code will be emitted to set the relevant hardware controls.

 

I don't have a strong opinion on whether it is better to emit a static constructor or to inject a call into main. The latter seems more predictable. I’d like to avoid a dependency on crtfastmath.o either way.


I would like to see it called from .init_array (or equivalent) with the highest init_priority. That way, dynamic initializers get the benefit too. If we're requesting DAZ+FTZ on the command line, there's no need for a slow start-up.

Digressing a bit, but I don't like how some implementations of crtfastmath.o clear all the flags while setting the DAZ+FTZ flags (e.g. AArch64). Seems unnecessary and makes its position on the link line significant.
 

 

Do we need an ftz fast-math flag?

 

Are there any other facets to this problem that I've overlooked?

 

Thanks,

Andy

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Handling of FP denormal values

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev


On Sep 16, 2019, at 19:57, Kaylor, Andrew via llvm-dev <[hidden email]> wrote:

 
Do we need an ftz fast-math flag?

This would be useful for matching a handful of AMDGPU instructions (a fmad that only always flushes being the most important). We have a dedicated intrinsic to allow flushing in this case when denormals are enabled

 
Are there any other facets to this problem that I've overlooked?

For AMDGPU we need to split -denormal-fp-math into per-FP type flags (and the corresponding IR attribute). The denorm mode register has separate fields for f32, and f64+f16. The default for each of these is different depending on the subtarget/language combination. Mostly we want f64+f16 to always be on, and only change the f32 mode. The current naming implies changing all of the modes.

The different sign of 0 modes as exist now aren’t available. There are however separate flags for enabling flushing on input and output. This isn’t particular important, and currently we just set both bits at the same time but it might be something to think about if this is being expanded.

-Matt

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Handling of FP denormal values

Kristof Beyls via cfe-dev
On Mon, Sep 16, 2019 at 9:43 PM Matt Arsenault via cfe-dev <[hidden email]> wrote:


On Sep 16, 2019, at 19:57, Kaylor, Andrew via llvm-dev <[hidden email]> wrote:

 
Do we need an ftz fast-math flag?

This would be useful for matching a handful of AMDGPU instructions (a fmad that only always flushes being the most important). We have a dedicated intrinsic to allow flushing in this case when denormals are enabled

+1 

For FTZ/DAZ, we're currently getting cases like this incorrect:

  %add = fadd nnan ninf nsz float %a, 0.000000e+00

That cannot be safely optimized to 'a' with FTZ/DAZ enabled. Although, there's admittedly a small chance of problems, since a following FP operation would normalize it, but here be dragons.

Are there any other facets to this problem that I've overlooked?

For AMDGPU we need to split -denormal-fp-math into per-FP type flags (and the corresponding IR attribute). The denorm mode register has separate fields for f32, and f64+f16. The default for each of these is different depending on the subtarget/language combination. Mostly we want f64+f16 to always be on, and only change the f32 mode. The current naming implies changing all of the modes.

The different sign of 0 modes as exist now aren’t available. There are however separate flags for enabling flushing on input and output. This isn’t particular important, and currently we just set both bits at the same time but it might be something to think about if this is being expanded.

At the command-line level, I don't see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop. 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Handling of FP denormal values

Kristof Beyls via cfe-dev


On Tue, Sep 17, 2019 at 11:07 AM Cameron McInally <[hidden email]> wrote:
On Mon, Sep 16, 2019 at 9:43 PM Matt Arsenault via cfe-dev <[hidden email]> wrote:


On Sep 16, 2019, at 19:57, Kaylor, Andrew via llvm-dev <[hidden email]> wrote:

 
Do we need an ftz fast-math flag?

This would be useful for matching a handful of AMDGPU instructions (a fmad that only always flushes being the most important). We have a dedicated intrinsic to allow flushing in this case when denormals are enabled

+1 

For FTZ/DAZ, we're currently getting cases like this incorrect:

  %add = fadd nnan ninf nsz float %a, 0.000000e+00

That cannot be safely optimized to 'a' with FTZ/DAZ enabled. Although, there's admittedly a small chance of problems, since a following FP operation would normalize it, but here be dragons.

Are there any other facets to this problem that I've overlooked?

For AMDGPU we need to split -denormal-fp-math into per-FP type flags (and the corresponding IR attribute). The denorm mode register has separate fields for f32, and f64+f16. The default for each of these is different depending on the subtarget/language combination. Mostly we want f64+f16 to always be on, and only change the f32 mode. The current naming implies changing all of the modes.

The different sign of 0 modes as exist now aren’t available. There are however separate flags for enabling flushing on input and output. This isn’t particular important, and currently we just set both bits at the same time but it might be something to think about if this is being expanded.

At the command-line level, I don't see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop. 

EDIT: 'accuracy' should be 'precision'. 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Handling of FP denormal values

Kristof Beyls via cfe-dev

>> At the command-line level, I don't see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop.

> EDIT: 'accuracy' should be 'precision'.

 

How do you imagine that being specified in the local scope? Two ways that come to mind would be a pragma or an intrinsic. The pragma would probably be the cleanest, though more work for the front end. I suspect most architectures already have intrinsics to control this locally, but we could possibly add a target-independent intrinsic that would be better for the optimizer. But I think you want this to set or clear a flag on individual operations to help with instruction selection, right?

 

 

From: Cameron McInally <[hidden email]>
Sent: Tuesday, September 17, 2019 8:55 AM
To: Matt Arsenault <[hidden email]>
Cc: Kaylor, Andrew <[hidden email]>; LLVM Developers Mailing List <[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [llvm-dev] Handling of FP denormal values

 

 

 

On Tue, Sep 17, 2019 at 11:07 AM Cameron McInally <[hidden email]> wrote:

On Mon, Sep 16, 2019 at 9:43 PM Matt Arsenault via cfe-dev <[hidden email]> wrote:

 



On Sep 16, 2019, at 19:57, Kaylor, Andrew via llvm-dev <[hidden email]> wrote:

 

 

Do we need an ftz fast-math flag?

 

This would be useful for matching a handful of AMDGPU instructions (a fmad that only always flushes being the most important). We have a dedicated intrinsic to allow flushing in this case when denormals are enabled

 

+1 

 

For FTZ/DAZ, we're currently getting cases like this incorrect:

 

  %add = fadd nnan ninf nsz float %a, 0.000000e+00

 

That cannot be safely optimized to 'a' with FTZ/DAZ enabled. Although, there's admittedly a small chance of problems, since a following FP operation would normalize it, but here be dragons.

 

Are there any other facets to this problem that I've overlooked?

 

For AMDGPU we need to split -denormal-fp-math into per-FP type flags (and the corresponding IR attribute). The denorm mode register has separate fields for f32, and f64+f16. The default for each of these is different depending on the subtarget/language combination. Mostly we want f64+f16 to always be on, and only change the f32 mode. The current naming implies changing all of the modes.

 

The different sign of 0 modes as exist now aren’t available. There are however separate flags for enabling flushing on input and output. This isn’t particular important, and currently we just set both bits at the same time but it might be something to think about if this is being expanded.

 

At the command-line level, I don't see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop. 

 

EDIT: 'accuracy' should be 'precision'. 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Handling of FP denormal values

Kristof Beyls via cfe-dev
On Tue, Sep 17, 2019 at 12:27 PM Kaylor, Andrew <[hidden email]> wrote:
>
> >> At the command-line level, I don't see a lot of value in separating the two flags. At the Function/Loop/Block/Instruction level, separating the two would be more useful though. E.g. normalizing input/output; or sacrificing accuracy to speed up a hot loop.
>
> > EDIT: 'accuracy' should be 'precision'.
>
>
>
> How do you imagine that being specified in the local scope? Two ways that come to mind would be a pragma or an intrinsic. The pragma would probably be the cleanest, though more work for the front end. I suspect most architectures already have intrinsics to control this locally, but we could possibly add a target-independent intrinsic that would be better for the optimizer.

Good question. I haven't thought about it. I don't know if I have a
strong opinion either. It's pretty clear that something will be
needed, since tracking bits being flipped in the control register is
dubious.

It's probably a question for the CFE experts. Assuming that we add
FTZ/DAZ fast math flags, what would be the best way to attach a
FTZ/DAZ fast math flag to an individual IR instruction? Is that
currently done for other FMFs? Or are they just toggled by the higher
level -ffast-math and friends?

> But I think you want this to set or clear a flag on individual operations to help with instruction selection, right?

I think that would be useful. Well, at least I imagine it could be
useful. My personal experience is that users want all-or-nothing
regarding DAZ+FTZ, so a command-line switch would be sufficient.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev