_Float16 support

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

_Float16 support

David Blaikie via cfe-dev

I'd like to start a discussion about how clang supports _Float16 for target architectures that don't have direct support for 16-bit floating point arithmetic.

 

The current clang language extensions documentation says, "If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision." This is somewhat vague (to me) as to what is meant by promotion of values, and the part about results being stored in single-precision isn't what actually happens.

 

Consider this example:

 

_Float16 x;

_Float16 f(_Float16 y, _Float16 z) {

  x = y * z;

  return x;

}

 

When compiling with “-march=core-avx2” that results (after some trivial cleanup) in this IR:

 

@x = global half 0xH0000, align 2

define half @f(half, half) {

  %3 = fmul half %0, %1

  store half %3, half* @x

  ret half %3

}

 

That’s not too unreasonable I suppose, except for the fact that it hasn’t taken the lack of target support for half-precision arithmetic into account yet. That will happen in the selection DAG. The assembly code generated looks like this (with my annotations):

 

f:                                      # @f

# %bb.0:

       vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 from single to half

        vcvtph2ps       xmm1, xmm1                # Convert argument 1 back to single

        vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 from single to half

        vcvtph2ps       xmm0, xmm0                # Convert argument 0 back to single

        vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 (single precision)

        vcvtps2ph       xmm1, xmm0, 4            # Convert the single precision result to half

        vmovd             eax, xmm1                      # Move the half precision result to eax

        mov                 word ptr [rip + x], ax     # Store the half precision result in the global, x

        ret                                                             # Return the single precision result still in xmm0

.Lfunc_end0:

                                        # -- End function

 

Something odd has happened here, and it may not be obvious what it is. This code begins by converting xmm0 and xmm1 from single to half and then back to single. The first conversion is happening because the back end decided that it needed to change the types of the parameters to single precision but the function body is expecting half precision values. However, since the target can’t perform the required computation with half precision values they must be converted back to single for the multiplication. The single precision result of the multiplication is converted to half precision to be stored in the global value, x, but the result is returned as single precision (via xmm0).

 

I’m not primarily worried about the extra conversions here. We can’t get rid of them because we can’t prove they aren’t rounding, but that’s a secondary issue. What I’m worried about is that we allowed/required the back end to improvise an ABI to satisfy the incoming IR, and the choice it made is questionable.

 

For a point of comparison, I looked at what gcc does. Currently, gcc only allows _Float16 in C, not C++, and if you try to use it with a target that doesn’t have native support for half-precision arithmetic, it tells you “’_Float16’ is not supported on this target.” That seems preferable to making up an ABI on the fly.

 

I haven’t looked at what happens with clang when compiling for other targets that don’t have native support for half-precision arithmetic, but I would imagine that similar problems exist.

 

Thoughts?

 

Thanks,

Andy


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: _Float16 support

David Blaikie via cfe-dev
While looking at the codegen Andy showed, I notice that the initial SelectionDAG looks like this for x86-64.

  t0: ch = EntryToken
      t2: f32,ch = CopyFromReg t0, Register:f32 %0
    t6: f16 = fp_round t2, TargetConstant:i64<1>
      t4: f32,ch = CopyFromReg t0, Register:f32 %1
    t7: f16 = fp_round t4, TargetConstant:i64<1>
  t8: f16 = fmul t6, t7
  t10: i64 = Constant<0>
    t12: ch = store<(store 2 into @x)> t0, t8, GlobalAddress:i64<half* @x> 0, undef:i64
    t13: f32 = fp_extend t8
  t16: ch,glue = CopyToReg t12, Register:f32 $xmm0, t13
  t17: ch = X86ISD::RET_FLAG t16, TargetConstant:i32<0>, Register:f32 $xmm0, t16:1

The FP_ROUNDs for the arguments each have the flag set that indicates that the fp_round doesn't lose any information. This is the TargetConstant:i64<1> as the second operand.

As far as I can tell, any caller of this would have an FP_EXTEND from f16 to f32 in their initial selection dag for calling this function. When the FP_EXTENDs are type legalized by DAGTypeLegalizer::PromoteFloatOp_FP_EXTEND, the FP_EXTEND will be removed completely with no replacement operations. I believe this means there is no guarantee that the f32 value passed in doesn't contain precision beyond the range of f16. So the fp_round nodes saying no information is lost in the callee are not accurate.

~Craig


On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev <[hidden email]> wrote:

I'd like to start a discussion about how clang supports _Float16 for target architectures that don't have direct support for 16-bit floating point arithmetic.

 

The current clang language extensions documentation says, "If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision." This is somewhat vague (to me) as to what is meant by promotion of values, and the part about results being stored in single-precision isn't what actually happens.

 

Consider this example:

 

_Float16 x;

_Float16 f(_Float16 y, _Float16 z) {

  x = y * z;

  return x;

}

 

When compiling with “-march=core-avx2” that results (after some trivial cleanup) in this IR:

 

@x = global half 0xH0000, align 2

define half @f(half, half) {

  %3 = fmul half %0, %1

  store half %3, half* @x

  ret half %3

}

 

That’s not too unreasonable I suppose, except for the fact that it hasn’t taken the lack of target support for half-precision arithmetic into account yet. That will happen in the selection DAG. The assembly code generated looks like this (with my annotations):

 

f:                                      # @f

# %bb.0:

       vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 from single to half

        vcvtph2ps       xmm1, xmm1                # Convert argument 1 back to single

        vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 from single to half

        vcvtph2ps       xmm0, xmm0                # Convert argument 0 back to single

        vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 (single precision)

        vcvtps2ph       xmm1, xmm0, 4            # Convert the single precision result to half

        vmovd             eax, xmm1                      # Move the half precision result to eax

        mov                 word ptr [rip + x], ax     # Store the half precision result in the global, x

        ret                                                             # Return the single precision result still in xmm0

.Lfunc_end0:

                                        # -- End function

 

Something odd has happened here, and it may not be obvious what it is. This code begins by converting xmm0 and xmm1 from single to half and then back to single. The first conversion is happening because the back end decided that it needed to change the types of the parameters to single precision but the function body is expecting half precision values. However, since the target can’t perform the required computation with half precision values they must be converted back to single for the multiplication. The single precision result of the multiplication is converted to half precision to be stored in the global value, x, but the result is returned as single precision (via xmm0).

 

I’m not primarily worried about the extra conversions here. We can’t get rid of them because we can’t prove they aren’t rounding, but that’s a secondary issue. What I’m worried about is that we allowed/required the back end to improvise an ABI to satisfy the incoming IR, and the choice it made is questionable.

 

For a point of comparison, I looked at what gcc does. Currently, gcc only allows _Float16 in C, not C++, and if you try to use it with a target that doesn’t have native support for half-precision arithmetic, it tells you “’_Float16’ is not supported on this target.” That seems preferable to making up an ABI on the fly.

 

I haven’t looked at what happens with clang when compiling for other targets that don’t have native support for half-precision arithmetic, but I would imagine that similar problems exist.

 

Thoughts?

 

Thanks,

Andy

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] _Float16 support

David Blaikie via cfe-dev
On Wed, 23 Jan 2019, 11:27 Craig Topper via llvm-dev, <[hidden email]> wrote:
While looking at the codegen Andy showed, I notice that the initial SelectionDAG looks like this for x86-64.

  t0: ch = EntryToken
      t2: f32,ch = CopyFromReg t0, Register:f32 %0
    t6: f16 = fp_round t2, TargetConstant:i64<1>
      t4: f32,ch = CopyFromReg t0, Register:f32 %1
    t7: f16 = fp_round t4, TargetConstant:i64<1>
  t8: f16 = fmul t6, t7
  t10: i64 = Constant<0>
    t12: ch = store<(store 2 into @x)> t0, t8, GlobalAddress:i64<half* @x> 0, undef:i64
    t13: f32 = fp_extend t8
  t16: ch,glue = CopyToReg t12, Register:f32 $xmm0, t13
  t17: ch = X86ISD::RET_FLAG t16, TargetConstant:i32<0>, Register:f32 $xmm0, t16:1

The FP_ROUNDs for the arguments each have the flag set that indicates that the fp_round doesn't lose any information. This is the TargetConstant:i64<1> as the second operand.

As far as I can tell, any caller of this would have an FP_EXTEND from f16 to f32 in their initial selection dag for calling this function. When the FP_EXTENDs are type legalized by DAGTypeLegalizer::PromoteFloatOp_FP_EXTEND, the FP_EXTEND will be removed completely with no replacement operations. I believe this means there is no guarantee that the f32 value passed in doesn't contain precision beyond the range of f16. So the fp_round nodes saying no information is lost in the callee are not accurate.

That seems wrong to me from an ABI perspective; I would expect the burden to be on the caller to only pass a valid "half" value to a "half" parameter. But this leads back to Andy's point: we're inventing an ABI rule here.

~Craig


On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev <[hidden email]> wrote:

I'd like to start a discussion about how clang supports _Float16 for target architectures that don't have direct support for 16-bit floating point arithmetic.

 

The current clang language extensions documentation says, "If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision." This is somewhat vague (to me) as to what is meant by promotion of values, and the part about results being stored in single-precision isn't what actually happens.

 

Consider this example:

 

_Float16 x;

_Float16 f(_Float16 y, _Float16 z) {

  x = y * z;

  return x;

}

 

When compiling with “-march=core-avx2” that results (after some trivial cleanup) in this IR:

 

@x = global half 0xH0000, align 2

define half @f(half, half) {

  %3 = fmul half %0, %1

  store half %3, half* @x

  ret half %3

}

 

That’s not too unreasonable I suppose, except for the fact that it hasn’t taken the lack of target support for half-precision arithmetic into account yet. That will happen in the selection DAG. The assembly code generated looks like this (with my annotations):

 

f:                                      # @f

# %bb.0:

       vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 from single to half

        vcvtph2ps       xmm1, xmm1                # Convert argument 1 back to single

        vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 from single to half

        vcvtph2ps       xmm0, xmm0                # Convert argument 0 back to single

        vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 (single precision)

        vcvtps2ph       xmm1, xmm0, 4            # Convert the single precision result to half

        vmovd             eax, xmm1                      # Move the half precision result to eax

        mov                 word ptr [rip + x], ax     # Store the half precision result in the global, x

        ret                                                             # Return the single precision result still in xmm0

.Lfunc_end0:

                                        # -- End function

 

Something odd has happened here, and it may not be obvious what it is. This code begins by converting xmm0 and xmm1 from single to half and then back to single. The first conversion is happening because the back end decided that it needed to change the types of the parameters to single precision but the function body is expecting half precision values. However, since the target can’t perform the required computation with half precision values they must be converted back to single for the multiplication. The single precision result of the multiplication is converted to half precision to be stored in the global value, x, but the result is returned as single precision (via xmm0).

 

I’m not primarily worried about the extra conversions here. We can’t get rid of them because we can’t prove they aren’t rounding, but that’s a secondary issue. What I’m worried about is that we allowed/required the back end to improvise an ABI to satisfy the incoming IR, and the choice it made is questionable.

 

For a point of comparison, I looked at what gcc does. Currently, gcc only allows _Float16 in C, not C++, and if you try to use it with a target that doesn’t have native support for half-precision arithmetic, it tells you “’_Float16’ is not supported on this target.” That seems preferable to making up an ABI on the fly.

 

I haven’t looked at what happens with clang when compiling for other targets that don’t have native support for half-precision arithmetic, but I would imagine that similar problems exist.

 

Thoughts?

 

Thanks,

Andy

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] _Float16 support

David Blaikie via cfe-dev
The issue isn't limited to calls either. If a half is a liveout of one basic block and used by another basic block. We'll emit an fp_round with 1 for the second argument in the receiving basic block. But the producing basic block won't have done anything to make it true.


~Craig


On Wed, Jan 23, 2019 at 11:53 AM Richard Smith <[hidden email]> wrote:
On Wed, 23 Jan 2019, 11:27 Craig Topper via llvm-dev, <[hidden email]> wrote:
While looking at the codegen Andy showed, I notice that the initial SelectionDAG looks like this for x86-64.

  t0: ch = EntryToken
      t2: f32,ch = CopyFromReg t0, Register:f32 %0
    t6: f16 = fp_round t2, TargetConstant:i64<1>
      t4: f32,ch = CopyFromReg t0, Register:f32 %1
    t7: f16 = fp_round t4, TargetConstant:i64<1>
  t8: f16 = fmul t6, t7
  t10: i64 = Constant<0>
    t12: ch = store<(store 2 into @x)> t0, t8, GlobalAddress:i64<half* @x> 0, undef:i64
    t13: f32 = fp_extend t8
  t16: ch,glue = CopyToReg t12, Register:f32 $xmm0, t13
  t17: ch = X86ISD::RET_FLAG t16, TargetConstant:i32<0>, Register:f32 $xmm0, t16:1

The FP_ROUNDs for the arguments each have the flag set that indicates that the fp_round doesn't lose any information. This is the TargetConstant:i64<1> as the second operand.

As far as I can tell, any caller of this would have an FP_EXTEND from f16 to f32 in their initial selection dag for calling this function. When the FP_EXTENDs are type legalized by DAGTypeLegalizer::PromoteFloatOp_FP_EXTEND, the FP_EXTEND will be removed completely with no replacement operations. I believe this means there is no guarantee that the f32 value passed in doesn't contain precision beyond the range of f16. So the fp_round nodes saying no information is lost in the callee are not accurate.

That seems wrong to me from an ABI perspective; I would expect the burden to be on the caller to only pass a valid "half" value to a "half" parameter. But this leads back to Andy's point: we're inventing an ABI rule here.

~Craig


On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev <[hidden email]> wrote:

I'd like to start a discussion about how clang supports _Float16 for target architectures that don't have direct support for 16-bit floating point arithmetic.

 

The current clang language extensions documentation says, "If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision." This is somewhat vague (to me) as to what is meant by promotion of values, and the part about results being stored in single-precision isn't what actually happens.

 

Consider this example:

 

_Float16 x;

_Float16 f(_Float16 y, _Float16 z) {

  x = y * z;

  return x;

}

 

When compiling with “-march=core-avx2” that results (after some trivial cleanup) in this IR:

 

@x = global half 0xH0000, align 2

define half @f(half, half) {

  %3 = fmul half %0, %1

  store half %3, half* @x

  ret half %3

}

 

That’s not too unreasonable I suppose, except for the fact that it hasn’t taken the lack of target support for half-precision arithmetic into account yet. That will happen in the selection DAG. The assembly code generated looks like this (with my annotations):

 

f:                                      # @f

# %bb.0:

       vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 from single to half

        vcvtph2ps       xmm1, xmm1                # Convert argument 1 back to single

        vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 from single to half

        vcvtph2ps       xmm0, xmm0                # Convert argument 0 back to single

        vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 (single precision)

        vcvtps2ph       xmm1, xmm0, 4            # Convert the single precision result to half

        vmovd             eax, xmm1                      # Move the half precision result to eax

        mov                 word ptr [rip + x], ax     # Store the half precision result in the global, x

        ret                                                             # Return the single precision result still in xmm0

.Lfunc_end0:

                                        # -- End function

 

Something odd has happened here, and it may not be obvious what it is. This code begins by converting xmm0 and xmm1 from single to half and then back to single. The first conversion is happening because the back end decided that it needed to change the types of the parameters to single precision but the function body is expecting half precision values. However, since the target can’t perform the required computation with half precision values they must be converted back to single for the multiplication. The single precision result of the multiplication is converted to half precision to be stored in the global value, x, but the result is returned as single precision (via xmm0).

 

I’m not primarily worried about the extra conversions here. We can’t get rid of them because we can’t prove they aren’t rounding, but that’s a secondary issue. What I’m worried about is that we allowed/required the back end to improvise an ABI to satisfy the incoming IR, and the choice it made is questionable.

 

For a point of comparison, I looked at what gcc does. Currently, gcc only allows _Float16 in C, not C++, and if you try to use it with a target that doesn’t have native support for half-precision arithmetic, it tells you “’_Float16’ is not supported on this target.” That seems preferable to making up an ABI on the fly.

 

I haven’t looked at what happens with clang when compiling for other targets that don’t have native support for half-precision arithmetic, but I would imagine that similar problems exist.

 

Thoughts?

 

Thanks,

Andy

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: _Float16 support

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev
Hey Andy,

On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev
<[hidden email]> wrote:
> I'd like to start a discussion about how clang supports _Float16 for target architectures that don't have direct support for 16-bit floating point arithmetic.

Thanks for bringing this up;  we'd also like to get better support,
for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is
usable with +fp16.

I'm not sure much of this discussion generalizes across platforms
though (beyond Craig's potential bug fix?).  I guess the
"target-independent" question is: should we allow this kind of
"legalization" in the vreg assignment code at all? (I think that's
where it all comes from: RegsForValue, TLI::get*Register*)
It's convenient for experimental frontends: you can use weird types
(half, i3, ...) without worrying too much about it, and you usually
get something self-consistent out of the backend.  But you eventually
need to worry about it and need to make the calling convention
explicit.  But I guess that's a discussion for the other thread ;)

> The current clang language extensions documentation says, "If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision." This is somewhat vague (to me) as to what is meant by promotion of values, and the part about results being stored in single-precision isn't what actually happens.
>
> Consider this example:
>
> _Float16 x;
> _Float16 f(_Float16 y, _Float16 z) {
>   x = y * z;
>   return x;
> }
>
> When compiling with “-march=core-avx2” that results (after some trivial cleanup) in this IR:
>
> @x = global half 0xH0000, align 2
> define half @f(half, half) {
>   %3 = fmul half %0, %1
>   store half %3, half* @x
>   ret half %3
> }
>
> That’s not too unreasonable I suppose, except for the fact that it hasn’t taken the lack of target support for half-precision arithmetic into account yet. That will happen in the selection DAG. The assembly code generated looks like this (with my annotations):
>
> f:                                      # @f
> # %bb.0:
>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 from single to half
>         vcvtph2ps       xmm1, xmm1                # Convert argument 1 back to single
>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 from single to half
>         vcvtph2ps       xmm0, xmm0                # Convert argument 0 back to single
>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 (single precision)
>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single precision result to half
>         vmovd             eax, xmm1                      # Move the half precision result to eax
>         mov                 word ptr [rip + x], ax     # Store the half precision result in the global, x
>         ret                                                             # Return the single precision result still in xmm0
> .Lfunc_end0:
>                                         # -- End function
>
> Something odd has happened here, and it may not be obvious what it is. This code begins by converting xmm0 and xmm1 from single to half and then back to single. The first conversion is happening because the back end decided that it needed to change the types of the parameters to single precision but the function body is expecting half precision values. However, since the target can’t perform the required computation with half precision values they must be converted back to single for the multiplication. The single precision result of the multiplication is converted to half precision to be stored in the global value, x, but the result is returned as single precision (via xmm0).
>
> I’m not primarily worried about the extra conversions here. We can’t get rid of them because we can’t prove they aren’t rounding, but that’s a secondary issue. What I’m worried about is that we allowed/required the back end to improvise an ABI to satisfy the incoming IR, and the choice it made is questionable.

As Richard said, an ABI rule emerged from the implementation, and I
believe we should solidify it, so here's a simple strawman proposal:
pass scalars in the low 16 bits of SSE registers, don't change the
memory layout, and pack them in vectors of 16-bit elements.  That
matches the only ISA extension so far (ph<>ps conversions), and fits
well with that (as opposed to i16 coercion) as well as vectors (as
opposed to f32 promotion).  To my knowledge, there hasn't been any
alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's
interesting because we technically have no way of accessing scalars
(so we have the same problems as i8/i16 vector elements, but without
the saving grace of having matching GPRs - x86, or direct copies -
aarch64), and there are not even any scalar operations.

Any thoughts?  We can suggest this to x86-psABI if folks think this is
a good idea. (I don't know about other ABIs or other architectures
though).

Concretely, this means no/little change in IRGen.  As for the SDAG
implementation, this is an unusual situation.  I've done some
experimentation a long time ago.  We can make the types legal, even
though no operations are.   It's relatively straightforward to promote
all operations (and we made sure that worked years ago for AArch64,
for the pre-v8.2 mode), but vectors are fun, because of build_vector
(where it helps to have the truncating behavior we have for integers,
but for fp), extract_vector_elt (where you need the matching extend),
and insert_vector_elt (which you have to lower using some movd and/or
pinsrw trickery, if you want to avoid the generic slow via-memory
fallback).
Alternatively, we can immediately, in call lowering/register
assignment logic (this covers the SDAG cross-BB vreg assignments Craig
mentions) promote to f32 "via" i16.  I'm afraid I don't remember the
arguments one way or the other, I can dust off my old patches and put
them up on phabricator.


-Ahmed

>
> For a point of comparison, I looked at what gcc does. Currently, gcc only allows _Float16 in C, not C++, and if you try to use it with a target that doesn’t have native support for half-precision arithmetic, it tells you “’_Float16’ is not supported on this target.” That seems preferable to making up an ABI on the fly.
>
> I haven’t looked at what happens with clang when compiling for other targets that don’t have native support for half-precision arithmetic, but I would imagine that similar problems exist.
>
> Thoughts?
>
> Thanks,
> Andy
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: _Float16 support

David Blaikie via cfe-dev
It seems that there are several issues here:

1. Should the front end be concerned with whether or not the IR that it is emitting can be translated into a well-defined IR?
2. How should the selection DAG handle data types whose representation isn't defined by the ABI we're targeting?
3. What should the ABI do with half-precision floats?

Working backward...

The third question here is obviously target specific. I've talked to HJ Lu about this, and he's working on an update to the x86 psABI. I believe that his eventual proposal will follow the lines of what you (Ahmed) suggested below, but I'm not completely proficient at comprehending ABI definitions so there may be some subtlety that I am misunderstanding in what he told me. I also talked to Craig about would be involved in making the LLVM x86 backend handle 'half' values this way. That involves a good bit of work, but it can be done.

The second question above probably involves a mix of target-independent and target-specific code. Right now the selection DAG code is operating on the assumption that it needs to do *something* with any IR it is given. It tries to make a reasonable choice, and the choice is consistent and predictable but not necessarily what the user expects. It seems like we should at the very least be producing a diagnostic so the user knows what we did (or even just that we did something). Then there are the specific problems Craig has brought up with the way we're currently handling 'half' values. Would defining a legal f16 type take care of those problems?

The first question exposes my lack of understanding of the proper role of the front end. It isn't clear to me what responsibility the front end has for enforcing conformance to the ABI. As a user of the compiler, I would like the compiler to tell me when code I've written can't be represented using the ABI I am targeting. Whether the front end should detect this or the backend, I don't know. I suppose it's also an open question how strictly this should be enforced. Is it a warning that can be elevated to an error at the users' discretion? Is it something that should be blocked by default but enabled by a user-specified option? Should it always be rejected?

-Andy

-----Original Message-----
From: Ahmed Bougacha <[hidden email]>
Sent: Wednesday, January 23, 2019 3:30 PM
To: Kaylor, Andrew <[hidden email]>
Cc: [hidden email]; llvm-dev <[hidden email]>; Craig Topper <[hidden email]>; Richard Smith <[hidden email]>
Subject: Re: [cfe-dev] _Float16 support

Hey Andy,

On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev <[hidden email]> wrote:
> I'd like to start a discussion about how clang supports _Float16 for target architectures that don't have direct support for 16-bit floating point arithmetic.

Thanks for bringing this up;  we'd also like to get better support, for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is usable with +fp16.

I'm not sure much of this discussion generalizes across platforms though (beyond Craig's potential bug fix?).  I guess the "target-independent" question is: should we allow this kind of "legalization" in the vreg assignment code at all? (I think that's where it all comes from: RegsForValue, TLI::get*Register*) It's convenient for experimental frontends: you can use weird types (half, i3, ...) without worrying too much about it, and you usually get something self-consistent out of the backend.  But you eventually need to worry about it and need to make the calling convention explicit.  But I guess that's a discussion for the other thread ;)

> The current clang language extensions documentation says, "If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision." This is somewhat vague (to me) as to what is meant by promotion of values, and the part about results being stored in single-precision isn't what actually happens.
>
> Consider this example:
>
> _Float16 x;
> _Float16 f(_Float16 y, _Float16 z) {
>   x = y * z;
>   return x;
> }
>
> When compiling with “-march=core-avx2” that results (after some trivial cleanup) in this IR:
>
> @x = global half 0xH0000, align 2
> define half @f(half, half) {
>   %3 = fmul half %0, %1
>   store half %3, half* @x
>   ret half %3
> }
>
> That’s not too unreasonable I suppose, except for the fact that it hasn’t taken the lack of target support for half-precision arithmetic into account yet. That will happen in the selection DAG. The assembly code generated looks like this (with my annotations):
>
> f:                                      # @f
> # %bb.0:
>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 from single to half
>         vcvtph2ps       xmm1, xmm1                # Convert argument 1 back to single
>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 from single to half
>         vcvtph2ps       xmm0, xmm0                # Convert argument 0 back to single
>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 (single precision)
>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single precision result to half
>         vmovd             eax, xmm1                      # Move the half precision result to eax
>         mov                 word ptr [rip + x], ax     # Store the half precision result in the global, x
>         ret                                                             # Return the single precision result still in xmm0
> .Lfunc_end0:
>                                         # -- End function
>
> Something odd has happened here, and it may not be obvious what it is. This code begins by converting xmm0 and xmm1 from single to half and then back to single. The first conversion is happening because the back end decided that it needed to change the types of the parameters to single precision but the function body is expecting half precision values. However, since the target can’t perform the required computation with half precision values they must be converted back to single for the multiplication. The single precision result of the multiplication is converted to half precision to be stored in the global value, x, but the result is returned as single precision (via xmm0).
>
> I’m not primarily worried about the extra conversions here. We can’t get rid of them because we can’t prove they aren’t rounding, but that’s a secondary issue. What I’m worried about is that we allowed/required the back end to improvise an ABI to satisfy the incoming IR, and the choice it made is questionable.

As Richard said, an ABI rule emerged from the implementation, and I believe we should solidify it, so here's a simple strawman proposal:
pass scalars in the low 16 bits of SSE registers, don't change the memory layout, and pack them in vectors of 16-bit elements.  That matches the only ISA extension so far (ph<>ps conversions), and fits well with that (as opposed to i16 coercion) as well as vectors (as opposed to f32 promotion).  To my knowledge, there hasn't been any alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's interesting because we technically have no way of accessing scalars (so we have the same problems as i8/i16 vector elements, but without the saving grace of having matching GPRs - x86, or direct copies - aarch64), and there are not even any scalar operations.

Any thoughts?  We can suggest this to x86-psABI if folks think this is a good idea. (I don't know about other ABIs or other architectures though).

Concretely, this means no/little change in IRGen.  As for the SDAG implementation, this is an unusual situation.  I've done some experimentation a long time ago.  We can make the types legal, even
though no operations are.   It's relatively straightforward to promote
all operations (and we made sure that worked years ago for AArch64, for the pre-v8.2 mode), but vectors are fun, because of build_vector (where it helps to have the truncating behavior we have for integers, but for fp), extract_vector_elt (where you need the matching extend), and insert_vector_elt (which you have to lower using some movd and/or pinsrw trickery, if you want to avoid the generic slow via-memory fallback).
Alternatively, we can immediately, in call lowering/register assignment logic (this covers the SDAG cross-BB vreg assignments Craig
mentions) promote to f32 "via" i16.  I'm afraid I don't remember the arguments one way or the other, I can dust off my old patches and put them up on phabricator.


-Ahmed

>
> For a point of comparison, I looked at what gcc does. Currently, gcc only allows _Float16 in C, not C++, and if you try to use it with a target that doesn’t have native support for half-precision arithmetic, it tells you “’_Float16’ is not supported on this target.” That seems preferable to making up an ABI on the fly.
>
> I haven’t looked at what happens with clang when compiling for other targets that don’t have native support for half-precision arithmetic, but I would imagine that similar problems exist.
>
> Thoughts?
>
> Thanks,
> Andy
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] _Float16 support

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev
On 23 Jan 2019, at 14:52, Richard Smith via cfe-dev wrote:

> On Wed, 23 Jan 2019, 11:27 Craig Topper via llvm-dev, <
> [hidden email]> wrote:
>
>> While looking at the codegen Andy showed, I notice that the initial
>> SelectionDAG looks like this for x86-64.
>>
>>   t0: ch = EntryToken
>>       t2: f32,ch = CopyFromReg t0, Register:f32 %0
>>     t6: f16 = fp_round t2, TargetConstant:i64<1>
>>       t4: f32,ch = CopyFromReg t0, Register:f32 %1
>>     t7: f16 = fp_round t4, TargetConstant:i64<1>
>>   t8: f16 = fmul t6, t7
>>   t10: i64 = Constant<0>
>>     t12: ch = store<(store 2 into @x)> t0, t8,
>> GlobalAddress:i64<half* @x>
>> 0, undef:i64
>>     t13: f32 = fp_extend t8
>>   t16: ch,glue = CopyToReg t12, Register:f32 $xmm0, t13
>>   t17: ch = X86ISD::RET_FLAG t16, TargetConstant:i32<0>, Register:f32
>> $xmm0, t16:1
>>
>> The FP_ROUNDs for the arguments each have the flag set that indicates
>> that
>> the fp_round doesn't lose any information. This is the
>> TargetConstant:i64<1> as the second operand.
>>
>> As far as I can tell, any caller of this would have an FP_EXTEND from
>> f16
>> to f32 in their initial selection dag for calling this function. When
>> the
>> FP_EXTENDs are type legalized
>> by DAGTypeLegalizer::PromoteFloatOp_FP_EXTEND, the FP_EXTEND will be
>> removed completely with no replacement operations. I believe this
>> means
>> there is no guarantee that the f32 value passed in doesn't contain
>> precision beyond the range of f16. So the fp_round nodes saying no
>> information is lost in the callee are not accurate.
>>
>
> That seems wrong to me from an ABI perspective; I would expect the
> burden
> to be on the caller to only pass a valid "half" value to a "half"
> parameter. But this leads back to Andy's point: we're inventing an ABI
> rule
> here.

Right.  IR and SelectionDAG representational choices aside, it seems to
me
that, like GCC, Clang should not be permitting _Float16 on any target
that
doesn't specify an ABI for it, because otherwise we're just creating
future compatibility problems for that target.  I'm surprised and
disappointed
that it wasn't implemented this way.

Unlike GCC, of course, we would implement it in all language modes on
the
target, since there's zero reason to make it C-specific.

As for those internal representational choices: I'll leave SelectionDAG
up to the backend engineers, but I think that in IR, a half argument
should
clearly correspond to a direct representation whenever hardware support
for
half exists.  If the ABI calls for the type to be promoted and passed as
a
float, that should be done in the frontend, just as is done for small
integer
types.  It would then make sense to have an attribute (fpext?) for
optimization
purposes that says that a parameter is guaranteed to be a promotion of a
smaller type; Clang could use this whenever it's allowed by the psABI.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: _Float16 support

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev

Hello, 

I added _Float16 support to Clang and codegen support in the AArch64 and ARM backends, but have not looked into x86. Ahmed is right: AArch64 is fine, only a few ACLE intrinsics are missing. ARM has rough edges: scalar codegen should be mostly fine, vector codegen needs some more work.

Implementation for AArch64 was mostly straightforward (it only has hard float ABI, and has half register/type support), but for ARM it was a huge pain to plumb f16 support because of different ABIs (hard/soft), different architecture extensions of FP and FP16 support, and the existence of another half-precision type with different semantics. Sounds like you're doing a similar exercise, and yes, argument passing was one of the trickiest parts.


> IR and SelectionDAG representational choices aside, it seems to me that,

> like GCC, Clang should not be permitting _Float16 on any target  that doesn't

> specify an ABI for it, because otherwise we're just creating future compatibility

> problems for that target.  I'm surprised and  disappointed that it wasn't implemented

> this way.


Apologies, I missed that.

Sjoerd.


From: llvm-dev <[hidden email]> on behalf of Kaylor, Andrew via llvm-dev <[hidden email]>
Sent: 24 January 2019 00:23
To: Ahmed Bougacha; Lu, Hongjiu
Cc: llvm-dev; [hidden email]
Subject: Re: [llvm-dev] [cfe-dev] _Float16 support
 
It seems that there are several issues here:

1. Should the front end be concerned with whether or not the IR that it is emitting can be translated into a well-defined IR?
2. How should the selection DAG handle data types whose representation isn't defined by the ABI we're targeting?
3. What should the ABI do with half-precision floats?

Working backward...

The third question here is obviously target specific. I've talked to HJ Lu about this, and he's working on an update to the x86 psABI. I believe that his eventual proposal will follow the lines of what you (Ahmed) suggested below, but I'm not completely proficient at comprehending ABI definitions so there may be some subtlety that I am misunderstanding in what he told me. I also talked to Craig about would be involved in making the LLVM x86 backend handle 'half' values this way. That involves a good bit of work, but it can be done.

The second question above probably involves a mix of target-independent and target-specific code. Right now the selection DAG code is operating on the assumption that it needs to do *something* with any IR it is given. It tries to make a reasonable choice, and the choice is consistent and predictable but not necessarily what the user expects. It seems like we should at the very least be producing a diagnostic so the user knows what we did (or even just that we did something). Then there are the specific problems Craig has brought up with the way we're currently handling 'half' values. Would defining a legal f16 type take care of those problems?

The first question exposes my lack of understanding of the proper role of the front end. It isn't clear to me what responsibility the front end has for enforcing conformance to the ABI. As a user of the compiler, I would like the compiler to tell me when code I've written can't be represented using the ABI I am targeting. Whether the front end should detect this or the backend, I don't know. I suppose it's also an open question how strictly this should be enforced. Is it a warning that can be elevated to an error at the users' discretion? Is it something that should be blocked by default but enabled by a user-specified option? Should it always be rejected?

-Andy

-----Original Message-----
From: Ahmed Bougacha <[hidden email]>
Sent: Wednesday, January 23, 2019 3:30 PM
To: Kaylor, Andrew <[hidden email]>
Cc: [hidden email]; llvm-dev <[hidden email]>; Craig Topper <[hidden email]>; Richard Smith <[hidden email]>
Subject: Re: [cfe-dev] _Float16 support

Hey Andy,

On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev <[hidden email]> wrote:
> I'd like to start a discussion about how clang supports _Float16 for target architectures that don't have direct support for 16-bit floating point arithmetic.

Thanks for bringing this up;  we'd also like to get better support, for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is usable with +fp16.

I'm not sure much of this discussion generalizes across platforms though (beyond Craig's potential bug fix?).  I guess the "target-independent" question is: should we allow this kind of "legalization" in the vreg assignment code at all? (I think that's where it all comes from: RegsForValue, TLI::get*Register*) It's convenient for experimental frontends: you can use weird types (half, i3, ...) without worrying too much about it, and you usually get something self-consistent out of the backend.  But you eventually need to worry about it and need to make the calling convention explicit.  But I guess that's a discussion for the other thread ;)

> The current clang language extensions documentation says, "If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision." This is somewhat vague (to me) as to what is meant by promotion of values, and the part about results being stored in single-precision isn't what actually happens.
>
> Consider this example:
>
> _Float16 x;
> _Float16 f(_Float16 y, _Float16 z) {
>   x = y * z;
>   return x;
> }
>
> When compiling with “-march=core-avx2” that results (after some trivial cleanup) in this IR:
>
> @x = global half 0xH0000, align 2
> define half @f(half, half) {
>   %3 = fmul half %0, %1
>   store half %3, half* @x
>   ret half %3
> }
>
> That’s not too unreasonable I suppose, except for the fact that it hasn’t taken the lack of target support for half-precision arithmetic into account yet. That will happen in the selection DAG. The assembly code generated looks like this (with my annotations):
>
> f:                                      # @f
> # %bb.0:
>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 from single to half
>         vcvtph2ps       xmm1, xmm1                # Convert argument 1 back to single
>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 from single to half
>         vcvtph2ps       xmm0, xmm0                # Convert argument 0 back to single
>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 (single precision)
>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single precision result to half
>         vmovd             eax, xmm1                      # Move the half precision result to eax
>         mov                 word ptr [rip + x], ax     # Store the half precision result in the global, x
>         ret                                                             # Return the single precision result still in xmm0
> .Lfunc_end0:
>                                         # -- End function
>
> Something odd has happened here, and it may not be obvious what it is. This code begins by converting xmm0 and xmm1 from single to half and then back to single. The first conversion is happening because the back end decided that it needed to change the types of the parameters to single precision but the function body is expecting half precision values. However, since the target can’t perform the required computation with half precision values they must be converted back to single for the multiplication. The single precision result of the multiplication is converted to half precision to be stored in the global value, x, but the result is returned as single precision (via xmm0).
>
> I’m not primarily worried about the extra conversions here. We can’t get rid of them because we can’t prove they aren’t rounding, but that’s a secondary issue. What I’m worried about is that we allowed/required the back end to improvise an ABI to satisfy the incoming IR, and the choice it made is questionable.

As Richard said, an ABI rule emerged from the implementation, and I believe we should solidify it, so here's a simple strawman proposal:
pass scalars in the low 16 bits of SSE registers, don't change the memory layout, and pack them in vectors of 16-bit elements.  That matches the only ISA extension so far (ph<>ps conversions), and fits well with that (as opposed to i16 coercion) as well as vectors (as opposed to f32 promotion).  To my knowledge, there hasn't been any alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's interesting because we technically have no way of accessing scalars (so we have the same problems as i8/i16 vector elements, but without the saving grace of having matching GPRs - x86, or direct copies - aarch64), and there are not even any scalar operations.

Any thoughts?  We can suggest this to x86-psABI if folks think this is a good idea. (I don't know about other ABIs or other architectures though).

Concretely, this means no/little change in IRGen.  As for the SDAG implementation, this is an unusual situation.  I've done some experimentation a long time ago.  We can make the types legal, even
though no operations are.   It's relatively straightforward to promote
all operations (and we made sure that worked years ago for AArch64, for the pre-v8.2 mode), but vectors are fun, because of build_vector (where it helps to have the truncating behavior we have for integers, but for fp), extract_vector_elt (where you need the matching extend), and insert_vector_elt (which you have to lower using some movd and/or pinsrw trickery, if you want to avoid the generic slow via-memory fallback).
Alternatively, we can immediately, in call lowering/register assignment logic (this covers the SDAG cross-BB vreg assignments Craig
mentions) promote to f32 "via" i16.  I'm afraid I don't remember the arguments one way or the other, I can dust off my old patches and put them up on phabricator.


-Ahmed

>
> For a point of comparison, I looked at what gcc does. Currently, gcc only allows _Float16 in C, not C++, and if you try to use it with a target that doesn’t have native support for half-precision arithmetic, it tells you “’_Float16’ is not supported on this target.” That seems preferable to making up an ABI on the fly.
>
> I haven’t looked at what happens with clang when compiling for other targets that don’t have native support for half-precision arithmetic, but I would imagine that similar problems exist.
>
> Thoughts?
>
> Thanks,
> Andy
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: _Float16 support

David Blaikie via cfe-dev


On 24 Jan 2019, at 4:46, Sjoerd Meijer wrote:

> Hello,
>
> I added _Float16 support to Clang and codegen support in the AArch64
> and ARM backends, but have not looked into x86. Ahmed is right:
> AArch64 is fine, only a few ACLE intrinsics are missing. ARM has rough
> edges: scalar codegen should be mostly fine, vector codegen needs some
> more work.
>
> Implementation for AArch64 was mostly straightforward (it only has
> hard float ABI, and has half register/type support), but for ARM it
> was a huge pain to plumb f16 support because of different ABIs
> (hard/soft), different architecture extensions of FP and FP16 support,
> and the existence of another half-precision type with different
> semantics. Sounds like you're doing a similar exercise, and yes,
> argument passing was one of the trickiest parts.
>
>
>> IR and SelectionDAG representational choices aside, it seems to me
>> that,
>
>> like GCC, Clang should not be permitting _Float16 on any target  that
>> doesn't
>
>> specify an ABI for it, because otherwise we're just creating future
>> compatibility
>
>> problems for that target.  I'm surprised and  disappointed that it
>> wasn't implemented
>
>> this way.
>
> Apologies, I missed that.

It's alright, oversights happen (in both patch-writing and review).  Can
we get a volunteer to do the work to restrict this now?  I'm a little
crushed.

John.

>
> Sjoerd.
>
> ________________________________
> From: llvm-dev <[hidden email]> on behalf of Kaylor,
> Andrew via llvm-dev <[hidden email]>
> Sent: 24 January 2019 00:23
> To: Ahmed Bougacha; Lu, Hongjiu
> Cc: llvm-dev; [hidden email]
> Subject: Re: [llvm-dev] [cfe-dev] _Float16 support
>
> It seems that there are several issues here:
>
> 1. Should the front end be concerned with whether or not the IR that
> it is emitting can be translated into a well-defined IR?
> 2. How should the selection DAG handle data types whose representation
> isn't defined by the ABI we're targeting?
> 3. What should the ABI do with half-precision floats?
>
> Working backward...
>
> The third question here is obviously target specific. I've talked to
> HJ Lu about this, and he's working on an update to the x86 psABI. I
> believe that his eventual proposal will follow the lines of what you
> (Ahmed) suggested below, but I'm not completely proficient at
> comprehending ABI definitions so there may be some subtlety that I am
> misunderstanding in what he told me. I also talked to Craig about
> would be involved in making the LLVM x86 backend handle 'half' values
> this way. That involves a good bit of work, but it can be done.
>
> The second question above probably involves a mix of
> target-independent and target-specific code. Right now the selection
> DAG code is operating on the assumption that it needs to do
> *something* with any IR it is given. It tries to make a reasonable
> choice, and the choice is consistent and predictable but not
> necessarily what the user expects. It seems like we should at the very
> least be producing a diagnostic so the user knows what we did (or even
> just that we did something). Then there are the specific problems
> Craig has brought up with the way we're currently handling 'half'
> values. Would defining a legal f16 type take care of those problems?
>
> The first question exposes my lack of understanding of the proper role
> of the front end. It isn't clear to me what responsibility the front
> end has for enforcing conformance to the ABI. As a user of the
> compiler, I would like the compiler to tell me when code I've written
> can't be represented using the ABI I am targeting. Whether the front
> end should detect this or the backend, I don't know. I suppose it's
> also an open question how strictly this should be enforced. Is it a
> warning that can be elevated to an error at the users' discretion? Is
> it something that should be blocked by default but enabled by a
> user-specified option? Should it always be rejected?
>
> -Andy
>
> -----Original Message-----
> From: Ahmed Bougacha <[hidden email]>
> Sent: Wednesday, January 23, 2019 3:30 PM
> To: Kaylor, Andrew <[hidden email]>
> Cc: [hidden email]; llvm-dev <[hidden email]>; Craig
> Topper <[hidden email]>; Richard Smith <[hidden email]>
> Subject: Re: [cfe-dev] _Float16 support
>
> Hey Andy,
>
> On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev
> <[hidden email]> wrote:
>> I'd like to start a discussion about how clang supports _Float16 for
>> target architectures that don't have direct support for 16-bit
>> floating point arithmetic.
>
> Thanks for bringing this up;  we'd also like to get better support,
> for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is
> usable with +fp16.
>
> I'm not sure much of this discussion generalizes across platforms
> though (beyond Craig's potential bug fix?).  I guess the
> "target-independent" question is: should we allow this kind of
> "legalization" in the vreg assignment code at all? (I think that's
> where it all comes from: RegsForValue, TLI::get*Register*) It's
> convenient for experimental frontends: you can use weird types (half,
> i3, ...) without worrying too much about it, and you usually get
> something self-consistent out of the backend.  But you eventually need
> to worry about it and need to make the calling convention explicit.  
> But I guess that's a discussion for the other thread ;)
>
>> The current clang language extensions documentation says, "If
>> half-precision instructions are unavailable, values will be promoted
>> to single-precision, similar to the semantics of __fp16 except that
>> the results will be stored in single-precision." This is somewhat
>> vague (to me) as to what is meant by promotion of values, and the
>> part about results being stored in single-precision isn't what
>> actually happens.
>>
>> Consider this example:
>>
>> _Float16 x;
>> _Float16 f(_Float16 y, _Float16 z) {
>>   x = y * z;
>>   return x;
>> }
>>
>> When compiling with “-march=core-avx2” that results (after some
>> trivial cleanup) in this IR:
>>
>> @x = global half 0xH0000, align 2
>> define half @f(half, half) {
>>   %3 = fmul half %0, %1
>>   store half %3, half* @x
>>   ret half %3
>> }
>>
>> That’s not too unreasonable I suppose, except for the fact that it
>> hasn’t taken the lack of target support for half-precision
>> arithmetic into account yet. That will happen in the selection DAG.
>> The assembly code generated looks like this (with my annotations):
>>
>> f:                                      # @f
>> # %bb.0:
>>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1
>> from single to half
>>         vcvtph2ps       xmm1, xmm1                # Convert argument
>> 1 back to single
>>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0
>> from single to half
>>         vcvtph2ps       xmm0, xmm0                # Convert argument
>> 0 back to single
>>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1
>> (single precision)
>>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single
>> precision result to half
>>         vmovd             eax, xmm1                      # Move the
>> half precision result to eax
>>         mov                 word ptr [rip + x], ax     # Store the
>> half precision result in the global, x
>>         ret                                                          
>>   # Return the single precision result still in xmm0
>> .Lfunc_end0:
>>                                         # -- End function
>>
>> Something odd has happened here, and it may not be obvious what it
>> is. This code begins by converting xmm0 and xmm1 from single to half
>> and then back to single. The first conversion is happening because
>> the back end decided that it needed to change the types of the
>> parameters to single precision but the function body is expecting
>> half precision values. However, since the target can’t perform the
>> required computation with half precision values they must be
>> converted back to single for the multiplication. The single precision
>> result of the multiplication is converted to half precision to be
>> stored in the global value, x, but the result is returned as single
>> precision (via xmm0).
>>
>> I’m not primarily worried about the extra conversions here. We
>> can’t get rid of them because we can’t prove they aren’t
>> rounding, but that’s a secondary issue. What I’m worried about is
>> that we allowed/required the back end to improvise an ABI to satisfy
>> the incoming IR, and the choice it made is questionable.
>
> As Richard said, an ABI rule emerged from the implementation, and I
> believe we should solidify it, so here's a simple strawman proposal:
> pass scalars in the low 16 bits of SSE registers, don't change the
> memory layout, and pack them in vectors of 16-bit elements.  That
> matches the only ISA extension so far (ph<>ps conversions), and fits
> well with that (as opposed to i16 coercion) as well as vectors (as
> opposed to f32 promotion).  To my knowledge, there hasn't been any
> alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's
> interesting because we technically have no way of accessing scalars
> (so we have the same problems as i8/i16 vector elements, but without
> the saving grace of having matching GPRs - x86, or direct copies -
> aarch64), and there are not even any scalar operations.
>
> Any thoughts?  We can suggest this to x86-psABI if folks think this is
> a good idea. (I don't know about other ABIs or other architectures
> though).
>
> Concretely, this means no/little change in IRGen.  As for the SDAG
> implementation, this is an unusual situation.  I've done some
> experimentation a long time ago.  We can make the types legal, even
> though no operations are.   It's relatively straightforward to promote
> all operations (and we made sure that worked years ago for AArch64,
> for the pre-v8.2 mode), but vectors are fun, because of build_vector
> (where it helps to have the truncating behavior we have for integers,
> but for fp), extract_vector_elt (where you need the matching extend),
> and insert_vector_elt (which you have to lower using some movd and/or
> pinsrw trickery, if you want to avoid the generic slow via-memory
> fallback).
> Alternatively, we can immediately, in call lowering/register
> assignment logic (this covers the SDAG cross-BB vreg assignments Craig
> mentions) promote to f32 "via" i16.  I'm afraid I don't remember the
> arguments one way or the other, I can dust off my old patches and put
> them up on phabricator.
>
>
> -Ahmed
>
>>
>> For a point of comparison, I looked at what gcc does. Currently, gcc
>> only allows _Float16 in C, not C++, and if you try to use it with a
>> target that doesn’t have native support for half-precision
>> arithmetic, it tells you “’_Float16’ is not supported on this
>> target.” That seems preferable to making up an ABI on the fly.
>>
>> I haven’t looked at what happens with clang when compiling for
>> other targets that don’t have native support for half-precision
>> arithmetic, but I would imagine that similar problems exist.
>>
>> Thoughts?
>>
>> Thanks,
>> Andy
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium. Thank you.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] _Float16 support

David Blaikie via cfe-dev
Since Andy is my architect, I'm probably responsible (aka, being Volun-told :)) for this. Do we have a good idea which targets should currently have support?  Is it just AArch64 and ARM?

-Erich

-----Original Message-----
From: llvm-dev [mailto:[hidden email]] On Behalf Of John McCall via llvm-dev
Sent: Thursday, January 24, 2019 10:58 AM
To: Sjoerd Meijer <[hidden email]>
Cc: [hidden email]; Lu, Hongjiu <[hidden email]>; [hidden email]
Subject: Re: [llvm-dev] [cfe-dev] _Float16 support



On 24 Jan 2019, at 4:46, Sjoerd Meijer wrote:

> Hello,
>
> I added _Float16 support to Clang and codegen support in the AArch64
> and ARM backends, but have not looked into x86. Ahmed is right:
> AArch64 is fine, only a few ACLE intrinsics are missing. ARM has rough
> edges: scalar codegen should be mostly fine, vector codegen needs some
> more work.
>
> Implementation for AArch64 was mostly straightforward (it only has
> hard float ABI, and has half register/type support), but for ARM it
> was a huge pain to plumb f16 support because of different ABIs
> (hard/soft), different architecture extensions of FP and FP16 support,
> and the existence of another half-precision type with different
> semantics. Sounds like you're doing a similar exercise, and yes,
> argument passing was one of the trickiest parts.
>
>
>> IR and SelectionDAG representational choices aside, it seems to me
>> that,
>
>> like GCC, Clang should not be permitting _Float16 on any target  that
>> doesn't
>
>> specify an ABI for it, because otherwise we're just creating future
>> compatibility
>
>> problems for that target.  I'm surprised and  disappointed that it
>> wasn't implemented
>
>> this way.
>
> Apologies, I missed that.

It's alright, oversights happen (in both patch-writing and review).  Can we get a volunteer to do the work to restrict this now?  I'm a little crushed.

John.

>
> Sjoerd.
>
> ________________________________
> From: llvm-dev <[hidden email]> on behalf of Kaylor,
> Andrew via llvm-dev <[hidden email]>
> Sent: 24 January 2019 00:23
> To: Ahmed Bougacha; Lu, Hongjiu
> Cc: llvm-dev; [hidden email]
> Subject: Re: [llvm-dev] [cfe-dev] _Float16 support
>
> It seems that there are several issues here:
>
> 1. Should the front end be concerned with whether or not the IR that
> it is emitting can be translated into a well-defined IR?
> 2. How should the selection DAG handle data types whose representation
> isn't defined by the ABI we're targeting?
> 3. What should the ABI do with half-precision floats?
>
> Working backward...
>
> The third question here is obviously target specific. I've talked to
> HJ Lu about this, and he's working on an update to the x86 psABI. I
> believe that his eventual proposal will follow the lines of what you
> (Ahmed) suggested below, but I'm not completely proficient at
> comprehending ABI definitions so there may be some subtlety that I am
> misunderstanding in what he told me. I also talked to Craig about
> would be involved in making the LLVM x86 backend handle 'half' values
> this way. That involves a good bit of work, but it can be done.
>
> The second question above probably involves a mix of
> target-independent and target-specific code. Right now the selection
> DAG code is operating on the assumption that it needs to do
> *something* with any IR it is given. It tries to make a reasonable
> choice, and the choice is consistent and predictable but not
> necessarily what the user expects. It seems like we should at the very
> least be producing a diagnostic so the user knows what we did (or even
> just that we did something). Then there are the specific problems
> Craig has brought up with the way we're currently handling 'half'
> values. Would defining a legal f16 type take care of those problems?
>
> The first question exposes my lack of understanding of the proper role
> of the front end. It isn't clear to me what responsibility the front
> end has for enforcing conformance to the ABI. As a user of the
> compiler, I would like the compiler to tell me when code I've written
> can't be represented using the ABI I am targeting. Whether the front
> end should detect this or the backend, I don't know. I suppose it's
> also an open question how strictly this should be enforced. Is it a
> warning that can be elevated to an error at the users' discretion? Is
> it something that should be blocked by default but enabled by a
> user-specified option? Should it always be rejected?
>
> -Andy
>
> -----Original Message-----
> From: Ahmed Bougacha <[hidden email]>
> Sent: Wednesday, January 23, 2019 3:30 PM
> To: Kaylor, Andrew <[hidden email]>
> Cc: [hidden email]; llvm-dev <[hidden email]>; Craig
> Topper <[hidden email]>; Richard Smith <[hidden email]>
> Subject: Re: [cfe-dev] _Float16 support
>
> Hey Andy,
>
> On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev
> <[hidden email]> wrote:
>> I'd like to start a discussion about how clang supports _Float16 for
>> target architectures that don't have direct support for 16-bit
>> floating point arithmetic.
>
> Thanks for bringing this up;  we'd also like to get better support,
> for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is
> usable with +fp16.
>
> I'm not sure much of this discussion generalizes across platforms
> though (beyond Craig's potential bug fix?).  I guess the
> "target-independent" question is: should we allow this kind of
> "legalization" in the vreg assignment code at all? (I think that's
> where it all comes from: RegsForValue, TLI::get*Register*) It's
> convenient for experimental frontends: you can use weird types (half,
> i3, ...) without worrying too much about it, and you usually get
> something self-consistent out of the backend.  But you eventually need
> to worry about it and need to make the calling convention explicit.
> But I guess that's a discussion for the other thread ;)
>
>> The current clang language extensions documentation says, "If
>> half-precision instructions are unavailable, values will be promoted
>> to single-precision, similar to the semantics of __fp16 except that
>> the results will be stored in single-precision." This is somewhat
>> vague (to me) as to what is meant by promotion of values, and the
>> part about results being stored in single-precision isn't what
>> actually happens.
>>
>> Consider this example:
>>
>> _Float16 x;
>> _Float16 f(_Float16 y, _Float16 z) {
>>   x = y * z;
>>   return x;
>> }
>>
>> When compiling with “-march=core-avx2” that results (after some
>> trivial cleanup) in this IR:
>>
>> @x = global half 0xH0000, align 2
>> define half @f(half, half) {
>>   %3 = fmul half %0, %1
>>   store half %3, half* @x
>>   ret half %3
>> }
>>
>> That’s not too unreasonable I suppose, except for the fact that it
>> hasn’t taken the lack of target support for half-precision arithmetic
>> into account yet. That will happen in the selection DAG.
>> The assembly code generated looks like this (with my annotations):
>>
>> f:                                      # @f
>> # %bb.0:
>>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1
>> from single to half
>>         vcvtph2ps       xmm1, xmm1                # Convert argument
>> 1 back to single
>>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0
>> from single to half
>>         vcvtph2ps       xmm0, xmm0                # Convert argument
>> 0 back to single
>>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1
>> (single precision)
>>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single
>> precision result to half
>>         vmovd             eax, xmm1                      # Move the
>> half precision result to eax
>>         mov                 word ptr [rip + x], ax     # Store the
>> half precision result in the global, x
>>         ret                                                          
>>   # Return the single precision result still in xmm0
>> .Lfunc_end0:
>>                                         # -- End function
>>
>> Something odd has happened here, and it may not be obvious what it
>> is. This code begins by converting xmm0 and xmm1 from single to half
>> and then back to single. The first conversion is happening because
>> the back end decided that it needed to change the types of the
>> parameters to single precision but the function body is expecting
>> half precision values. However, since the target can’t perform the
>> required computation with half precision values they must be
>> converted back to single for the multiplication. The single precision
>> result of the multiplication is converted to half precision to be
>> stored in the global value, x, but the result is returned as single
>> precision (via xmm0).
>>
>> I’m not primarily worried about the extra conversions here. We can’t
>> get rid of them because we can’t prove they aren’t rounding, but
>> that’s a secondary issue. What I’m worried about is that we
>> allowed/required the back end to improvise an ABI to satisfy the
>> incoming IR, and the choice it made is questionable.
>
> As Richard said, an ABI rule emerged from the implementation, and I
> believe we should solidify it, so here's a simple strawman proposal:
> pass scalars in the low 16 bits of SSE registers, don't change the
> memory layout, and pack them in vectors of 16-bit elements.  That
> matches the only ISA extension so far (ph<>ps conversions), and fits
> well with that (as opposed to i16 coercion) as well as vectors (as
> opposed to f32 promotion).  To my knowledge, there hasn't been any
> alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's
> interesting because we technically have no way of accessing scalars
> (so we have the same problems as i8/i16 vector elements, but without
> the saving grace of having matching GPRs - x86, or direct copies -
> aarch64), and there are not even any scalar operations.
>
> Any thoughts?  We can suggest this to x86-psABI if folks think this is
> a good idea. (I don't know about other ABIs or other architectures
> though).
>
> Concretely, this means no/little change in IRGen.  As for the SDAG
> implementation, this is an unusual situation.  I've done some
> experimentation a long time ago.  We can make the types legal, even
> though no operations are.   It's relatively straightforward to promote
> all operations (and we made sure that worked years ago for AArch64,
> for the pre-v8.2 mode), but vectors are fun, because of build_vector
> (where it helps to have the truncating behavior we have for integers,
> but for fp), extract_vector_elt (where you need the matching extend),
> and insert_vector_elt (which you have to lower using some movd and/or
> pinsrw trickery, if you want to avoid the generic slow via-memory
> fallback).
> Alternatively, we can immediately, in call lowering/register
> assignment logic (this covers the SDAG cross-BB vreg assignments Craig
> mentions) promote to f32 "via" i16.  I'm afraid I don't remember the
> arguments one way or the other, I can dust off my old patches and put
> them up on phabricator.
>
>
> -Ahmed
>
>>
>> For a point of comparison, I looked at what gcc does. Currently, gcc
>> only allows _Float16 in C, not C++, and if you try to use it with a
>> target that doesn’t have native support for half-precision
>> arithmetic, it tells you “’_Float16’ is not supported on this
>> target.” That seems preferable to making up an ABI on the fly.
>>
>> I haven’t looked at what happens with clang when compiling for other
>> targets that don’t have native support for half-precision arithmetic,
>> but I would imagine that similar problems exist.
>>
>> Thoughts?
>>
>> Thanks,
>> Andy
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium. Thank you.


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] _Float16 support

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev
Woops, dropped llvm-dev :/

-----Original Message-----
From: Keane, Erich
Sent: Thursday, January 24, 2019 11:17 AM
To: 'Clang Dev' <[hidden email]>
Subject: RE: [llvm-dev] [cfe-dev] _Float16 support

Since Andy is my architect, I'm probably responsible (aka, being Volun-told :)) for this. Do we have a good idea which targets should currently have support?  Is it just AArch64 and ARM?

-Erich

-----Original Message-----
From: llvm-dev [mailto:[hidden email]] On Behalf Of John McCall via llvm-dev
Sent: Thursday, January 24, 2019 10:58 AM
To: Sjoerd Meijer <[hidden email]>
Cc: [hidden email]; Lu, Hongjiu <[hidden email]>; [hidden email]
Subject: Re: [llvm-dev] [cfe-dev] _Float16 support



On 24 Jan 2019, at 4:46, Sjoerd Meijer wrote:

> Hello,
>
> I added _Float16 support to Clang and codegen support in the AArch64
> and ARM backends, but have not looked into x86. Ahmed is right:
> AArch64 is fine, only a few ACLE intrinsics are missing. ARM has rough
> edges: scalar codegen should be mostly fine, vector codegen needs some
> more work.
>
> Implementation for AArch64 was mostly straightforward (it only has
> hard float ABI, and has half register/type support), but for ARM it
> was a huge pain to plumb f16 support because of different ABIs
> (hard/soft), different architecture extensions of FP and FP16 support,
> and the existence of another half-precision type with different
> semantics. Sounds like you're doing a similar exercise, and yes,
> argument passing was one of the trickiest parts.
>
>
>> IR and SelectionDAG representational choices aside, it seems to me
>> that,
>
>> like GCC, Clang should not be permitting _Float16 on any target  that
>> doesn't
>
>> specify an ABI for it, because otherwise we're just creating future
>> compatibility
>
>> problems for that target.  I'm surprised and  disappointed that it
>> wasn't implemented
>
>> this way.
>
> Apologies, I missed that.

It's alright, oversights happen (in both patch-writing and review).  Can we get a volunteer to do the work to restrict this now?  I'm a little crushed.

John.

>
> Sjoerd.
>
> ________________________________
> From: llvm-dev <[hidden email]> on behalf of Kaylor,
> Andrew via llvm-dev <[hidden email]>
> Sent: 24 January 2019 00:23
> To: Ahmed Bougacha; Lu, Hongjiu
> Cc: llvm-dev; [hidden email]
> Subject: Re: [llvm-dev] [cfe-dev] _Float16 support
>
> It seems that there are several issues here:
>
> 1. Should the front end be concerned with whether or not the IR that
> it is emitting can be translated into a well-defined IR?
> 2. How should the selection DAG handle data types whose representation
> isn't defined by the ABI we're targeting?
> 3. What should the ABI do with half-precision floats?
>
> Working backward...
>
> The third question here is obviously target specific. I've talked to
> HJ Lu about this, and he's working on an update to the x86 psABI. I
> believe that his eventual proposal will follow the lines of what you
> (Ahmed) suggested below, but I'm not completely proficient at
> comprehending ABI definitions so there may be some subtlety that I am
> misunderstanding in what he told me. I also talked to Craig about
> would be involved in making the LLVM x86 backend handle 'half' values
> this way. That involves a good bit of work, but it can be done.
>
> The second question above probably involves a mix of
> target-independent and target-specific code. Right now the selection
> DAG code is operating on the assumption that it needs to do
> *something* with any IR it is given. It tries to make a reasonable
> choice, and the choice is consistent and predictable but not
> necessarily what the user expects. It seems like we should at the very
> least be producing a diagnostic so the user knows what we did (or even
> just that we did something). Then there are the specific problems
> Craig has brought up with the way we're currently handling 'half'
> values. Would defining a legal f16 type take care of those problems?
>
> The first question exposes my lack of understanding of the proper role
> of the front end. It isn't clear to me what responsibility the front
> end has for enforcing conformance to the ABI. As a user of the
> compiler, I would like the compiler to tell me when code I've written
> can't be represented using the ABI I am targeting. Whether the front
> end should detect this or the backend, I don't know. I suppose it's
> also an open question how strictly this should be enforced. Is it a
> warning that can be elevated to an error at the users' discretion? Is
> it something that should be blocked by default but enabled by a
> user-specified option? Should it always be rejected?
>
> -Andy
>
> -----Original Message-----
> From: Ahmed Bougacha <[hidden email]>
> Sent: Wednesday, January 23, 2019 3:30 PM
> To: Kaylor, Andrew <[hidden email]>
> Cc: [hidden email]; llvm-dev <[hidden email]>; Craig
> Topper <[hidden email]>; Richard Smith <[hidden email]>
> Subject: Re: [cfe-dev] _Float16 support
>
> Hey Andy,
>
> On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev
> <[hidden email]> wrote:
>> I'd like to start a discussion about how clang supports _Float16 for
>> target architectures that don't have direct support for 16-bit
>> floating point arithmetic.
>
> Thanks for bringing this up;  we'd also like to get better support,
> for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is
> usable with +fp16.
>
> I'm not sure much of this discussion generalizes across platforms
> though (beyond Craig's potential bug fix?).  I guess the
> "target-independent" question is: should we allow this kind of
> "legalization" in the vreg assignment code at all? (I think that's
> where it all comes from: RegsForValue, TLI::get*Register*) It's
> convenient for experimental frontends: you can use weird types (half,
> i3, ...) without worrying too much about it, and you usually get
> something self-consistent out of the backend.  But you eventually need
> to worry about it and need to make the calling convention explicit.
> But I guess that's a discussion for the other thread ;)
>
>> The current clang language extensions documentation says, "If
>> half-precision instructions are unavailable, values will be promoted
>> to single-precision, similar to the semantics of __fp16 except that
>> the results will be stored in single-precision." This is somewhat
>> vague (to me) as to what is meant by promotion of values, and the
>> part about results being stored in single-precision isn't what
>> actually happens.
>>
>> Consider this example:
>>
>> _Float16 x;
>> _Float16 f(_Float16 y, _Float16 z) {
>>   x = y * z;
>>   return x;
>> }
>>
>> When compiling with “-march=core-avx2” that results (after some
>> trivial cleanup) in this IR:
>>
>> @x = global half 0xH0000, align 2
>> define half @f(half, half) {
>>   %3 = fmul half %0, %1
>>   store half %3, half* @x
>>   ret half %3
>> }
>>
>> That’s not too unreasonable I suppose, except for the fact that it
>> hasn’t taken the lack of target support for half-precision arithmetic
>> into account yet. That will happen in the selection DAG.
>> The assembly code generated looks like this (with my annotations):
>>
>> f:                                      # @f
>> # %bb.0:
>>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1
>> from single to half
>>         vcvtph2ps       xmm1, xmm1                # Convert argument
>> 1 back to single
>>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0
>> from single to half
>>         vcvtph2ps       xmm0, xmm0                # Convert argument
>> 0 back to single
>>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1
>> (single precision)
>>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single
>> precision result to half
>>         vmovd             eax, xmm1                      # Move the
>> half precision result to eax
>>         mov                 word ptr [rip + x], ax     # Store the
>> half precision result in the global, x
>>         ret                                                          
>>   # Return the single precision result still in xmm0
>> .Lfunc_end0:
>>                                         # -- End function
>>
>> Something odd has happened here, and it may not be obvious what it
>> is. This code begins by converting xmm0 and xmm1 from single to half
>> and then back to single. The first conversion is happening because
>> the back end decided that it needed to change the types of the
>> parameters to single precision but the function body is expecting
>> half precision values. However, since the target can’t perform the
>> required computation with half precision values they must be
>> converted back to single for the multiplication. The single precision
>> result of the multiplication is converted to half precision to be
>> stored in the global value, x, but the result is returned as single
>> precision (via xmm0).
>>
>> I’m not primarily worried about the extra conversions here. We can’t
>> get rid of them because we can’t prove they aren’t rounding, but
>> that’s a secondary issue. What I’m worried about is that we
>> allowed/required the back end to improvise an ABI to satisfy the
>> incoming IR, and the choice it made is questionable.
>
> As Richard said, an ABI rule emerged from the implementation, and I
> believe we should solidify it, so here's a simple strawman proposal:
> pass scalars in the low 16 bits of SSE registers, don't change the
> memory layout, and pack them in vectors of 16-bit elements.  That
> matches the only ISA extension so far (ph<>ps conversions), and fits
> well with that (as opposed to i16 coercion) as well as vectors (as
> opposed to f32 promotion).  To my knowledge, there hasn't been any
> alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's
> interesting because we technically have no way of accessing scalars
> (so we have the same problems as i8/i16 vector elements, but without
> the saving grace of having matching GPRs - x86, or direct copies -
> aarch64), and there are not even any scalar operations.
>
> Any thoughts?  We can suggest this to x86-psABI if folks think this is
> a good idea. (I don't know about other ABIs or other architectures
> though).
>
> Concretely, this means no/little change in IRGen.  As for the SDAG
> implementation, this is an unusual situation.  I've done some
> experimentation a long time ago.  We can make the types legal, even
> though no operations are.   It's relatively straightforward to promote
> all operations (and we made sure that worked years ago for AArch64,
> for the pre-v8.2 mode), but vectors are fun, because of build_vector
> (where it helps to have the truncating behavior we have for integers,
> but for fp), extract_vector_elt (where you need the matching extend),
> and insert_vector_elt (which you have to lower using some movd and/or
> pinsrw trickery, if you want to avoid the generic slow via-memory
> fallback).
> Alternatively, we can immediately, in call lowering/register
> assignment logic (this covers the SDAG cross-BB vreg assignments Craig
> mentions) promote to f32 "via" i16.  I'm afraid I don't remember the
> arguments one way or the other, I can dust off my old patches and put
> them up on phabricator.
>
>
> -Ahmed
>
>>
>> For a point of comparison, I looked at what gcc does. Currently, gcc
>> only allows _Float16 in C, not C++, and if you try to use it with a
>> target that doesn’t have native support for half-precision
>> arithmetic, it tells you “’_Float16’ is not supported on this
>> target.” That seems preferable to making up an ABI on the fly.
>>
>> I haven’t looked at what happens with clang when compiling for other
>> targets that don’t have native support for half-precision arithmetic,
>> but I would imagine that similar problems exist.
>>
>> Thoughts?
>>
>> Thanks,
>> Andy
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium. Thank you.


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] _Float16 support

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev
Disable up for review here: https://reviews.llvm.org/D57188

-----Original Message-----
From: llvm-dev [mailto:[hidden email]] On Behalf Of John McCall via llvm-dev
Sent: Thursday, January 24, 2019 10:58 AM
To: Sjoerd Meijer <[hidden email]>
Cc: [hidden email]; Lu, Hongjiu <[hidden email]>; [hidden email]
Subject: Re: [llvm-dev] [cfe-dev] _Float16 support



On 24 Jan 2019, at 4:46, Sjoerd Meijer wrote:

> Hello,
>
> I added _Float16 support to Clang and codegen support in the AArch64
> and ARM backends, but have not looked into x86. Ahmed is right:
> AArch64 is fine, only a few ACLE intrinsics are missing. ARM has rough
> edges: scalar codegen should be mostly fine, vector codegen needs some
> more work.
>
> Implementation for AArch64 was mostly straightforward (it only has
> hard float ABI, and has half register/type support), but for ARM it
> was a huge pain to plumb f16 support because of different ABIs
> (hard/soft), different architecture extensions of FP and FP16 support,
> and the existence of another half-precision type with different
> semantics. Sounds like you're doing a similar exercise, and yes,
> argument passing was one of the trickiest parts.
>
>
>> IR and SelectionDAG representational choices aside, it seems to me
>> that,
>
>> like GCC, Clang should not be permitting _Float16 on any target  that
>> doesn't
>
>> specify an ABI for it, because otherwise we're just creating future
>> compatibility
>
>> problems for that target.  I'm surprised and  disappointed that it
>> wasn't implemented
>
>> this way.
>
> Apologies, I missed that.

It's alright, oversights happen (in both patch-writing and review).  Can we get a volunteer to do the work to restrict this now?  I'm a little crushed.

John.

>
> Sjoerd.
>
> ________________________________
> From: llvm-dev <[hidden email]> on behalf of Kaylor,
> Andrew via llvm-dev <[hidden email]>
> Sent: 24 January 2019 00:23
> To: Ahmed Bougacha; Lu, Hongjiu
> Cc: llvm-dev; [hidden email]
> Subject: Re: [llvm-dev] [cfe-dev] _Float16 support
>
> It seems that there are several issues here:
>
> 1. Should the front end be concerned with whether or not the IR that
> it is emitting can be translated into a well-defined IR?
> 2. How should the selection DAG handle data types whose representation
> isn't defined by the ABI we're targeting?
> 3. What should the ABI do with half-precision floats?
>
> Working backward...
>
> The third question here is obviously target specific. I've talked to
> HJ Lu about this, and he's working on an update to the x86 psABI. I
> believe that his eventual proposal will follow the lines of what you
> (Ahmed) suggested below, but I'm not completely proficient at
> comprehending ABI definitions so there may be some subtlety that I am
> misunderstanding in what he told me. I also talked to Craig about
> would be involved in making the LLVM x86 backend handle 'half' values
> this way. That involves a good bit of work, but it can be done.
>
> The second question above probably involves a mix of
> target-independent and target-specific code. Right now the selection
> DAG code is operating on the assumption that it needs to do
> *something* with any IR it is given. It tries to make a reasonable
> choice, and the choice is consistent and predictable but not
> necessarily what the user expects. It seems like we should at the very
> least be producing a diagnostic so the user knows what we did (or even
> just that we did something). Then there are the specific problems
> Craig has brought up with the way we're currently handling 'half'
> values. Would defining a legal f16 type take care of those problems?
>
> The first question exposes my lack of understanding of the proper role
> of the front end. It isn't clear to me what responsibility the front
> end has for enforcing conformance to the ABI. As a user of the
> compiler, I would like the compiler to tell me when code I've written
> can't be represented using the ABI I am targeting. Whether the front
> end should detect this or the backend, I don't know. I suppose it's
> also an open question how strictly this should be enforced. Is it a
> warning that can be elevated to an error at the users' discretion? Is
> it something that should be blocked by default but enabled by a
> user-specified option? Should it always be rejected?
>
> -Andy
>
> -----Original Message-----
> From: Ahmed Bougacha <[hidden email]>
> Sent: Wednesday, January 23, 2019 3:30 PM
> To: Kaylor, Andrew <[hidden email]>
> Cc: [hidden email]; llvm-dev <[hidden email]>; Craig
> Topper <[hidden email]>; Richard Smith <[hidden email]>
> Subject: Re: [cfe-dev] _Float16 support
>
> Hey Andy,
>
> On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev
> <[hidden email]> wrote:
>> I'd like to start a discussion about how clang supports _Float16 for
>> target architectures that don't have direct support for 16-bit
>> floating point arithmetic.
>
> Thanks for bringing this up;  we'd also like to get better support,
> for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is
> usable with +fp16.
>
> I'm not sure much of this discussion generalizes across platforms
> though (beyond Craig's potential bug fix?).  I guess the
> "target-independent" question is: should we allow this kind of
> "legalization" in the vreg assignment code at all? (I think that's
> where it all comes from: RegsForValue, TLI::get*Register*) It's
> convenient for experimental frontends: you can use weird types (half,
> i3, ...) without worrying too much about it, and you usually get
> something self-consistent out of the backend.  But you eventually need
> to worry about it and need to make the calling convention explicit.
> But I guess that's a discussion for the other thread ;)
>
>> The current clang language extensions documentation says, "If
>> half-precision instructions are unavailable, values will be promoted
>> to single-precision, similar to the semantics of __fp16 except that
>> the results will be stored in single-precision." This is somewhat
>> vague (to me) as to what is meant by promotion of values, and the
>> part about results being stored in single-precision isn't what
>> actually happens.
>>
>> Consider this example:
>>
>> _Float16 x;
>> _Float16 f(_Float16 y, _Float16 z) {
>>   x = y * z;
>>   return x;
>> }
>>
>> When compiling with “-march=core-avx2” that results (after some
>> trivial cleanup) in this IR:
>>
>> @x = global half 0xH0000, align 2
>> define half @f(half, half) {
>>   %3 = fmul half %0, %1
>>   store half %3, half* @x
>>   ret half %3
>> }
>>
>> That’s not too unreasonable I suppose, except for the fact that it
>> hasn’t taken the lack of target support for half-precision arithmetic
>> into account yet. That will happen in the selection DAG.
>> The assembly code generated looks like this (with my annotations):
>>
>> f:                                      # @f
>> # %bb.0:
>>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1
>> from single to half
>>         vcvtph2ps       xmm1, xmm1                # Convert argument
>> 1 back to single
>>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0
>> from single to half
>>         vcvtph2ps       xmm0, xmm0                # Convert argument
>> 0 back to single
>>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1
>> (single precision)
>>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single
>> precision result to half
>>         vmovd             eax, xmm1                      # Move the
>> half precision result to eax
>>         mov                 word ptr [rip + x], ax     # Store the
>> half precision result in the global, x
>>         ret                                                          
>>   # Return the single precision result still in xmm0
>> .Lfunc_end0:
>>                                         # -- End function
>>
>> Something odd has happened here, and it may not be obvious what it
>> is. This code begins by converting xmm0 and xmm1 from single to half
>> and then back to single. The first conversion is happening because
>> the back end decided that it needed to change the types of the
>> parameters to single precision but the function body is expecting
>> half precision values. However, since the target can’t perform the
>> required computation with half precision values they must be
>> converted back to single for the multiplication. The single precision
>> result of the multiplication is converted to half precision to be
>> stored in the global value, x, but the result is returned as single
>> precision (via xmm0).
>>
>> I’m not primarily worried about the extra conversions here. We can’t
>> get rid of them because we can’t prove they aren’t rounding, but
>> that’s a secondary issue. What I’m worried about is that we
>> allowed/required the back end to improvise an ABI to satisfy the
>> incoming IR, and the choice it made is questionable.
>
> As Richard said, an ABI rule emerged from the implementation, and I
> believe we should solidify it, so here's a simple strawman proposal:
> pass scalars in the low 16 bits of SSE registers, don't change the
> memory layout, and pack them in vectors of 16-bit elements.  That
> matches the only ISA extension so far (ph<>ps conversions), and fits
> well with that (as opposed to i16 coercion) as well as vectors (as
> opposed to f32 promotion).  To my knowledge, there hasn't been any
> alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's
> interesting because we technically have no way of accessing scalars
> (so we have the same problems as i8/i16 vector elements, but without
> the saving grace of having matching GPRs - x86, or direct copies -
> aarch64), and there are not even any scalar operations.
>
> Any thoughts?  We can suggest this to x86-psABI if folks think this is
> a good idea. (I don't know about other ABIs or other architectures
> though).
>
> Concretely, this means no/little change in IRGen.  As for the SDAG
> implementation, this is an unusual situation.  I've done some
> experimentation a long time ago.  We can make the types legal, even
> though no operations are.   It's relatively straightforward to promote
> all operations (and we made sure that worked years ago for AArch64,
> for the pre-v8.2 mode), but vectors are fun, because of build_vector
> (where it helps to have the truncating behavior we have for integers,
> but for fp), extract_vector_elt (where you need the matching extend),
> and insert_vector_elt (which you have to lower using some movd and/or
> pinsrw trickery, if you want to avoid the generic slow via-memory
> fallback).
> Alternatively, we can immediately, in call lowering/register
> assignment logic (this covers the SDAG cross-BB vreg assignments Craig
> mentions) promote to f32 "via" i16.  I'm afraid I don't remember the
> arguments one way or the other, I can dust off my old patches and put
> them up on phabricator.
>
>
> -Ahmed
>
>>
>> For a point of comparison, I looked at what gcc does. Currently, gcc
>> only allows _Float16 in C, not C++, and if you try to use it with a
>> target that doesn’t have native support for half-precision
>> arithmetic, it tells you “’_Float16’ is not supported on this
>> target.” That seems preferable to making up an ABI on the fly.
>>
>> I haven’t looked at what happens with clang when compiling for other
>> targets that don’t have native support for half-precision arithmetic,
>> but I would imagine that similar problems exist.
>>
>> Thoughts?
>>
>> Thanks,
>> Andy
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium. Thank you.


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: _Float16 support

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev
I can start looking Friday/Monday. But if it's more urgent, perhaps someone else might want to have a look.

Sjoerd.
From: [hidden email] <[hidden email]> on behalf of John McCall <[hidden email]>
Sent: 24 January 2019 18:57:54
To: Sjoerd Meijer
Cc: Ahmed Bougacha; Lu, Hongjiu; Kaylor, Andrew; [hidden email]; [hidden email]; [hidden email]
Subject: Re: [cfe-dev] _Float16 support
 


On 24 Jan 2019, at 4:46, Sjoerd Meijer wrote:

> Hello,
>
> I added _Float16 support to Clang and codegen support in the AArch64
> and ARM backends, but have not looked into x86. Ahmed is right:
> AArch64 is fine, only a few ACLE intrinsics are missing. ARM has rough
> edges: scalar codegen should be mostly fine, vector codegen needs some
> more work.
>
> Implementation for AArch64 was mostly straightforward (it only has
> hard float ABI, and has half register/type support), but for ARM it
> was a huge pain to plumb f16 support because of different ABIs
> (hard/soft), different architecture extensions of FP and FP16 support,
> and the existence of another half-precision type with different
> semantics. Sounds like you're doing a similar exercise, and yes,
> argument passing was one of the trickiest parts.
>
>
>> IR and SelectionDAG representational choices aside, it seems to me
>> that,
>
>> like GCC, Clang should not be permitting _Float16 on any target  that
>> doesn't
>
>> specify an ABI for it, because otherwise we're just creating future
>> compatibility
>
>> problems for that target.  I'm surprised and  disappointed that it
>> wasn't implemented
>
>> this way.
>
> Apologies, I missed that.

It's alright, oversights happen (in both patch-writing and review).  Can
we get a volunteer to do the work to restrict this now?  I'm a little
crushed.

John.

>
> Sjoerd.
>
> ________________________________
> From: llvm-dev <[hidden email]> on behalf of Kaylor,
> Andrew via llvm-dev <[hidden email]>
> Sent: 24 January 2019 00:23
> To: Ahmed Bougacha; Lu, Hongjiu
> Cc: llvm-dev; [hidden email]
> Subject: Re: [llvm-dev] [cfe-dev] _Float16 support
>
> It seems that there are several issues here:
>
> 1. Should the front end be concerned with whether or not the IR that
> it is emitting can be translated into a well-defined IR?
> 2. How should the selection DAG handle data types whose representation
> isn't defined by the ABI we're targeting?
> 3. What should the ABI do with half-precision floats?
>
> Working backward...
>
> The third question here is obviously target specific. I've talked to
> HJ Lu about this, and he's working on an update to the x86 psABI. I
> believe that his eventual proposal will follow the lines of what you
> (Ahmed) suggested below, but I'm not completely proficient at
> comprehending ABI definitions so there may be some subtlety that I am
> misunderstanding in what he told me. I also talked to Craig about
> would be involved in making the LLVM x86 backend handle 'half' values
> this way. That involves a good bit of work, but it can be done.
>
> The second question above probably involves a mix of
> target-independent and target-specific code. Right now the selection
> DAG code is operating on the assumption that it needs to do
> *something* with any IR it is given. It tries to make a reasonable
> choice, and the choice is consistent and predictable but not
> necessarily what the user expects. It seems like we should at the very
> least be producing a diagnostic so the user knows what we did (or even
> just that we did something). Then there are the specific problems
> Craig has brought up with the way we're currently handling 'half'
> values. Would defining a legal f16 type take care of those problems?
>
> The first question exposes my lack of understanding of the proper role
> of the front end. It isn't clear to me what responsibility the front
> end has for enforcing conformance to the ABI. As a user of the
> compiler, I would like the compiler to tell me when code I've written
> can't be represented using the ABI I am targeting. Whether the front
> end should detect this or the backend, I don't know. I suppose it's
> also an open question how strictly this should be enforced. Is it a
> warning that can be elevated to an error at the users' discretion? Is
> it something that should be blocked by default but enabled by a
> user-specified option? Should it always be rejected?
>
> -Andy
>
> -----Original Message-----
> From: Ahmed Bougacha <[hidden email]>
> Sent: Wednesday, January 23, 2019 3:30 PM
> To: Kaylor, Andrew <[hidden email]>
> Cc: [hidden email]; llvm-dev <[hidden email]>; Craig
> Topper <[hidden email]>; Richard Smith <[hidden email]>
> Subject: Re: [cfe-dev] _Float16 support
>
> Hey Andy,
>
> On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev
> <[hidden email]> wrote:
>> I'd like to start a discussion about how clang supports _Float16 for
>> target architectures that don't have direct support for 16-bit
>> floating point arithmetic.
>
> Thanks for bringing this up;  we'd also like to get better support,
> for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is
> usable with +fp16.
>
> I'm not sure much of this discussion generalizes across platforms
> though (beyond Craig's potential bug fix?).  I guess the
> "target-independent" question is: should we allow this kind of
> "legalization" in the vreg assignment code at all? (I think that's
> where it all comes from: RegsForValue, TLI::get*Register*) It's
> convenient for experimental frontends: you can use weird types (half,
> i3, ...) without worrying too much about it, and you usually get
> something self-consistent out of the backend.  But you eventually need
> to worry about it and need to make the calling convention explicit. 
> But I guess that's a discussion for the other thread ;)
>
>> The current clang language extensions documentation says, "If
>> half-precision instructions are unavailable, values will be promoted
>> to single-precision, similar to the semantics of __fp16 except that
>> the results will be stored in single-precision." This is somewhat
>> vague (to me) as to what is meant by promotion of values, and the
>> part about results being stored in single-precision isn't what
>> actually happens.
>>
>> Consider this example:
>>
>> _Float16 x;
>> _Float16 f(_Float16 y, _Float16 z) {
>>   x = y * z;
>>   return x;
>> }
>>
>> When compiling with “-march=core-avx2” that results (after some
>> trivial cleanup) in this IR:
>>
>> @x = global half 0xH0000, align 2
>> define half @f(half, half) {
>>   %3 = fmul half %0, %1
>>   store half %3, half* @x
>>   ret half %3
>> }
>>
>> That’s not too unreasonable I suppose, except for the fact that it
>> hasn’t taken the lack of target support for half-precision
>> arithmetic into account yet. That will happen in the selection DAG.
>> The assembly code generated looks like this (with my annotations):
>>
>> f:                                      # @f
>> # %bb.0:
>>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1
>> from single to half
>>         vcvtph2ps       xmm1, xmm1                # Convert argument
>> 1 back to single
>>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0
>> from single to half
>>         vcvtph2ps       xmm0, xmm0                # Convert argument
>> 0 back to single
>>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1
>> (single precision)
>>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single
>> precision result to half
>>         vmovd             eax, xmm1                      # Move the
>> half precision result to eax
>>         mov                 word ptr [rip + x], ax     # Store the
>> half precision result in the global, x
>>         ret                                                          
>>   # Return the single precision result still in xmm0
>> .Lfunc_end0:
>>                                         # -- End function
>>
>> Something odd has happened here, and it may not be obvious what it
>> is. This code begins by converting xmm0 and xmm1 from single to half
>> and then back to single. The first conversion is happening because
>> the back end decided that it needed to change the types of the
>> parameters to single precision but the function body is expecting
>> half precision values. However, since the target can’t perform the
>> required computation with half precision values they must be
>> converted back to single for the multiplication. The single precision
>> result of the multiplication is converted to half precision to be
>> stored in the global value, x, but the result is returned as single
>> precision (via xmm0).
>>
>> I’m not primarily worried about the extra conversions here. We
>> can’t get rid of them because we can’t prove they aren’t
>> rounding, but that’s a secondary issue. What I’m worried about is
>> that we allowed/required the back end to improvise an ABI to satisfy
>> the incoming IR, and the choice it made is questionable.
>
> As Richard said, an ABI rule emerged from the implementation, and I
> believe we should solidify it, so here's a simple strawman proposal:
> pass scalars in the low 16 bits of SSE registers, don't change the
> memory layout, and pack them in vectors of 16-bit elements.  That
> matches the only ISA extension so far (ph<>ps conversions), and fits
> well with that (as opposed to i16 coercion) as well as vectors (as
> opposed to f32 promotion).  To my knowledge, there hasn't been any
> alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's
> interesting because we technically have no way of accessing scalars
> (so we have the same problems as i8/i16 vector elements, but without
> the saving grace of having matching GPRs - x86, or direct copies -
> aarch64), and there are not even any scalar operations.
>
> Any thoughts?  We can suggest this to x86-psABI if folks think this is
> a good idea. (I don't know about other ABIs or other architectures
> though).
>
> Concretely, this means no/little change in IRGen.  As for the SDAG
> implementation, this is an unusual situation.  I've done some
> experimentation a long time ago.  We can make the types legal, even
> though no operations are.   It's relatively straightforward to promote
> all operations (and we made sure that worked years ago for AArch64,
> for the pre-v8.2 mode), but vectors are fun, because of build_vector
> (where it helps to have the truncating behavior we have for integers,
> but for fp), extract_vector_elt (where you need the matching extend),
> and insert_vector_elt (which you have to lower using some movd and/or
> pinsrw trickery, if you want to avoid the generic slow via-memory
> fallback).
> Alternatively, we can immediately, in call lowering/register
> assignment logic (this covers the SDAG cross-BB vreg assignments Craig
> mentions) promote to f32 "via" i16.  I'm afraid I don't remember the
> arguments one way or the other, I can dust off my old patches and put
> them up on phabricator.
>
>
> -Ahmed
>
>>
>> For a point of comparison, I looked at what gcc does. Currently, gcc
>> only allows _Float16 in C, not C++, and if you try to use it with a
>> target that doesn’t have native support for half-precision
>> arithmetic, it tells you “’_Float16’ is not supported on this
>> target.” That seems preferable to making up an ABI on the fly.
>>
>> I haven’t looked at what happens with clang when compiling for
>> other targets that don’t have native support for half-precision
>> arithmetic, but I would imagine that similar problems exist.
>>
>> Thoughts?
>>
>> Thanks,
>> Andy
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium. Thank you.


IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] _Float16 support

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev
On 24 Jan 2019, at 16:44, Keane, Erich wrote:
> Disable up for review here: https://reviews.llvm.org/D57188

Thanks!  Reviewing.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev