[RFC] Loading Bitfields with Smallest Needed Types

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
We're running into an interesting issue with the Linux kernel, and
wanted advice on how to proceed.

Example of what we're talking about: https://godbolt.org/z/ABGySq

The issue is that when working with a bitfield a load may happen
quickly after a store. For instance:

struct napi_gro_cb {
    void *frag;
    unsigned int frag_len;
    u16 flush;
    u16 flush_id;
    u16 count;
    u16 gro_remcsum_start;
    unsigned long age;
    u16 proto;
    u8 same_flow : 1;
    u8 encap_mark : 1;
    u8 csum_valid : 1;
    u8 csum_cnt : 3;
    u8 free : 2;
    u8 is_ipv6 : 1;
    u8 is_fou : 1;
    u8 is_atomic : 1;
    u8 recursive_counter : 4;
    __wsum csum;
    struct sk_buff *last;
};

void dev_gro_receive(struct sk_buff *skb)
{
...
        same_flow = NAPI_GRO_CB(skb)->same_flow;
...
}

Right before the "same_flow = ... ->same_flow;" statement is executed,
a store is made to the bitfield at the end of a called function:

    NAPI_GRO_CB(skb)->same_flow = 1;

The store is a byte:

    orb    $0x1,0x4a(%rbx)

while the read is a word:

    movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

GCC gets around this by using the smallest load for a bitfield. It
seems to use a byte for everything, at least in our examples. From the
comments, this is intentional, because according to the comments
(which are never wrong) C++0x doesn't allow one to touch bits outside
of the bitfield. (I'm not a language lawyer, but take this to mean
that gcc is trying to minimize which bits it's accessing by using byte
stores and loads whenever possible.)

The question I have is what should we do to fix this issue? Once we
get to LLVM IR, the information saying that we're accessing a bitfield
is gone. We have a few options:

* We can glean this information from how the loaded value is used and
fix this during DAG combine, but it seems a bit brittle.

* We could correct the size of the load during the front-end's code
generation. This benefits from using all of LLVM's passes on the code.

* We could perform the transformation in another place, possible in MIR or MC.

What do people think?

-bw
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
By default, clang emits all bitfield load/store operations using the width of the entire sequence of bitfield members.  If you look at the LLVM IR for your testcase, all the bitfield operations are i16.  (For thread safety, the C/C++ standards treat a sequence of bitfield members as a single "field".)

If you look at the assembly, though, an "andb $-2, (%rdi)" slips in.  This is specific to the x86 backend: it's narrowing the store to save a couple bytes in the encoding, and a potential decoding stall due to a 2-byte immediate.  Maybe we shouldn't do that, or we should guard it with a better heuristic.

-Eli

-----Original Message-----
From: cfe-dev <[hidden email]> On Behalf Of Bill Wendling via cfe-dev
Sent: Tuesday, May 26, 2020 3:29 PM
To: [hidden email] Developers <[hidden email]>; llvm-dev <[hidden email]>
Subject: [EXT] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

We're running into an interesting issue with the Linux kernel, and
wanted advice on how to proceed.

Example of what we're talking about: https://godbolt.org/z/ABGySq

The issue is that when working with a bitfield a load may happen
quickly after a store. For instance:

struct napi_gro_cb {
    void *frag;
    unsigned int frag_len;
    u16 flush;
    u16 flush_id;
    u16 count;
    u16 gro_remcsum_start;
    unsigned long age;
    u16 proto;
    u8 same_flow : 1;
    u8 encap_mark : 1;
    u8 csum_valid : 1;
    u8 csum_cnt : 3;
    u8 free : 2;
    u8 is_ipv6 : 1;
    u8 is_fou : 1;
    u8 is_atomic : 1;
    u8 recursive_counter : 4;
    __wsum csum;
    struct sk_buff *last;
};

void dev_gro_receive(struct sk_buff *skb)
{
...
        same_flow = NAPI_GRO_CB(skb)->same_flow;
...
}

Right before the "same_flow = ... ->same_flow;" statement is executed,
a store is made to the bitfield at the end of a called function:

    NAPI_GRO_CB(skb)->same_flow = 1;

The store is a byte:

    orb    $0x1,0x4a(%rbx)

while the read is a word:

    movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

GCC gets around this by using the smallest load for a bitfield. It
seems to use a byte for everything, at least in our examples. From the
comments, this is intentional, because according to the comments
(which are never wrong) C++0x doesn't allow one to touch bits outside
of the bitfield. (I'm not a language lawyer, but take this to mean
that gcc is trying to minimize which bits it's accessing by using byte
stores and loads whenever possible.)

The question I have is what should we do to fix this issue? Once we
get to LLVM IR, the information saying that we're accessing a bitfield
is gone. We have a few options:

* We can glean this information from how the loaded value is used and
fix this during DAG combine, but it seems a bit brittle.

* We could correct the size of the load during the front-end's code
generation. This benefits from using all of LLVM's passes on the code.

* We could perform the transformation in another place, possible in MIR or MC.

What do people think?

-bw
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
We also have the opposite problem of the store shrinking. We'll also try to widen 4 byte aligned i16/i8 extload to i32. I didn't count out the alignment in this particular struct layout.

~Craig


On Tue, May 26, 2020 at 3:54 PM Eli Friedman via cfe-dev <[hidden email]> wrote:
By default, clang emits all bitfield load/store operations using the width of the entire sequence of bitfield members.  If you look at the LLVM IR for your testcase, all the bitfield operations are i16.  (For thread safety, the C/C++ standards treat a sequence of bitfield members as a single "field".)

If you look at the assembly, though, an "andb $-2, (%rdi)" slips in.  This is specific to the x86 backend: it's narrowing the store to save a couple bytes in the encoding, and a potential decoding stall due to a 2-byte immediate.  Maybe we shouldn't do that, or we should guard it with a better heuristic.

-Eli

-----Original Message-----
From: cfe-dev <[hidden email]> On Behalf Of Bill Wendling via cfe-dev
Sent: Tuesday, May 26, 2020 3:29 PM
To: [hidden email] Developers <[hidden email]>; llvm-dev <[hidden email]>
Subject: [EXT] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

We're running into an interesting issue with the Linux kernel, and
wanted advice on how to proceed.

Example of what we're talking about: https://godbolt.org/z/ABGySq

The issue is that when working with a bitfield a load may happen
quickly after a store. For instance:

struct napi_gro_cb {
    void *frag;
    unsigned int frag_len;
    u16 flush;
    u16 flush_id;
    u16 count;
    u16 gro_remcsum_start;
    unsigned long age;
    u16 proto;
    u8 same_flow : 1;
    u8 encap_mark : 1;
    u8 csum_valid : 1;
    u8 csum_cnt : 3;
    u8 free : 2;
    u8 is_ipv6 : 1;
    u8 is_fou : 1;
    u8 is_atomic : 1;
    u8 recursive_counter : 4;
    __wsum csum;
    struct sk_buff *last;
};

void dev_gro_receive(struct sk_buff *skb)
{
...
        same_flow = NAPI_GRO_CB(skb)->same_flow;
...
}

Right before the "same_flow = ... ->same_flow;" statement is executed,
a store is made to the bitfield at the end of a called function:

    NAPI_GRO_CB(skb)->same_flow = 1;

The store is a byte:

    orb    $0x1,0x4a(%rbx)

while the read is a word:

    movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

GCC gets around this by using the smallest load for a bitfield. It
seems to use a byte for everything, at least in our examples. From the
comments, this is intentional, because according to the comments
(which are never wrong) C++0x doesn't allow one to touch bits outside
of the bitfield. (I'm not a language lawyer, but take this to mean
that gcc is trying to minimize which bits it's accessing by using byte
stores and loads whenever possible.)

The question I have is what should we do to fix this issue? Once we
get to LLVM IR, the information saying that we're accessing a bitfield
is gone. We have a few options:

* We can glean this information from how the loaded value is used and
fix this during DAG combine, but it seems a bit brittle.

* We could correct the size of the load during the front-end's code
generation. This benefits from using all of LLVM's passes on the code.

* We could perform the transformation in another place, possible in MIR or MC.

What do people think?

-bw
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev
When writing my initial email, I forgot another option which Eli
pointed out: don't shrink the store's size. That would be acceptable
for our purposes. If it's something that needs further consideration,
perhaps we could disable it via a flag (not an official "-m..." flag,
but "-mllvm -disable-store-shortening" or whatever)?

-bw

On Tue, May 26, 2020 at 3:54 PM Eli Friedman <[hidden email]> wrote:

>
> By default, clang emits all bitfield load/store operations using the width of the entire sequence of bitfield members.  If you look at the LLVM IR for your testcase, all the bitfield operations are i16.  (For thread safety, the C/C++ standards treat a sequence of bitfield members as a single "field".)
>
> If you look at the assembly, though, an "andb $-2, (%rdi)" slips in.  This is specific to the x86 backend: it's narrowing the store to save a couple bytes in the encoding, and a potential decoding stall due to a 2-byte immediate.  Maybe we shouldn't do that, or we should guard it with a better heuristic.
>
> -Eli
>
> -----Original Message-----
> From: cfe-dev <[hidden email]> On Behalf Of Bill Wendling via cfe-dev
> Sent: Tuesday, May 26, 2020 3:29 PM
> To: [hidden email] Developers <[hidden email]>; llvm-dev <[hidden email]>
> Subject: [EXT] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
>
> We're running into an interesting issue with the Linux kernel, and
> wanted advice on how to proceed.
>
> Example of what we're talking about: https://godbolt.org/z/ABGySq
>
> The issue is that when working with a bitfield a load may happen
> quickly after a store. For instance:
>
> struct napi_gro_cb {
>     void *frag;
>     unsigned int frag_len;
>     u16 flush;
>     u16 flush_id;
>     u16 count;
>     u16 gro_remcsum_start;
>     unsigned long age;
>     u16 proto;
>     u8 same_flow : 1;
>     u8 encap_mark : 1;
>     u8 csum_valid : 1;
>     u8 csum_cnt : 3;
>     u8 free : 2;
>     u8 is_ipv6 : 1;
>     u8 is_fou : 1;
>     u8 is_atomic : 1;
>     u8 recursive_counter : 4;
>     __wsum csum;
>     struct sk_buff *last;
> };
>
> void dev_gro_receive(struct sk_buff *skb)
> {
> ...
>         same_flow = NAPI_GRO_CB(skb)->same_flow;
> ...
> }
>
> Right before the "same_flow = ... ->same_flow;" statement is executed,
> a store is made to the bitfield at the end of a called function:
>
>     NAPI_GRO_CB(skb)->same_flow = 1;
>
> The store is a byte:
>
>     orb    $0x1,0x4a(%rbx)
>
> while the read is a word:
>
>     movzwl 0x4a(%r12),%r15d
>
> The problem is that between the store and the load the value hasn't
> been retired / placed in the cache. One would expect store-to-load
> forwarding to kick in, but on x86 that doesn't happen because x86
> requires the store to be of equal or greater size than the load. So
> instead the load takes the slow path, causing unacceptable slowdowns.
>
> GCC gets around this by using the smallest load for a bitfield. It
> seems to use a byte for everything, at least in our examples. From the
> comments, this is intentional, because according to the comments
> (which are never wrong) C++0x doesn't allow one to touch bits outside
> of the bitfield. (I'm not a language lawyer, but take this to mean
> that gcc is trying to minimize which bits it's accessing by using byte
> stores and loads whenever possible.)
>
> The question I have is what should we do to fix this issue? Once we
> get to LLVM IR, the information saying that we're accessing a bitfield
> is gone. We have a few options:
>
> * We can glean this information from how the loaded value is used and
> fix this during DAG combine, but it seems a bit brittle.
>
> * We could correct the size of the load during the front-end's code
> generation. This benefits from using all of LLVM's passes on the code.
>
> * We could perform the transformation in another place, possible in MIR or MC.
>
> What do people think?
>
> -bw
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev


On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:

> We're running into an interesting issue with the Linux kernel, and
> wanted advice on how to proceed.
>
> Example of what we're talking about: https://godbolt.org/z/ABGySq
>
> The issue is that when working with a bitfield a load may happen
> quickly after a store. For instance:
>
> struct napi_gro_cb {
>     void *frag;
>     unsigned int frag_len;
>     u16 flush;
>     u16 flush_id;
>     u16 count;
>     u16 gro_remcsum_start;
>     unsigned long age;
>     u16 proto;
>     u8 same_flow : 1;
>     u8 encap_mark : 1;
>     u8 csum_valid : 1;
>     u8 csum_cnt : 3;
>     u8 free : 2;
>     u8 is_ipv6 : 1;
>     u8 is_fou : 1;
>     u8 is_atomic : 1;
>     u8 recursive_counter : 4;
>     __wsum csum;
>     struct sk_buff *last;
> };
>
> void dev_gro_receive(struct sk_buff *skb)
> {
> ...
>         same_flow = NAPI_GRO_CB(skb)->same_flow;
> ...
> }
>
> Right before the "same_flow = ... ->same_flow;" statement is executed,
> a store is made to the bitfield at the end of a called function:
>
>     NAPI_GRO_CB(skb)->same_flow = 1;
>
> The store is a byte:
>
>     orb    $0x1,0x4a(%rbx)
>
> while the read is a word:
>
>     movzwl 0x4a(%r12),%r15d
>
> The problem is that between the store and the load the value hasn't
> been retired / placed in the cache. One would expect store-to-load
> forwarding to kick in, but on x86 that doesn't happen because x86
> requires the store to be of equal or greater size than the load. So
> instead the load takes the slow path, causing unacceptable slowdowns.
>
> GCC gets around this by using the smallest load for a bitfield. It
> seems to use a byte for everything, at least in our examples. From the
> comments, this is intentional, because according to the comments
> (which are never wrong) C++0x doesn't allow one to touch bits outside
> of the bitfield. (I'm not a language lawyer, but take this to mean
> that gcc is trying to minimize which bits it's accessing by using byte
> stores and loads whenever possible.)
>
> The question I have is what should we do to fix this issue? Once we
> get to LLVM IR, the information saying that we're accessing a bitfield
> is gone. We have a few options:
>
> * We can glean this information from how the loaded value is used and
> fix this during DAG combine, but it seems a bit brittle.
>
> * We could correct the size of the load during the front-end's code
> generation. This benefits from using all of LLVM's passes on the code.
>
> * We could perform the transformation in another place, possible in
> MIR or MC.
>
> What do people think?

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler.  There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly
have
an issue with the wider loads and stores.  I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary.  But mostly I think the
frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
When I spent some time looking at this back in March when Bill mentioned it on IRC. I think I saw a write to one bit in one of the 8-bit pieces and then a read of that bit and a bit from the adjacent byte. So we used a narrow store and then a wider load due to the 2 bits.

~Craig


On Tue, May 26, 2020 at 4:32 PM John McCall via cfe-dev <[hidden email]> wrote:


On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:

> We're running into an interesting issue with the Linux kernel, and
> wanted advice on how to proceed.
>
> Example of what we're talking about: https://godbolt.org/z/ABGySq
>
> The issue is that when working with a bitfield a load may happen
> quickly after a store. For instance:
>
> struct napi_gro_cb {
>     void *frag;
>     unsigned int frag_len;
>     u16 flush;
>     u16 flush_id;
>     u16 count;
>     u16 gro_remcsum_start;
>     unsigned long age;
>     u16 proto;
>     u8 same_flow : 1;
>     u8 encap_mark : 1;
>     u8 csum_valid : 1;
>     u8 csum_cnt : 3;
>     u8 free : 2;
>     u8 is_ipv6 : 1;
>     u8 is_fou : 1;
>     u8 is_atomic : 1;
>     u8 recursive_counter : 4;
>     __wsum csum;
>     struct sk_buff *last;
> };
>
> void dev_gro_receive(struct sk_buff *skb)
> {
> ...
>         same_flow = NAPI_GRO_CB(skb)->same_flow;
> ...
> }
>
> Right before the "same_flow = ... ->same_flow;" statement is executed,
> a store is made to the bitfield at the end of a called function:
>
>     NAPI_GRO_CB(skb)->same_flow = 1;
>
> The store is a byte:
>
>     orb    $0x1,0x4a(%rbx)
>
> while the read is a word:
>
>     movzwl 0x4a(%r12),%r15d
>
> The problem is that between the store and the load the value hasn't
> been retired / placed in the cache. One would expect store-to-load
> forwarding to kick in, but on x86 that doesn't happen because x86
> requires the store to be of equal or greater size than the load. So
> instead the load takes the slow path, causing unacceptable slowdowns.
>
> GCC gets around this by using the smallest load for a bitfield. It
> seems to use a byte for everything, at least in our examples. From the
> comments, this is intentional, because according to the comments
> (which are never wrong) C++0x doesn't allow one to touch bits outside
> of the bitfield. (I'm not a language lawyer, but take this to mean
> that gcc is trying to minimize which bits it's accessing by using byte
> stores and loads whenever possible.)
>
> The question I have is what should we do to fix this issue? Once we
> get to LLVM IR, the information saying that we're accessing a bitfield
> is gone. We have a few options:
>
> * We can glean this information from how the loaded value is used and
> fix this during DAG combine, but it seems a bit brittle.
>
> * We could correct the size of the load during the front-end's code
> generation. This benefits from using all of LLVM's passes on the code.
>
> * We could perform the transformation in another place, possible in
> MIR or MC.
>
> What do people think?

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler.  There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly
have
an issue with the wider loads and stores.  I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary.  But mostly I think the
frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
On Tue, May 26, 2020 at 4:38 PM Craig Topper <[hidden email]> wrote:
>
> When I spent some time looking at this back in March when Bill mentioned it on IRC. I think I saw a write to one bit in one of the 8-bit pieces and then a read of that bit and a bit from the adjacent byte. So we used a narrow store and then a wider load due to the 2 bits.
>
I think you're referring to same_flow and free in the structure below.
Those both have stores, as does most of the rest of the bitfield (it's
an initialization, which seems like could be done with a few bitwise
operations on the whole bitfield, but I digress). But yeah, in the
case that we have consecutive accesses of bitfields in adjacent bytes,
then a bigger read & store are better.

> ~Craig
>
>
> On Tue, May 26, 2020 at 4:32 PM John McCall via cfe-dev <[hidden email]> wrote:
>>
>>
>>
>> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:
>>
>> > We're running into an interesting issue with the Linux kernel, and
>> > wanted advice on how to proceed.
>> >
>> > Example of what we're talking about: https://godbolt.org/z/ABGySq
>> >
>> > The issue is that when working with a bitfield a load may happen
>> > quickly after a store. For instance:
>> >
>> > struct napi_gro_cb {
>> >     void *frag;
>> >     unsigned int frag_len;
>> >     u16 flush;
>> >     u16 flush_id;
>> >     u16 count;
>> >     u16 gro_remcsum_start;
>> >     unsigned long age;
>> >     u16 proto;
>> >     u8 same_flow : 1;
>> >     u8 encap_mark : 1;
>> >     u8 csum_valid : 1;
>> >     u8 csum_cnt : 3;
>> >     u8 free : 2;
>> >     u8 is_ipv6 : 1;
>> >     u8 is_fou : 1;
>> >     u8 is_atomic : 1;
>> >     u8 recursive_counter : 4;
>> >     __wsum csum;
>> >     struct sk_buff *last;
>> > };
>> >
>> > void dev_gro_receive(struct sk_buff *skb)
>> > {
>> > ...
>> >         same_flow = NAPI_GRO_CB(skb)->same_flow;
>> > ...
>> > }
>> >
>> > Right before the "same_flow = ... ->same_flow;" statement is executed,
>> > a store is made to the bitfield at the end of a called function:
>> >
>> >     NAPI_GRO_CB(skb)->same_flow = 1;
>> >
>> > The store is a byte:
>> >
>> >     orb    $0x1,0x4a(%rbx)
>> >
>> > while the read is a word:
>> >
>> >     movzwl 0x4a(%r12),%r15d
>> >
>> > The problem is that between the store and the load the value hasn't
>> > been retired / placed in the cache. One would expect store-to-load
>> > forwarding to kick in, but on x86 that doesn't happen because x86
>> > requires the store to be of equal or greater size than the load. So
>> > instead the load takes the slow path, causing unacceptable slowdowns.
>> >
>> > GCC gets around this by using the smallest load for a bitfield. It
>> > seems to use a byte for everything, at least in our examples. From the
>> > comments, this is intentional, because according to the comments
>> > (which are never wrong) C++0x doesn't allow one to touch bits outside
>> > of the bitfield. (I'm not a language lawyer, but take this to mean
>> > that gcc is trying to minimize which bits it's accessing by using byte
>> > stores and loads whenever possible.)
>> >
>> > The question I have is what should we do to fix this issue? Once we
>> > get to LLVM IR, the information saying that we're accessing a bitfield
>> > is gone. We have a few options:
>> >
>> > * We can glean this information from how the loaded value is used and
>> > fix this during DAG combine, but it seems a bit brittle.
>> >
>> > * We could correct the size of the load during the front-end's code
>> > generation. This benefits from using all of LLVM's passes on the code.
>> >
>> > * We could perform the transformation in another place, possible in
>> > MIR or MC.
>> >
>> > What do people think?
>>
>> Clang used to generate narrower loads and stores for bit-fields, but a
>> long time ago it was intentionally changed to generate wider loads
>> and stores, IIRC by Chandler.  There are some cases where I think the
>> “new” code goes overboard, but in this case I don’t particularly
>> have
>> an issue with the wider loads and stores.  I guess we could make a
>> best-effort attempt to stick to the storage-unit size when the
>> bit-fields break evenly on a boundary.  But mostly I think the
>> frontend’s
>> responsibility ends with it generating same-size accesses in both
>> places, and if inconsistent access sizes trigger poor performance,
>> the backend should be more careful about intentionally changing access
>> sizes.

Fair enough.

-bw
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev
On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <[hidden email]> wrote:
On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:
> [...] The store is a byte:
>
>     orb    $0x1,0x4a(%rbx)
>
> while the read is a word:
>
>     movzwl 0x4a(%r12),%r15d
>
> The problem is that between the store and the load the value hasn't
> been retired / placed in the cache. One would expect store-to-load
> forwarding to kick in, but on x86 that doesn't happen because x86
> requires the store to be of equal or greater size than the load. So
> instead the load takes the slow path, causing unacceptable slowdowns. [...]

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler.  There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly have
an issue with the wider loads and stores.  I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary.  But mostly I think the frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

FWIW, when I was at Green Hills, I recall the rule being "Always use the declared type of the bitfield to govern the size of the read or write."  (There was a similar rule for the meaning of `volatile`. I hope I'm not just getting confused between the two. Actually, since of the compilers on Godbolt, only MSVC follows this rule, I'm probably wrong.)  That is, if the bitfield is declared `int16_t`, then use 16-bit loads and stores for it; if it's declared `int32_t`, then use 32-bit loads and stores. This gives the programmer a reason to prefer one declared type over another. For example, in

template<class T>
struct A {
    T w : 5;
    T x : 3;
    T y : 4;
    T z : 4;
};

the only differences between A<char> and A<short> are
- whether the struct's alignment is 1 or 2, and
- whether you use 8-bit or 16-bit accesses to modify its fields.

"The backend should be more careful about intentionally changing access sizes" sounds like absolutely the correct diagnosis, to me. 

my $.02,
Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev

On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:

On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
[hidden email]> wrote:

On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:

[...] The store is a byte:

orb $0x1,0x4a(%rbx)

while the read is a word:

movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

[...]

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler. There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly have
an issue with the wider loads and stores. I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary. But mostly I think the frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

FWIW, when I was at Green Hills, I recall the rule being "Always use the
declared type of the bitfield to govern the size of the read or write."
(There was a similar rule for the meaning of `volatile`. I hope I'm not
just getting confused between the two. Actually, since of the compilers on
Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm
*probably* wrong.) That is, if the bitfield is declared `int16_t`, then
use 16-bit loads and stores for it; if it's declared `int32_t`, then use
32-bit loads and stores.

I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are
quite different from the standard Unix rules. On Windows, T x : 3
literally allocates an entire T in the structure, and successive
bit-fields get packed into that T only if their base type is of the
same size (and they haven’t exhausted the original T). So of course
all accesses to that bit-field are basically of the full size of the T;
there’s no overlap to be concerned with. On Unix, bit-fields will typically
get packed together regardless of the base type; the base type does have
some influence, but it’s target-specific and somewhat odd.

I’d prefer if we degraded to a Windows-like access behavior as much
as we can, but it’s not always possible because of that packing.

John.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent.

Of course that insight does mean it's quite easy to create a test-case with the exact same problematic store->load mismatch which doesn't use bit-fields at all. For example:
short f2(short *bfs) {
    *bfs &= ~0x1;
    g();
    return *bfs;
}

creates the same bad sequence:
        movq    %rdi, %rbx
        andb    $-2, (%rdi)
        callq   g()
        movzwl  (%rbx), %eax

On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev <[hidden email]> wrote:

On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:

On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
[hidden email]> wrote:

On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:

[...] The store is a byte:

orb $0x1,0x4a(%rbx)

while the read is a word:

movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

[...]

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler. There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly have
an issue with the wider loads and stores. I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary. But mostly I think the frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

FWIW, when I was at Green Hills, I recall the rule being "Always use the
declared type of the bitfield to govern the size of the read or write."
(There was a similar rule for the meaning of `volatile`. I hope I'm not
just getting confused between the two. Actually, since of the compilers on
Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm
*probably* wrong.) That is, if the bitfield is declared `int16_t`, then
use 16-bit loads and stores for it; if it's declared `int32_t`, then use
32-bit loads and stores.

I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are
quite different from the standard Unix rules. On Windows, T x : 3
literally allocates an entire T in the structure, and successive
bit-fields get packed into that T only if their base type is of the
same size (and they haven’t exhausted the original T). So of course
all accesses to that bit-field are basically of the full size of the T;
there’s no overlap to be concerned with. On Unix, bit-fields will typically
get packed together regardless of the base type; the base type does have
some influence, but it’s target-specific and somewhat odd.

I’d prefer if we degraded to a Windows-like access behavior as much
as we can, but it’s not always possible because of that packing.

John.

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
Similar issue with aligned extload widening. The load and store are both i16 in IR.

struct s {
int x; // force alignment
short y;
};

void g();
short f2(s *a, short b) {
a->y = b;
g();
return a->y - b;
}

produces this
movq %rdi, %rbx
movw %bp, 4(%rdi) // 2 byte store
callq g()
movl 4(%rbx), %eax // 4 byte load
subl %ebp, %eax

~Craig


On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev <[hidden email]> wrote:
At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent.

Of course that insight does mean it's quite easy to create a test-case with the exact same problematic store->load mismatch which doesn't use bit-fields at all. For example:
short f2(short *bfs) {
    *bfs &= ~0x1;
    g();
    return *bfs;
}

creates the same bad sequence:
        movq    %rdi, %rbx
        andb    $-2, (%rdi)
        callq   g()
        movzwl  (%rbx), %eax

On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev <[hidden email]> wrote:

On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:

On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
[hidden email]> wrote:

On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:

[...] The store is a byte:

orb $0x1,0x4a(%rbx)

while the read is a word:

movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

[...]

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler. There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly have
an issue with the wider loads and stores. I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary. But mostly I think the frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

FWIW, when I was at Green Hills, I recall the rule being "Always use the
declared type of the bitfield to govern the size of the read or write."
(There was a similar rule for the meaning of `volatile`. I hope I'm not
just getting confused between the two. Actually, since of the compilers on
Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm
*probably* wrong.) That is, if the bitfield is declared `int16_t`, then
use 16-bit loads and stores for it; if it's declared `int32_t`, then use
32-bit loads and stores.

I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are
quite different from the standard Unix rules. On Windows, T x : 3
literally allocates an entire T in the structure, and successive
bit-fields get packed into that T only if their base type is of the
same size (and they haven’t exhausted the original T). So of course
all accesses to that bit-field are basically of the full size of the T;
there’s no overlap to be concerned with. On Unix, bit-fields will typically
get packed together regardless of the base type; the base type does have
some influence, but it’s target-specific and somewhat odd.

I’d prefer if we degraded to a Windows-like access behavior as much
as we can, but it’s not always possible because of that packing.

John.

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
There is a related discussion about load widening (byte read --> 64bit read) transformation 9 years ago: http://lists.llvm.org/pipermail/llvm-dev/2011-December/046322.html

David

On Tue, May 26, 2020 at 8:37 PM Craig Topper via cfe-dev <[hidden email]> wrote:
Similar issue with aligned extload widening. The load and store are both i16 in IR.

struct s {
int x; // force alignment
short y;
};

void g();
short f2(s *a, short b) {
a->y = b;
g();
return a->y - b;
}

produces this
movq %rdi, %rbx
movw %bp, 4(%rdi) // 2 byte store
callq g()
movl 4(%rbx), %eax // 4 byte load
subl %ebp, %eax

~Craig


On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev <[hidden email]> wrote:
At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent.

Of course that insight does mean it's quite easy to create a test-case with the exact same problematic store->load mismatch which doesn't use bit-fields at all. For example:
short f2(short *bfs) {
    *bfs &= ~0x1;
    g();
    return *bfs;
}

creates the same bad sequence:
        movq    %rdi, %rbx
        andb    $-2, (%rdi)
        callq   g()
        movzwl  (%rbx), %eax

On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev <[hidden email]> wrote:

On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:

On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
[hidden email]> wrote:

On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:

[...] The store is a byte:

orb $0x1,0x4a(%rbx)

while the read is a word:

movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

[...]

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler. There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly have
an issue with the wider loads and stores. I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary. But mostly I think the frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

FWIW, when I was at Green Hills, I recall the rule being "Always use the
declared type of the bitfield to govern the size of the read or write."
(There was a similar rule for the meaning of `volatile`. I hope I'm not
just getting confused between the two. Actually, since of the compilers on
Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm
*probably* wrong.) That is, if the bitfield is declared `int16_t`, then
use 16-bit loads and stores for it; if it's declared `int32_t`, then use
32-bit loads and stores.

I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are
quite different from the standard Unix rules. On Windows, T x : 3
literally allocates an entire T in the structure, and successive
bit-fields get packed into that T only if their base type is of the
same size (and they haven’t exhausted the original T). So of course
all accesses to that bit-field are basically of the full size of the T;
there’s no overlap to be concerned with. On Unix, bit-fields will typically
get packed together regardless of the base type; the base type does have
some influence, but it’s target-specific and somewhat odd.

I’d prefer if we degraded to a Windows-like access behavior as much
as we can, but it’s not always possible because of that packing.

John.

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev
On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
<[hidden email]> wrote:
>
> At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent.
>
I suspect that this is more prevalent with bitfields as they're more
likely to have the load / bitwise op / store operations done on them,
resulting in an access type that can be shortened. But yes, it's not
specific to just bitfields.

I'm more interested in consistency, to be honest. If the loads and
stores for the bitfields (or other such shorten-able objects) were the
same, then we wouldn't run into the store-to-load forwarding issue on
x86 (I don't know about other platforms, but suspect that consistency
wouldn't hurt). I liked Arthur's idea of accessing the object using
the type size the bitfield was defined with (i8, i16, i256). It would
help with improving the heuristic. The downside is that it could lead
to un-optimal code, but that's the situation we have now, so...

-bw

> Of course that insight does mean it's quite easy to create a test-case with the exact same problematic store->load mismatch which doesn't use bit-fields at all. For example:
> short f2(short *bfs) {
>     *bfs &= ~0x1;
>     g();
>     return *bfs;
> }
>
> creates the same bad sequence:
>         movq    %rdi, %rbx
>         andb    $-2, (%rdi)
>         callq   g()
>         movzwl  (%rbx), %eax
>
> On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev <[hidden email]> wrote:
>>
>> On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:
>>
>> On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
>> [hidden email]> wrote:
>>
>> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:
>>
>> [...] The store is a byte:
>>
>> orb $0x1,0x4a(%rbx)
>>
>> while the read is a word:
>>
>> movzwl 0x4a(%r12),%r15d
>>
>> The problem is that between the store and the load the value hasn't
>> been retired / placed in the cache. One would expect store-to-load
>> forwarding to kick in, but on x86 that doesn't happen because x86
>> requires the store to be of equal or greater size than the load. So
>> instead the load takes the slow path, causing unacceptable slowdowns.
>>
>> [...]
>>
>> Clang used to generate narrower loads and stores for bit-fields, but a
>> long time ago it was intentionally changed to generate wider loads
>> and stores, IIRC by Chandler. There are some cases where I think the
>> “new” code goes overboard, but in this case I don’t particularly have
>> an issue with the wider loads and stores. I guess we could make a
>> best-effort attempt to stick to the storage-unit size when the
>> bit-fields break evenly on a boundary. But mostly I think the frontend’s
>> responsibility ends with it generating same-size accesses in both
>> places, and if inconsistent access sizes trigger poor performance,
>> the backend should be more careful about intentionally changing access
>> sizes.
>>
>> FWIW, when I was at Green Hills, I recall the rule being "Always use the
>> declared type of the bitfield to govern the size of the read or write."
>> (There was a similar rule for the meaning of `volatile`. I hope I'm not
>> just getting confused between the two. Actually, since of the compilers on
>> Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm
>> *probably* wrong.) That is, if the bitfield is declared `int16_t`, then
>> use 16-bit loads and stores for it; if it's declared `int32_t`, then use
>> 32-bit loads and stores.
>>
>> I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are
>> quite different from the standard Unix rules. On Windows, T x : 3
>> literally allocates an entire T in the structure, and successive
>> bit-fields get packed into that T only if their base type is of the
>> same size (and they haven’t exhausted the original T). So of course
>> all accesses to that bit-field are basically of the full size of the T;
>> there’s no overlap to be concerned with. On Unix, bit-fields will typically
>> get packed together regardless of the base type; the base type does have
>> some influence, but it’s target-specific and somewhat odd.
>>
>> I’d prefer if we degraded to a Windows-like access behavior as much
>> as we can, but it’s not always possible because of that packing.
>>
>> John.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev


On 28 May 2020, at 18:42, Bill Wendling wrote:

> On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
> <[hidden email]> wrote:
>>
>> At least in this test-case, the "bitfield" part of this seems to be a
>> distraction. As Eli notes, Clang has lowered the function to LLVM IR
>> containing consistent i16 operations. Despite that being a different
>> choice from GCC, it should still be correct and consistent.
>>
> I suspect that this is more prevalent with bitfields as they're more
> likely to have the load / bitwise op / store operations done on them,
> resulting in an access type that can be shortened. But yes, it's not
> specific to just bitfields.
>
> I'm more interested in consistency, to be honest. If the loads and
> stores for the bitfields (or other such shorten-able objects) were the
> same, then we wouldn't run into the store-to-load forwarding issue on
> x86 (I don't know about other platforms, but suspect that consistency
> wouldn't hurt). I liked Arthur's idea of accessing the object using
> the type size the bitfield was defined with (i8, i16, i256). It would
> help with improving the heuristic. The downside is that it could lead
> to un-optimal code, but that's the situation we have now, so...

Okay, but what concretely are you suggesting here?  Clang IRGen is
emitting accesses with consistent sizes, and LLVM is making them
inconsistent.  Are you just asking Clang to emit smaller accesses
in the hope that LLVM won’t mess them up?

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
On Fri, 29 May 2020 at 11:06, John McCall via llvm-dev <[hidden email]> wrote:
On 28 May 2020, at 18:42, Bill Wendling wrote:

> On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
> <[hidden email]> wrote:
>>
>> At least in this test-case, the "bitfield" part of this seems to be a
>> distraction. As Eli notes, Clang has lowered the function to LLVM IR
>> containing consistent i16 operations. Despite that being a different
>> choice from GCC, it should still be correct and consistent.
>>
> I suspect that this is more prevalent with bitfields as they're more
> likely to have the load / bitwise op / store operations done on them,
> resulting in an access type that can be shortened. But yes, it's not
> specific to just bitfields.
>
> I'm more interested in consistency, to be honest. If the loads and
> stores for the bitfields (or other such shorten-able objects) were the
> same, then we wouldn't run into the store-to-load forwarding issue on
> x86 (I don't know about other platforms, but suspect that consistency
> wouldn't hurt). I liked Arthur's idea of accessing the object using
> the type size the bitfield was defined with (i8, i16, i256). It would
> help with improving the heuristic. The downside is that it could lead
> to un-optimal code, but that's the situation we have now, so...

Okay, but what concretely are you suggesting here?  Clang IRGen is
emitting accesses with consistent sizes, and LLVM is making them
inconsistent.  Are you just asking Clang to emit smaller accesses
in the hope that LLVM won’t mess them up?

I don't think this has anything to do with bit-fields or Clang's lowering. This seems to "just" be an optimizer issue (one that happens to show up for bit-field access patterns, but also affects other cases). Much-reduced testcase:

unsigned short n;
void set() { n |= 1; }

For this testcase, -O2 generates a 1-byte 'or' instruction, which will often be a pessimization when there are also full-width accesses. I don't think the frontend can or should be working around this.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
On Fri, May 29, 2020 at 4:00 PM Richard Smith <[hidden email]> wrote:

>
> On Fri, 29 May 2020 at 11:06, John McCall via llvm-dev <[hidden email]> wrote:
>>
>> On 28 May 2020, at 18:42, Bill Wendling wrote:
>>
>> > On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
>> > <[hidden email]> wrote:
>> >>
>> >> At least in this test-case, the "bitfield" part of this seems to be a
>> >> distraction. As Eli notes, Clang has lowered the function to LLVM IR
>> >> containing consistent i16 operations. Despite that being a different
>> >> choice from GCC, it should still be correct and consistent.
>> >>
>> > I suspect that this is more prevalent with bitfields as they're more
>> > likely to have the load / bitwise op / store operations done on them,
>> > resulting in an access type that can be shortened. But yes, it's not
>> > specific to just bitfields.
>> >
>> > I'm more interested in consistency, to be honest. If the loads and
>> > stores for the bitfields (or other such shorten-able objects) were the
>> > same, then we wouldn't run into the store-to-load forwarding issue on
>> > x86 (I don't know about other platforms, but suspect that consistency
>> > wouldn't hurt). I liked Arthur's idea of accessing the object using
>> > the type size the bitfield was defined with (i8, i16, i256). It would
>> > help with improving the heuristic. The downside is that it could lead
>> > to un-optimal code, but that's the situation we have now, so...
>>
>> Okay, but what concretely are you suggesting here?  Clang IRGen is
>> emitting accesses with consistent sizes, and LLVM is making them
>> inconsistent.  Are you just asking Clang to emit smaller accesses
>> in the hope that LLVM won’t mess them up?
>
>
> I don't think this has anything to do with bit-fields or Clang's lowering. This seems to "just" be an optimizer issue (one that happens to show up for bit-field access patterns, but also affects other cases). Much-reduced testcase:
>
> unsigned short n;
> void set() { n |= 1; }
>
> For this testcase, -O2 generates a 1-byte 'or' instruction, which will often be a pessimization when there are also full-width accesses. I don't think the frontend can or should be working around this.

We're a bit (heh) tied in what we can do. LLVM's IR doesn't convey
that we're working with a bitfield, but as many pointed out it's not
bitfield-specific (I didn't mean to imply that it was *only* bitfields
affected by this with my initial post, that was just where it showed
up). The front-end generates code that's perfectly valid for the
various accesses, and changing what it generates doesn't seem like it
would be worth the hassle. (E.g. using i1, i3, etc. for the bitfields
(note I'm *not* suggesting we do this, I'm just using as an example).)

So what we're left with is trying to ensure that we don't generate
sub-optimal code via the DAG combiner and other passes. Some ideas off
the top of my head:

1. Avoid generating byte-sized bitwise operations, because as Richard
pointed out they can lead to pessimizations.

2. Come up with a better heuristic to determine if we're going to
generate inconsistent load/store accesses to the same object.

3. Transform the machine instructions before register allocation to
ensure that all accesses to the same address are the same size, or at
least that the stores are of equal or greater size than the loads.

Option (1) may be the simplest and could potentially solve our immediate issue.

Option (2) is difficult because the DAG combiner operates on a single
BB at a time and doesn't have visibility into past DAGs it modified.

Option (3) allows us to retain the current behavior and gives us
visibility into the whole function, not just an individual BB.

I kind of like option 3, but I'm open to other suggestions.

-bw
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
I thought the IR you showed in March had the store in one function and the load in another function. Option 3 would not handle that.

~Craig


On Fri, May 29, 2020 at 4:29 PM Bill Wendling via cfe-dev <[hidden email]> wrote:
On Fri, May 29, 2020 at 4:00 PM Richard Smith <[hidden email]> wrote:
>
> On Fri, 29 May 2020 at 11:06, John McCall via llvm-dev <[hidden email]> wrote:
>>
>> On 28 May 2020, at 18:42, Bill Wendling wrote:
>>
>> > On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
>> > <[hidden email]> wrote:
>> >>
>> >> At least in this test-case, the "bitfield" part of this seems to be a
>> >> distraction. As Eli notes, Clang has lowered the function to LLVM IR
>> >> containing consistent i16 operations. Despite that being a different
>> >> choice from GCC, it should still be correct and consistent.
>> >>
>> > I suspect that this is more prevalent with bitfields as they're more
>> > likely to have the load / bitwise op / store operations done on them,
>> > resulting in an access type that can be shortened. But yes, it's not
>> > specific to just bitfields.
>> >
>> > I'm more interested in consistency, to be honest. If the loads and
>> > stores for the bitfields (or other such shorten-able objects) were the
>> > same, then we wouldn't run into the store-to-load forwarding issue on
>> > x86 (I don't know about other platforms, but suspect that consistency
>> > wouldn't hurt). I liked Arthur's idea of accessing the object using
>> > the type size the bitfield was defined with (i8, i16, i256). It would
>> > help with improving the heuristic. The downside is that it could lead
>> > to un-optimal code, but that's the situation we have now, so...
>>
>> Okay, but what concretely are you suggesting here?  Clang IRGen is
>> emitting accesses with consistent sizes, and LLVM is making them
>> inconsistent.  Are you just asking Clang to emit smaller accesses
>> in the hope that LLVM won’t mess them up?
>
>
> I don't think this has anything to do with bit-fields or Clang's lowering. This seems to "just" be an optimizer issue (one that happens to show up for bit-field access patterns, but also affects other cases). Much-reduced testcase:
>
> unsigned short n;
> void set() { n |= 1; }
>
> For this testcase, -O2 generates a 1-byte 'or' instruction, which will often be a pessimization when there are also full-width accesses. I don't think the frontend can or should be working around this.

We're a bit (heh) tied in what we can do. LLVM's IR doesn't convey
that we're working with a bitfield, but as many pointed out it's not
bitfield-specific (I didn't mean to imply that it was *only* bitfields
affected by this with my initial post, that was just where it showed
up). The front-end generates code that's perfectly valid for the
various accesses, and changing what it generates doesn't seem like it
would be worth the hassle. (E.g. using i1, i3, etc. for the bitfields
(note I'm *not* suggesting we do this, I'm just using as an example).)

So what we're left with is trying to ensure that we don't generate
sub-optimal code via the DAG combiner and other passes. Some ideas off
the top of my head:

1. Avoid generating byte-sized bitwise operations, because as Richard
pointed out they can lead to pessimizations.

2. Come up with a better heuristic to determine if we're going to
generate inconsistent load/store accesses to the same object.

3. Transform the machine instructions before register allocation to
ensure that all accesses to the same address are the same size, or at
least that the stores are of equal or greater size than the loads.

Option (1) may be the simplest and could potentially solve our immediate issue.

Option (2) is difficult because the DAG combiner operates on a single
BB at a time and doesn't have visibility into past DAGs it modified.

Option (3) allows us to retain the current behavior and gives us
visibility into the whole function, not just an individual BB.

I kind of like option 3, but I'm open to other suggestions.

-bw
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
On Fri, May 29, 2020 at 4:50 PM Craig Topper <[hidden email]> wrote:
>
> I thought the IR you showed in March had the store in one function and the load in another function. Option 3 would not handle that.
>
Yeah. That's an entirely different issue though as it'll need IPA. I'm
a bit skeptical about the initial report as I suspect that a store at
the end of a function would have retired or at least made it to a
cache by the time the function returns, the stack's readjusted,
registers restored, etc. So I think some of the details in the initial
reports may have been a bit off.

-bw

> On Fri, May 29, 2020 at 4:29 PM Bill Wendling via cfe-dev <[hidden email]> wrote:
>>
>> On Fri, May 29, 2020 at 4:00 PM Richard Smith <[hidden email]> wrote:
>> >
>> > On Fri, 29 May 2020 at 11:06, John McCall via llvm-dev <[hidden email]> wrote:
>> >>
>> >> On 28 May 2020, at 18:42, Bill Wendling wrote:
>> >>
>> >> > On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
>> >> > <[hidden email]> wrote:
>> >> >>
>> >> >> At least in this test-case, the "bitfield" part of this seems to be a
>> >> >> distraction. As Eli notes, Clang has lowered the function to LLVM IR
>> >> >> containing consistent i16 operations. Despite that being a different
>> >> >> choice from GCC, it should still be correct and consistent.
>> >> >>
>> >> > I suspect that this is more prevalent with bitfields as they're more
>> >> > likely to have the load / bitwise op / store operations done on them,
>> >> > resulting in an access type that can be shortened. But yes, it's not
>> >> > specific to just bitfields.
>> >> >
>> >> > I'm more interested in consistency, to be honest. If the loads and
>> >> > stores for the bitfields (or other such shorten-able objects) were the
>> >> > same, then we wouldn't run into the store-to-load forwarding issue on
>> >> > x86 (I don't know about other platforms, but suspect that consistency
>> >> > wouldn't hurt). I liked Arthur's idea of accessing the object using
>> >> > the type size the bitfield was defined with (i8, i16, i256). It would
>> >> > help with improving the heuristic. The downside is that it could lead
>> >> > to un-optimal code, but that's the situation we have now, so...
>> >>
>> >> Okay, but what concretely are you suggesting here?  Clang IRGen is
>> >> emitting accesses with consistent sizes, and LLVM is making them
>> >> inconsistent.  Are you just asking Clang to emit smaller accesses
>> >> in the hope that LLVM won’t mess them up?
>> >
>> >
>> > I don't think this has anything to do with bit-fields or Clang's lowering. This seems to "just" be an optimizer issue (one that happens to show up for bit-field access patterns, but also affects other cases). Much-reduced testcase:
>> >
>> > unsigned short n;
>> > void set() { n |= 1; }
>> >
>> > For this testcase, -O2 generates a 1-byte 'or' instruction, which will often be a pessimization when there are also full-width accesses. I don't think the frontend can or should be working around this.
>>
>> We're a bit (heh) tied in what we can do. LLVM's IR doesn't convey
>> that we're working with a bitfield, but as many pointed out it's not
>> bitfield-specific (I didn't mean to imply that it was *only* bitfields
>> affected by this with my initial post, that was just where it showed
>> up). The front-end generates code that's perfectly valid for the
>> various accesses, and changing what it generates doesn't seem like it
>> would be worth the hassle. (E.g. using i1, i3, etc. for the bitfields
>> (note I'm *not* suggesting we do this, I'm just using as an example).)
>>
>> So what we're left with is trying to ensure that we don't generate
>> sub-optimal code via the DAG combiner and other passes. Some ideas off
>> the top of my head:
>>
>> 1. Avoid generating byte-sized bitwise operations, because as Richard
>> pointed out they can lead to pessimizations.
>>
>> 2. Come up with a better heuristic to determine if we're going to
>> generate inconsistent load/store accesses to the same object.
>>
>> 3. Transform the machine instructions before register allocation to
>> ensure that all accesses to the same address are the same size, or at
>> least that the stores are of equal or greater size than the loads.
>>
>> Option (1) may be the simplest and could potentially solve our immediate issue.
>>
>> Option (2) is difficult because the DAG combiner operates on a single
>> BB at a time and doesn't have visibility into past DAGs it modified.
>>
>> Option (3) allows us to retain the current behavior and gives us
>> visibility into the whole function, not just an individual BB.
>>
>> I kind of like option 3, but I'm open to other suggestions.
>>
>> -bw
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev
On 29 May 2020, at 19:29, Bill Wendling wrote:

> On Fri, May 29, 2020 at 4:00 PM Richard Smith <[hidden email]>
> wrote:
>> On Fri, 29 May 2020 at 11:06, John McCall via llvm-dev
>> <[hidden email]> wrote:
>>> On 28 May 2020, at 18:42, Bill Wendling wrote:
>>>
>>>> On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
>>>> <[hidden email]> wrote:
>>>>>
>>>>> At least in this test-case, the "bitfield" part of this seems to
>>>>> be a
>>>>> distraction. As Eli notes, Clang has lowered the function to LLVM
>>>>> IR
>>>>> containing consistent i16 operations. Despite that being a
>>>>> different
>>>>> choice from GCC, it should still be correct and consistent.
>>>>>
>>>> I suspect that this is more prevalent with bitfields as they're
>>>> more
>>>> likely to have the load / bitwise op / store operations done on
>>>> them,
>>>> resulting in an access type that can be shortened. But yes, it's
>>>> not
>>>> specific to just bitfields.
>>>>
>>>> I'm more interested in consistency, to be honest. If the loads and
>>>> stores for the bitfields (or other such shorten-able objects) were
>>>> the
>>>> same, then we wouldn't run into the store-to-load forwarding issue
>>>> on
>>>> x86 (I don't know about other platforms, but suspect that
>>>> consistency
>>>> wouldn't hurt). I liked Arthur's idea of accessing the object using
>>>> the type size the bitfield was defined with (i8, i16, i256). It
>>>> would
>>>> help with improving the heuristic. The downside is that it could
>>>> lead
>>>> to un-optimal code, but that's the situation we have now, so...
>>>
>>> Okay, but what concretely are you suggesting here?  Clang IRGen is
>>> emitting accesses with consistent sizes, and LLVM is making them
>>> inconsistent.  Are you just asking Clang to emit smaller accesses
>>> in the hope that LLVM won’t mess them up?
>>
>>
>> I don't think this has anything to do with bit-fields or Clang's
>> lowering. This seems to "just" be an optimizer issue (one that
>> happens to show up for bit-field access patterns, but also affects
>> other cases). Much-reduced testcase:
>>
>> unsigned short n;
>> void set() { n |= 1; }

To be clear, I agree with you that this is not a frontend problem.
I asked because the question seemed to lead towards changing the width
that Clang uses for accesses.

> We're a bit (heh) tied in what we can do. LLVM's IR doesn't convey
> that we're working with a bitfield, but as many pointed out it's not
> bitfield-specific (I didn't mean to imply that it was *only* bitfields
> affected by this with my initial post, that was just where it showed
> up). The front-end generates code that's perfectly valid for the
> various accesses, and changing what it generates doesn't seem like it
> would be worth the hassle. (E.g. using i1, i3, etc. for the bitfields
> (note I'm *not* suggesting we do this, I'm just using as an example).)
>
> So what we're left with is trying to ensure that we don't generate
> sub-optimal code via the DAG combiner and other passes. Some ideas off
> the top of my head:
>
> 1. Avoid generating byte-sized bitwise operations, because as Richard
> pointed out they can lead to pessimizations.
>
> 2. Come up with a better heuristic to determine if we're going to
> generate inconsistent load/store accesses to the same object.
>
> 3. Transform the machine instructions before register allocation to
> ensure that all accesses to the same address are the same size, or at
> least that the stores are of equal or greater size than the loads.
>
> Option (1) may be the simplest and could potentially solve our
> immediate issue.
>
> Option (2) is difficult because the DAG combiner operates on a single
> BB at a time and doesn't have visibility into past DAGs it modified.
>
> Option (3) allows us to retain the current behavior and gives us
> visibility into the whole function, not just an individual BB.
>
> I kind of like option 3, but I'm open to other suggestions.

What’s the concrete suggestion for (3), that we analyze the uses
of a single address and try to re-widen the access to make it use
a consistent width?  Widening accesses is difficult in general,
unless the information is still present that it was originally
wider before DAG combine.

At any rate, seems like a backend problem. :)

John.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Vassil Vassilev via cfe-dev
On Fri, May 29, 2020 at 5:59 PM John McCall <[hidden email]> wrote:

> On 29 May 2020, at 19:29, Bill Wendling wrote:
> > On Fri, May 29, 2020 at 4:00 PM Richard Smith <[hidden email]>
> > wrote:
> >> On Fri, 29 May 2020 at 11:06, John McCall via llvm-dev
> >> <[hidden email]> wrote:
> >>> On 28 May 2020, at 18:42, Bill Wendling wrote:
> >>>
> >>>> On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
> >>>> <[hidden email]> wrote:
> >>>>>
> >>>>> At least in this test-case, the "bitfield" part of this seems to
> >>>>> be a
> >>>>> distraction. As Eli notes, Clang has lowered the function to LLVM
> >>>>> IR
> >>>>> containing consistent i16 operations. Despite that being a
> >>>>> different
> >>>>> choice from GCC, it should still be correct and consistent.
> >>>>>
> >>>> I suspect that this is more prevalent with bitfields as they're
> >>>> more
> >>>> likely to have the load / bitwise op / store operations done on
> >>>> them,
> >>>> resulting in an access type that can be shortened. But yes, it's
> >>>> not
> >>>> specific to just bitfields.
> >>>>
> >>>> I'm more interested in consistency, to be honest. If the loads and
> >>>> stores for the bitfields (or other such shorten-able objects) were
> >>>> the
> >>>> same, then we wouldn't run into the store-to-load forwarding issue
> >>>> on
> >>>> x86 (I don't know about other platforms, but suspect that
> >>>> consistency
> >>>> wouldn't hurt). I liked Arthur's idea of accessing the object using
> >>>> the type size the bitfield was defined with (i8, i16, i256). It
> >>>> would
> >>>> help with improving the heuristic. The downside is that it could
> >>>> lead
> >>>> to un-optimal code, but that's the situation we have now, so...
> >>>
> >>> Okay, but what concretely are you suggesting here?  Clang IRGen is
> >>> emitting accesses with consistent sizes, and LLVM is making them
> >>> inconsistent.  Are you just asking Clang to emit smaller accesses
> >>> in the hope that LLVM won’t mess them up?
> >>
> >>
> >> I don't think this has anything to do with bit-fields or Clang's
> >> lowering. This seems to "just" be an optimizer issue (one that
> >> happens to show up for bit-field access patterns, but also affects
> >> other cases). Much-reduced testcase:
> >>
> >> unsigned short n;
> >> void set() { n |= 1; }
>
> To be clear, I agree with you that this is not a frontend problem.
> I asked because the question seemed to lead towards changing the width
> that Clang uses for accesses.
>
> > We're a bit (heh) tied in what we can do. LLVM's IR doesn't convey
> > that we're working with a bitfield, but as many pointed out it's not
> > bitfield-specific (I didn't mean to imply that it was *only* bitfields
> > affected by this with my initial post, that was just where it showed
> > up). The front-end generates code that's perfectly valid for the
> > various accesses, and changing what it generates doesn't seem like it
> > would be worth the hassle. (E.g. using i1, i3, etc. for the bitfields
> > (note I'm *not* suggesting we do this, I'm just using as an example).)
> >
> > So what we're left with is trying to ensure that we don't generate
> > sub-optimal code via the DAG combiner and other passes. Some ideas off
> > the top of my head:
> >
> > 1. Avoid generating byte-sized bitwise operations, because as Richard
> > pointed out they can lead to pessimizations.
> >
> > 2. Come up with a better heuristic to determine if we're going to
> > generate inconsistent load/store accesses to the same object.
> >
> > 3. Transform the machine instructions before register allocation to
> > ensure that all accesses to the same address are the same size, or at
> > least that the stores are of equal or greater size than the loads.
> >
> > Option (1) may be the simplest and could potentially solve our
> > immediate issue.
> >
> > Option (2) is difficult because the DAG combiner operates on a single
> > BB at a time and doesn't have visibility into past DAGs it modified.
> >
> > Option (3) allows us to retain the current behavior and gives us
> > visibility into the whole function, not just an individual BB.
> >
> > I kind of like option 3, but I'm open to other suggestions.
>
> What’s the concrete suggestion for (3), that we analyze the uses
> of a single address and try to re-widen the access to make it use
> a consistent width?  Widening accesses is difficult in general,
> unless the information is still present that it was originally
> wider before DAG combine.
>
Alternatively, we could move the shortening logic out of DAG combine
and into a separate pass. I'll come up with a more concrete proposal.
In the meantime, some flags were added which may get us past the
initial issue. (Crossing fingers that it doesn't pessimise other
code.)

> At any rate, seems like a backend problem. :)

Yes. :)

-bw
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12