The intrinsics headers (especially avx512) are too big. What to do about it?

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
Hi,

on Windows, C++ system headers like e.g. <string> end up pulling in intrin.h. clang's intrinsic headers are very large.

If you take a cc file containing just `#include <string>` and run that through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the test.I file generated by clang-cl is 1.7MB while the one created by cl.exe is 0.7MB. This is solely due to clang's intrin.h expanding to way more stuff.

The biggest offenders are avx512vlintrin.h, avx512fintrin.h, avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only included avx headers if __AVX512F__ etc was defined. This is currently never the case in practice. Later (r243394 r243402 r243406 and more), the avx headers got much bigger.

Parsing all this code takes time -- removing the avx512 includes from immintrin.h locally makes compiling a file containing just the <string> header 0.25s faster (!), and building all of v8 gets 6% faster, just from not including the avx512 headers.

What can we do about this? Since avx512 is new, maybe they could be not part of immintrin.h? Or we could re-introduce

  #if !__has_feature(modules) && defined(__AVX512BW__)

include guards in immintrin.h. This would give us a speed win immediately without drawbacks as far as I can see, but in a few years when people start compiling with /arch:avx512 that'd go away again. (Then again, by then, modules are hopefully commonly available. cl.exe doesn't have an /arch:avx512 switch yet, so this is probably several years away from happening.)

Comments? Is it feasible to require that people who want to use avx512 include a new header instead of immintrin.h? Else, does anyone have a better idea other than reintroducing the #ifdefs, augmented with the module check?

Thanks,
Nico

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
I think our approach to the mmintrin headers doesn't scale. We're
creating the windows.h of intel intrinsics in immintrin.h.

When they were first created, a large percentage of the intrinsics
were mapping from hyper-specific instruction names to generic vector
math operations like this:

static __inline__ __m128 __DEFAULT_FN_ATTRS
_mm_add_ps(__m128 __a, __m128 __b) { return __a + __b; }

This made a lot of sense at the time, because we could just write come
C and not worry about teaching clang and LLVM about every Intel
intrinsic under the sun.

From looking at the avx512 headers, it seems this is no longer the
case. Now we are mostly mapping from _mm_* intrinsic to
__builtin_ia32_ function.

If this continues to be the case going forward, then I think we should
make the _mm* intrinsics into compiler builtins like the
__builtin_ia32 functions. It also avoids the need for those ugly
forwarding macros for intrinsics that take arguments that must be
constant.

The _mm_* builtins should only be available if the user includes
<immintrin.h>. We can replace the contents of that file with a pragma
that just says "enable all intel intrinsics".

On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
<[hidden email]> wrote:

> Hi,
>
> on Windows, C++ system headers like e.g. <string> end up pulling in
> intrin.h. clang's intrinsic headers are very large.
>
> If you take a cc file containing just `#include <string>` and run that
> through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the
> test.I file generated by clang-cl is 1.7MB while the one created by cl.exe
> is 0.7MB. This is solely due to clang's intrin.h expanding to way more
> stuff.
>
> The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
> avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only
> included avx headers if __AVX512F__ etc was defined. This is currently never
> the case in practice. Later (r243394 r243402 r243406 and more), the avx
> headers got much bigger.
>
> Parsing all this code takes time -- removing the avx512 includes from
> immintrin.h locally makes compiling a file containing just the <string>
> header 0.25s faster (!), and building all of v8 gets 6% faster, just from
> not including the avx512 headers.
>
> What can we do about this? Since avx512 is new, maybe they could be not part
> of immintrin.h? Or we could re-introduce
>
>   #if !__has_feature(modules) && defined(__AVX512BW__)
>
> include guards in immintrin.h. This would give us a speed win immediately
> without drawbacks as far as I can see, but in a few years when people start
> compiling with /arch:avx512 that'd go away again. (Then again, by then,
> modules are hopefully commonly available. cl.exe doesn't have an
> /arch:avx512 switch yet, so this is probably several years away from
> happening.)
>
> Comments? Is it feasible to require that people who want to use avx512
> include a new header instead of immintrin.h? Else, does anyone have a better
> idea other than reintroducing the #ifdefs, augmented with the module check?
>
> Thanks,
> Nico
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
----- Original Message -----

> From: "Reid Kleckner via cfe-dev" <[hidden email]>
> To: "Nico Weber" <[hidden email]>, "David Majnemer" <[hidden email]>
> Cc: "Elena Demikhovsky" <[hidden email]>, "cfe-dev" <[hidden email]>, "asaf badouh"
> <[hidden email]>, "Michael zuckerman" <[hidden email]>
> Sent: Thursday, May 12, 2016 6:10:06 PM
> Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?
>
> I think our approach to the mmintrin headers doesn't scale. We're
> creating the windows.h of intel intrinsics in immintrin.h.
>
> When they were first created, a large percentage of the intrinsics
> were mapping from hyper-specific instruction names to generic vector
> math operations like this:
>
> static __inline__ __m128 __DEFAULT_FN_ATTRS
> _mm_add_ps(__m128 __a, __m128 __b) { return __a + __b; }
>
> This made a lot of sense at the time, because we could just write
> come
> C and not worry about teaching clang and LLVM about every Intel
> intrinsic under the sun.
>
> From looking at the avx512 headers, it seems this is no longer the
> case. Now we are mostly mapping from _mm_* intrinsic to
> __builtin_ia32_ function.
>
> If this continues to be the case going forward,

Indeed. It is not clear to me, however, that this situation is desirable. We had a general policy that our intrinsics headers should generate generic IR whenever possible, and if we've strayed from that, we should discuss that first.

 -Hal

> then I think we
> should
> make the _mm* intrinsics into compiler builtins like the
> __builtin_ia32 functions. It also avoids the need for those ugly
> forwarding macros for intrinsics that take arguments that must be
> constant.
>
> The _mm_* builtins should only be available if the user includes
> <immintrin.h>. We can replace the contents of that file with a pragma
> that just says "enable all intel intrinsics".
>
> On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> <[hidden email]> wrote:
> > Hi,
> >
> > on Windows, C++ system headers like e.g. <string> end up pulling in
> > intrin.h. clang's intrinsic headers are very large.
> >
> > If you take a cc file containing just `#include <string>` and run
> > that
> > through the preprocessor with `cl /P test.cc` and `clang-cl /P
> > test.cc`, the
> > test.I file generated by clang-cl is 1.7MB while the one created by
> > cl.exe
> > is 0.7MB. This is solely due to clang's intrin.h expanding to way
> > more
> > stuff.
> >
> > The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
> > avx512vlbwintrin.h which add up to 657kB already. Before r239883,
> > we only
> > included avx headers if __AVX512F__ etc was defined. This is
> > currently never
> > the case in practice. Later (r243394 r243402 r243406 and more), the
> > avx
> > headers got much bigger.
> >
> > Parsing all this code takes time -- removing the avx512 includes
> > from
> > immintrin.h locally makes compiling a file containing just the
> > <string>
> > header 0.25s faster (!), and building all of v8 gets 6% faster,
> > just from
> > not including the avx512 headers.
> >
> > What can we do about this? Since avx512 is new, maybe they could be
> > not part
> > of immintrin.h? Or we could re-introduce
> >
> >   #if !__has_feature(modules) && defined(__AVX512BW__)
> >
> > include guards in immintrin.h. This would give us a speed win
> > immediately
> > without drawbacks as far as I can see, but in a few years when
> > people start
> > compiling with /arch:avx512 that'd go away again. (Then again, by
> > then,
> > modules are hopefully commonly available. cl.exe doesn't have an
> > /arch:avx512 switch yet, so this is probably several years away
> > from
> > happening.)
> >
> > Comments? Is it feasible to require that people who want to use
> > avx512
> > include a new header instead of immintrin.h? Else, does anyone have
> > a better
> > idea other than reintroducing the #ifdefs, augmented with the
> > module check?
> >
> > Thanks,
> > Nico
> >
> > _______________________________________________
> > cfe-dev mailing list
> > [hidden email]
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev


> -----Original Message-----
> From: cfe-dev [mailto:[hidden email]] On Behalf Of Hal
> Finkel via cfe-dev
> Sent: Thursday, May 12, 2016 4:14 PM
> To: Reid Kleckner
> Cc: asaf badouh; David Majnemer; Michael zuckerman; cfe-dev; Elena
> Demikhovsky
> Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too
> big. What to do about it?
>
> ----- Original Message -----
> > From: "Reid Kleckner via cfe-dev" <[hidden email]>
> > To: "Nico Weber" <[hidden email]>, "David Majnemer"
> <[hidden email]>
> > Cc: "Elena Demikhovsky" <[hidden email]>, "cfe-dev" <cfe-
> [hidden email]>, "asaf badouh"
> > <[hidden email]>, "Michael zuckerman"
> <[hidden email]>
> > Sent: Thursday, May 12, 2016 6:10:06 PM
> > Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are
> too big. What to do about it?
> >
> > I think our approach to the mmintrin headers doesn't scale. We're
> > creating the windows.h of intel intrinsics in immintrin.h.
> >
> > When they were first created, a large percentage of the intrinsics
> > were mapping from hyper-specific instruction names to generic vector
> > math operations like this:
> >
> > static __inline__ __m128 __DEFAULT_FN_ATTRS
> > _mm_add_ps(__m128 __a, __m128 __b) { return __a + __b; }
> >
> > This made a lot of sense at the time, because we could just write
> > come
> > C and not worry about teaching clang and LLVM about every Intel
> > intrinsic under the sun.
> >
> > From looking at the avx512 headers, it seems this is no longer the
> > case. Now we are mostly mapping from _mm_* intrinsic to
> > __builtin_ia32_ function.
> >
> > If this continues to be the case going forward,
>
> Indeed. It is not clear to me, however, that this situation is desirable.
> We had a general policy that our intrinsics headers should generate
> generic IR whenever possible, and if we've strayed from that, we should
> discuss that first.

If you look at the history of some of the headers, they used to map the
intrinsic function names to builtins.  As codegen got smarter over time,
many of these were converted to generic C, and the builtins could go away.
A *lot* of the intrinsics didn't start out as generic C.

(I personally spent probably months of my life merging the evolution of
the intrinsics and builtins and tablegen instruction definitions for a
variety of X86 instruction subsets into our local tree.)

>
>  -Hal
>
> > then I think we
> > should
> > make the _mm* intrinsics into compiler builtins like the
> > __builtin_ia32 functions. It also avoids the need for those ugly
> > forwarding macros for intrinsics that take arguments that must be
> > constant.

If you do that, then there is less motivation to make codegen smarter.
--paulr

> >
> > The _mm_* builtins should only be available if the user includes
> > <immintrin.h>. We can replace the contents of that file with a pragma
> > that just says "enable all intel intrinsics".
> >
> > On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> > <[hidden email]> wrote:
> > > Hi,
> > >
> > > on Windows, C++ system headers like e.g. <string> end up pulling in
> > > intrin.h. clang's intrinsic headers are very large.
> > >
> > > If you take a cc file containing just `#include <string>` and run
> > > that
> > > through the preprocessor with `cl /P test.cc` and `clang-cl /P
> > > test.cc`, the
> > > test.I file generated by clang-cl is 1.7MB while the one created by
> > > cl.exe
> > > is 0.7MB. This is solely due to clang's intrin.h expanding to way
> > > more
> > > stuff.
> > >
> > > The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
> > > avx512vlbwintrin.h which add up to 657kB already. Before r239883,
> > > we only
> > > included avx headers if __AVX512F__ etc was defined. This is
> > > currently never
> > > the case in practice. Later (r243394 r243402 r243406 and more), the
> > > avx
> > > headers got much bigger.
> > >
> > > Parsing all this code takes time -- removing the avx512 includes
> > > from
> > > immintrin.h locally makes compiling a file containing just the
> > > <string>
> > > header 0.25s faster (!), and building all of v8 gets 6% faster,
> > > just from
> > > not including the avx512 headers.
> > >
> > > What can we do about this? Since avx512 is new, maybe they could be
> > > not part
> > > of immintrin.h? Or we could re-introduce
> > >
> > >   #if !__has_feature(modules) && defined(__AVX512BW__)
> > >
> > > include guards in immintrin.h. This would give us a speed win
> > > immediately
> > > without drawbacks as far as I can see, but in a few years when
> > > people start
> > > compiling with /arch:avx512 that'd go away again. (Then again, by
> > > then,
> > > modules are hopefully commonly available. cl.exe doesn't have an
> > > /arch:avx512 switch yet, so this is probably several years away
> > > from
> > > happening.)
> > >
> > > Comments? Is it feasible to require that people who want to use
> > > avx512
> > > include a new header instead of immintrin.h? Else, does anyone have
> > > a better
> > > idea other than reintroducing the #ifdefs, augmented with the
> > > module check?
> > >
> > > Thanks,
> > > Nico
> > >
> > > _______________________________________________
> > > cfe-dev mailing list
> > > [hidden email]
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> > >
> > _______________________________________________
> > cfe-dev mailing list
> > [hidden email]
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
In reply to this post by Renato Golin via cfe-dev
Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang's? (e.g. maybe the windows ones don't cover all the ISA's that clang's do).

-- Sean Silva

On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev <[hidden email]> wrote:
Hi,

on Windows, C++ system headers like e.g. <string> end up pulling in intrin.h. clang's intrinsic headers are very large.

If you take a cc file containing just `#include <string>` and run that through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the test.I file generated by clang-cl is 1.7MB while the one created by cl.exe is 0.7MB. This is solely due to clang's intrin.h expanding to way more stuff.

The biggest offenders are avx512vlintrin.h, avx512fintrin.h, avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only included avx headers if __AVX512F__ etc was defined. This is currently never the case in practice. Later (r243394 r243402 r243406 and more), the avx headers got much bigger.

Parsing all this code takes time -- removing the avx512 includes from immintrin.h locally makes compiling a file containing just the <string> header 0.25s faster (!), and building all of v8 gets 6% faster, just from not including the avx512 headers.

What can we do about this? Since avx512 is new, maybe they could be not part of immintrin.h? Or we could re-introduce

  #if !__has_feature(modules) && defined(__AVX512BW__)

include guards in immintrin.h. This would give us a speed win immediately without drawbacks as far as I can see, but in a few years when people start compiling with /arch:avx512 that'd go away again. (Then again, by then, modules are hopefully commonly available. cl.exe doesn't have an /arch:avx512 switch yet, so this is probably several years away from happening.)

Comments? Is it feasible to require that people who want to use avx512 include a new header instead of immintrin.h? Else, does anyone have a better idea other than reintroducing the #ifdefs, augmented with the module check?

Thanks,
Nico

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
This old discussion may cover some of this as well? I also thought I
remember something more recent around this..
http://clang-developers.42468.n3.nabble.com/PROPOSAL-Reintroduce-guards-for-Intel-intrinsic-headers-td4046979.html

On Sat, May 14, 2016 at 8:59 AM, Sean Silva via cfe-dev
<[hidden email]> wrote:

> Sorry if this is a stupid question, but do the windows intrinsic headers
> actually contain the same contents as clang's? (e.g. maybe the windows ones
> don't cover all the ISA's that clang's do).
>
> -- Sean Silva
>
> On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> on Windows, C++ system headers like e.g. <string> end up pulling in
>> intrin.h. clang's intrinsic headers are very large.
>>
>> If you take a cc file containing just `#include <string>` and run that
>> through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the
>> test.I file generated by clang-cl is 1.7MB while the one created by cl.exe
>> is 0.7MB. This is solely due to clang's intrin.h expanding to way more
>> stuff.
>>
>> The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
>> avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only
>> included avx headers if __AVX512F__ etc was defined. This is currently never
>> the case in practice. Later (r243394 r243402 r243406 and more), the avx
>> headers got much bigger.
>>
>> Parsing all this code takes time -- removing the avx512 includes from
>> immintrin.h locally makes compiling a file containing just the <string>
>> header 0.25s faster (!), and building all of v8 gets 6% faster, just from
>> not including the avx512 headers.
>>
>> What can we do about this? Since avx512 is new, maybe they could be not
>> part of immintrin.h? Or we could re-introduce
>>
>>   #if !__has_feature(modules) && defined(__AVX512BW__)
>>
>> include guards in immintrin.h. This would give us a speed win immediately
>> without drawbacks as far as I can see, but in a few years when people start
>> compiling with /arch:avx512 that'd go away again. (Then again, by then,
>> modules are hopefully commonly available. cl.exe doesn't have an
>> /arch:avx512 switch yet, so this is probably several years away from
>> happening.)
>>
>> Comments? Is it feasible to require that people who want to use avx512
>> include a new header instead of immintrin.h? Else, does anyone have a better
>> idea other than reintroducing the #ifdefs, augmented with the module check?
>>
>> Thanks,
>> Nico
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
A couple of points:

1) Definitely agree with Hal that these intrinsics really shouldn't be mapping to builtins. This is something I'm pretty frustrated about the direction of AVX-512 support in Clang and LLVM. We really need generic vector IR to lower cleanly into these instructions.

2) Reid, you specifically advocated for not having the set of intrinsics available based on particular feature sets. ;] But I agree there seems to be a scalability problem here.

3) I think a lot of the scalability problem is that very basic, non-vector code patterns, require Intrin.h on Windows and pull in *ALL* the vector intrinsics. =/ It'd be really good to try to fix *that*.

4) AVX-512 has made this *incredibly* worse than any previous ISA extension. It used to be we had the product of (operation * operand-type) intrinsics. This is already pretty bad. Now we have (operation * operand-type * 4) because we have 4 masking variants. So it seems Intel has just made a really unfortunate API choice by forcing every permutation of these things to get a different name and thus a different intrinsic in a  header file. =/ And sadly that too is probably too late to walk back.


I wonder if we could at least initially address this by providing very limited "builtin" modules for truly builtin headers that don't touch any system headers, and actually *always* use the modules approach for these headers, right out of the box.

On Fri, May 13, 2016 at 7:32 PM C Bergström <[hidden email]> wrote:
This old discussion may cover some of this as well? I also thought I
remember something more recent around this..
http://clang-developers.42468.n3.nabble.com/PROPOSAL-Reintroduce-guards-for-Intel-intrinsic-headers-td4046979.html

On Sat, May 14, 2016 at 8:59 AM, Sean Silva via cfe-dev
<[hidden email]> wrote:
> Sorry if this is a stupid question, but do the windows intrinsic headers
> actually contain the same contents as clang's? (e.g. maybe the windows ones
> don't cover all the ISA's that clang's do).
>
> -- Sean Silva
>
> On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> on Windows, C++ system headers like e.g. <string> end up pulling in
>> intrin.h. clang's intrinsic headers are very large.
>>
>> If you take a cc file containing just `#include <string>` and run that
>> through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the
>> test.I file generated by clang-cl is 1.7MB while the one created by cl.exe
>> is 0.7MB. This is solely due to clang's intrin.h expanding to way more
>> stuff.
>>
>> The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
>> avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only
>> included avx headers if __AVX512F__ etc was defined. This is currently never
>> the case in practice. Later (r243394 r243402 r243406 and more), the avx
>> headers got much bigger.
>>
>> Parsing all this code takes time -- removing the avx512 includes from
>> immintrin.h locally makes compiling a file containing just the <string>
>> header 0.25s faster (!), and building all of v8 gets 6% faster, just from
>> not including the avx512 headers.
>>
>> What can we do about this? Since avx512 is new, maybe they could be not
>> part of immintrin.h? Or we could re-introduce
>>
>>   #if !__has_feature(modules) && defined(__AVX512BW__)
>>
>> include guards in immintrin.h. This would give us a speed win immediately
>> without drawbacks as far as I can see, but in a few years when people start
>> compiling with /arch:avx512 that'd go away again. (Then again, by then,
>> modules are hopefully commonly available. cl.exe doesn't have an
>> /arch:avx512 switch yet, so this is probably several years away from
>> happening.)
>>
>> Comments? Is it feasible to require that people who want to use avx512
>> include a new header instead of immintrin.h? Else, does anyone have a better
>> idea other than reintroducing the #ifdefs, augmented with the module check?
>>
>> Thanks,
>> Nico
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang's?

As far as I can tell (from looking at https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing to clang's headers), yes. MSVC doesn't have the avx512 intrinsics yet, but that's probably only because they're new.

I had hoped that I could not include all of x86intrin.h in intrin.h, but that page says "The intrin.h header includes both immintrin.h and ammintrin.h for simplicity."

> This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the includes back behind ifdefs. The thread doesn't really mention the reasons, and since clang doesn't implement full multiversioning yet I'm unable to guess at the reasons -- but it sounds like people don't want to re-add the arch ifdefs. Ok, I'll send a patch to add them back ifdef _MSC_VER only -- there should be no drawback to that, and it stops the bleeding in the case where it's worst (with Microsoft headers).

Going forward, we'll have to teach clang more about at least some intrinsics for `#pragma intrin` (PR19898), which might end up helping for this too.

I also reached out to STL at Microsoft, he said he'll try to look into including an "intrin0.h" header in the next major version of MSVC which would only declare a small set of intrinsics instead of all of them (no promises, of course).

People working on avx512, I'd be curious to hear your perspective on this, as well as your reply to Chandler's points.

Thanks,
Nico

On Sat, May 14, 2016 at 2:04 AM, Chandler Carruth via cfe-dev <[hidden email]> wrote:
A couple of points:

1) Definitely agree with Hal that these intrinsics really shouldn't be mapping to builtins. This is something I'm pretty frustrated about the direction of AVX-512 support in Clang and LLVM. We really need generic vector IR to lower cleanly into these instructions.

2) Reid, you specifically advocated for not having the set of intrinsics available based on particular feature sets. ;] But I agree there seems to be a scalability problem here.

3) I think a lot of the scalability problem is that very basic, non-vector code patterns, require Intrin.h on Windows and pull in *ALL* the vector intrinsics. =/ It'd be really good to try to fix *that*.

4) AVX-512 has made this *incredibly* worse than any previous ISA extension. It used to be we had the product of (operation * operand-type) intrinsics. This is already pretty bad. Now we have (operation * operand-type * 4) because we have 4 masking variants. So it seems Intel has just made a really unfortunate API choice by forcing every permutation of these things to get a different name and thus a different intrinsic in a  header file. =/ And sadly that too is probably too late to walk back.


I wonder if we could at least initially address this by providing very limited "builtin" modules for truly builtin headers that don't touch any system headers, and actually *always* use the modules approach for these headers, right out of the box.

On Fri, May 13, 2016 at 7:32 PM C Bergström <[hidden email]> wrote:
This old discussion may cover some of this as well? I also thought I
remember something more recent around this..
http://clang-developers.42468.n3.nabble.com/PROPOSAL-Reintroduce-guards-for-Intel-intrinsic-headers-td4046979.html

On Sat, May 14, 2016 at 8:59 AM, Sean Silva via cfe-dev
<[hidden email]> wrote:
> Sorry if this is a stupid question, but do the windows intrinsic headers
> actually contain the same contents as clang's? (e.g. maybe the windows ones
> don't cover all the ISA's that clang's do).
>
> -- Sean Silva
>
> On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> on Windows, C++ system headers like e.g. <string> end up pulling in
>> intrin.h. clang's intrinsic headers are very large.
>>
>> If you take a cc file containing just `#include <string>` and run that
>> through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the
>> test.I file generated by clang-cl is 1.7MB while the one created by cl.exe
>> is 0.7MB. This is solely due to clang's intrin.h expanding to way more
>> stuff.
>>
>> The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
>> avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only
>> included avx headers if __AVX512F__ etc was defined. This is currently never
>> the case in practice. Later (r243394 r243402 r243406 and more), the avx
>> headers got much bigger.
>>
>> Parsing all this code takes time -- removing the avx512 includes from
>> immintrin.h locally makes compiling a file containing just the <string>
>> header 0.25s faster (!), and building all of v8 gets 6% faster, just from
>> not including the avx512 headers.
>>
>> What can we do about this? Since avx512 is new, maybe they could be not
>> part of immintrin.h? Or we could re-introduce
>>
>>   #if !__has_feature(modules) && defined(__AVX512BW__)
>>
>> include guards in immintrin.h. This would give us a speed win immediately
>> without drawbacks as far as I can see, but in a few years when people start
>> compiling with /arch:avx512 that'd go away again. (Then again, by then,
>> modules are hopefully commonly available. cl.exe doesn't have an
>> /arch:avx512 switch yet, so this is probably several years away from
>> happening.)
>>
>> Comments? Is it feasible to require that people who want to use avx512
>> include a new header instead of immintrin.h? Else, does anyone have a better
>> idea other than reintroducing the #ifdefs, augmented with the module check?
>>
>> Thanks,
>> Nico
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev


On Mon, May 16, 2016 at 9:43 AM Nico Weber via cfe-dev <[hidden email]> wrote:
Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang's?

As far as I can tell (from looking at https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing to clang's headers), yes. MSVC doesn't have the avx512 intrinsics yet, but that's probably only because they're new.

I had hoped that I could not include all of x86intrin.h in intrin.h, but that page says "The intrin.h header includes both immintrin.h and ammintrin.h for simplicity."

> This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the includes back behind ifdefs. The thread doesn't really mention the reasons, and since clang doesn't implement full multiversioning yet I'm unable to guess at the reasons -- but it sounds like people don't want to re-add the arch ifdefs. Ok, I'll send a patch to add them back ifdef _MSC_VER only -- there should be no drawback to that, and it stops the bleeding in the case where it's worst (with Microsoft headers).


It implements enough multiversioning to make it worthwhile to have them, I don't know what you're confused about here. Why do we want to make this platform specific? If you wanted to match MSVC I guess you could just turn them off for windows as an alternate solution?

-eric
 
Going forward, we'll have to teach clang more about at least some intrinsics for `#pragma intrin` (PR19898), which might end up helping for this too.

I also reached out to STL at Microsoft, he said he'll try to look into including an "intrin0.h" header in the next major version of MSVC which would only declare a small set of intrinsics instead of all of them (no promises, of course).

People working on avx512, I'd be curious to hear your perspective on this, as well as your reply to Chandler's points.

Thanks,
Nico

On Sat, May 14, 2016 at 2:04 AM, Chandler Carruth via cfe-dev <[hidden email]> wrote:
A couple of points:

1) Definitely agree with Hal that these intrinsics really shouldn't be mapping to builtins. This is something I'm pretty frustrated about the direction of AVX-512 support in Clang and LLVM. We really need generic vector IR to lower cleanly into these instructions.

2) Reid, you specifically advocated for not having the set of intrinsics available based on particular feature sets. ;] But I agree there seems to be a scalability problem here.

3) I think a lot of the scalability problem is that very basic, non-vector code patterns, require Intrin.h on Windows and pull in *ALL* the vector intrinsics. =/ It'd be really good to try to fix *that*.

4) AVX-512 has made this *incredibly* worse than any previous ISA extension. It used to be we had the product of (operation * operand-type) intrinsics. This is already pretty bad. Now we have (operation * operand-type * 4) because we have 4 masking variants. So it seems Intel has just made a really unfortunate API choice by forcing every permutation of these things to get a different name and thus a different intrinsic in a  header file. =/ And sadly that too is probably too late to walk back.


I wonder if we could at least initially address this by providing very limited "builtin" modules for truly builtin headers that don't touch any system headers, and actually *always* use the modules approach for these headers, right out of the box.

On Fri, May 13, 2016 at 7:32 PM C Bergström <[hidden email]> wrote:
This old discussion may cover some of this as well? I also thought I
remember something more recent around this..
http://clang-developers.42468.n3.nabble.com/PROPOSAL-Reintroduce-guards-for-Intel-intrinsic-headers-td4046979.html

On Sat, May 14, 2016 at 8:59 AM, Sean Silva via cfe-dev
<[hidden email]> wrote:
> Sorry if this is a stupid question, but do the windows intrinsic headers
> actually contain the same contents as clang's? (e.g. maybe the windows ones
> don't cover all the ISA's that clang's do).
>
> -- Sean Silva
>
> On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> on Windows, C++ system headers like e.g. <string> end up pulling in
>> intrin.h. clang's intrinsic headers are very large.
>>
>> If you take a cc file containing just `#include <string>` and run that
>> through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the
>> test.I file generated by clang-cl is 1.7MB while the one created by cl.exe
>> is 0.7MB. This is solely due to clang's intrin.h expanding to way more
>> stuff.
>>
>> The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
>> avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only
>> included avx headers if __AVX512F__ etc was defined. This is currently never
>> the case in practice. Later (r243394 r243402 r243406 and more), the avx
>> headers got much bigger.
>>
>> Parsing all this code takes time -- removing the avx512 includes from
>> immintrin.h locally makes compiling a file containing just the <string>
>> header 0.25s faster (!), and building all of v8 gets 6% faster, just from
>> not including the avx512 headers.
>>
>> What can we do about this? Since avx512 is new, maybe they could be not
>> part of immintrin.h? Or we could re-introduce
>>
>>   #if !__has_feature(modules) && defined(__AVX512BW__)
>>
>> include guards in immintrin.h. This would give us a speed win immediately
>> without drawbacks as far as I can see, but in a few years when people start
>> compiling with /arch:avx512 that'd go away again. (Then again, by then,
>> modules are hopefully commonly available. cl.exe doesn't have an
>> /arch:avx512 switch yet, so this is probably several years away from
>> happening.)
>>
>> Comments? Is it feasible to require that people who want to use avx512
>> include a new header instead of immintrin.h? Else, does anyone have a better
>> idea other than reintroducing the #ifdefs, augmented with the module check?
>>
>> Thanks,
>> Nico
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
On Mon, May 16, 2016 at 12:48 PM, Eric Christopher <[hidden email]> wrote:


On Mon, May 16, 2016 at 9:43 AM Nico Weber via cfe-dev <[hidden email]> wrote:
Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang's?

As far as I can tell (from looking at https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing to clang's headers), yes. MSVC doesn't have the avx512 intrinsics yet, but that's probably only because they're new.

I had hoped that I could not include all of x86intrin.h in intrin.h, but that page says "The intrin.h header includes both immintrin.h and ammintrin.h for simplicity."

> This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the includes back behind ifdefs. The thread doesn't really mention the reasons, and since clang doesn't implement full multiversioning yet I'm unable to guess at the reasons -- but it sounds like people don't want to re-add the arch ifdefs. Ok, I'll send a patch to add them back ifdef _MSC_VER only -- there should be no drawback to that, and it stops the bleeding in the case where it's worst (with Microsoft headers).


It implements enough multiversioning to make it worthwhile to have them, I don't know what you're confused about here.

I didn't mean to question this point, I just don't understand it. Can you give an example where it's useful? I'm sure there is one, I just can't think of one.
 
Why do we want to make this platform specific? If you wanted to match MSVC I guess you could just turn them off for windows as an alternate solution?

Yes, that's what I meant with the _MSC_VER check.
 

-eric
 
Going forward, we'll have to teach clang more about at least some intrinsics for `#pragma intrin` (PR19898), which might end up helping for this too.

I also reached out to STL at Microsoft, he said he'll try to look into including an "intrin0.h" header in the next major version of MSVC which would only declare a small set of intrinsics instead of all of them (no promises, of course).

People working on avx512, I'd be curious to hear your perspective on this, as well as your reply to Chandler's points.

Thanks,
Nico

On Sat, May 14, 2016 at 2:04 AM, Chandler Carruth via cfe-dev <[hidden email]> wrote:
A couple of points:

1) Definitely agree with Hal that these intrinsics really shouldn't be mapping to builtins. This is something I'm pretty frustrated about the direction of AVX-512 support in Clang and LLVM. We really need generic vector IR to lower cleanly into these instructions.

2) Reid, you specifically advocated for not having the set of intrinsics available based on particular feature sets. ;] But I agree there seems to be a scalability problem here.

3) I think a lot of the scalability problem is that very basic, non-vector code patterns, require Intrin.h on Windows and pull in *ALL* the vector intrinsics. =/ It'd be really good to try to fix *that*.

4) AVX-512 has made this *incredibly* worse than any previous ISA extension. It used to be we had the product of (operation * operand-type) intrinsics. This is already pretty bad. Now we have (operation * operand-type * 4) because we have 4 masking variants. So it seems Intel has just made a really unfortunate API choice by forcing every permutation of these things to get a different name and thus a different intrinsic in a  header file. =/ And sadly that too is probably too late to walk back.


I wonder if we could at least initially address this by providing very limited "builtin" modules for truly builtin headers that don't touch any system headers, and actually *always* use the modules approach for these headers, right out of the box.

On Fri, May 13, 2016 at 7:32 PM C Bergström <[hidden email]> wrote:
This old discussion may cover some of this as well? I also thought I
remember something more recent around this..
http://clang-developers.42468.n3.nabble.com/PROPOSAL-Reintroduce-guards-for-Intel-intrinsic-headers-td4046979.html

On Sat, May 14, 2016 at 8:59 AM, Sean Silva via cfe-dev
<[hidden email]> wrote:
> Sorry if this is a stupid question, but do the windows intrinsic headers
> actually contain the same contents as clang's? (e.g. maybe the windows ones
> don't cover all the ISA's that clang's do).
>
> -- Sean Silva
>
> On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> on Windows, C++ system headers like e.g. <string> end up pulling in
>> intrin.h. clang's intrinsic headers are very large.
>>
>> If you take a cc file containing just `#include <string>` and run that
>> through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the
>> test.I file generated by clang-cl is 1.7MB while the one created by cl.exe
>> is 0.7MB. This is solely due to clang's intrin.h expanding to way more
>> stuff.
>>
>> The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
>> avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only
>> included avx headers if __AVX512F__ etc was defined. This is currently never
>> the case in practice. Later (r243394 r243402 r243406 and more), the avx
>> headers got much bigger.
>>
>> Parsing all this code takes time -- removing the avx512 includes from
>> immintrin.h locally makes compiling a file containing just the <string>
>> header 0.25s faster (!), and building all of v8 gets 6% faster, just from
>> not including the avx512 headers.
>>
>> What can we do about this? Since avx512 is new, maybe they could be not
>> part of immintrin.h? Or we could re-introduce
>>
>>   #if !__has_feature(modules) && defined(__AVX512BW__)
>>
>> include guards in immintrin.h. This would give us a speed win immediately
>> without drawbacks as far as I can see, but in a few years when people start
>> compiling with /arch:avx512 that'd go away again. (Then again, by then,
>> modules are hopefully commonly available. cl.exe doesn't have an
>> /arch:avx512 switch yet, so this is probably several years away from
>> happening.)
>>
>> Comments? Is it feasible to require that people who want to use avx512
>> include a new header instead of immintrin.h? Else, does anyone have a better
>> idea other than reintroducing the #ifdefs, augmented with the module check?
>>
>> Thanks,
>> Nico
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev


On Mon, May 16, 2016 at 10:02 AM Nico Weber <[hidden email]> wrote:
On Mon, May 16, 2016 at 12:48 PM, Eric Christopher <[hidden email]> wrote:


On Mon, May 16, 2016 at 9:43 AM Nico Weber via cfe-dev <[hidden email]> wrote:
Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang's?

As far as I can tell (from looking at https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing to clang's headers), yes. MSVC doesn't have the avx512 intrinsics yet, but that's probably only because they're new.

I had hoped that I could not include all of x86intrin.h in intrin.h, but that page says "The intrin.h header includes both immintrin.h and ammintrin.h for simplicity."

> This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the includes back behind ifdefs. The thread doesn't really mention the reasons, and since clang doesn't implement full multiversioning yet I'm unable to guess at the reasons -- but it sounds like people don't want to re-add the arch ifdefs. Ok, I'll send a patch to add them back ifdef _MSC_VER only -- there should be no drawback to that, and it stops the bleeding in the case where it's worst (with Microsoft headers).


It implements enough multiversioning to make it worthwhile to have them, I don't know what you're confused about here.

I didn't mean to question this point, I just don't understand it. Can you give an example where it's useful? I'm sure there is one, I just can't think of one.

Sure, the programmer has to write their own dispatch, but it'll allow you to include variously target optimized versions of the same function in the same file.

The equivalent linux side of things is:

void my_avx_function() __attribute__((__target__("avx")))
void my_nonavx_function()

...

if (__builtin_cpu_supports("avx"))
   my_avx_function()
else
   my_nonavx_function()

and you can keep both implementations in the same file and don't have to worry about things like command line options being different and causing all sorts of haywire.
 
 
Why do we want to make this platform specific? If you wanted to match MSVC I guess you could just turn them off for windows as an alternate solution?

Yes, that's what I meant with the _MSC_VER check.

I meant just turn off the avx512 headers, not sure if that's what you meant. We should probably look at the lexing and parsing code to see what can be sped up here.

-eric
 
 

-eric
 
Going forward, we'll have to teach clang more about at least some intrinsics for `#pragma intrin` (PR19898), which might end up helping for this too.

I also reached out to STL at Microsoft, he said he'll try to look into including an "intrin0.h" header in the next major version of MSVC which would only declare a small set of intrinsics instead of all of them (no promises, of course).

People working on avx512, I'd be curious to hear your perspective on this, as well as your reply to Chandler's points.

Thanks,
Nico

On Sat, May 14, 2016 at 2:04 AM, Chandler Carruth via cfe-dev <[hidden email]> wrote:
A couple of points:

1) Definitely agree with Hal that these intrinsics really shouldn't be mapping to builtins. This is something I'm pretty frustrated about the direction of AVX-512 support in Clang and LLVM. We really need generic vector IR to lower cleanly into these instructions.

2) Reid, you specifically advocated for not having the set of intrinsics available based on particular feature sets. ;] But I agree there seems to be a scalability problem here.

3) I think a lot of the scalability problem is that very basic, non-vector code patterns, require Intrin.h on Windows and pull in *ALL* the vector intrinsics. =/ It'd be really good to try to fix *that*.

4) AVX-512 has made this *incredibly* worse than any previous ISA extension. It used to be we had the product of (operation * operand-type) intrinsics. This is already pretty bad. Now we have (operation * operand-type * 4) because we have 4 masking variants. So it seems Intel has just made a really unfortunate API choice by forcing every permutation of these things to get a different name and thus a different intrinsic in a  header file. =/ And sadly that too is probably too late to walk back.


I wonder if we could at least initially address this by providing very limited "builtin" modules for truly builtin headers that don't touch any system headers, and actually *always* use the modules approach for these headers, right out of the box.

On Fri, May 13, 2016 at 7:32 PM C Bergström <[hidden email]> wrote:
This old discussion may cover some of this as well? I also thought I
remember something more recent around this..
http://clang-developers.42468.n3.nabble.com/PROPOSAL-Reintroduce-guards-for-Intel-intrinsic-headers-td4046979.html

On Sat, May 14, 2016 at 8:59 AM, Sean Silva via cfe-dev
<[hidden email]> wrote:
> Sorry if this is a stupid question, but do the windows intrinsic headers
> actually contain the same contents as clang's? (e.g. maybe the windows ones
> don't cover all the ISA's that clang's do).
>
> -- Sean Silva
>
> On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> on Windows, C++ system headers like e.g. <string> end up pulling in
>> intrin.h. clang's intrinsic headers are very large.
>>
>> If you take a cc file containing just `#include <string>` and run that
>> through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the
>> test.I file generated by clang-cl is 1.7MB while the one created by cl.exe
>> is 0.7MB. This is solely due to clang's intrin.h expanding to way more
>> stuff.
>>
>> The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
>> avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only
>> included avx headers if __AVX512F__ etc was defined. This is currently never
>> the case in practice. Later (r243394 r243402 r243406 and more), the avx
>> headers got much bigger.
>>
>> Parsing all this code takes time -- removing the avx512 includes from
>> immintrin.h locally makes compiling a file containing just the <string>
>> header 0.25s faster (!), and building all of v8 gets 6% faster, just from
>> not including the avx512 headers.
>>
>> What can we do about this? Since avx512 is new, maybe they could be not
>> part of immintrin.h? Or we could re-introduce
>>
>>   #if !__has_feature(modules) && defined(__AVX512BW__)
>>
>> include guards in immintrin.h. This would give us a speed win immediately
>> without drawbacks as far as I can see, but in a few years when people start
>> compiling with /arch:avx512 that'd go away again. (Then again, by then,
>> modules are hopefully commonly available. cl.exe doesn't have an
>> /arch:avx512 switch yet, so this is probably several years away from
>> happening.)
>>
>> Comments? Is it feasible to require that people who want to use avx512
>> include a new header instead of immintrin.h? Else, does anyone have a better
>> idea other than reintroducing the #ifdefs, augmented with the module check?
>>
>> Thanks,
>> Nico
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
On Mon, May 16, 2016 at 2:30 PM, Eric Christopher <[hidden email]> wrote:


On Mon, May 16, 2016 at 10:02 AM Nico Weber <[hidden email]> wrote:
On Mon, May 16, 2016 at 12:48 PM, Eric Christopher <[hidden email]> wrote:


On Mon, May 16, 2016 at 9:43 AM Nico Weber via cfe-dev <[hidden email]> wrote:
Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang's?

As far as I can tell (from looking at https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing to clang's headers), yes. MSVC doesn't have the avx512 intrinsics yet, but that's probably only because they're new.

I had hoped that I could not include all of x86intrin.h in intrin.h, but that page says "The intrin.h header includes both immintrin.h and ammintrin.h for simplicity."

> This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the includes back behind ifdefs. The thread doesn't really mention the reasons, and since clang doesn't implement full multiversioning yet I'm unable to guess at the reasons -- but it sounds like people don't want to re-add the arch ifdefs. Ok, I'll send a patch to add them back ifdef _MSC_VER only -- there should be no drawback to that, and it stops the bleeding in the case where it's worst (with Microsoft headers).


It implements enough multiversioning to make it worthwhile to have them, I don't know what you're confused about here.

I didn't mean to question this point, I just don't understand it. Can you give an example where it's useful? I'm sure there is one, I just can't think of one.

Sure, the programmer has to write their own dispatch, but it'll allow you to include variously target optimized versions of the same function in the same file.

The equivalent linux side of things is:

void my_avx_function() __attribute__((__target__("avx")))
void my_nonavx_function()

...

if (__builtin_cpu_supports("avx"))
   my_avx_function()
else
   my_nonavx_function()

and you can keep both implementations in the same file and don't have to worry about things like command line options being different and causing all sorts of haywire.

Ah, thanks, I didn't know that worked :-) (I played with it a bit and found PR27779)
 
 
 
Why do we want to make this platform specific? If you wanted to match MSVC I guess you could just turn them off for windows as an alternate solution?

Yes, that's what I meant with the _MSC_VER check.

I meant just turn off the avx512 headers, not sure if that's what you meant. We should probably look at the lexing and parsing code to see what can be sped up here.

I turned it off for all headers for now in r269675. I agree that we should look at lexing and parsing speed, and also consider things like tablegen'ing intinsics. Until then, making most compiles faster seems like a better tradeoff on Windows.

As said above, I'm curious to hear from the people working on avx512 :-)
 

-eric
 
 

-eric
 
Going forward, we'll have to teach clang more about at least some intrinsics for `#pragma intrin` (PR19898), which might end up helping for this too.

I also reached out to STL at Microsoft, he said he'll try to look into including an "intrin0.h" header in the next major version of MSVC which would only declare a small set of intrinsics instead of all of them (no promises, of course).

People working on avx512, I'd be curious to hear your perspective on this, as well as your reply to Chandler's points.

Thanks,
Nico

On Sat, May 14, 2016 at 2:04 AM, Chandler Carruth via cfe-dev <[hidden email]> wrote:
A couple of points:

1) Definitely agree with Hal that these intrinsics really shouldn't be mapping to builtins. This is something I'm pretty frustrated about the direction of AVX-512 support in Clang and LLVM. We really need generic vector IR to lower cleanly into these instructions.

2) Reid, you specifically advocated for not having the set of intrinsics available based on particular feature sets. ;] But I agree there seems to be a scalability problem here.

3) I think a lot of the scalability problem is that very basic, non-vector code patterns, require Intrin.h on Windows and pull in *ALL* the vector intrinsics. =/ It'd be really good to try to fix *that*.

4) AVX-512 has made this *incredibly* worse than any previous ISA extension. It used to be we had the product of (operation * operand-type) intrinsics. This is already pretty bad. Now we have (operation * operand-type * 4) because we have 4 masking variants. So it seems Intel has just made a really unfortunate API choice by forcing every permutation of these things to get a different name and thus a different intrinsic in a  header file. =/ And sadly that too is probably too late to walk back.


I wonder if we could at least initially address this by providing very limited "builtin" modules for truly builtin headers that don't touch any system headers, and actually *always* use the modules approach for these headers, right out of the box.

On Fri, May 13, 2016 at 7:32 PM C Bergström <[hidden email]> wrote:
This old discussion may cover some of this as well? I also thought I
remember something more recent around this..
http://clang-developers.42468.n3.nabble.com/PROPOSAL-Reintroduce-guards-for-Intel-intrinsic-headers-td4046979.html

On Sat, May 14, 2016 at 8:59 AM, Sean Silva via cfe-dev
<[hidden email]> wrote:
> Sorry if this is a stupid question, but do the windows intrinsic headers
> actually contain the same contents as clang's? (e.g. maybe the windows ones
> don't cover all the ISA's that clang's do).
>
> -- Sean Silva
>
> On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> <[hidden email]> wrote:
>>
>> Hi,
>>
>> on Windows, C++ system headers like e.g. <string> end up pulling in
>> intrin.h. clang's intrinsic headers are very large.
>>
>> If you take a cc file containing just `#include <string>` and run that
>> through the preprocessor with `cl /P test.cc` and `clang-cl /P test.cc`, the
>> test.I file generated by clang-cl is 1.7MB while the one created by cl.exe
>> is 0.7MB. This is solely due to clang's intrin.h expanding to way more
>> stuff.
>>
>> The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
>> avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only
>> included avx headers if __AVX512F__ etc was defined. This is currently never
>> the case in practice. Later (r243394 r243402 r243406 and more), the avx
>> headers got much bigger.
>>
>> Parsing all this code takes time -- removing the avx512 includes from
>> immintrin.h locally makes compiling a file containing just the <string>
>> header 0.25s faster (!), and building all of v8 gets 6% faster, just from
>> not including the avx512 headers.
>>
>> What can we do about this? Since avx512 is new, maybe they could be not
>> part of immintrin.h? Or we could re-introduce
>>
>>   #if !__has_feature(modules) && defined(__AVX512BW__)
>>
>> include guards in immintrin.h. This would give us a speed win immediately
>> without drawbacks as far as I can see, but in a few years when people start
>> compiling with /arch:avx512 that'd go away again. (Then again, by then,
>> modules are hopefully commonly available. cl.exe doesn't have an
>> /arch:avx512 switch yet, so this is probably several years away from
>> happening.)
>>
>> Comments? Is it feasible to require that people who want to use avx512
>> include a new header instead of immintrin.h? Else, does anyone have a better
>> idea other than reintroducing the #ifdefs, augmented with the module check?
>>
>> Thanks,
>> Nico
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
In reply to this post by Renato Golin via cfe-dev
   >Indeed. It is not clear to me, however, that this situation is desirable. We
   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)

Thanks.

-  Elena

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
On Tue, May 17, 2016 at 3:49 PM, Demikhovsky, Elena <[hidden email]> wrote:
   >Indeed. It is not clear to me, however, that this situation is desirable. We
   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)

The bit I care most about is that adding `#include <intrin.h>` shouldn't add megabytes of stuff to my translation unit.

Hve you discussed making immintrin.h more modular? It looks like many more avx512 builtins keep landing, making this problem bigger and bigger. It'd be good if I only had to pay for this if I explicitly included an avx512.h, and even then it'd be nice if that wasn't one huge header, but several smaller ones, so I only have to pay compile time for the bits I need.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev

We are still trying to find a suitable solution.

Keeping declarations only inside header files will save compile time.

In this case the implementation will be hidden inside clang.

Can somebody help me to estimate impact and complexity of this solution?

 

Thank you.

 

-           Elena

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Nico Weber
Sent: Tuesday, June 14, 2016 15:50
To: Demikhovsky, Elena <[hidden email]>
Cc: Hal Finkel <[hidden email]>; Reid Kleckner <[hidden email]>; cfe-dev <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>; David Majnemer <[hidden email]>; Chandler Carruth ([hidden email]) <[hidden email]>
Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

On Tue, May 17, 2016 at 3:49 PM, Demikhovsky, Elena <[hidden email]> wrote:

   >Indeed. It is not clear to me, however, that this situation is desirable. We
   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)


The bit I care most about is that adding `#include <intrin.h>` shouldn't add megabytes of stuff to my translation unit.

 

Hve you discussed making immintrin.h more modular? It looks like many more avx512 builtins keep landing, making this problem bigger and bigger. It'd be good if I only had to pay for this if I explicitly included an avx512.h, and even then it'd be nice if that wasn't one huge header, but several smaller ones, so I only have to pay compile time for the bits I need.

 

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
So have clang magically emit the generated code based on the intrinsic header? That'll be a lot of typing, but ultimately shouldn't be terrible. You'll effectively turn the _mm_ interface into __builtin as far as automatic recognition etc and I'm not sure we'd want to do that sort of thing.

-eric

On Tue, Jun 14, 2016 at 6:43 AM Demikhovsky, Elena via cfe-dev <[hidden email]> wrote:

We are still trying to find a suitable solution.

Keeping declarations only inside header files will save compile time.

In this case the implementation will be hidden inside clang.

Can somebody help me to estimate impact and complexity of this solution?

 

Thank you.

 

-           Elena

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Nico Weber
Sent: Tuesday, June 14, 2016 15:50
To: Demikhovsky, Elena <[hidden email]>
Cc: Hal Finkel <[hidden email]>; Reid Kleckner <[hidden email]>; cfe-dev <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>; David Majnemer <[hidden email]>; Chandler Carruth ([hidden email]) <[hidden email]>


Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

On Tue, May 17, 2016 at 3:49 PM, Demikhovsky, Elena <[hidden email]> wrote:

   >Indeed. It is not clear to me, however, that this situation is desirable. We
   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)


The bit I care most about is that adding `#include <intrin.h>` shouldn't add megabytes of stuff to my translation unit.

 

Hve you discussed making immintrin.h more modular? It looks like many more avx512 builtins keep landing, making this problem bigger and bigger. It'd be good if I only had to pay for this if I explicitly included an avx512.h, and even then it'd be nice if that wasn't one huge header, but several smaller ones, so I only have to pay compile time for the bits I need.

 

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev

> I'm not sure we'd want to do that sort of thing.

Do you have any other suggestion?

 

-           Elena

 

From: Eric Christopher [mailto:[hidden email]]
Sent: Tuesday, June 14, 2016 23:53
To: Demikhovsky, Elena <[hidden email]>; Nico Weber <[hidden email]>
Cc: Badouh, Asaf <[hidden email]>; David Majnemer <[hidden email]>; Zuckerman, Michael <[hidden email]>; cfe-dev <[hidden email]>
Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

So have clang magically emit the generated code based on the intrinsic header? That'll be a lot of typing, but ultimately shouldn't be terrible. You'll effectively turn the _mm_ interface into __builtin as far as automatic recognition etc and I'm not sure we'd want to do that sort of thing.

 

 

-eric

 

On Tue, Jun 14, 2016 at 6:43 AM Demikhovsky, Elena via cfe-dev <[hidden email]> wrote:

We are still trying to find a suitable solution.

Keeping declarations only inside header files will save compile time.

In this case the implementation will be hidden inside clang.

Can somebody help me to estimate impact and complexity of this solution?

 

Thank you.

 

-           Elena

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Nico Weber
Sent: Tuesday, June 14, 2016 15:50
To: Demikhovsky, Elena <[hidden email]>
Cc: Hal Finkel <[hidden email]>; Reid Kleckner <[hidden email]>; cfe-dev <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>; David Majnemer <[hidden email]>; Chandler Carruth ([hidden email]) <[hidden email]>


Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

On Tue, May 17, 2016 at 3:49 PM, Demikhovsky, Elena <[hidden email]> wrote:

   >Indeed. It is not clear to me, however, that this situation is desirable. We
   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)


The bit I care most about is that adding `#include <intrin.h>` shouldn't add megabytes of stuff to my translation unit.

 

Hve you discussed making immintrin.h more modular? It looks like many more avx512 builtins keep landing, making this problem bigger and bigger. It'd be good if I only had to pay for this if I explicitly included an avx512.h, and even then it'd be nice if that wasn't one huge header, but several smaller ones, so I only have to pay compile time for the bits I need.

 

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
In reply to this post by Renato Golin via cfe-dev

I agree, that it’s not desirable thing in Clang, however it seems to me the lesser evil.

 

The long compilation time will hurt many projects, I’m not sure if it’s reasonable even for the ones trying to use AVX intrinsics intentionally.

Another issue is, that the “_mm_” prefixed identifiers are not reserved for the compiler - the C99 spec says that identifiers that start with two underscores or an underscore and a capital letter are reserved. So, Clang should recognize the “_mm_” prefix, check if the right header was indeed included, and then try to identify the x86 intrinsic. If this fails, the identifier should be considered a standard identifier.

 

Of course, there is the case of x86 intrinsics that should be compiled to pure LLVM IR, rather than LLVM intrinsic calls. I see two possible solutions:

 

1.       Make CGBuiltin.cpp/EmitX86BuiltinExpr generate the IR using the IR builder. This approach might be less intuitive, and may become very long as we change more and more intrinsics to pure LLVM IR

2.       Leave the intrinsics implemented in C language in the header, rather than making these “_mm_” builtins. Then, again, as we move more and more intrinsics to C representation, the header might get big and heavy again.

 

I think in the short term I would prefer the 2nd solution as for its simplicity. Any other ideas to overcome this issue?

 

Thanks

     Guy Benyei

 

 

signature

 

From: cfe-dev [mailto:[hidden email]] On Behalf Of Eric Christopher via cfe-dev
Sent: Tuesday, June 14, 2016 23:53
To: Demikhovsky, Elena <[hidden email]>; Nico Weber <[hidden email]>
Cc: cfe-dev <[hidden email]>; David Majnemer <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>
Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

So have clang magically emit the generated code based on the intrinsic header? That'll be a lot of typing, but ultimately shouldn't be terrible. You'll effectively turn the _mm_ interface into __builtin as far as automatic recognition etc and I'm not sure we'd want to do that sort of thing.

 

-eric

 

On Tue, Jun 14, 2016 at 6:43 AM Demikhovsky, Elena via cfe-dev <[hidden email]> wrote:

We are still trying to find a suitable solution.

Keeping declarations only inside header files will save compile time.

In this case the implementation will be hidden inside clang.

Can somebody help me to estimate impact and complexity of this solution?

 

Thank you.

 

-           Elena

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Nico Weber
Sent: Tuesday, June 14, 2016 15:50
To: Demikhovsky, Elena <[hidden email]>
Cc: Hal Finkel <[hidden email]>; Reid Kleckner <[hidden email]>; cfe-dev <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>; David Majnemer <[hidden email]>; Chandler Carruth ([hidden email]) <[hidden email]>


Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

On Tue, May 17, 2016 at 3:49 PM, Demikhovsky, Elena <[hidden email]> wrote:

   >Indeed. It is not clear to me, however, that this situation is desirable. We
   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)


The bit I care most about is that adding `#include <intrin.h>` shouldn't add megabytes of stuff to my translation unit.

 

Hve you discussed making immintrin.h more modular? It looks like many more avx512 builtins keep landing, making this problem bigger and bigger. It'd be good if I only had to pay for this if I explicitly included an avx512.h, and even then it'd be nice if that wasn't one huge header, but several smaller ones, so I only have to pay compile time for the bits I need.

 

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev
As I said up the thread, I think the *right* way to solve this is with modules.

We have the infrastructure in clang to lazily load things like the intrinsics headers in a very efficient way. All we are missing is:
1) The ability to enable this by default exclusively for the intrinsic headers. (or more generally for any subset of the builtin headers where we would like this behavior...)
2) To build the actual module files for the builtin headers (much the way we generate some of them) as part of the build system

So far I've not seen any suggestions that really seem superior to this...

On Wed, Jun 15, 2016 at 1:23 PM Benyei, Guy via cfe-dev <[hidden email]> wrote:

I agree, that it’s not desirable thing in Clang, however it seems to me the lesser evil.

 

The long compilation time will hurt many projects, I’m not sure if it’s reasonable even for the ones trying to use AVX intrinsics intentionally.

Another issue is, that the “_mm_” prefixed identifiers are not reserved for the compiler - the C99 spec says that identifiers that start with two underscores or an underscore and a capital letter are reserved. So, Clang should recognize the “_mm_” prefix, check if the right header was indeed included, and then try to identify the x86 intrinsic. If this fails, the identifier should be considered a standard identifier.

 

Of course, there is the case of x86 intrinsics that should be compiled to pure LLVM IR, rather than LLVM intrinsic calls. I see two possible solutions:

 

1.       Make CGBuiltin.cpp/EmitX86BuiltinExpr generate the IR using the IR builder. This approach might be less intuitive, and may become very long as we change more and more intrinsics to pure LLVM IR

2.       Leave the intrinsics implemented in C language in the header, rather than making these “_mm_” builtins. Then, again, as we move more and more intrinsics to C representation, the header might get big and heavy again.

 

I think in the short term I would prefer the 2nd solution as for its simplicity. Any other ideas to overcome this issue?

 

Thanks

     Guy Benyei

 

 

 

From: cfe-dev [mailto:[hidden email]] On Behalf Of Eric Christopher via cfe-dev


Sent: Tuesday, June 14, 2016 23:53
To: Demikhovsky, Elena <[hidden email]>; Nico Weber <[hidden email]>

Cc: cfe-dev <[hidden email]>; David Majnemer <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>


Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

So have clang magically emit the generated code based on the intrinsic header? That'll be a lot of typing, but ultimately shouldn't be terrible. You'll effectively turn the _mm_ interface into __builtin as far as automatic recognition etc and I'm not sure we'd want to do that sort of thing.

 

-eric

 

On Tue, Jun 14, 2016 at 6:43 AM Demikhovsky, Elena via cfe-dev <[hidden email]> wrote:

We are still trying to find a suitable solution.

Keeping declarations only inside header files will save compile time.

In this case the implementation will be hidden inside clang.

Can somebody help me to estimate impact and complexity of this solution?

 

Thank you.

 

-           Elena

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Nico Weber
Sent: Tuesday, June 14, 2016 15:50
To: Demikhovsky, Elena <[hidden email]>
Cc: Hal Finkel <[hidden email]>; Reid Kleckner <[hidden email]>; cfe-dev <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>; David Majnemer <[hidden email]>; Chandler Carruth ([hidden email]) <[hidden email]>


Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

On Tue, May 17, 2016 at 3:49 PM, Demikhovsky, Elena <[hidden email]> wrote:

   >Indeed. It is not clear to me, however, that this situation is desirable. We
   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)


The bit I care most about is that adding `#include <intrin.h>` shouldn't add megabytes of stuff to my translation unit.

 

Hve you discussed making immintrin.h more modular? It looks like many more avx512 builtins keep landing, making this problem bigger and bigger. It'd be good if I only had to pay for this if I explicitly included an avx512.h, and even then it'd be nice if that wasn't one huge header, but several smaller ones, so I only have to pay compile time for the bits I need.

 

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

image001.png (23K) Download Attachment
image001.png (23K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: The intrinsics headers (especially avx512) are too big. What to do about it?

Renato Golin via cfe-dev

Modules could be a good solution for this issue, however I’m a bit concerned about some technical issues with keeping the modules in sync with the Clang compiler.

 

Building the actual module files as part of the build system doesn’t seem really hard, but still need to think about the location it should be installed (with the headers?), and need to make sure, that for any Clang we run we can find the suiting module files, or fall back to the headers approach.

 

We could add the module as some kind of resource to the executable. This way it couldn’t get out of sync. I would also add some way to disable this optimization entirely.

 

What do you think?

 

signature

 

From: Chandler Carruth [mailto:[hidden email]]
Sent: Thursday, June 16, 2016 00:20
To: Benyei, Guy <[hidden email]>; Eric Christopher <[hidden email]>; Demikhovsky, Elena <[hidden email]>; Nico Weber <[hidden email]>; [hidden email]
Cc: David Majnemer <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>
Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

As I said up the thread, I think the *right* way to solve this is with modules.

 

We have the infrastructure in clang to lazily load things like the intrinsics headers in a very efficient way. All we are missing is:

1) The ability to enable this by default exclusively for the intrinsic headers. (or more generally for any subset of the builtin headers where we would like this behavior...)

2) To build the actual module files for the builtin headers (much the way we generate some of them) as part of the build system

 

So far I've not seen any suggestions that really seem superior to this...

 

On Wed, Jun 15, 2016 at 1:23 PM Benyei, Guy via cfe-dev <[hidden email]> wrote:

I agree, that it’s not desirable thing in Clang, however it seems to me the lesser evil.

 

The long compilation time will hurt many projects, I’m not sure if it’s reasonable even for the ones trying to use AVX intrinsics intentionally.

Another issue is, that the “_mm_” prefixed identifiers are not reserved for the compiler - the C99 spec says that identifiers that start with two underscores or an underscore and a capital letter are reserved. So, Clang should recognize the “_mm_” prefix, check if the right header was indeed included, and then try to identify the x86 intrinsic. If this fails, the identifier should be considered a standard identifier.

 

Of course, there is the case of x86 intrinsics that should be compiled to pure LLVM IR, rather than LLVM intrinsic calls. I see two possible solutions:

 

1.       Make CGBuiltin.cpp/EmitX86BuiltinExpr generate the IR using the IR builder. This approach might be less intuitive, and may become very long as we change more and more intrinsics to pure LLVM IR

2.       Leave the intrinsics implemented in C language in the header, rather than making these “_mm_” builtins. Then, again, as we move more and more intrinsics to C representation, the header might get big and heavy again.

 

I think in the short term I would prefer the 2nd solution as for its simplicity. Any other ideas to overcome this issue?

 

Thanks

     Guy Benyei

 

 

 

From: cfe-dev [mailto:[hidden email]] On Behalf Of Eric Christopher via cfe-dev


Sent: Tuesday, June 14, 2016 23:53
To: Demikhovsky, Elena <[hidden email]>; Nico Weber <[hidden email]>

Cc: cfe-dev <[hidden email]>; David Majnemer <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>


Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

So have clang magically emit the generated code based on the intrinsic header? That'll be a lot of typing, but ultimately shouldn't be terrible. You'll effectively turn the _mm_ interface into __builtin as far as automatic recognition etc and I'm not sure we'd want to do that sort of thing.

 

-eric

 

On Tue, Jun 14, 2016 at 6:43 AM Demikhovsky, Elena via cfe-dev <[hidden email]> wrote:

We are still trying to find a suitable solution.

Keeping declarations only inside header files will save compile time.

In this case the implementation will be hidden inside clang.

Can somebody help me to estimate impact and complexity of this solution?

 

Thank you.

 

-           Elena

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Nico Weber
Sent: Tuesday, June 14, 2016 15:50
To: Demikhovsky, Elena <[hidden email]>
Cc: Hal Finkel <[hidden email]>; Reid Kleckner <[hidden email]>; cfe-dev <[hidden email]>; Badouh, Asaf <[hidden email]>; Zuckerman, Michael <[hidden email]>; David Majnemer <[hidden email]>; Chandler Carruth ([hidden email]) <[hidden email]>


Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

 

On Tue, May 17, 2016 at 3:49 PM, Demikhovsky, Elena <[hidden email]> wrote:

   >Indeed. It is not clear to me, however, that this situation is desirable. We
   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)


The bit I care most about is that adding `#include <intrin.h>` shouldn't add megabytes of stuff to my translation unit.

 

Hve you discussed making immintrin.h more modular? It looks like many more avx512 builtins keep landing, making this problem bigger and bigger. It'd be good if I only had to pay for this if I explicitly included an avx512.h, and even then it'd be nice if that wasn't one huge header, but several smaller ones, so I only have to pay compile time for the bits I need.

 

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12