Re: [RFC] Identifying wasteful template function bodies

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Identifying wasteful template function bodies

Tom Stellard via cfe-dev
Oops, sorry, I meant to send this to cfe-dev! Looking forward to any
and all advice/opinions, though. Thanks! - Brian

On Mon, Dec 2, 2019 at 8:00 AM Brian Gesiak <[hidden email]> wrote:

>
> I work on a C++ project for which compilation time is a significant
> concern. One of my colleagues was able to significantly shorten the
> time Clang took to compile our project, by manually outlining
> independently-typed code from large template functions.
>
> This makes intuitive sense to me, because when instantiating a
> template function, Clang traverses the body of the function. The
> longer the function body, the more nodes in the AST Clang has to
> traverse, and the more time it takes. Programmers can read the
> function and see that some statements in the function body remain the
> same no matter what types the function is instantiated with. By
> extracting these statements into a separate, non-template function,
> programmers can reduce the amount of nodes Clang must traverse.
>
> I created a contrived example that demonstrates how splitting up a
> long template function can improve compile time. (Beware, the files
> are large, I needed something that would take Clang a hefty amount of
> time to process.)
> https://gist.github.com/modocache/77b8ac09280c08bd88f84b92ff43a28b
>
> In the example above, 'example.cpp' defines a template function
> 'foo<T, U, V, W, X, Y, Z>', whose body is ~46k LoC. It then
> instantiates 'foo' 10 times, with 10 different combinations of
> template type parameters. In total, 'clang -c -O1 example.cpp -Xclang
> -disable-llvm-passes -Xclang -emit-llvm' takes ~35 seconds in total to
> compile. Each additional instantiation of 'foo' adds an additional ~3
> seconds to the total compile time.
>
> Only the last statement in 'foo' is dependent upon the template type
> parameters to 'foo'. 'example-outlined.cpp' moves ~46k LoC of
> independently-typed statements out of 'foo' and into a function named
> 'foo_prologue_outlined', and has 'foo' call 'foo_prologue_outlined'.
> 'foo_prologue_outlined' is not a template function. The result is
> identical program behavior, but a total compile time of just ~5
> seconds (~85% faster). Additional instantiations of 'foo' in
> 'example-outlined.cpp' cost almost no additional compile time.
>
> Although the functions in our project are not as long, some of them
> take significantly longer than 35 seconds to compile. By outlining
> independently-typed statements, we've been able to reduce compile time
> of some functions, from 300s to 200s (1/3rd faster). So, my colleagues
> and I are looking for other functions we can manually outline in order
> to reduce the amount of time Clang takes to compile our project. To
> this end, it would be handy if Clang could tell us, for example, “hey,
> I just instantiated 'bar<int, float, double>', but X% of the
> statements in that function did not require transformation,” where
> 'X%' is some threshold that could be set in the compiler invocation.
> For now I'm thinking the option to set this warning threshold could be
> called '-Wwasteful-template-threshold=' -- but I'm aware that sounds
> awkward, and I'd love suggestions for a better name.
>
> I think implementing this feature is possible by adding some state to
> TreeTransform, or the Clang template instantiators that derive from
> that class. But before I send a patch to do so, I'm curious if anyone
> has attempted such a thing before, or if anyone has thoughts or
> comments on this feature. I'd prefer not to spend time implementing
> this diagnostic in Clang if it's predestined to be rejected in code
> review, so please let me know what you think!
>
> (I've cc'ed some contributors who I think have worked in this space
> like @rnk, or those who might have better naming suggestions like
> @rtrieu.)
>
> - Brian Gesiak
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Identifying wasteful template function bodies

Tom Stellard via cfe-dev
I like this idea. I'm not sure a warning is the best way to surface it. The first alternative that occurs to be would be the -ftime-trace JSON file, so you can dump out complete info about the most often instantiated templates, how many nodes they contained, and how many of them were dependent.

Which brings me to wonder, if clang already tracks whether an Expr is instantiation dependent, why can't it do this optimization itself? I assume there are good reasons. I recall there are some invariants about Decl or Expr pointer identity, but maybe those are handled by AlwaysRebuild. Another thing that occurs to me is that some semantic analysis or warnings may not fire if the DeclContext of an Expr is dependent. I wonder what the savings would be if we could enumerate those checks, store them in the template pattern, and run only the checks that matter without re-traversing the entire AST. The last reason I can think of is ADL. I think that would defer name lookup for almost every CallExpr in a template.

On Mon, Dec 2, 2019 at 8:01 AM Brian Gesiak <[hidden email]> wrote:
I work on a C++ project for which compilation time is a significant
concern. One of my colleagues was able to significantly shorten the
time Clang took to compile our project, by manually outlining
independently-typed code from large template functions.

This makes intuitive sense to me, because when instantiating a
template function, Clang traverses the body of the function. The
longer the function body, the more nodes in the AST Clang has to
traverse, and the more time it takes. Programmers can read the
function and see that some statements in the function body remain the
same no matter what types the function is instantiated with. By
extracting these statements into a separate, non-template function,
programmers can reduce the amount of nodes Clang must traverse.

I created a contrived example that demonstrates how splitting up a
long template function can improve compile time. (Beware, the files
are large, I needed something that would take Clang a hefty amount of
time to process.)
https://gist.github.com/modocache/77b8ac09280c08bd88f84b92ff43a28b

In the example above, 'example.cpp' defines a template function
'foo<T, U, V, W, X, Y, Z>', whose body is ~46k LoC. It then
instantiates 'foo' 10 times, with 10 different combinations of
template type parameters. In total, 'clang -c -O1 example.cpp -Xclang
-disable-llvm-passes -Xclang -emit-llvm' takes ~35 seconds in total to
compile. Each additional instantiation of 'foo' adds an additional ~3
seconds to the total compile time.

Only the last statement in 'foo' is dependent upon the template type
parameters to 'foo'. 'example-outlined.cpp' moves ~46k LoC of
independently-typed statements out of 'foo' and into a function named
'foo_prologue_outlined', and has 'foo' call 'foo_prologue_outlined'.
'foo_prologue_outlined' is not a template function. The result is
identical program behavior, but a total compile time of just ~5
seconds (~85% faster). Additional instantiations of 'foo' in
'example-outlined.cpp' cost almost no additional compile time.

Although the functions in our project are not as long, some of them
take significantly longer than 35 seconds to compile. By outlining
independently-typed statements, we've been able to reduce compile time
of some functions, from 300s to 200s (1/3rd faster). So, my colleagues
and I are looking for other functions we can manually outline in order
to reduce the amount of time Clang takes to compile our project. To
this end, it would be handy if Clang could tell us, for example, “hey,
I just instantiated 'bar<int, float, double>', but X% of the
statements in that function did not require transformation,” where
'X%' is some threshold that could be set in the compiler invocation.
For now I'm thinking the option to set this warning threshold could be
called '-Wwasteful-template-threshold=' -- but I'm aware that sounds
awkward, and I'd love suggestions for a better name.

I think implementing this feature is possible by adding some state to
TreeTransform, or the Clang template instantiators that derive from
that class. But before I send a patch to do so, I'm curious if anyone
has attempted such a thing before, or if anyone has thoughts or
comments on this feature. I'd prefer not to spend time implementing
this diagnostic in Clang if it's predestined to be rejected in code
review, so please let me know what you think!

(I've cc'ed some contributors who I think have worked in this space
like @rnk, or those who might have better naming suggestions like
@rtrieu.)

- Brian Gesiak

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Identifying wasteful template function bodies

Tom Stellard via cfe-dev


On Mon, Dec 2, 2019 at 2:42 PM Reid Kleckner via cfe-dev <[hidden email]> wrote:
I like this idea. I'm not sure a warning is the best way to surface it. The first alternative that occurs to be would be the -ftime-trace JSON file, so you can dump out complete info about the most often instantiated templates, how many nodes they contained, and how many of them were dependent.

Which brings me to wonder, if clang already tracks whether an Expr is instantiation dependent, why can't it do this optimization itself? I assume there are good reasons. I recall there are some invariants about Decl or Expr pointer identity, but maybe those are handled by AlwaysRebuild. Another thing that occurs to me is that some semantic analysis or warnings may not fire if the DeclContext of an Expr is dependent. I wonder what the savings would be if we could enumerate those checks, store them in the template pattern, and run only the checks that matter without re-traversing the entire AST. The last reason I can think of is ADL. I think that would defer name lookup for almost every CallExpr in a template.

I /think/ ADL still allows calls to be resolved up-front if the parameters aren't dependent. Yeah, here the non-dependent call is resolved and the dependent call is left with an UnresolvedLookupExpr: https://godbolt.org/z/kgmDHj
 

On Mon, Dec 2, 2019 at 8:01 AM Brian Gesiak <[hidden email]> wrote:
I work on a C++ project for which compilation time is a significant
concern. One of my colleagues was able to significantly shorten the
time Clang took to compile our project, by manually outlining
independently-typed code from large template functions.

This makes intuitive sense to me, because when instantiating a
template function, Clang traverses the body of the function. The
longer the function body, the more nodes in the AST Clang has to
traverse, and the more time it takes. Programmers can read the
function and see that some statements in the function body remain the
same no matter what types the function is instantiated with. By
extracting these statements into a separate, non-template function,
programmers can reduce the amount of nodes Clang must traverse.

I created a contrived example that demonstrates how splitting up a
long template function can improve compile time. (Beware, the files
are large, I needed something that would take Clang a hefty amount of
time to process.)
https://gist.github.com/modocache/77b8ac09280c08bd88f84b92ff43a28b

In the example above, 'example.cpp' defines a template function
'foo<T, U, V, W, X, Y, Z>', whose body is ~46k LoC. It then
instantiates 'foo' 10 times, with 10 different combinations of
template type parameters. In total, 'clang -c -O1 example.cpp -Xclang
-disable-llvm-passes -Xclang -emit-llvm' takes ~35 seconds in total to
compile. Each additional instantiation of 'foo' adds an additional ~3
seconds to the total compile time.

Only the last statement in 'foo' is dependent upon the template type
parameters to 'foo'. 'example-outlined.cpp' moves ~46k LoC of
independently-typed statements out of 'foo' and into a function named
'foo_prologue_outlined', and has 'foo' call 'foo_prologue_outlined'.
'foo_prologue_outlined' is not a template function. The result is
identical program behavior, but a total compile time of just ~5
seconds (~85% faster). Additional instantiations of 'foo' in
'example-outlined.cpp' cost almost no additional compile time.

Although the functions in our project are not as long, some of them
take significantly longer than 35 seconds to compile. By outlining
independently-typed statements, we've been able to reduce compile time
of some functions, from 300s to 200s (1/3rd faster). So, my colleagues
and I are looking for other functions we can manually outline in order
to reduce the amount of time Clang takes to compile our project. To
this end, it would be handy if Clang could tell us, for example, “hey,
I just instantiated 'bar<int, float, double>', but X% of the
statements in that function did not require transformation,” where
'X%' is some threshold that could be set in the compiler invocation.
For now I'm thinking the option to set this warning threshold could be
called '-Wwasteful-template-threshold=' -- but I'm aware that sounds
awkward, and I'd love suggestions for a better name.

I think implementing this feature is possible by adding some state to
TreeTransform, or the Clang template instantiators that derive from
that class. But before I send a patch to do so, I'm curious if anyone
has attempted such a thing before, or if anyone has thoughts or
comments on this feature. I'd prefer not to spend time implementing
this diagnostic in Clang if it's predestined to be rejected in code
review, so please let me know what you think!

(I've cc'ed some contributors who I think have worked in this space
like @rnk, or those who might have better naming suggestions like
@rtrieu.)

- Brian Gesiak
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Identifying wasteful template function bodies

Tom Stellard via cfe-dev
Awesome, thanks all!

> On Mon, Dec 2, 2019 at 2:42 PM Reid Kleckner via cfe-dev <[hidden email]> wrote:
>> I like this idea. I'm not sure a warning is the best way to surface it. The first alternative that occurs to be would be the -ftime-trace JSON file, so you can dump out complete info about the most often instantiated templates, how many nodes they contained, and how many of them were dependent.

This is a great idea. I'll try to get this working and send a patch.

The discussion on ADL was interesting as well. I'm glad I asked the
list, thanks!

- Brian
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Identifying wasteful template function bodies

Tom Stellard via cfe-dev
Note that https://github.com/aras-p/ClangBuildAnalyzer already has some information that helps in these kinds of diagnosing.  It consumes the traces from -ftime-trace.  It has good ways of saying which template instantiations took too long over your code base, and hacky ways of saying which templates took too long to instantiate (I wrote the hacky thing).  Here's a more recent view into the output that you can get: https://github.com/aras-p/ClangBuildAnalyzer/blob/master/tests/self-win-clang-cl-9.0rc2/_AnalysisOutputExpected.txt

I will say that one gap in this profiling tool is large quantities of tiny functions / templates, e.g. std::move.  -ftime-trace won't trace events that are sufficiently small, so it can miss things like std::move that involve a huge quantity of tiny instantiations.

> -----Original Message-----
> From: cfe-dev <[hidden email]> On Behalf Of Brian Gesiak
> via cfe-dev
> Sent: Wednesday, December 4, 2019 10:04 AM
> To: David Blaikie <[hidden email]>
> Cc: cfe-dev <[hidden email]>
> Subject: [EXTERNAL] Re: [cfe-dev] [RFC] Identifying wasteful template
> function bodies
>
> Awesome, thanks all!
>
> > On Mon, Dec 2, 2019 at 2:42 PM Reid Kleckner via cfe-dev <cfe-
> [hidden email]> wrote:
> >> I like this idea. I'm not sure a warning is the best way to surface it. The first
> alternative that occurs to be would be the -ftime-trace JSON file, so you can
> dump out complete info about the most often instantiated templates, how
> many nodes they contained, and how many of them were dependent.
>
> This is a great idea. I'll try to get this working and send a patch.
>
> The discussion on ADL was interesting as well. I'm glad I asked the list,
> thanks!
>
> - Brian
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://urldefense.com/v3/__https://lists.llvm.org/cgi-
> bin/mailman/listinfo/cfe-dev__;!!FbZ0ZwI3Qg!-
> pVFNcaaFVYdPdIuBJkdNWBPr_-Z-PjM8KlbAi71-2e56yNnxYZYarYnQ6WN$
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Identifying wasteful template function bodies

Tom Stellard via cfe-dev
ClangBuildAnalyzer is interesting. I used it to make one change to LLVM already:
You can see the analysis output that change was informed by here:
The next area of interest I identified was printAliasInstr, which produced this patch:

On Wed, Dec 4, 2019 at 8:51 AM Ben Craig via cfe-dev <[hidden email]> wrote:
Note that https://github.com/aras-p/ClangBuildAnalyzer already has some information that helps in these kinds of diagnosing.  It consumes the traces from -ftime-trace.  It has good ways of saying which template instantiations took too long over your code base, and hacky ways of saying which templates took too long to instantiate (I wrote the hacky thing).  Here's a more recent view into the output that you can get: https://github.com/aras-p/ClangBuildAnalyzer/blob/master/tests/self-win-clang-cl-9.0rc2/_AnalysisOutputExpected.txt

I will say that one gap in this profiling tool is large quantities of tiny functions / templates, e.g. std::move.  -ftime-trace won't trace events that are sufficiently small, so it can miss things like std::move that involve a huge quantity of tiny instantiations.

> -----Original Message-----
> From: cfe-dev <[hidden email]> On Behalf Of Brian Gesiak
> via cfe-dev
> Sent: Wednesday, December 4, 2019 10:04 AM
> To: David Blaikie <[hidden email]>
> Cc: cfe-dev <[hidden email]>
> Subject: [EXTERNAL] Re: [cfe-dev] [RFC] Identifying wasteful template
> function bodies
>
> Awesome, thanks all!
>
> > On Mon, Dec 2, 2019 at 2:42 PM Reid Kleckner via cfe-dev <cfe-
> [hidden email]> wrote:
> >> I like this idea. I'm not sure a warning is the best way to surface it. The first
> alternative that occurs to be would be the -ftime-trace JSON file, so you can
> dump out complete info about the most often instantiated templates, how
> many nodes they contained, and how many of them were dependent.
>
> This is a great idea. I'll try to get this working and send a patch.
>
> The discussion on ADL was interesting as well. I'm glad I asked the list,
> thanks!
>
> - Brian
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://urldefense.com/v3/__https://lists.llvm.org/cgi-
> bin/mailman/listinfo/cfe-dev__;!!FbZ0ZwI3Qg!-
> pVFNcaaFVYdPdIuBJkdNWBPr_-Z-PjM8KlbAi71-2e56yNnxYZYarYnQ6WN$
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev