JumboSupport: making unity builds easier in Clang

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev
On Tue, Apr 10, 2018 at 9:13 PM, Richard Smith via cfe-dev <[hidden email]> wrote:
On 10 April 2018 at 10:05, Nico Weber via cfe-dev <[hidden email]> wrote:
On Tue, Apr 10, 2018 at 1:01 PM, David Blaikie <[hidden email]> wrote:


On Tue, Apr 10, 2018 at 9:58 AM Nico Weber <[hidden email]> wrote:
On Tue, Apr 10, 2018 at 11:56 AM, David Blaikie <[hidden email]> wrote:


On Tue, Apr 10, 2018 at 8:52 AM Mostyn Bramley-Moore <[hidden email]> wrote:
On Tue, Apr 10, 2018 at 4:27 PM, David Blaikie <[hidden email]> wrote:
I haven't looked at the patches in detail - but generally a jumbo build feels like a bit of a workaround & maybe there are better long-term solutions that might fit into the compiler. A few sort of background questions:

* Have you tried Clang header modules ( https://clang.llvm.org/docs/Modules.html )? (explicit (granted, explicit might only be practical at the moment using Google's internal version of Bazel - but you /might/ get some comparison numbers from a Google Chrome developer) and implicit)
  * The doc talks about maybe disabling jumbo builds for a single target for developer efficiency, with the risk that a header edit would maybe be worse for the developer than the jumbo build - this is where modules would help as well, since it doesn't have this tradeoff property of two different dimensions of "more work" you have to choose from.

There are ways to minimise this- an earlier proprietary jumbo build system used at Opera would detect when you're modifying and rebuilding files, and compile these in "normal" mode.  This gave fast full/clean build times but also short modify+rebuild times.  We have not attempted to implement this in the Chromium Jumbo build configuration.

Building that kind of infrastructure seems like a pretty big hammer compared to modularizing the codebase...

Modularizing the codebase doesn't give you the same build time impact, linearizes your build more,

Not sure I follow - it partially linearizes (as you say, due to the module dependency rather than header dependency issue), as does the jumbo build.

The jumbo build just needs to append a bunch of files, that's fast. Compiling a module isn't.

Well, compiling a module is just appending a bunch of headers and compiling them. It's just at a different layer of the graph.
 
and slows down incremental builds.

Compared to a traditional build? I wouldn't think so (I mean, yes, reading/writing modules has some overhead - but also some gains) on average. I'd expect slower builds if you modify a header at the very base of the dependency (the STL), but beyond that I would've thought the reading/writing modules overhead would be saved by reusing modules for infrequently modified files (like the STL).

Say you touch some header foo.h. Previously, you needed to rebuild all cc files including it. Now you need to instead rebuild the module, and since the module has changed you now need to rebuild all cc files using any header in the module, not just the users of foo.h. That's potentially way more cc files.

But say you touch some source file foo.cc. Previously, and with modules, you just need to rebuild that cc file. With a unity build, you now instead need to rebuild the concatenation of that .cc file and a bunch of others. That's also potentially way more cc files. :)

But measurements beat speculation here.

Here's one data point: on a non-ccache, non-distributed build on a fairly high end machine (20 CPU cores, 40 threads), I built a subset of Chromium (content_shell) in both jumbo and non-jumbo mode.  Then I picked a single source file that is in part of the tree that we have previously made jumbo-capable (content/public/renderer/browser_plugin_delegate.cc), touched it and timed how long the rebuilds would take in both jumbo and non-jumbo mode.  The target that this source file is part of has 16 source files in total. which is smaller than the default jumbo_merge_file_limit value of 50, so to rebuild this one source file in jumbo mode requires that we also rebuild the other 15 source files in this target, which will not be done in parallel since they're all in a single jumbo compilation unit- in other words this is a moderately bad scenario for jumbo.

The non-jumbo rebuild + relink time on this machine was between 9 and 10 seconds, and the jumbo rebuild + relink time was 23-24 seconds- a little more than double, but still nowhere near "time to grab a coffee while I wait" territory.  This time is easily won back in jumbo mode if you need to rebase on master, or build another target or configuration.

If you find yourself in a modify/rebuild/retest loop in this code, you can try a workflow optimisation mentioned in Daniel Bratell's doc (and earlier in this thread): turn jumbo off for just this target but on for all others, and you only have a one-time overhead of regenerating ninja files (which is quick) plus rebuilding 15 source files once in parallel.  Then you only need to rebuild a single source file each time around the loop.  

I am currently running the same benchmark on a lower-specced machine, one which is more realistic for many developers: a 4 core / 8 thread CPU workstation, but the test setup is excruciatingly slow to prepare so I will have to report back tomorrow with the numbers.  I expect the rebuild times to be comparable, since this test cannot make use of multiple CPU cores simultaneously (other than maybe parallel linking).  But the clean-build time speedup for this configuration is known to be a big net win in terms of absolute time saved (jumbo builds something like ~3x faster than non-jumbo which take several hours).

Jumbo builds are not a solution that you should use blindly without confirming that they work for your codebase and workflow, but in some cases they clearly have enormous benefits.

-Mostyn.
 
(wonder what the combination would be like - modularizing headers, and also jumbo-ifying .cpp files together... - whether there's much to be saved in the reading modules part of the work, reading them in fewer times - that gets into some of the ideas of compiler as a service I guess)
 
Even if it wasn't a lot more work to get modules going, it's not completely clear to me that that would address the use case that the people working on the jumbo build have.
 
(maybe still less work - but a lot of work to workaround things & produce some rather quirky behavior (in terms of how the build functions based on looking at exactly how the source files have changed & changing the build action graph depending on that) - but enough that I'd be inclined to reconsider going in the modular direction again)
 
 
* I was going to ask about the lack of parallelism in a jumbo build - but reading the doc I see it's not a 'full' jumbo build, but chunkifying the build - so there's still some/enough parallelism. Cool :)

I have heard rumours of some codebases in the games industry using a single jumbo source file for the entire build, but this is generally considered to be taking things too far and not our intended use case.

Ah, my understanding was that jumbo builds were often/mainly used for optimized builds to get cross-module optimizations (LTO-esque) & so it'd be likely to be the whole program.
 
The size of Chromium's jumbo compilation units is tunable- you can simply #include fewer real source files per jumbo source file- the bigger your build farm is, the smaller you want this number to be.  The optimal setup depends on things like the shape of the dependency graph and the relative costs of the original source files.  IIRC we currently only have build-wide "jumbo_file_merge_limit" setting, though that might have changed since I last looked (V8 would benefit from this, since its source files compile more slowly than most Chromium source files).  


-Mostyn.
 
On Tue, Apr 10, 2018 at 5:12 AM Mostyn Bramley-Moore via cfe-dev <[hidden email]> wrote:

Hi,


I am a member of a small group of Chromium developers who are working on adding a unity build[1] setup to Chromium[2], in order to reduce the project's long and ever-increasing compile times.  We're calling these "jumbo" builds, because this term is not as overloaded as "unity".


We're slowly making progress, but find that a lot of our time is spent renaming things in anonymous namespaces- it would be much simpler if it was possible to automatically treat these as if they were file-local.   Jens Widell has put together a proof-of-concept which appears to work reasonably well, it consists of a clang plugin and a small clang patch:

https://github.com/jensl/llvm-project-20170507/tree/wip/jumbo-support/v1

https://github.com/jensl/llvm-project-20170507/commit/a00d5ce3f20bf1c7a41145be8b7a3a478df9935f


After building clang and the plugin, you generate jumbo source files that look like:


jumbo_source_1.cc:


#pragma jumbo

#include "real_source_file_1.cc"

#include "real_source_file_2.cc"

#include "real_source_file_3.cc"


Then, you compile something like this:

clang++ -c jumbo_source_1.cc -Xclang -load -Xclang lib/JumboSupport.so -Xclang -add-plugin -Xclang jumbo-support


The plugin gives unique names[3] to the anonymous namespaces without otherwise changing their semantics, and also #undef's macros defined in each top-level source file before processing the next top-level source file.  That way header files can still define macros that are used in multiple source files in the jumbo translation unit. Collisions between macros defined in header files and names used in other headers and other source files are still possible, but less likely.


To show how much these two changes help, here's a patch to make Chromium's network code build in jumbo mode:

https://chromium-review.googlesource.com/c/chromium/src/+/966523 (+352/-377 lines)


And here's the corresponding patch using the proof-of-concept JumboSupport plugin:

https://chromium-review.googlesource.com/c/chromium/src/+/962062 (+53/-52 lines)


It seems clear that the version using the JumboSupport plugin would require less effort to create, review and merge into the codebase.  We have a few other feature ideas, but these two changes seem to do most of the work for us.


So now we're trying to figure out the best way forward- would a feature like this be welcome to the Clang project?  And if so, how would you recommend that we go about it? We would prefer to do this in a way that does not require a locally patched Clang and could live with building a custom plugin, although implementing this entirely in Clang would be even better.


I've been thinking about ways to get the benefits of unity builds without the semantic changes. With the functionality we introduced for -fmodules-local-submodule-visibility, we have the abililty to parse one file, then make it "invisible" and parse another file, skipping all the repeated parts from the two parses, which would give us some (maybe most) of the performance benefit of unity builds without the semantic changes. (This is not quite as good as a unity build: you'd still repeatedly lex and preprocess the files #included into both source files. We could implicitly treat header files with include guards as being "modular" to get the performance back, but then you also get back some of the semantic changes.)
 

Thanks,



-Mostyn.



[1] If you're not familiar with unity builds, the idea is to compile multiple source files per compiler invocation, reducing the overhead of processing header files (which can be surprisingly high).  We do this by taking a list of the source files in a target and generating "jumbo" source files that #include multiple "real" source files, and then we feed these jumbo files to the compiler one at a time.  This way, we don't prevent the usage of valuable build tools like ccache and icecc that only support a single source file on the command line.


[2] Daniel Bratell has a summary of our progress jumbo-ifying the Chromium codebase here:

https://docs.google.com/document/d/19jGsZxh7DX8jkAKbL1nYBa5rcByUL2EeidnYsoXfsYQ/edit#


[3] The JumboSupport plugin assigns names to the anonymous namespaces in a given file:  foo::(anonymous namespace)::bar is replaced with a symbol name of the form foo::__anonymous_<number>::bar where <number> is unique to the file within the jumbo translation unit.  Due to the internal linkage of these symbols, <number> does not need to be unique across multiple object files/jumbo source files.


--
Mostyn Bramley-Moore
Vewd Software
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



--
Mostyn Bramley-Moore
Vewd Software


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Mostyn Bramley-Moore
Vewd Software

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev
On Wed, Apr 11, 2018 at 1:41 AM, Mostyn Bramley-Moore <[hidden email]> wrote:
On Tue, Apr 10, 2018 at 9:13 PM, Richard Smith via cfe-dev <[hidden email]> wrote:
On 10 April 2018 at 10:05, Nico Weber via cfe-dev <[hidden email]> wrote:
On Tue, Apr 10, 2018 at 1:01 PM, David Blaikie <[hidden email]> wrote:


On Tue, Apr 10, 2018 at 9:58 AM Nico Weber <[hidden email]> wrote:
On Tue, Apr 10, 2018 at 11:56 AM, David Blaikie <[hidden email]> wrote:


On Tue, Apr 10, 2018 at 8:52 AM Mostyn Bramley-Moore <[hidden email]> wrote:
On Tue, Apr 10, 2018 at 4:27 PM, David Blaikie <[hidden email]> wrote:
I haven't looked at the patches in detail - but generally a jumbo build feels like a bit of a workaround & maybe there are better long-term solutions that might fit into the compiler. A few sort of background questions:

* Have you tried Clang header modules ( https://clang.llvm.org/docs/Modules.html )? (explicit (granted, explicit might only be practical at the moment using Google's internal version of Bazel - but you /might/ get some comparison numbers from a Google Chrome developer) and implicit)
  * The doc talks about maybe disabling jumbo builds for a single target for developer efficiency, with the risk that a header edit would maybe be worse for the developer than the jumbo build - this is where modules would help as well, since it doesn't have this tradeoff property of two different dimensions of "more work" you have to choose from.

There are ways to minimise this- an earlier proprietary jumbo build system used at Opera would detect when you're modifying and rebuilding files, and compile these in "normal" mode.  This gave fast full/clean build times but also short modify+rebuild times.  We have not attempted to implement this in the Chromium Jumbo build configuration.

Building that kind of infrastructure seems like a pretty big hammer compared to modularizing the codebase...

Modularizing the codebase doesn't give you the same build time impact, linearizes your build more,

Not sure I follow - it partially linearizes (as you say, due to the module dependency rather than header dependency issue), as does the jumbo build.

The jumbo build just needs to append a bunch of files, that's fast. Compiling a module isn't.

Well, compiling a module is just appending a bunch of headers and compiling them. It's just at a different layer of the graph.
 
and slows down incremental builds.

Compared to a traditional build? I wouldn't think so (I mean, yes, reading/writing modules has some overhead - but also some gains) on average. I'd expect slower builds if you modify a header at the very base of the dependency (the STL), but beyond that I would've thought the reading/writing modules overhead would be saved by reusing modules for infrequently modified files (like the STL).

Say you touch some header foo.h. Previously, you needed to rebuild all cc files including it. Now you need to instead rebuild the module, and since the module has changed you now need to rebuild all cc files using any header in the module, not just the users of foo.h. That's potentially way more cc files.

But say you touch some source file foo.cc. Previously, and with modules, you just need to rebuild that cc file. With a unity build, you now instead need to rebuild the concatenation of that .cc file and a bunch of others. That's also potentially way more cc files. :)

But measurements beat speculation here.

Here's one data point: on a non-ccache, non-distributed build on a fairly high end machine (20 CPU cores, 40 threads), I built a subset of Chromium (content_shell) in both jumbo and non-jumbo mode.  Then I picked a single source file that is in part of the tree that we have previously made jumbo-capable (content/public/renderer/browser_plugin_delegate.cc), touched it and timed how long the rebuilds would take in both jumbo and non-jumbo mode.  The target that this source file is part of has 16 source files in total. which is smaller than the default jumbo_merge_file_limit value of 50, so to rebuild this one source file in jumbo mode requires that we also rebuild the other 15 source files in this target, which will not be done in parallel since they're all in a single jumbo compilation unit- in other words this is a moderately bad scenario for jumbo.

The non-jumbo rebuild + relink time on this machine was between 9 and 10 seconds, and the jumbo rebuild + relink time was 23-24 seconds- a little more than double, but still nowhere near "time to grab a coffee while I wait" territory.  This time is easily won back in jumbo mode if you need to rebase on master, or build another target or configuration.

If you find yourself in a modify/rebuild/retest loop in this code, you can try a workflow optimisation mentioned in Daniel Bratell's doc (and earlier in this thread): turn jumbo off for just this target but on for all others, and you only have a one-time overhead of regenerating ninja files (which is quick) plus rebuilding 15 source files once in parallel.  Then you only need to rebuild a single source file each time around the loop.  

I am currently running the same benchmark on a lower-specced machine, one which is more realistic for many developers: a 4 core / 8 thread CPU workstation, but the test setup is excruciatingly slow to prepare so I will have to report back tomorrow with the numbers.  I expect the rebuild times to be comparable, since this test cannot make use of multiple CPU cores simultaneously (other than maybe parallel linking).  But the clean-build time speedup for this configuration is known to be a big net win in terms of absolute time saved (jumbo builds something like ~3x faster than non-jumbo which take several hours).

Results on my 4c/8t reference machine:
non-jumbo rebuild + relink time: about 7 seconds
jumbo rebuild + relink time: about 18 seconds

So a slightly higher percentage increase than my larger machine, but lower absolute time increase.

The storage systems in these two machines are wildly different, but I suspect the main difference in this benchmark is the frequency of the cores (higher in the lower-specced machine).


-Mostyn.
 
Jumbo builds are not a solution that you should use blindly without confirming that they work for your codebase and workflow, but in some cases they clearly have enormous benefits.

-Mostyn.
 
(wonder what the combination would be like - modularizing headers, and also jumbo-ifying .cpp files together... - whether there's much to be saved in the reading modules part of the work, reading them in fewer times - that gets into some of the ideas of compiler as a service I guess)
 
Even if it wasn't a lot more work to get modules going, it's not completely clear to me that that would address the use case that the people working on the jumbo build have.
 
(maybe still less work - but a lot of work to workaround things & produce some rather quirky behavior (in terms of how the build functions based on looking at exactly how the source files have changed & changing the build action graph depending on that) - but enough that I'd be inclined to reconsider going in the modular direction again)
 
 
* I was going to ask about the lack of parallelism in a jumbo build - but reading the doc I see it's not a 'full' jumbo build, but chunkifying the build - so there's still some/enough parallelism. Cool :)

I have heard rumours of some codebases in the games industry using a single jumbo source file for the entire build, but this is generally considered to be taking things too far and not our intended use case.

Ah, my understanding was that jumbo builds were often/mainly used for optimized builds to get cross-module optimizations (LTO-esque) & so it'd be likely to be the whole program.
 
The size of Chromium's jumbo compilation units is tunable- you can simply #include fewer real source files per jumbo source file- the bigger your build farm is, the smaller you want this number to be.  The optimal setup depends on things like the shape of the dependency graph and the relative costs of the original source files.  IIRC we currently only have build-wide "jumbo_file_merge_limit" setting, though that might have changed since I last looked (V8 would benefit from this, since its source files compile more slowly than most Chromium source files).  


-Mostyn.
 
On Tue, Apr 10, 2018 at 5:12 AM Mostyn Bramley-Moore via cfe-dev <[hidden email]> wrote:

Hi,


I am a member of a small group of Chromium developers who are working on adding a unity build[1] setup to Chromium[2], in order to reduce the project's long and ever-increasing compile times.  We're calling these "jumbo" builds, because this term is not as overloaded as "unity".


We're slowly making progress, but find that a lot of our time is spent renaming things in anonymous namespaces- it would be much simpler if it was possible to automatically treat these as if they were file-local.   Jens Widell has put together a proof-of-concept which appears to work reasonably well, it consists of a clang plugin and a small clang patch:

https://github.com/jensl/llvm-project-20170507/tree/wip/jumbo-support/v1

https://github.com/jensl/llvm-project-20170507/commit/a00d5ce3f20bf1c7a41145be8b7a3a478df9935f


After building clang and the plugin, you generate jumbo source files that look like:


jumbo_source_1.cc:


#pragma jumbo

#include "real_source_file_1.cc"

#include "real_source_file_2.cc"

#include "real_source_file_3.cc"


Then, you compile something like this:

clang++ -c jumbo_source_1.cc -Xclang -load -Xclang lib/JumboSupport.so -Xclang -add-plugin -Xclang jumbo-support


The plugin gives unique names[3] to the anonymous namespaces without otherwise changing their semantics, and also #undef's macros defined in each top-level source file before processing the next top-level source file.  That way header files can still define macros that are used in multiple source files in the jumbo translation unit. Collisions between macros defined in header files and names used in other headers and other source files are still possible, but less likely.


To show how much these two changes help, here's a patch to make Chromium's network code build in jumbo mode:

https://chromium-review.googlesource.com/c/chromium/src/+/966523 (+352/-377 lines)


And here's the corresponding patch using the proof-of-concept JumboSupport plugin:

https://chromium-review.googlesource.com/c/chromium/src/+/962062 (+53/-52 lines)


It seems clear that the version using the JumboSupport plugin would require less effort to create, review and merge into the codebase.  We have a few other feature ideas, but these two changes seem to do most of the work for us.


So now we're trying to figure out the best way forward- would a feature like this be welcome to the Clang project?  And if so, how would you recommend that we go about it? We would prefer to do this in a way that does not require a locally patched Clang and could live with building a custom plugin, although implementing this entirely in Clang would be even better.


I've been thinking about ways to get the benefits of unity builds without the semantic changes. With the functionality we introduced for -fmodules-local-submodule-visibility, we have the abililty to parse one file, then make it "invisible" and parse another file, skipping all the repeated parts from the two parses, which would give us some (maybe most) of the performance benefit of unity builds without the semantic changes. (This is not quite as good as a unity build: you'd still repeatedly lex and preprocess the files #included into both source files. We could implicitly treat header files with include guards as being "modular" to get the performance back, but then you also get back some of the semantic changes.)
 

Thanks,



-Mostyn.



[1] If you're not familiar with unity builds, the idea is to compile multiple source files per compiler invocation, reducing the overhead of processing header files (which can be surprisingly high).  We do this by taking a list of the source files in a target and generating "jumbo" source files that #include multiple "real" source files, and then we feed these jumbo files to the compiler one at a time.  This way, we don't prevent the usage of valuable build tools like ccache and icecc that only support a single source file on the command line.


[2] Daniel Bratell has a summary of our progress jumbo-ifying the Chromium codebase here:

https://docs.google.com/document/d/19jGsZxh7DX8jkAKbL1nYBa5rcByUL2EeidnYsoXfsYQ/edit#


[3] The JumboSupport plugin assigns names to the anonymous namespaces in a given file:  foo::(anonymous namespace)::bar is replaced with a symbol name of the form foo::__anonymous_<number>::bar where <number> is unique to the file within the jumbo translation unit.  Due to the internal linkage of these symbols, <number> does not need to be unique across multiple object files/jumbo source files.


--
Mostyn Bramley-Moore
Vewd Software
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



--
Mostyn Bramley-Moore
Vewd Software


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Mostyn Bramley-Moore
Vewd Software



--
Mostyn Bramley-Moore
Vewd Software

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev
In reply to this post by Leonard Chan via cfe-dev
On 10 Apr 2018, at 21:28, Daniel Bratell via cfe-dev <[hidden email]> wrote:
>
> I've heard (hearsay, I admit) from profiling that it seems the single largest time consumer in clang is template instantiation, something I assume can't easily be prepared in advance.
>
> One example is chromium's chrome/browser/browser target which is 732 files that normally need 6220 CPU seconds to compile, average 8,5 seconds per file. All combined together gives a single translation unit that takes 400 seconds to compile, a mere 0.54 seconds on average per file. That indicates that about 8 seconds per compiled file is related to the processing of headers.

It sounds as if there are two things here:

1. The time taken to parse the headers
2. The time taken to repeatedly instantiate templates that the linker will then discard

Assuming a command line where all of the relevant source files are provided to the compiler invocation:

Solving the first one is relatively easy if the files have a common prefix (which can be determined by simple string comparison).  Find the common prefix in the source files, build the clang AST, and then do a clone for each compilation unit.  Hopefully, the clone is a lot cheaper than re-parsing (and can ideally share source locations).

The second is slightly more difficult, because it relies on sharing parts of the AST across notional compilation units.

To make this work well with incremental builds, ideally you’d spit out all of the common template instantiations into a separate IR file, which could then be used with ThinLTO.  

Personally, I would prefer to have an interface where a build system can invoke clang with all of the files that need building and the degree of parallelism to use and let it share as much state as it wants across builds.  In an ideal world, clang would record which templates have been instantiated in a prior build (or a previous build step in the current build) and avoid any IRGen for them, at the very least.

Old C++ compilers, predating linker support for COMDATs, emitted templates lazily, simply emitting references to them, then parsing the linker errors and generating missing implementations until the linker errors went away.  Modern C++ compilers generate many instantiations of the same templates and then discard most of them.  It would be nice to find an intermediate point, which worked well with ThinLTO, where templates could be emitted once and be available for inlining everywhere.

David

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev


> -----Original Message-----
> From: cfe-dev [mailto:[hidden email]] On Behalf Of David
> Chisnall via cfe-dev
> Sent: Wednesday, April 11, 2018 3:43 AM
> To: Daniel Bratell
> Cc: Jens Widell; [hidden email]; [hidden email]; Daniel
> Cheng; Bruce Dawson
> Subject: Re: [cfe-dev] JumboSupport: making unity builds easier in Clang
>
> On 10 Apr 2018, at 21:28, Daniel Bratell via cfe-dev <cfe-
> [hidden email]> wrote:
> >
> > I've heard (hearsay, I admit) from profiling that it seems the single
> largest time consumer in clang is template instantiation, something I
> assume can't easily be prepared in advance.
> >
> > One example is chromium's chrome/browser/browser target which is 732
> files that normally need 6220 CPU seconds to compile, average 8,5 seconds
> per file. All combined together gives a single translation unit that takes
> 400 seconds to compile, a mere 0.54 seconds on average per file. That
> indicates that about 8 seconds per compiled file is related to the
> processing of headers.
>
> It sounds as if there are two things here:
>
> 1. The time taken to parse the headers
> 2. The time taken to repeatedly instantiate templates that the linker will
> then discard
>
> Assuming a command line where all of the relevant source files are
> provided to the compiler invocation:
>
> Solving the first one is relatively easy if the files have a common prefix
> (which can be determined by simple string comparison).  Find the common
> prefix in the source files, build the clang AST, and then do a clone for
> each compilation unit.  Hopefully, the clone is a lot cheaper than re-
> parsing (and can ideally share source locations).
>
> The second is slightly more difficult, because it relies on sharing parts
> of the AST across notional compilation units.
>
> To make this work well with incremental builds, ideally you’d spit out all
> of the common template instantiations into a separate IR file, which could
> then be used with ThinLTO.
>
> Personally, I would prefer to have an interface where a build system can
> invoke clang with all of the files that need building and the degree of
> parallelism to use and let it share as much state as it wants across
> builds.  In an ideal world, clang would record which templates have been
> instantiated in a prior build (or a previous build step in the current
> build) and avoid any IRGen for them, at the very least.

Let me put in a plug for Paul Huggett's work, see his 2016 US dev talk:
https://llvm.org/devmtg/2016-11/#talk22
He's looking to do something like this with a program-fragment database.
It's obviously not anywhere near production ready but it looks like a pretty
good direction to me.
--paulr

>
> Old C++ compilers, predating linker support for COMDATs, emitted templates
> lazily, simply emitting references to them, then parsing the linker errors
> and generating missing implementations until the linker errors went away.
> Modern C++ compilers generate many instantiations of the same templates
> and then discard most of them.  It would be nice to find an intermediate
> point, which worked well with ThinLTO, where templates could be emitted
> once and be available for inlining everywhere.
>
> David
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev
In reply to this post by Leonard Chan via cfe-dev
This would have issues with distributed builds, though, right? Unless clang then took on the burden of doing the distribution too, which might be a bit much.

On Wed, Apr 11, 2018 at 12:43 AM David Chisnall via cfe-dev <[hidden email]> wrote:
On 10 Apr 2018, at 21:28, Daniel Bratell via cfe-dev <[hidden email]> wrote:
>
> I've heard (hearsay, I admit) from profiling that it seems the single largest time consumer in clang is template instantiation, something I assume can't easily be prepared in advance.
>
> One example is chromium's chrome/browser/browser target which is 732 files that normally need 6220 CPU seconds to compile, average 8,5 seconds per file. All combined together gives a single translation unit that takes 400 seconds to compile, a mere 0.54 seconds on average per file. That indicates that about 8 seconds per compiled file is related to the processing of headers.

It sounds as if there are two things here:

1. The time taken to parse the headers
2. The time taken to repeatedly instantiate templates that the linker will then discard

Assuming a command line where all of the relevant source files are provided to the compiler invocation:

Solving the first one is relatively easy if the files have a common prefix (which can be determined by simple string comparison).  Find the common prefix in the source files, build the clang AST, and then do a clone for each compilation unit.  Hopefully, the clone is a lot cheaper than re-parsing (and can ideally share source locations).

The second is slightly more difficult, because it relies on sharing parts of the AST across notional compilation units.

To make this work well with incremental builds, ideally you’d spit out all of the common template instantiations into a separate IR file, which could then be used with ThinLTO.

Personally, I would prefer to have an interface where a build system can invoke clang with all of the files that need building and the degree of parallelism to use and let it share as much state as it wants across builds.  In an ideal world, clang would record which templates have been instantiated in a prior build (or a previous build step in the current build) and avoid any IRGen for them, at the very least.

Old C++ compilers, predating linker support for COMDATs, emitted templates lazily, simply emitting references to them, then parsing the linker errors and generating missing implementations until the linker errors went away.  Modern C++ compilers generate many instantiations of the same templates and then discard most of them.  It would be nice to find an intermediate point, which worked well with ThinLTO, where templates could be emitted once and be available for inlining everywhere.

David

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev

If you want to share ASTs (an ephemeral structure) clang would need to do the distributing.  If you want to share IR of instantiated templates, you can do a shared database where clang is much less involved in managing the distribution.  Say the database key can be maybe a hash of the token stream of the template definition would work?  plus the template parameters.  Then you can pull precompiled IR out of the database (if you want to do optimizations) or make a reference to it (if you're doing LTO).

--paulr

 

From: cfe-dev [mailto:[hidden email]] On Behalf Of David Blaikie via cfe-dev
Sent: Wednesday, April 11, 2018 11:09 AM
To: David Chisnall
Cc: Bruce Dawson; Daniel Cheng; [hidden email]; [hidden email]; Daniel Bratell; Jens Widell
Subject: Re: [cfe-dev] JumboSupport: making unity builds easier in Clang

 

This would have issues with distributed builds, though, right? Unless clang then took on the burden of doing the distribution too, which might be a bit much.

On Wed, Apr 11, 2018 at 12:43 AM David Chisnall via cfe-dev <[hidden email]> wrote:

On 10 Apr 2018, at 21:28, Daniel Bratell via cfe-dev <[hidden email]> wrote:
>
> I've heard (hearsay, I admit) from profiling that it seems the single largest time consumer in clang is template instantiation, something I assume can't easily be prepared in advance.
>
> One example is chromium's chrome/browser/browser target which is 732 files that normally need 6220 CPU seconds to compile, average 8,5 seconds per file. All combined together gives a single translation unit that takes 400 seconds to compile, a mere 0.54 seconds on average per file. That indicates that about 8 seconds per compiled file is related to the processing of headers.

It sounds as if there are two things here:

1. The time taken to parse the headers
2. The time taken to repeatedly instantiate templates that the linker will then discard

Assuming a command line where all of the relevant source files are provided to the compiler invocation:

Solving the first one is relatively easy if the files have a common prefix (which can be determined by simple string comparison).  Find the common prefix in the source files, build the clang AST, and then do a clone for each compilation unit.  Hopefully, the clone is a lot cheaper than re-parsing (and can ideally share source locations).

The second is slightly more difficult, because it relies on sharing parts of the AST across notional compilation units.

To make this work well with incremental builds, ideally you’d spit out all of the common template instantiations into a separate IR file, which could then be used with ThinLTO.

Personally, I would prefer to have an interface where a build system can invoke clang with all of the files that need building and the degree of parallelism to use and let it share as much state as it wants across builds.  In an ideal world, clang would record which templates have been instantiated in a prior build (or a previous build step in the current build) and avoid any IRGen for them, at the very least.

Old C++ compilers, predating linker support for COMDATs, emitted templates lazily, simply emitting references to them, then parsing the linker errors and generating missing implementations until the linker errors went away.  Modern C++ compilers generate many instantiations of the same templates and then discard most of them.  It would be nice to find an intermediate point, which worked well with ThinLTO, where templates could be emitted once and be available for inlining everywhere.

David

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev
See also: https://www.llvm.org/devmtg/2014-04/PDFs/Talks/Tenseconds.pdf

I started experimenting with a unity build of an LLVM/Clang-sized
proprietary project at my previous employer, and I found the basics
easy to get going. The hard part was massaging the code base to avoid
collisions, as indicated by the work by Mostyn & co.

I left the job before I had a chance to fully evaluate it, but
assuming I'd had something like `#pragma jumbo` to reduce the
friction, it might have been easier to get more data for less effort.

Mostyn/Daniel, do you have any gut feel/data on how much of the
problem a #pragma would solve? I suppose there are still constructs
that `#pragma jumbo` can't help with, that requires manual
intervention?

Also, Chromium is hardly a typical codebase, the little I've looked at
it, it's *extremely* clean and consistent, so it might be interesting
to try this on something else. Maybe LLVM itself would be an
interesting candidate.

- Kim

On Wed, Apr 11, 2018 at 7:08 PM, via cfe-dev <[hidden email]> wrote:

> If you want to share ASTs (an ephemeral structure) clang would need to do
> the distributing.  If you want to share IR of instantiated templates, you
> can do a shared database where clang is much less involved in managing the
> distribution.  Say the database key can be maybe a hash of the token stream
> of the template definition would work?  plus the template parameters.  Then
> you can pull precompiled IR out of the database (if you want to do
> optimizations) or make a reference to it (if you're doing LTO).
>
> --paulr
>
>
>
> From: cfe-dev [mailto:[hidden email]] On Behalf Of David
> Blaikie via cfe-dev
> Sent: Wednesday, April 11, 2018 11:09 AM
> To: David Chisnall
> Cc: Bruce Dawson; Daniel Cheng; [hidden email];
> [hidden email]; Daniel Bratell; Jens Widell
> Subject: Re: [cfe-dev] JumboSupport: making unity builds easier in Clang
>
>
>
> This would have issues with distributed builds, though, right? Unless clang
> then took on the burden of doing the distribution too, which might be a bit
> much.
>
> On Wed, Apr 11, 2018 at 12:43 AM David Chisnall via cfe-dev
> <[hidden email]> wrote:
>
> On 10 Apr 2018, at 21:28, Daniel Bratell via cfe-dev
> <[hidden email]> wrote:
>>
>> I've heard (hearsay, I admit) from profiling that it seems the single
>> largest time consumer in clang is template instantiation, something I assume
>> can't easily be prepared in advance.
>>
>> One example is chromium's chrome/browser/browser target which is 732 files
>> that normally need 6220 CPU seconds to compile, average 8,5 seconds per
>> file. All combined together gives a single translation unit that takes 400
>> seconds to compile, a mere 0.54 seconds on average per file. That indicates
>> that about 8 seconds per compiled file is related to the processing of
>> headers.
>
> It sounds as if there are two things here:
>
> 1. The time taken to parse the headers
> 2. The time taken to repeatedly instantiate templates that the linker will
> then discard
>
> Assuming a command line where all of the relevant source files are provided
> to the compiler invocation:
>
> Solving the first one is relatively easy if the files have a common prefix
> (which can be determined by simple string comparison).  Find the common
> prefix in the source files, build the clang AST, and then do a clone for
> each compilation unit.  Hopefully, the clone is a lot cheaper than
> re-parsing (and can ideally share source locations).
>
> The second is slightly more difficult, because it relies on sharing parts of
> the AST across notional compilation units.
>
> To make this work well with incremental builds, ideally you’d spit out all
> of the common template instantiations into a separate IR file, which could
> then be used with ThinLTO.
>
> Personally, I would prefer to have an interface where a build system can
> invoke clang with all of the files that need building and the degree of
> parallelism to use and let it share as much state as it wants across builds.
> In an ideal world, clang would record which templates have been instantiated
> in a prior build (or a previous build step in the current build) and avoid
> any IRGen for them, at the very least.
>
> Old C++ compilers, predating linker support for COMDATs, emitted templates
> lazily, simply emitting references to them, then parsing the linker errors
> and generating missing implementations until the linker errors went away.
> Modern C++ compilers generate many instantiations of the same templates and
> then discard most of them.  It would be nice to find an intermediate point,
> which worked well with ThinLTO, where templates could be emitted once and be
> available for inlining everywhere.
>
> David
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev
On Wed, Apr 11, 2018 at 7:53 PM, Kim Gräsman via cfe-dev <[hidden email]> wrote:
See also: https://www.llvm.org/devmtg/2014-04/PDFs/Talks/Tenseconds.pdf

I CC'ed Andy in my initial post, but the email bounced.
 
I started experimenting with a unity build of an LLVM/Clang-sized
proprietary project at my previous employer, and I found the basics
easy to get going. The hard part was massaging the code base to avoid
collisions, as indicated by the work by Mostyn & co.

I left the job before I had a chance to fully evaluate it, but
assuming I'd had something like `#pragma jumbo` to reduce the
friction, it might have been easier to get more data for less effort.

Mostyn/Daniel, do you have any gut feel/data on how much of the
problem a #pragma would solve? I suppose there are still constructs
that `#pragma jumbo` can't help with, that requires manual
intervention?

The best side-by-side comparison that we have at the moment are the two chromium patch sets I mentioned- the numbers there match my gut feeling that something like the JumboSupport proof-of-concept could save us about 80% of the effort to jumbo-ify Chromium code.  

There are a few other constructs that cause trouble less often, which could be investigated later for diminishing returns.  Automatically popping clang diagnostic warning pragma states is one that came up the other day.  I think I have seen globally scoped typedefs in top-level source files cause trouble (but these are rare).

And there are of course some constructs that I don't think are feasible to try to fix automatically, eg symbols and macros leaked by library headers (which are intentionally leaky)- X11 and Windows headers are particularly bad.  
 
Also, Chromium is hardly a typical codebase, the little I've looked at
it, it's *extremely* clean and consistent, so it might be interesting
to try this on something else. Maybe LLVM itself would be an
interesting candidate.

I don't have much experience with CMake, but I see a few references to CMake unity build helpers on the web (if anyone has tips, feel free to ping me off-list).  If it would be useful I can try to put together a small experiment with a subset of LLVM or Clang.


-Mostyn.
 
- Kim

On Wed, Apr 11, 2018 at 7:08 PM, via cfe-dev <[hidden email]> wrote:
> If you want to share ASTs (an ephemeral structure) clang would need to do
> the distributing.  If you want to share IR of instantiated templates, you
> can do a shared database where clang is much less involved in managing the
> distribution.  Say the database key can be maybe a hash of the token stream
> of the template definition would work?  plus the template parameters.  Then
> you can pull precompiled IR out of the database (if you want to do
> optimizations) or make a reference to it (if you're doing LTO).
>
> --paulr
>
>
>
> From: cfe-dev [mailto:[hidden email]] On Behalf Of David
> Blaikie via cfe-dev
> Sent: Wednesday, April 11, 2018 11:09 AM
> To: David Chisnall
> Cc: Bruce Dawson; Daniel Cheng; [hidden email];
> [hidden email]; Daniel Bratell; Jens Widell
> Subject: Re: [cfe-dev] JumboSupport: making unity builds easier in Clang
>
>
>
> This would have issues with distributed builds, though, right? Unless clang
> then took on the burden of doing the distribution too, which might be a bit
> much.
>
> On Wed, Apr 11, 2018 at 12:43 AM David Chisnall via cfe-dev
> <[hidden email]> wrote:
>
> On 10 Apr 2018, at 21:28, Daniel Bratell via cfe-dev
> <[hidden email]> wrote:
>>
>> I've heard (hearsay, I admit) from profiling that it seems the single
>> largest time consumer in clang is template instantiation, something I assume
>> can't easily be prepared in advance.
>>
>> One example is chromium's chrome/browser/browser target which is 732 files
>> that normally need 6220 CPU seconds to compile, average 8,5 seconds per
>> file. All combined together gives a single translation unit that takes 400
>> seconds to compile, a mere 0.54 seconds on average per file. That indicates
>> that about 8 seconds per compiled file is related to the processing of
>> headers.
>
> It sounds as if there are two things here:
>
> 1. The time taken to parse the headers
> 2. The time taken to repeatedly instantiate templates that the linker will
> then discard
>
> Assuming a command line where all of the relevant source files are provided
> to the compiler invocation:
>
> Solving the first one is relatively easy if the files have a common prefix
> (which can be determined by simple string comparison).  Find the common
> prefix in the source files, build the clang AST, and then do a clone for
> each compilation unit.  Hopefully, the clone is a lot cheaper than
> re-parsing (and can ideally share source locations).
>
> The second is slightly more difficult, because it relies on sharing parts of
> the AST across notional compilation units.
>
> To make this work well with incremental builds, ideally you’d spit out all
> of the common template instantiations into a separate IR file, which could
> then be used with ThinLTO.
>
> Personally, I would prefer to have an interface where a build system can
> invoke clang with all of the files that need building and the degree of
> parallelism to use and let it share as much state as it wants across builds.
> In an ideal world, clang would record which templates have been instantiated
> in a prior build (or a previous build step in the current build) and avoid
> any IRGen for them, at the very least.
>
> Old C++ compilers, predating linker support for COMDATs, emitted templates
> lazily, simply emitting references to them, then parsing the linker errors
> and generating missing implementations until the linker errors went away.
> Modern C++ compilers generate many instantiations of the same templates and
> then discard most of them.  It would be nice to find an intermediate point,
> which worked well with ThinLTO, where templates could be emitted once and be
> available for inlining everywhere.
>
> David
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



--
Mostyn Bramley-Moore
Vewd Software

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev
In reply to this post by Leonard Chan via cfe-dev
It is different rates for maintaining and for initially adding support.

When first preparing the code for jumbo there are several groups of  
changes necessary. Some of them are just that the code initially did  
something wrong that is suddenly detected in jumbo builds, some of them is  
that the same constant/function name is used in many files. kBufferSize,  
kIconSize, kSecondsPerMinute, GetThingWithNullCheck(), those kind of  
things.

In the initial cleanup I think names, the kind of problems clang support  
as suggested would help with, is about 60-80%, and the experiment with  
/net in Chromium supports that estimate.

After the initial cleanup, the new problems that appear seems to be of the  
"duplicate symbol name" kind to a much higher degree, maybe 90%.

So if those rough estimates are correct, it would make it 4 times as easy  
to implement something like jumbo, and 10 times as easy to maintain, and  
it would mean that developers can use the short common names they have  
become accustomed to.

It would also hide some code problems that jumbo right now expose, such as  
copy/pasted code but if we can live with it today, we can probably survive  
with that a while longer and leave it to other tools to find such problems.

/Daniel

(My notes from adding jumbo to a code part with 1000+ files, those with a  
* would probably have been unnecessary if clang had had this support:
----
* 20.5 patches to rename something
* 11.5 patches to remove duplicate code
2 fixes to bad forward declarations
1 removal of "using namespace" (not allowed by the coding standard)
1 fix to ambiguity between ::prefs and ::metric::prefs
1 fix to clash with X11 headers
3 fixes to clashes with Windows headers
* 3 changes to inline trivial code/constants
1 case of bind.h finding Bind being called the wrong way thanks to access  
to more type information
1 removal of dead code
1 patch to add include guards
)

On Wed, 11 Apr 2018 19:53:58 +0200, Kim Gräsman <[hidden email]>  
wrote:

> See also: https://www.llvm.org/devmtg/2014-04/PDFs/Talks/Tenseconds.pdf
>
> I started experimenting with a unity build of an LLVM/Clang-sized
> proprietary project at my previous employer, and I found the basics
> easy to get going. The hard part was massaging the code base to avoid
> collisions, as indicated by the work by Mostyn & co.
>
> I left the job before I had a chance to fully evaluate it, but
> assuming I'd had something like `#pragma jumbo` to reduce the
> friction, it might have been easier to get more data for less effort.
>
> Mostyn/Daniel, do you have any gut feel/data on how much of the
> problem a #pragma would solve? I suppose there are still constructs
> that `#pragma jumbo` can't help with, that requires manual
> intervention?
>
> Also, Chromium is hardly a typical codebase, the little I've looked at
> it, it's *extremely* clean and consistent, so it might be interesting
> to try this on something else. Maybe LLVM itself would be an
> interesting candidate.
>
> - Kim
>
> On Wed, Apr 11, 2018 at 7:08 PM, via cfe-dev <[hidden email]>  
> wrote:
>> If you want to share ASTs (an ephemeral structure) clang would need to  
>> do
>> the distributing.  If you want to share IR of instantiated templates,  
>> you
>> can do a shared database where clang is much less involved in managing  
>> the
>> distribution.  Say the database key can be maybe a hash of the token  
>> stream
>> of the template definition would work?  plus the template parameters.  
>> Then
>> you can pull precompiled IR out of the database (if you want to do
>> optimizations) or make a reference to it (if you're doing LTO).
>>
>> --paulr
>>
>>
>>
>> From: cfe-dev [mailto:[hidden email]] On Behalf Of David
>> Blaikie via cfe-dev
>> Sent: Wednesday, April 11, 2018 11:09 AM
>> To: David Chisnall
>> Cc: Bruce Dawson; Daniel Cheng; [hidden email];
>> [hidden email]; Daniel Bratell; Jens Widell
>> Subject: Re: [cfe-dev] JumboSupport: making unity builds easier in Clang
>>
>>
>>
>> This would have issues with distributed builds, though, right? Unless  
>> clang
>> then took on the burden of doing the distribution too, which might be a  
>> bit
>> much.
>>
>> On Wed, Apr 11, 2018 at 12:43 AM David Chisnall via cfe-dev
>> <[hidden email]> wrote:
>>
>> On 10 Apr 2018, at 21:28, Daniel Bratell via cfe-dev
>> <[hidden email]> wrote:
>>>
>>> I've heard (hearsay, I admit) from profiling that it seems the single
>>> largest time consumer in clang is template instantiation, something I  
>>> assume
>>> can't easily be prepared in advance.
>>>
>>> One example is chromium's chrome/browser/browser target which is 732  
>>> files
>>> that normally need 6220 CPU seconds to compile, average 8,5 seconds per
>>> file. All combined together gives a single translation unit that takes  
>>> 400
>>> seconds to compile, a mere 0.54 seconds on average per file. That  
>>> indicates
>>> that about 8 seconds per compiled file is related to the processing of
>>> headers.
>>
>> It sounds as if there are two things here:
>>
>> 1. The time taken to parse the headers
>> 2. The time taken to repeatedly instantiate templates that the linker  
>> will
>> then discard
>>
>> Assuming a command line where all of the relevant source files are  
>> provided
>> to the compiler invocation:
>>
>> Solving the first one is relatively easy if the files have a common  
>> prefix
>> (which can be determined by simple string comparison).  Find the common
>> prefix in the source files, build the clang AST, and then do a clone for
>> each compilation unit.  Hopefully, the clone is a lot cheaper than
>> re-parsing (and can ideally share source locations).
>>
>> The second is slightly more difficult, because it relies on sharing  
>> parts of
>> the AST across notional compilation units.
>>
>> To make this work well with incremental builds, ideally you’d spit out  
>> all
>> of the common template instantiations into a separate IR file, which  
>> could
>> then be used with ThinLTO.
>>
>> Personally, I would prefer to have an interface where a build system can
>> invoke clang with all of the files that need building and the degree of
>> parallelism to use and let it share as much state as it wants across  
>> builds.
>> In an ideal world, clang would record which templates have been  
>> instantiated
>> in a prior build (or a previous build step in the current build) and  
>> avoid
>> any IRGen for them, at the very least.
>>
>> Old C++ compilers, predating linker support for COMDATs, emitted  
>> templates
>> lazily, simply emitting references to them, then parsing the linker  
>> errors
>> and generating missing implementations until the linker errors went  
>> away.
>> Modern C++ compilers generate many instantiations of the same templates  
>> and
>> then discard most of them.  It would be nice to find an intermediate  
>> point,
>> which worked well with ThinLTO, where templates could be emitted once  
>> and be
>> available for inlining everywhere.
>>
>> David
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>


--
/* Opera Software, Linköping, Sweden: CEST (UTC+2) */
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: JumboSupport: making unity builds easier in Clang

Leonard Chan via cfe-dev
In reply to this post by Leonard Chan via cfe-dev
On Wed, Apr 11, 2018 at 8:52 PM, Mostyn Bramley-Moore <[hidden email]> wrote:
On Wed, Apr 11, 2018 at 7:53 PM, Kim Gräsman via cfe-dev <[hidden email]> wrote:
See also: https://www.llvm.org/devmtg/2014-04/PDFs/Talks/Tenseconds.pdf

I CC'ed Andy in my initial post, but the email bounced.
 
I started experimenting with a unity build of an LLVM/Clang-sized
proprietary project at my previous employer, and I found the basics
easy to get going. The hard part was massaging the code base to avoid
collisions, as indicated by the work by Mostyn & co.

I left the job before I had a chance to fully evaluate it, but
assuming I'd had something like `#pragma jumbo` to reduce the
friction, it might have been easier to get more data for less effort.

Mostyn/Daniel, do you have any gut feel/data on how much of the
problem a #pragma would solve? I suppose there are still constructs
that `#pragma jumbo` can't help with, that requires manual
intervention?

The best side-by-side comparison that we have at the moment are the two chromium patch sets I mentioned- the numbers there match my gut feeling that something like the JumboSupport proof-of-concept could save us about 80% of the effort to jumbo-ify Chromium code.  

There are a few other constructs that cause trouble less often, which could be investigated later for diminishing returns.  Automatically popping clang diagnostic warning pragma states is one that came up the other day.  I think I have seen globally scoped typedefs in top-level source files cause trouble (but these are rare).

And there are of course some constructs that I don't think are feasible to try to fix automatically, eg symbols and macros leaked by library headers (which are intentionally leaky)- X11 and Windows headers are particularly bad.  
 
Also, Chromium is hardly a typical codebase, the little I've looked at
it, it's *extremely* clean and consistent, so it might be interesting
to try this on something else. Maybe LLVM itself would be an
interesting candidate.

I don't have much experience with CMake, but I see a few references to CMake unity build helpers on the web (if anyone has tips, feel free to ping me off-list).  If it would be useful I can try to put together a small experiment with a subset of LLVM or Clang.
 
I decided to take a look at the clangSema target, and see what kind of difference the JumboSupport PoC would make.  Instead of digging into CMake, I just wrote some small shell scripts to build this target in the various modes.

Without JumboSupport, I had to rename a couple of static functions (isGlobalVar and getDepthAndIndex in a couple of places), and rename a struct (PartialSpecMatchResult) that was inside an anonymous namespace.  Alternatively you could decide to refactor and share the same implementations.  I also excluded two source files from the jumbo compilation unit, due to clashes caused by a file being intentionally #include'd multiple times (alternatively you could sprinkle some #undef's around to make this work).

With JumboSupport, instead of renaming the static functions I just moved them into anonymous namespaces, and excluded the same two source files which #include some .def files multiple times, for the same reasons as above.  I did not need to do anything about the PartialSpecMatchResult structs since they were already inside anonymous namespaces (one of them was at least, I did not need to check the other).

Of these two patches, the JumboSupport version was easier to produce, and I believe would require less effort to review- there would be no debate about what to rename things, or whether or not the code should be refactored and how.  I think that anonymous namespaces should generally be preferred over static functions, and JumboSupport makes anonymous namespaces even more useful- it makes them behave the way that many developers (incorrectly) assume that they work.  

Note that we don't claim that jumbo builds make sense for all codebases, and I'm not sure if it would make sense for Clang/LLVM.  But JumboSupport did appear to help in this tiny experiment.

-Mostyn.

-Mostyn.
 
- Kim

On Wed, Apr 11, 2018 at 7:08 PM, via cfe-dev <[hidden email]> wrote:
> If you want to share ASTs (an ephemeral structure) clang would need to do
> the distributing.  If you want to share IR of instantiated templates, you
> can do a shared database where clang is much less involved in managing the
> distribution.  Say the database key can be maybe a hash of the token stream
> of the template definition would work?  plus the template parameters.  Then
> you can pull precompiled IR out of the database (if you want to do
> optimizations) or make a reference to it (if you're doing LTO).
>
> --paulr
>
>
>
> From: cfe-dev [mailto:[hidden email]] On Behalf Of David
> Blaikie via cfe-dev
> Sent: Wednesday, April 11, 2018 11:09 AM
> To: David Chisnall
> Cc: Bruce Dawson; Daniel Cheng; [hidden email];
> [hidden email]; Daniel Bratell; Jens Widell
> Subject: Re: [cfe-dev] JumboSupport: making unity builds easier in Clang
>
>
>
> This would have issues with distributed builds, though, right? Unless clang
> then took on the burden of doing the distribution too, which might be a bit
> much.
>
> On Wed, Apr 11, 2018 at 12:43 AM David Chisnall via cfe-dev
> <[hidden email]> wrote:
>
> On 10 Apr 2018, at 21:28, Daniel Bratell via cfe-dev
> <[hidden email]> wrote:
>>
>> I've heard (hearsay, I admit) from profiling that it seems the single
>> largest time consumer in clang is template instantiation, something I assume
>> can't easily be prepared in advance.
>>
>> One example is chromium's chrome/browser/browser target which is 732 files
>> that normally need 6220 CPU seconds to compile, average 8,5 seconds per
>> file. All combined together gives a single translation unit that takes 400
>> seconds to compile, a mere 0.54 seconds on average per file. That indicates
>> that about 8 seconds per compiled file is related to the processing of
>> headers.
>
> It sounds as if there are two things here:
>
> 1. The time taken to parse the headers
> 2. The time taken to repeatedly instantiate templates that the linker will
> then discard
>
> Assuming a command line where all of the relevant source files are provided
> to the compiler invocation:
>
> Solving the first one is relatively easy if the files have a common prefix
> (which can be determined by simple string comparison).  Find the common
> prefix in the source files, build the clang AST, and then do a clone for
> each compilation unit.  Hopefully, the clone is a lot cheaper than
> re-parsing (and can ideally share source locations).
>
> The second is slightly more difficult, because it relies on sharing parts of
> the AST across notional compilation units.
>
> To make this work well with incremental builds, ideally you’d spit out all
> of the common template instantiations into a separate IR file, which could
> then be used with ThinLTO.
>
> Personally, I would prefer to have an interface where a build system can
> invoke clang with all of the files that need building and the degree of
> parallelism to use and let it share as much state as it wants across builds.
> In an ideal world, clang would record which templates have been instantiated
> in a prior build (or a previous build step in the current build) and avoid
> any IRGen for them, at the very least.
>
> Old C++ compilers, predating linker support for COMDATs, emitted templates
> lazily, simply emitting references to them, then parsing the linker errors
> and generating missing implementations until the linker errors went away.
> Modern C++ compilers generate many instantiations of the same templates and
> then discard most of them.  It would be nice to find an intermediate point,
> which worked well with ThinLTO, where templates could be emitted once and be
> available for inlining everywhere.
>
> David
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



--
Mostyn Bramley-Moore
Vewd Software



--
Mostyn Bramley-Moore
Vewd Software

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12