zapcc compiler

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

zapcc compiler

Chris Lattner
Does anyone know anything about this?
http://www.zapcc.com

-Chris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

Yaron Keren
Hi Chris,

I am the prinicipal developer of zapcc and can add some more tech details. zapcc is heavily-modified clang (the diff is about 200K) with additional code outside the llvm/clang codebase. zapcc operates in a client-compilation server mode. The compilation server (think of it as clang -cc1) stays in memory and accepts compilation commands from the driver. The client runs until cc1_main() which communicates with the server rather than rerunning another clang as usual. 

zapcc makes distinction between two classes of source files, the "system" ones of which all compilation state is kept in memory and the "user" ones whose compilation state is removed once compiled. The programmer can select which are the "user" files by wildcards set in a configuration files. The default of user is .c .cpp .cxx .CC files but it could easily be all files in /home/user/yaron or whatever. It is expected that the system files are non-changing (such a change will not be recognized anyhow until server restart) while the user files are the ones to be modified. As an example, you could have llvm/lib/MC/MachObjectWriter.cpp as the "user" file so every other file compilation result would be kept in memory.

Not only a header file is parsed once but all its templates instantiations *and* generated code are kept memory ready for the next compilation. zapcc is very carefull to undo anything releated to the 'user' files in clang/LLVM data structures,This is very very complex, which is why zapcc is not yet ready for public beta. We prefer to release a more reliable product rather than waste your time.

There are limitations with this approach, as previously declared entities are still visible in subsequent compilations, a limitation we hope to address someday, not in the near future. With good quality modern codebase such clashes are rare. In the LLVM/clang codebased there are just a few clashes which can be easily fixed by renaming one of the clashing entities. Some of the renaming would be according to the new codestyle anyhow... In such cases zapcc automatically resets the compilation cache and retries compilation before giving up. It also resets if compilation flags change or in some situations it finds out it can't undo the compilation.

Having everything ready in-memory saves time, especially where the headers are much more complex than the source code. With a short C++ program using boost::numeric, boost::graph etc or Eigen, we see a 10-50x speedup. We had some code examples on the web site which I asked to be currently removed now until we can provide you with a beta release so that the results could be independently replicated. These may be considered best-case examples but are actally useful for programmers modifying and rebuilding a smaller program based on heavy templated C++ infrastructure. 

For full-build LLVM compilation we don't yet have full results as not all zapcc bugs are solved, but we do see about 1.5x speedups building until 55% build or so. This timing includes some linking and tablegenning which just the same using zapcc, so compilation speedup is actully somewhat better.

We haven't compared with precompiled headers as they are really not equivalent. Using precomp headers is non-trivial change to a project build and will not always help build time ,depending on include patterns. I'm not sure precomp headers would benefit LLVM build time. OTOH, zapcc builds the project as-is without redesign, with the exception of renaming name clashes, a trivial refactoring.

Hoping to release a beta version soon,

Yaron


2015-05-23 21:05 GMT+03:00 Chris Lattner <[hidden email]>:
Does anyone know anything about this?
http://www.zapcc.com

-Chris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

Chris Lattner
On May 23, 2015, at 12:25 PM, Yaron Keren <[hidden email]> wrote:
> zapcc makes distinction between two classes of source files, the "system" ones of which all compilation state is kept in memory and the "user" ones whose compilation state is removed once compiled. The programmer can select which are the "user" files by wildcards set in a configuration files. The default of user is .c .cpp .cxx .CC files but it could easily be all files in /home/user/yaron or whatever. It is expected that the system files are non-changing (such a change will not be recognized anyhow until server restart) while the user files are the ones to be modified. As an example, you could have llvm/lib/MC/MachObjectWriter.cpp as the "user" file so every other file compilation result would be kept in memory.

This sounds like a very interesting approach, but also (as you say) very complex :-)

Have you looked at the modules work in clang?  It seems that building on that infrastructure could help simplify things.  In principle you could load “all the modules” and then treat any specific translation unit as a filter over the available decls.  This is also, uhm, nontrivial, but building on properly modular headers could simplify things a lot.

-Chris


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

Yaron Keren
zapcc maintains as much as possible from previous compilations: AST, IR, MC and DebugInfo.  I'm not sure that module support goes that far. This indeed would be easier to implement if we know that the C++ code is properly modularized.

One example, if a compile unit instantiates StringMap<bool> and the next compile unit also requires it, StringMap<bool> should not to be reinstantiated, codegenned and optimized. This could mostly achieved using extern + explicit template instantiations however this approach is quite rare. Maybe because extern template wasn't supported before C++11, programmers unfamiliar with the technique or because it's cleaner and easier to #include the template header and let the compiler handle the housekeeping. Whatever reason, zapcc handles this automatically.


2015-05-24 19:41 GMT+03:00 Chris Lattner <[hidden email]>:
On May 23, 2015, at 12:25 PM, Yaron Keren <[hidden email]> wrote:
> zapcc makes distinction between two classes of source files, the "system" ones of which all compilation state is kept in memory and the "user" ones whose compilation state is removed once compiled. The programmer can select which are the "user" files by wildcards set in a configuration files. The default of user is .c .cpp .cxx .CC files but it could easily be all files in /home/user/yaron or whatever. It is expected that the system files are non-changing (such a change will not be recognized anyhow until server restart) while the user files are the ones to be modified. As an example, you could have llvm/lib/MC/MachObjectWriter.cpp as the "user" file so every other file compilation result would be kept in memory.

This sounds like a very interesting approach, but also (as you say) very complex :-)

Have you looked at the modules work in clang?  It seems that building on that infrastructure could help simplify things.  In principle you could load “all the modules” and then treat any specific translation unit as a filter over the available decls.  This is also, uhm, nontrivial, but building on properly modular headers could simplify things a lot.

-Chris



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

Diego Novillo


On 05/25/15 15:37, Yaron Keren wrote:
> zapcc maintains as much as possible from previous compilations: AST,
> IR, MC and DebugInfo.  I'm not sure that module support goes that far.
> This indeed would be easier to implement if we know that the C++ code
> is properly modularized.

Oh, neat. This reminds me of the incremental compiler server efforts
with GCC (https://gcc.gnu.org/wiki/IncrementalCompiler). We also briefly
played with this notion at Google a few years back.

The big blockers at the time were the tricky implementation details.
GCC's code base is extremely toxic to multi-threading and server approaches.


Diego.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

David Blaikie
In reply to this post by Yaron Keren


On Mon, May 25, 2015 at 12:37 PM, Yaron Keren <[hidden email]> wrote:
zapcc maintains as much as possible from previous compilations: AST, IR, MC and DebugInfo.  I'm not sure that module support goes that far.

ASTs are preserved in modules, that's all they're for (parsing time tends to dominate, at least in our world/experiments/data as I understand it, so that's the first thing to fix). Duplicate IR/MC/DebugInfo is still present though it'd be the next thing to solve - we're talking about deduplicating some of the debug info and Adrian Prantl is working on that at the moment - putting debug info for types into the module files themselves and referencing it directly as a split DWARF file.

Duplicate IR/MC comes from comdat/linkonce_odr functions - and at some point it'd be nice to put those in a module too, if there's a clear single ownership (oh, you have an inline function in your modular header - OK, we'll IRGen it, make an available_externally copy of it in the module to be linked into any users of the module, and a standard external definition will be codegen'd down to object code and put in the module to be passed to the linker). This wouldn't solve the problems with templates that have no 'home' to put their definition.

- David
 
This indeed would be easier to implement if we know that the C++ code is properly modularized.

One example, if a compile unit instantiates StringMap<bool> and the next compile unit also requires it, StringMap<bool> should not to be reinstantiated, codegenned and optimized. This could mostly achieved using extern + explicit template instantiations however this approach is quite rare. Maybe because extern template wasn't supported before C++11,

Actually it was available in '98, so far as I know.
 
programmers unfamiliar with the technique or because it's cleaner and easier to #include the template header and let the compiler handle the housekeeping.

Yeah, the usual problem is that it's a maintenance burden to couple template definitions to the types they're instantiated with (& often impossible - because the template is in a library that doesn't know about the instantiated types at all (like std::vector - it can't know all the types in the world that it might be instantiated with)).
 
Whatever reason, zapcc handles this automatically.


2015-05-24 19:41 GMT+03:00 Chris Lattner <[hidden email]>:
On May 23, 2015, at 12:25 PM, Yaron Keren <[hidden email]> wrote:
> zapcc makes distinction between two classes of source files, the "system" ones of which all compilation state is kept in memory and the "user" ones whose compilation state is removed once compiled. The programmer can select which are the "user" files by wildcards set in a configuration files. The default of user is .c .cpp .cxx .CC files but it could easily be all files in /home/user/yaron or whatever. It is expected that the system files are non-changing (such a change will not be recognized anyhow until server restart) while the user files are the ones to be modified. As an example, you could have llvm/lib/MC/MachObjectWriter.cpp as the "user" file so every other file compilation result would be kept in memory.

This sounds like a very interesting approach, but also (as you say) very complex :-)

Have you looked at the modules work in clang?  It seems that building on that infrastructure could help simplify things.  In principle you could load “all the modules” and then treat any specific translation unit as a filter over the available decls.  This is also, uhm, nontrivial, but building on properly modular headers could simplify things a lot.

-Chris



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

Sean Silva
In reply to this post by Chris Lattner


On Sun, May 24, 2015 at 9:41 AM, Chris Lattner <[hidden email]> wrote:
On May 23, 2015, at 12:25 PM, Yaron Keren <[hidden email]> wrote:
> zapcc makes distinction between two classes of source files, the "system" ones of which all compilation state is kept in memory and the "user" ones whose compilation state is removed once compiled. The programmer can select which are the "user" files by wildcards set in a configuration files. The default of user is .c .cpp .cxx .CC files but it could easily be all files in /home/user/yaron or whatever. It is expected that the system files are non-changing (such a change will not be recognized anyhow until server restart) while the user files are the ones to be modified. As an example, you could have llvm/lib/MC/MachObjectWriter.cpp as the "user" file so every other file compilation result would be kept in memory.

This sounds like a very interesting approach, but also (as you say) very complex :-)

Have you looked at the modules work in clang?  It seems that building on that infrastructure could help simplify things.  In principle you could load “all the modules” and then treat any specific translation unit as a filter over the available decls.

This is actually exactly what clang's current modules infrastructure already does. Submodules are simply a visibility filter on top of the loaded AST. This is e.g. what the `Hidden` bit on Decl is for: http://clang.llvm.org/doxygen/classclang_1_1Decl.html#ad58279c91e474c764e418e5a09d32073 (among other places inside clang touched by implementing it this way).

-- Sean Silva
 
  This is also, uhm, nontrivial, but building on properly modular headers could simplify things a lot.

-Chris


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

James Widman-2
In reply to this post by David Blaikie
On Tue, May 26, 2015 at 1:38 PM, David Blaikie <[hidden email]> wrote:

> On Mon, May 25, 2015 at 12:37 PM, Yaron Keren <[hidden email]> wrote:
>>
>> zapcc maintains as much as possible from previous compilations: AST, IR,
>> MC and DebugInfo.  I'm not sure that module support goes that far.
>
>
> ASTs are preserved in modules, that's all they're for (parsing time tends to
> dominate, at least in our world/experiments/data as I understand it, so
> that's the first thing to fix). Duplicate IR/MC/DebugInfo is still present
> though it'd be the next thing to solve - we're talking about deduplicating
> some of the debug info and Adrian Prantl is working on that at the moment -
> putting debug info for types into the module files themselves and
> referencing it directly as a split DWARF file.
>
> Duplicate IR/MC comes from comdat/linkonce_odr functions - and at some point
> it'd be nice to put those in a module too, if there's a clear single
> ownership (oh, you have an inline function in your modular header - OK,
> we'll IRGen it, make an available_externally copy of it in the module to be
> linked into any users of the module, and a standard external definition will
> be codegen'd down to object code and put in the module to be passed to the
> linker). This wouldn't solve the problems with templates that have no 'home'
> to put their definition.

I guess it depends on the build setup:  if you spread the build across
multiple machines then... never mind.

But if the whole build is on one machine and it has enough memory, and
as long as something like zapcc is retaining the whole program's AST
anyway, it could be a win for it to complete that whole-program AST
before any IR is generated.  Presumably, the compiler could then
invent the 'home' and do each instantiation exactly once in the entire
build.

Or... it might still help the multi-machine setup.  In the worst case,
an instantiated function would get instantiated once per machine.

But in that case it might be nice to get a fix-it hint from the linker
to automatically extern-templateize all such instantiations. (:

--James
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

James Widman-2
On Wed, May 27, 2015 at 4:11 AM, James Widman <[hidden email]> wrote:

> On Tue, May 26, 2015 at 1:38 PM, David Blaikie <[hidden email]> wrote:
>> On Mon, May 25, 2015 at 12:37 PM, Yaron Keren <[hidden email]> wrote:
>>>
>>> zapcc maintains as much as possible from previous compilations: AST, IR,
>>> MC and DebugInfo.  I'm not sure that module support goes that far.
>>
>>
>> ASTs are preserved in modules, that's all they're for (parsing time tends to
>> dominate, at least in our world/experiments/data as I understand it, so
>> that's the first thing to fix). Duplicate IR/MC/DebugInfo is still present
>> though it'd be the next thing to solve - we're talking about deduplicating
>> some of the debug info and Adrian Prantl is working on that at the moment -
>> putting debug info for types into the module files themselves and
>> referencing it directly as a split DWARF file.
>>
>> Duplicate IR/MC comes from comdat/linkonce_odr functions - and at some point
>> it'd be nice to put those in a module too, if there's a clear single
>> ownership (oh, you have an inline function in your modular header - OK,
>> we'll IRGen it, make an available_externally copy of it in the module to be
>> linked into any users of the module, and a standard external definition will
>> be codegen'd down to object code and put in the module to be passed to the
>> linker). This wouldn't solve the problems with templates that have no 'home'
>> to put their definition.
>
> I guess it depends on the build setup:  if you spread the build across
> multiple machines then... never mind.
>
> But if the whole build is on one machine and it has enough memory, and
> as long as something like zapcc is retaining the whole program's AST
> anyway, it could be a win for it to complete that whole-program AST
> before any IR is generated.  Presumably, the compiler could then
> invent the 'home' and do each instantiation exactly once in the entire
> build.
>
> Or... it might still help the multi-machine setup.  In the worst case,
> an instantiated function would get instantiated once per machine.
>
> But in that case it might be nice to get a fix-it hint from the linker
> to automatically extern-templateize all such instantiations. (:


That reminds me: is there any public data that shows the percentage of
build time spent doing IRGen/opt/CodeGen for duplicates that end up
getting discarded?

--James
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: zapcc compiler

Sean Silva


On Wed, May 27, 2015 at 1:57 PM, James Widman <[hidden email]> wrote:
On Wed, May 27, 2015 at 4:11 AM, James Widman <[hidden email]> wrote:
> On Tue, May 26, 2015 at 1:38 PM, David Blaikie <[hidden email]> wrote:
>> On Mon, May 25, 2015 at 12:37 PM, Yaron Keren <[hidden email]> wrote:
>>>
>>> zapcc maintains as much as possible from previous compilations: AST, IR,
>>> MC and DebugInfo.  I'm not sure that module support goes that far.
>>
>>
>> ASTs are preserved in modules, that's all they're for (parsing time tends to
>> dominate, at least in our world/experiments/data as I understand it, so
>> that's the first thing to fix). Duplicate IR/MC/DebugInfo is still present
>> though it'd be the next thing to solve - we're talking about deduplicating
>> some of the debug info and Adrian Prantl is working on that at the moment -
>> putting debug info for types into the module files themselves and
>> referencing it directly as a split DWARF file.
>>
>> Duplicate IR/MC comes from comdat/linkonce_odr functions - and at some point
>> it'd be nice to put those in a module too, if there's a clear single
>> ownership (oh, you have an inline function in your modular header - OK,
>> we'll IRGen it, make an available_externally copy of it in the module to be
>> linked into any users of the module, and a standard external definition will
>> be codegen'd down to object code and put in the module to be passed to the
>> linker). This wouldn't solve the problems with templates that have no 'home'
>> to put their definition.
>
> I guess it depends on the build setup:  if you spread the build across
> multiple machines then... never mind.
>
> But if the whole build is on one machine and it has enough memory, and
> as long as something like zapcc is retaining the whole program's AST
> anyway, it could be a win for it to complete that whole-program AST
> before any IR is generated.  Presumably, the compiler could then
> invent the 'home' and do each instantiation exactly once in the entire
> build.
>
> Or... it might still help the multi-machine setup.  In the worst case,
> an instantiated function would get instantiated once per machine.
>
> But in that case it might be nice to get a fix-it hint from the linker
> to automatically extern-templateize all such instantiations. (:


That reminds me: is there any public data that shows the percentage of
build time spent doing IRGen/opt/CodeGen for duplicates that end up
getting discarded?

I have information on a couple large (1-10MLOC) codebases indicating that time spent outside of parsing is typically ~20% of total CPU time at -O2/-O3. IIRC, with lower optimization levels, I saw 10-15%.

So that ~20% number is a rough upper bound for the time spent in the LLVM optimizers and code generation, and hence an upper bound on the time for duplicates.

The fact that clang does IRGen as it parses (hence it fell under "parsing time" in my mesurements) makes it somewhat difficult to pinpoint how much time is spent on duplicates during IRGenj. If you want to measure this, you could do it similarly to how I describe measuring per-file time in http://permalink.gmane.org/gmane.comp.compilers.clang.devel/42127 but with extra probes tracking calls into IRGen. Also adding probes inside of the middle end and back end to track per-function time. 

By combining this information with information from the linker about which functions end up becoming "duplicates", you should have a decent empirical estimate for the data that you want. You might do this by placing probes in the linker so that you can easily measure any project by just building it with the instrumented toolchain and using DTrace to funnel out all the data, which can then be fed into a script.

-- Sean Silva
 

--James
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev