RFC: Replacing the default CRT allocator on Windows

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
I'd appreciate the speed-up due to the inclusion of an alternative
malloc and the ease of using it if its source was included in the LLVM
repository. We already have multiple sources from external projects
included in the repository (among them gtest, gmock, ConvertUTF,
google benchmark, ISL, ...) and don't see a reason why it should be
different for a malloc implementation.

AFAIK replacing malloc is quite common for Windows projects. The
Windows default implementation (called "low fragmentation heap") has
different optimization goals.

I'd start with including it in the repository and providing the option
to enable it, with the possibility to change it to the default after
some experience has been collected, maybe even with multiple malloc
implementations.

Michael
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
Not against this for the executables, but I just wanted to 100% check that it is possible to override malloc/free for clang and not for the libclang.dll?

I think it's fine to make the built executables use a different allocator, but it'd be a bigger pain if we force an allocator on users that link against the LLVM libraries or shared libraries by default.

Cheers,
-Neil.

On Thu, Jul 2, 2020 at 5:54 AM Michael Kruse via llvm-dev <[hidden email]> wrote:
I'd appreciate the speed-up due to the inclusion of an alternative
malloc and the ease of using it if its source was included in the LLVM
repository. We already have multiple sources from external projects
included in the repository (among them gtest, gmock, ConvertUTF,
google benchmark, ISL, ...) and don't see a reason why it should be
different for a malloc implementation.

AFAIK replacing malloc is quite common for Windows projects. The
Windows default implementation (called "low fragmentation heap") has
different optimization goals.

I'd start with including it in the repository and providing the option
to enable it, with the possibility to change it to the default after
some experience has been collected, maybe even with multiple malloc
implementations.

Michael
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Neil Henning
Senior Software Engineer Compiler

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev
Hello David,

Please see below.

-----Message d'origine-----
De : llvm-dev <[hidden email]> De la part de David Chisnall via llvm-dev
Envoyé : July 2, 2020 8:04 AM
À : [hidden email]
Objet : Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

> These numbers all seem very close (apart from the baseline).  How many runs did you do and what was the jitter?

Three runs, and I took the last value. Once the Windows cache is hot, the numbers are very stable. The ThinLTO cache is not enabled, and I used /opt:lldltojobs=all to extend the ThreadPool to all hardware threads.


> FWIW, I'm using snmalloc on FreeBSD instead of jemalloc and clang is around 2% faster, so it might be worth considering this as an option for all platforms.  It's likely to be a big win on anything where dlmalloc is the default allocator.

I didn't mention, but the compile-time experience has was improved, in the range of 5-10% depending on the file to compile. When using integrated compilation, ie. all TUs compile in the same process, the gain is in the range of 60%. But in that case there are other effects that come into play.


> I am obviously biased towards snmalloc, since I'm one of the authors, and happy to help out anyone wanting to integrate it with LLVM.  Note that snmalloc requires C++17, so would need to be conditional on LLVM being built with a vaguely modern compiler.

snmalloc currently compiles as part of the LLVM codebase with a few C++17-related constexpr warnings. However the contentious issue is the commit size, which //could be// a showstopper for certain users. A runtime flag -fno-integrated-crt-alloc could somehow mitigate this issue. However this only exacerbates with high core count machines.

Peak commit when linking clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 14.9 GB
  mimalloc: 19.8 GB
  rpmalloc: 31.9 GB
  snmalloc: 42 GB


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev
Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons.

If the performance is good, seems like that might be the simplest choice 



On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <[hidden email]> wrote:

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev

Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB -> 14 GB).

 

Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We could manually create thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free thread-local arenas, which probably explains why this issue has gone (mostly) unnoticed.

 

 

De : James Y Knight <[hidden email]>
Envoyé : July 2, 2020 6:08 PM
À : Alexandre Ganea <[hidden email]>
Cc : Clang Dev <[hidden email]>; LLVM Dev <[hidden email]>
Objet : Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons.

 

If the performance is good, seems like that might be the simplest choice 

 

 

 

On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <[hidden email]> wrote:

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
Let's make sure we're all working from the set set of assumptions.  There are several layers of allocators in a Windows application, and I think it's worth being explicit, since I've already seen comments referring to multiple different "allocators."

The C++ allocators and new and delete operators are generally built on malloc from the C run-time library.  malloc implementations typically rely on process heaps (Win32 HeapAlloc) and/or go directly to virtual memory (Win32 VirtualAlloc).  HeapAlloc itself uses VirtualAlloc.

C++ --> malloc (CRT) --> HeapAlloc (Win32) --> VirtualMalloc (Win32)
           |                                       ^
           +---------------------------------------+

This proposal is talking about replacing the malloc layer.

The "low-fragmentation heap" and "segment heap" are features of process heaps (HeapAlloc).

Since those are in different layers, there's some orthogonality there.  If your malloc implementation uses HeapAlloc, then tweaking process heap features may affect performance assuming the bottleneck isn't in malloc itself.  If you replace the malloc implementation with one that completely bypasses the process heap by going directly to virtual memory, that's a horse of a different color.

The only Windows app I worked on that switched malloc implementations was a cross-platform app.  On every platform, tcmalloc was a win, except for Windows.  We kept Windows on the default Microsoft implementation because it performed better.  (The app was a multithreaded real-time graphics simulation.  The number of threads was low, maybe 4 or 5.)

On Thu, Jul 2, 2020 at 8:57 PM Alexandre Ganea via llvm-dev <[hidden email]> wrote:

Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB -> 14 GB).

 

Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We could manually create thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free thread-local arenas, which probably explains why this issue has gone (mostly) unnoticed.

 

 

De : James Y Knight <[hidden email]>
Envoyé : July 2, 2020 6:08 PM
À : Alexandre Ganea <[hidden email]>
Cc : Clang Dev <[hidden email]>; LLVM Dev <[hidden email]>
Objet : Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons.

 

If the performance is good, seems like that might be the simplest choice 

 

 

 

On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <[hidden email]> wrote:

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev

Hi Adrian!

 

I completely agree with you, we should be clear on the wording. This proposal is about replacing both the MS CRT malloc layer *AND* the HeapAlloc layer.

 

C++ --> {mimalloc|rpmalloc|snmalloc} --> VirtualAlloc (Win32)

 

The bottom line is that libraries in LLVM allocate a lot, lots of small allocations. I think that should be solved on its own, perhaps by using BumpPtrAllocator a bit more. But this is fragile, any new system in LLVM could later add lots of allocations, and induce heap locking further down the line.

 

The MS CRT malloc layer is a very thin wrapper over HeapAlloc. But the issue we want to solve is contention at the HeapAlloc layer level. There’s only one heap by default for the entire process, every thread allocating needs to “aquire” the heap, thus the lock. On the short term, I don’t see an easy way to solve this, except by bypassing HeapAlloc completely.

 

Reid mentioned that a Google toolchain/platform team experimented with tcmalloc3 (the one published here: https://github.com/google/tcmalloc - not the one in gperftools) integrated into LLVM and got similar results to ones in the review below.

 

 

De : Adrian McCarthy <[hidden email]>
Envoyé : July 6, 2020 11:52 AM
À : Alexandre Ganea <[hidden email]>
Cc : James Y Knight <[hidden email]>; LLVM Dev <[hidden email]>; Clang Dev <[hidden email]>
Objet : Re: [llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Let's make sure we're all working from the set set of assumptions.  There are several layers of allocators in a Windows application, and I think it's worth being explicit, since I've already seen comments referring to multiple different "allocators."

 

The C++ allocators and new and delete operators are generally built on malloc from the C run-time library.  malloc implementations typically rely on process heaps (Win32 HeapAlloc) and/or go directly to virtual memory (Win32 VirtualAlloc).  HeapAlloc itself uses VirtualAlloc.

 

C++ --> malloc (CRT) --> HeapAlloc (Win32) --> VirtualMalloc (Win32)

           |                                       ^

           +---------------------------------------+

 

This proposal is talking about replacing the malloc layer.

 

The "low-fragmentation heap" and "segment heap" are features of process heaps (HeapAlloc).

 

Since those are in different layers, there's some orthogonality there.  If your malloc implementation uses HeapAlloc, then tweaking process heap features may affect performance assuming the bottleneck isn't in malloc itself.  If you replace the malloc implementation with one that completely bypasses the process heap by going directly to virtual memory, that's a horse of a different color.

 

The only Windows app I worked on that switched malloc implementations was a cross-platform app.  On every platform, tcmalloc was a win, except for Windows.  We kept Windows on the default Microsoft implementation because it performed better.  (The app was a multithreaded real-time graphics simulation.  The number of threads was low, maybe 4 or 5.)

 

On Thu, Jul 2, 2020 at 8:57 PM Alexandre Ganea via llvm-dev <[hidden email]> wrote:

Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB -> 14 GB).

 

Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We could manually create thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free thread-local arenas, which probably explains why this issue has gone (mostly) unnoticed.

 

 

De : James Y Knight <[hidden email]>
Envoyé : July 2, 2020 6:08 PM
À : Alexandre Ganea <[hidden email]>
Cc : Clang Dev <[hidden email]>; LLVM Dev <[hidden email]>
Objet : Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons.

 

If the performance is good, seems like that might be the simplest choice 

 

 

 

On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <[hidden email]> wrote:

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev
https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf seems to be the paper that goes with the sides I linked before. It says that there's some sort of adaptive mechanism that allocates per-CPU "affinity slot" if it detects lots of lock contention. Which seems like it ought to have good multithreaded behavior.

I see in your initial email that the sample backtrace is in "free", not allocate. Is that just an example, or is "free" where effectively all the contention is? If the latter, I wonder if we're hitting some pathological edge-case...e.g. allocating on one thread, and then freeing on different threads, or something along those lines.


On Thu, Jul 2, 2020, 11:56 PM Alexandre Ganea <[hidden email]> wrote:

Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB -> 14 GB).

 

Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We could manually create thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free thread-local arenas, which probably explains why this issue has gone (mostly) unnoticed.

 

 

De : James Y Knight <[hidden email]>
Envoyé : July 2, 2020 6:08 PM
À : Alexandre Ganea <[hidden email]>
Cc : Clang Dev <[hidden email]>; LLVM Dev <[hidden email]>
Objet : Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons.

 

If the performance is good, seems like that might be the simplest choice 

 

 

 

On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <[hidden email]> wrote:

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev

I couldn’t tell you offhand, and a quick google search does not turn up a comparison. However, one obvious advantage of the debug heap vs address sanitizer is that the debug heap Just Works out of the box with no configuration. I prefer to use the builtin tools as much as possible because I find that integrating a bunch of random external stuff tends to be brittle and have little sharp edges here and there. Address Sanitizer is certainly better than nothing, but Windows has a built in instrumented malloc that just works, and works well. Ideally it remains available.

 

Realistically, we used the system allocator, and it was found wanting. So now somebody is going to do the work to use a custom allocator. Since we had to change the allocator once, we may have to do it again. One size does not fit all, so we probably want different allocators on different platforms. Just using the system allocator should be a first-class option. Since supporting N custom allocators means N different build configurations, I’d like to see the windows allocator remain the default in some configuration to ensure that it doesn’t bitrot.

 

Thanks,

   Christopher Tetreault

 

From: Zachary Turner <[hidden email]>
Sent: Tuesday, July 7, 2020 10:25 AM
To: Chris Tetreault <[hidden email]>
Cc: Alexandre Ganea <[hidden email]>; LLVM Dev <[hidden email]>; [hidden email]
Subject: [EXT] Re: [llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

 

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev
Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

Both help with detection of errors like buffer overrun, double free, use after free, etc.  Asan generally gives you more immediate feedback on those, but you pay a higher price in performance.  Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds).  By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection.  But I don't think that feature is often used.

Windows's Appverifier is cool and powerful.  I cannot remember for sure, but I think some of its features might depend on the Debug CRT.  One thing it can do is simulate allocation failures so you can test your program's recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

On Tue, Jul 7, 2020 at 10:25 AM Zachary Turner via llvm-dev <[hidden email]> wrote:
Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
Bearing in mind that the ASan allocator isn't particularly suited to detecting memory corruption unless you compile LLVM/Clang with ASan instrumentation as well. I don't imagine anybody would be proposing making the debug build for Windows be ASan-ified by default.

On Tue, Jul 7, 2020 at 1:49 PM Adrian McCarthy via llvm-dev <[hidden email]> wrote:
Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

Both help with detection of errors like buffer overrun, double free, use after free, etc.  Asan generally gives you more immediate feedback on those, but you pay a higher price in performance.  Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds).  By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection.  But I don't think that feature is often used.

Windows's Appverifier is cool and powerful.  I cannot remember for sure, but I think some of its features might depend on the Debug CRT.  One thing it can do is simulate allocation failures so you can test your program's recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

On Tue, Jul 7, 2020 at 10:25 AM Zachary Turner via llvm-dev <[hidden email]> wrote:
Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
[hidden email]

w.r.t the licensing problem of a new allocator - have you considered using Scudo? The version in compiler-rt is the upstream (and thus fully licensed with LLVM), and it's what we use as the production allocator in Android. The docs are a little out of date (see the source code in //compiler-rt/lib/scudo/standalone for the bleeding edge), and it doesn't support Windows out of the box currently - but there have been some successful experiments to get it working. I don't imagine that getting full support would be more challenging than setting some sort of frankenbuild up. From Kostya (who maintains Scudo), "I don't think the port is going to be a lot of effort". 

On Tue, Jul 7, 2020 at 2:03 PM Mitch Phillips <[hidden email]> wrote:
Bearing in mind that the ASan allocator isn't particularly suited to detecting memory corruption unless you compile LLVM/Clang with ASan instrumentation as well. I don't imagine anybody would be proposing making the debug build for Windows be ASan-ified by default.

On Tue, Jul 7, 2020 at 1:49 PM Adrian McCarthy via llvm-dev <[hidden email]> wrote:
Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

Both help with detection of errors like buffer overrun, double free, use after free, etc.  Asan generally gives you more immediate feedback on those, but you pay a higher price in performance.  Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds).  By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection.  But I don't think that feature is often used.

Windows's Appverifier is cool and powerful.  I cannot remember for sure, but I think some of its features might depend on the Debug CRT.  One thing it can do is simulate allocation failures so you can test your program's recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

On Tue, Jul 7, 2020 at 10:25 AM Zachary Turner via llvm-dev <[hidden email]> wrote:
Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev
I hadn't heard this before.  If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not?  Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

On Tue, Jul 7, 2020 at 2:04 PM Mitch Phillips <[hidden email]> wrote:
Bearing in mind that the ASan allocator isn't particularly suited to detecting memory corruption unless you compile LLVM/Clang with ASan instrumentation as well. I don't imagine anybody would be proposing making the debug build for Windows be ASan-ified by default.

On Tue, Jul 7, 2020 at 1:49 PM Adrian McCarthy via llvm-dev <[hidden email]> wrote:
Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

Both help with detection of errors like buffer overrun, double free, use after free, etc.  Asan generally gives you more immediate feedback on those, but you pay a higher price in performance.  Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds).  By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection.  But I don't think that feature is often used.

Windows's Appverifier is cool and powerful.  I cannot remember for sure, but I think some of its features might depend on the Debug CRT.  One thing it can do is simulate allocation failures so you can test your program's recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

On Tue, Jul 7, 2020 at 10:25 AM Zachary Turner via llvm-dev <[hidden email]> wrote:
Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
> If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not

Yes, it doesn't make a difference to your final executable whether the compiler was built with ASan or not.

> Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

Sanitizer runtimes aren't instrumented with sanitizers :).

-------

To be clear, we're talking about replacing the runtime allocator for clang/LLD/etc., right? We're not talking about replacing the default allocator for -O0 executables?

In either instance, using the ASan allocator (for either clang or executables) is possible, but won't provide any of the bug detection capabilities you describe without also ensuring that clang/your executable is built with ASan instrumentation (-fsanitize=address implies both "replace my allocator" and "instrument my code").

On Tue, Jul 7, 2020 at 2:53 PM Zachary Turner <[hidden email]> wrote:
I hadn't heard this before.  If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not?  Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

On Tue, Jul 7, 2020 at 2:04 PM Mitch Phillips <[hidden email]> wrote:
Bearing in mind that the ASan allocator isn't particularly suited to detecting memory corruption unless you compile LLVM/Clang with ASan instrumentation as well. I don't imagine anybody would be proposing making the debug build for Windows be ASan-ified by default.

On Tue, Jul 7, 2020 at 1:49 PM Adrian McCarthy via llvm-dev <[hidden email]> wrote:
Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

Both help with detection of errors like buffer overrun, double free, use after free, etc.  Asan generally gives you more immediate feedback on those, but you pay a higher price in performance.  Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds).  By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection.  But I don't think that feature is often used.

Windows's Appverifier is cool and powerful.  I cannot remember for sure, but I think some of its features might depend on the Debug CRT.  One thing it can do is simulate allocation failures so you can test your program's recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

On Tue, Jul 7, 2020 at 10:25 AM Zachary Turner via llvm-dev <[hidden email]> wrote:
Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev

> To be clear, we're talking about replacing the runtime allocator for clang/LLD/etc., right

 

This is my understanding. I want to ensure that the CRT debug allocator remains optionally and on by default for debug builds so that I can use it to troubleshoot memory corruption issues in clang/LLVM/etc itself. The alternative would be instrumenting debug builds of LLVM with asan to provide similar benefits.

 

If I’m reading downthread correctly, it takes something like 40 minutes to link clang.exe with LLD using LTO if LLD is using the CRT allocator, and something like 3 minutes if LLD is using some other allocator. Assuming these numbers are correct, and something wasn’t wrong with the LLD built with the CRT allocator, then this certainly seems like a compelling reason to switch allocators. However, I doubt anybody is trying to use an LLD built in debug mode on windows to link clang.exe with LTO. I imagine it’d take an actual day to finish that build. The main use for a clang.exe built in debug mode on windows is to build small test programs and lit tests and such with a debugger attached. For this use case, I believe that the CRT debug allocator is the correct choice.

 

As a side note, these number seem very fishy to me. While it’s tempting to say that “malloc is a black box. I ask for a pointer, I get a pointer. I shouldn’t have to know what it does internally”, and just replace the allocator, I feel like maybe this merits investigation. Why are we allocating so much? Perhaps we should try to find ways to reduce the number of allocations? Are we doing something silly like creating a new std::vector in ever iteration of an inner loop somewhere? If we have tons of unnecessary allocations, we potentially could speed up LLD on all platforms. 3 minutes is still a really long time. If we could get that down to 30 seconds, that would be amazing. I keep hearing that each new version of LLVM takes longer to compile than the last. Perhaps it is time for us to figure out why? Maybe it’s lots of unnecessary allocations?

 

Thanks,

   Christopher Tetreault

 

From: llvm-dev <[hidden email]> On Behalf Of Mitch Phillips via llvm-dev
Sent: Tuesday, July 7, 2020 3:03 PM
To: Zachary Turner <[hidden email]>
Cc: LLVM Dev <[hidden email]>; [hidden email]
Subject: [EXT] Re: [llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

> If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not

 

Yes, it doesn't make a difference to your final executable whether the compiler was built with ASan or not.

 

> Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

 

Sanitizer runtimes aren't instrumented with sanitizers :).

 

-------

 

To be clear, we're talking about replacing the runtime allocator for clang/LLD/etc., right? We're not talking about replacing the default allocator for -O0 executables?

 

In either instance, using the ASan allocator (for either clang or executables) is possible, but won't provide any of the bug detection capabilities you describe without also ensuring that clang/your executable is built with ASan instrumentation (-fsanitize=address implies both "replace my allocator" and "instrument my code").

 

On Tue, Jul 7, 2020 at 2:53 PM Zachary Turner <[hidden email]> wrote:

I hadn't heard this before.  If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not?  Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

 

On Tue, Jul 7, 2020 at 2:04 PM Mitch Phillips <[hidden email]> wrote:

Bearing in mind that the ASan allocator isn't particularly suited to detecting memory corruption unless you compile LLVM/Clang with ASan instrumentation as well. I don't imagine anybody would be proposing making the debug build for Windows be ASan-ified by default.

 

On Tue, Jul 7, 2020 at 1:49 PM Adrian McCarthy via llvm-dev <[hidden email]> wrote:

Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

 

Both help with detection of errors like buffer overrun, double free, use after free, etc.  Asan generally gives you more immediate feedback on those, but you pay a higher price in performance.  Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

 

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds).  By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

 

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection.  But I don't think that feature is often used.

 

Windows's Appverifier is cool and powerful.  I cannot remember for sure, but I think some of its features might depend on the Debug CRT.  One thing it can do is simulate allocation failures so you can test your program's recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

 

On Tue, Jul 7, 2020 at 10:25 AM Zachary Turner via llvm-dev <[hidden email]> wrote:

Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

 

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev

Yes sorry, the callstack was only showing “free”. However the time is equally spent between calls to HeapFree and HeapAlloc, so I don’t think this is a pathological case. We can clearly see in the profile traces, threads stalling on the heap’s critical section and then woken up later when the critical section is released into another thread.

 

As for the adaptative behavior for “affinity slots”, I got word that it doesn’t scale above 4 threads. From what I hear, the behavior of the segment heap is similar to the older low-fragmentation heap, in terms of multi-threaded performance/contention. Although I’d like to hear other opinions if anyone has deeper/practical knowledge with Windows’ segment heap.

 

 

De : James Y Knight <[hidden email]>
Envoyé : July 6, 2020 3:19 PM
À : Alexandre Ganea <[hidden email]>
Cc : Clang Dev <[hidden email]>; LLVM Dev <[hidden email]>
Objet : Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf seems to be the paper that goes with the sides I linked before. It says that there's some sort of adaptive mechanism that allocates per-CPU "affinity slot" if it detects lots of lock contention. Which seems like it ought to have good multithreaded behavior.

 

I see in your initial email that the sample backtrace is in "free", not allocate. Is that just an example, or is "free" where effectively all the contention is? If the latter, I wonder if we're hitting some pathological edge-case...e.g. allocating on one thread, and then freeing on different threads, or something along those lines.

 

 

On Thu, Jul 2, 2020, 11:56 PM Alexandre Ganea <[hidden email]> wrote:

Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB -> 14 GB).

 

Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We could manually create thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free thread-local arenas, which probably explains why this issue has gone (mostly) unnoticed.

 

 

De : James Y Knight <[hidden email]>
Envoyé : July 2, 2020 6:08 PM
À : Alexandre Ganea <[hidden email]>
Cc : Clang Dev <[hidden email]>; LLVM Dev <[hidden email]>
Objet : Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons.

 

If the performance is good, seems like that might be the simplest choice 

 

 

 

On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <[hidden email]> wrote:

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows

David Chisnall via cfe-dev
In reply to this post by David Chisnall via cfe-dev

That sounds like an interesting idea. What does it take to complete/land the Windows port? Do you think the performance would be equivalent to that of the allocators mentioned in the review?

 

De : llvm-dev <[hidden email]> De la part de Mitch Phillips via llvm-dev
Envoyé : July 7, 2020 5:15 PM
À : Adrian McCarthy <[hidden email]>; Kostya Kortchinsky <[hidden email]>
Cc : LLVM Dev <[hidden email]>; [hidden email]
Objet : Re: [llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

[hidden email]

 

w.r.t the licensing problem of a new allocator - have you considered using Scudo? The version in compiler-rt is the upstream (and thus fully licensed with LLVM), and it's what we use as the production allocator in Android. The docs are a little out of date (see the source code in //compiler-rt/lib/scudo/standalone for the bleeding edge), and it doesn't support Windows out of the box currently - but there have been some successful experiments to get it working. I don't imagine that getting full support would be more challenging than setting some sort of frankenbuild up. From Kostya (who maintains Scudo), "I don't think the port is going to be a lot of effort". 

 

On Tue, Jul 7, 2020 at 2:03 PM Mitch Phillips <[hidden email]> wrote:

Bearing in mind that the ASan allocator isn't particularly suited to detecting memory corruption unless you compile LLVM/Clang with ASan instrumentation as well. I don't imagine anybody would be proposing making the debug build for Windows be ASan-ified by default.

 

On Tue, Jul 7, 2020 at 1:49 PM Adrian McCarthy via llvm-dev <[hidden email]> wrote:

Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

 

Both help with detection of errors like buffer overrun, double free, use after free, etc.  Asan generally gives you more immediate feedback on those, but you pay a higher price in performance.  Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

 

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds).  By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

 

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection.  But I don't think that feature is often used.

 

Windows's Appverifier is cool and powerful.  I cannot remember for sure, but I think some of its features might depend on the Debug CRT.  One thing it can do is simulate allocation failures so you can test your program's recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

 

On Tue, Jul 7, 2020 at 10:25 AM Zachary Turner via llvm-dev <[hidden email]> wrote:

Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

 

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <[hidden email]> wrote:

For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

 

  1. This be added as a configuration option to either select the new allocator or the windows allocator
  2. The Windows allocator be used by default in debug builds

 

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

 

Thanks,

   Christopher Tetreault

 

From: cfe-dev <[hidden email]> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: [hidden email]; LLVM Dev <[hidden email]>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

 

Hello,

 

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

 

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

 

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

 

Time to link clang.exe with LLD and -flto on 36-core:

  Windows CRT heap allocator: 38 min 47 sec

  mimalloc: 2 min 22 sec

  rpmalloc: 2 min 15 sec

  snmalloc: 2 min 19 sec

 

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

 

Two questions arise:

  1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

 

Please see demo patch here: https://reviews.llvm.org/D71786

 

Thank you in advance for the feedback!

Alex.

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12