RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:
https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

Questions for the reader
- Should we virtualize compiler outputs in Clang? (Hint: yes.)
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
- Do you have a use case that this won't address well?
- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?
- Any other major concerns / suggestions?
- If you think the above patches should be split up for initial review / commit, how?

(Other feedback welcome too!)

Longer version
There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

- Tooling wants to capture outputs directly, without going through the filesystem.
- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.
- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.
- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

- OutputManager—a shared manager for creating outputs without knowing about the storage.
- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.
- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.
- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.
- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.
- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

The patch includes a few backends:

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).

Other work in the area
See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading!
Duncan

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:
https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

Questions for the reader
- Should we virtualize compiler outputs in Clang? (Hint: yes.)
Definitely agree.
 
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.
Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
 
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.
If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.
 
- Do you have a use case that this won't address well?
- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?
- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Relatedly, working-directory/relative-path handling should be considered.

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
 
- If you think the above patches should be split up for initial review / commit, how?
Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.
 

(Other feedback welcome too!)

Longer version
There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

- Tooling wants to capture outputs directly, without going through the filesystem.
- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.
- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.
- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

- OutputManager—a shared manager for creating outputs without knowing about the storage.
- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.
- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.
- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.
- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.
- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

The patch includes a few backends:

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Other work in the area
See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading!
Duncan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
Thanks for the detailed response Sam!

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn't that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there's a refactoring that could fix it.

Here are the problems I hit that led to this design:

1. Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:
- ClientIntentOutputConfig::NeedsSeeking
- OnDiskOutputConfig::OpenFlagText

Maybe also something like:
- OnDiskOutputConfig::CreateDirectories
(and maybe others)

I couldn't think of a good way to solve this besides passing in configuration at output creation time.

2. Some "configurers" may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
- OnDiskOutputConfig::RemoveFileOnSignal
(others flags might benefit)

This is "documented" by the call chains of clang::CompilerInstance::createOutputFile.

I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any "contained" OnDiskOutputBackend. This seems pretty heavyweight though.

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

Firstly, `Output` has the user-facing abstraction. `OutputDestination` has low-level bits. Another reasonable name might be `OutputImpl` but `OutputDescription` seemed more descriptive.

Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:
- Avoid buffering content unless necessary.
- Avoid duplicating content buffers unless necessary.
- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious "interesting" case is in-memory + on-disk (in either order).
- Make sub-classes of `OutputDestination` as small as possible given the above.

Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.

Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).

(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn't need to exist.)

- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn't get around to posting it until this AM):

My intuition is:

Thread-safe:
- OutputBackend::createDestination
- Concurrently touching more than one Output/OutputDestination

Thread-unsafe:
- Using a single Output/OutputDestination concurrently

This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.

Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it's inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.

Relatedly, working-directory/relative-path handling should be considered.

Yeah, you're probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.

Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.

Two other thoughts related to paths:

1. I wonder if this abstraction treats the "path" as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)

I think it can. One approach is to use proxy backends:
- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)
- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)
- Outer layer passes the middle layer a proxy backend that "does the right thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)

=> If writing on-disk, "the right thing" is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.
=> If writing to (e.g.) Git's CAS, "the right thing" is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call "git-mktree" when the middle layer is "done" (maybe a callback in the backend-passed-to-the-middle-layer's destructor).

(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)

I think it's doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:
- Is `Output::getPath()` an abstraction leak?
- Should we have a `createOutput` that doesn't take a path?
- ...

Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
- The inner layer needs an output backend if it (sometimes) dumps "side" files (such as AST record layouts into ".ast-dump" or textual IR into ".ll"). This avoids needing to know the on-disk file path ("/path/to/output" => "/path/to/output.ll"), or to even know whether there's a disk.

2. How should we virtualize stdout / stderr?
- "'-' means stdout" is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?
- I'm not sure what to do with stderr. No one ever "closes" this stream.
- Are there other outputs that don't have path names?

3. Do we need to virtualize llvm::sys::fs::create_directories?
- If so, why?

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

No, I don't think that should be a requirement / expectation. It's a specific requirement for Clang's implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang's job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.

Note: implicit modules doesn't currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It's pretty awkward though.

On 2021 Jan  28, at 03:19, Sam McCall <[hidden email]> wrote:

Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:
https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

Questions for the reader
- Should we virtualize compiler outputs in Clang? (Hint: yes.)
Definitely agree.
 
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.
Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
 
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.
If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.
 
- Do you have a use case that this won't address well?
- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?
- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Relatedly, working-directory/relative-path handling should be considered.

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
 
- If you think the above patches should be split up for initial review / commit, how?
Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.
 

(Other feedback welcome too!)

Longer version
There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

- Tooling wants to capture outputs directly, without going through the filesystem.
- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.
- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.
- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

- OutputManager—a shared manager for creating outputs without knowing about the storage.
- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.
- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.
- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.
- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.
- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

The patch includes a few backends:

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Other work in the area
See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading!
Duncan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
A few follow-ups:

On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.

This is what it looks like if the mirroring stuff is localized to MirroringOutputDestination:
It's cleaner / more clear / less error-prone. I'll update the patch when I get a moment.

Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).

I haven't had a chance to prototype this, but I imagine it's fairly straightforward. This is just moving OutputManager::createOutput over to OutputBackend, and changing createDestination to take a PartialOutputConfig (which can be ignored or not).

- Is `Output::getPath()` an abstraction leak?

Removing Output::getConfig is easy:
I'll remove it from the patch when I get a moment.

Removing Output::getPath is a bit harder, but doable:
I think this is the right thing to do, per the longer discussion in my previous email... but it's a bit more awkward.

Output also doesn't need to be in a unique_ptr, as long as it's move-only. Here's what it looks like without the extra level indirection:
Probably it should have a NoneType constructor / etc., since it's fairly Optional-like.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
In reply to this post by Dimitry Andric via cfe-dev
Update: I've incorporated much of Sam's feedback into the main patch (https://reviews.llvm.org/D95501).

- Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.
- Remove OutputManager, instead exposing OutputBackend directly.
- Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().
- Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().
- Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.

The main thing not settled is the threading guarantees. Restating the straw man I proposed:

- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory *is* an OutputBackend, it can be used concurrently as well.
- An OutputFile cannot be used concurrently, but two files from the same backend can be.

Interested in everyone's thoughts on that; if that sounds reasonable I can update the patch to make it so.

Two other points:

- Sam proposed dropping OutputConfig. I don't think we can, as the API client needs to communicate semantic information about specific outputs, not about the backends generally.
- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked through some of the details in my previous response; most of the complexity is there to make MirroringOutputBackend work efficiently and avoid duplication in subclasses. As a counterpoint, the only API a concrete subclass needs to override is storeContentBuffer().

Sam, let please take another look and let me know if you have more high-level comments.
Others, if you have feedback, please let me know!

Thanks!
Duncan

On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

Thanks for the detailed response Sam!

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn't that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there's a refactoring that could fix it.

Here are the problems I hit that led to this design:

1. Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:
- ClientIntentOutputConfig::NeedsSeeking
- OnDiskOutputConfig::OpenFlagText

Maybe also something like:
- OnDiskOutputConfig::CreateDirectories
(and maybe others)

I couldn't think of a good way to solve this besides passing in configuration at output creation time.

2. Some "configurers" may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
- OnDiskOutputConfig::RemoveFileOnSignal
(others flags might benefit)

This is "documented" by the call chains of clang::CompilerInstance::createOutputFile.

I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any "contained" OnDiskOutputBackend. This seems pretty heavyweight though.

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

Firstly, `Output` has the user-facing abstraction. `OutputDestination` has low-level bits. Another reasonable name might be `OutputImpl` but `OutputDescription` seemed more descriptive.

Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:
- Avoid buffering content unless necessary.
- Avoid duplicating content buffers unless necessary.
- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious "interesting" case is in-memory + on-disk (in either order).
- Make sub-classes of `OutputDestination` as small as possible given the above.

Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.

Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).

(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn't need to exist.)

- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn't get around to posting it until this AM):

My intuition is:

Thread-safe:
- OutputBackend::createDestination
- Concurrently touching more than one Output/OutputDestination

Thread-unsafe:
- Using a single Output/OutputDestination concurrently

This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.

Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it's inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.

Relatedly, working-directory/relative-path handling should be considered.

Yeah, you're probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.

Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.

Two other thoughts related to paths:

1. I wonder if this abstraction treats the "path" as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)

I think it can. One approach is to use proxy backends:
- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)
- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)
- Outer layer passes the middle layer a proxy backend that "does the right thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)

=> If writing on-disk, "the right thing" is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.
=> If writing to (e.g.) Git's CAS, "the right thing" is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call "git-mktree" when the middle layer is "done" (maybe a callback in the backend-passed-to-the-middle-layer's destructor).

(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)

I think it's doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:
- Is `Output::getPath()` an abstraction leak?
- Should we have a `createOutput` that doesn't take a path?
- ...

Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
- The inner layer needs an output backend if it (sometimes) dumps "side" files (such as AST record layouts into ".ast-dump" or textual IR into ".ll"). This avoids needing to know the on-disk file path ("/path/to/output" => "/path/to/output.ll"), or to even know whether there's a disk.

2. How should we virtualize stdout / stderr?
- "'-' means stdout" is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?
- I'm not sure what to do with stderr. No one ever "closes" this stream.
- Are there other outputs that don't have path names?

3. Do we need to virtualize llvm::sys::fs::create_directories?
- If so, why?

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

No, I don't think that should be a requirement / expectation. It's a specific requirement for Clang's implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang's job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.

Note: implicit modules doesn't currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It's pretty awkward though.

On 2021 Jan  28, at 03:19, Sam McCall <[hidden email]> wrote:

Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:
https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

Questions for the reader
- Should we virtualize compiler outputs in Clang? (Hint: yes.)
Definitely agree.
 
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.
Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
 
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.
If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.
 
- Do you have a use case that this won't address well?
- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?
- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Relatedly, working-directory/relative-path handling should be considered.

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
 
- If you think the above patches should be split up for initial review / commit, how?
Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.
 

(Other feedback welcome too!)

Longer version
There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

- Tooling wants to capture outputs directly, without going through the filesystem.
- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.
- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.
- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

- OutputManager—a shared manager for creating outputs without knowing about the storage.
- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.
- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.
- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.
- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.
- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

The patch includes a few backends:

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Other work in the area
See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading!
Duncan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
Sam, any thoughts here?

On 2021 Feb  2, at 19:36, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

Update: I've incorporated much of Sam's feedback into the main patch (https://reviews.llvm.org/D95501).

- Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.
- Remove OutputManager, instead exposing OutputBackend directly.
- Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().
- Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().
- Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.

The main thing not settled is the threading guarantees. Restating the straw man I proposed:

- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory *is* an OutputBackend, it can be used concurrently as well.
- An OutputFile cannot be used concurrently, but two files from the same backend can be.

Interested in everyone's thoughts on that; if that sounds reasonable I can update the patch to make it so.

Two other points:

- Sam proposed dropping OutputConfig. I don't think we can, as the API client needs to communicate semantic information about specific outputs, not about the backends generally.
- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked through some of the details in my previous response; most of the complexity is there to make MirroringOutputBackend work efficiently and avoid duplication in subclasses. As a counterpoint, the only API a concrete subclass needs to override is storeContentBuffer().

Sam, let please take another look and let me know if you have more high-level comments.
Others, if you have feedback, please let me know!

Thanks!
Duncan

On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

Thanks for the detailed response Sam!

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn't that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there's a refactoring that could fix it.

Here are the problems I hit that led to this design:

1. Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:
- ClientIntentOutputConfig::NeedsSeeking
- OnDiskOutputConfig::OpenFlagText

Maybe also something like:
- OnDiskOutputConfig::CreateDirectories
(and maybe others)

I couldn't think of a good way to solve this besides passing in configuration at output creation time.

2. Some "configurers" may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
- OnDiskOutputConfig::RemoveFileOnSignal
(others flags might benefit)

This is "documented" by the call chains of clang::CompilerInstance::createOutputFile.

I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any "contained" OnDiskOutputBackend. This seems pretty heavyweight though.

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

Firstly, `Output` has the user-facing abstraction. `OutputDestination` has low-level bits. Another reasonable name might be `OutputImpl` but `OutputDescription` seemed more descriptive.

Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:
- Avoid buffering content unless necessary.
- Avoid duplicating content buffers unless necessary.
- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious "interesting" case is in-memory + on-disk (in either order).
- Make sub-classes of `OutputDestination` as small as possible given the above.

Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.

Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).

(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn't need to exist.)

- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn't get around to posting it until this AM):

My intuition is:

Thread-safe:
- OutputBackend::createDestination
- Concurrently touching more than one Output/OutputDestination

Thread-unsafe:
- Using a single Output/OutputDestination concurrently

This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.

Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it's inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.

Relatedly, working-directory/relative-path handling should be considered.

Yeah, you're probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.

Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.

Two other thoughts related to paths:

1. I wonder if this abstraction treats the "path" as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)

I think it can. One approach is to use proxy backends:
- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)
- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)
- Outer layer passes the middle layer a proxy backend that "does the right thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)

=> If writing on-disk, "the right thing" is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.
=> If writing to (e.g.) Git's CAS, "the right thing" is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call "git-mktree" when the middle layer is "done" (maybe a callback in the backend-passed-to-the-middle-layer's destructor).

(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)

I think it's doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:
- Is `Output::getPath()` an abstraction leak?
- Should we have a `createOutput` that doesn't take a path?
- ...

Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
- The inner layer needs an output backend if it (sometimes) dumps "side" files (such as AST record layouts into ".ast-dump" or textual IR into ".ll"). This avoids needing to know the on-disk file path ("/path/to/output" => "/path/to/output.ll"), or to even know whether there's a disk.

2. How should we virtualize stdout / stderr?
- "'-' means stdout" is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?
- I'm not sure what to do with stderr. No one ever "closes" this stream.
- Are there other outputs that don't have path names?

3. Do we need to virtualize llvm::sys::fs::create_directories?
- If so, why?

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

No, I don't think that should be a requirement / expectation. It's a specific requirement for Clang's implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang's job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.

Note: implicit modules doesn't currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It's pretty awkward though.

On 2021 Jan  28, at 03:19, Sam McCall <[hidden email]> wrote:

Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:
https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

Questions for the reader
- Should we virtualize compiler outputs in Clang? (Hint: yes.)
Definitely agree.
 
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.
Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
 
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.
If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.
 
- Do you have a use case that this won't address well?
- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?
- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Relatedly, working-directory/relative-path handling should be considered.

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
 
- If you think the above patches should be split up for initial review / commit, how?
Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.
 

(Other feedback welcome too!)

Longer version
There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

- Tooling wants to capture outputs directly, without going through the filesystem.
- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.
- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.
- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

- OutputManager—a shared manager for creating outputs without knowing about the storage.
- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.
- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.
- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.
- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.
- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

The patch includes a few backends:

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Other work in the area
See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading!
Duncan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
Sorry about the long delay!

This looks much better to me, thank you for the simplifications!
I've made a bunch of notes on the patch itself which I'll send now too. High-level things inline: TL;DR: stream/buffer confusion in outputfile, scope.

On Wed, Feb 10, 2021 at 3:12 AM Duncan P. N. Exon Smith <[hidden email]> wrote:
Sam, any thoughts here?

On 2021 Feb  2, at 19:36, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

Update: I've incorporated much of Sam's feedback into the main patch (https://reviews.llvm.org/D95501).

- Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.
The remaining options look great, and agree that we can't expect everything to be configured globally.
- Remove OutputManager, instead exposing OutputBackend directly.
Hooray! 
- Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().
- Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().
- Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.
This is definitely needed, but there are a few design options and I can't say I precisely understand either the requirements or the solution in the patch.
I think it would be useful to review this separately. 

The main thing not settled is the threading guarantees. Restating the straw man I proposed:

- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory *is* an OutputBackend, it can be used concurrently as well.
- An OutputFile cannot be used concurrently, but two files from the same backend can be.
LGTM 

Interested in everyone's thoughts on that; if that sounds reasonable I can update the patch to make it so.

Two other points:

- Sam proposed dropping OutputConfig. I don't think we can, as the API client needs to communicate semantic information about specific outputs, not about the backends generally.
- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked through some of the details in my previous response; most of the complexity is there to make MirroringOutputBackend work efficiently and avoid duplication in subclasses. As a counterpoint, the only API a concrete subclass needs to override is storeContentBuffer().
Hmm, OutputFile still feels caught between two concepts: a buffer-based one and a stream based one.
I understand the goal to make this easy to implement and easy to use, but having a lot of conceptual distance between the two, and optional methods/implementation strategies is easy to misunderstand.

I may have missed it (apologies if so), but is there somewhere a breakdown of which performance-critical configurations this is aimed at?
(e.g. many small outputs that don't benefit from streaming, mirrored to two output streams, but one is null)
Performance is certainly going to cut against simplicity sometimes, but it's hard to evaluate without knowing what we're optimizing for.

 > MirroringOutputBackend work efficiently
One thing I don't understand about MirroringOutputBackend - why does it buffer the content? It seems possible to create a raw_pwrite_stream that broadcasts to raw_pwrite_streams of the underlying outputs.

Sam, let please take another look and let me know if you have more high-level comments.
Only other thing is the scope of the patch - it's useful that as an RFC it's comprehensive and shows things fitting together.
But the amount of detail is a bit overwhelming :-)
I think the parts regarding directories, unique files, mirroring/filtering backends, many of the output options would be easier to review in detail in isolated patches.
 
Others, if you have feedback, please let me know!

Thanks!
Duncan

On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

Thanks for the detailed response Sam!

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn't that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there's a refactoring that could fix it.

Here are the problems I hit that led to this design:

1. Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:
- ClientIntentOutputConfig::NeedsSeeking
- OnDiskOutputConfig::OpenFlagText

Maybe also something like:
- OnDiskOutputConfig::CreateDirectories
(and maybe others)

I couldn't think of a good way to solve this besides passing in configuration at output creation time.

2. Some "configurers" may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
- OnDiskOutputConfig::RemoveFileOnSignal
(others flags might benefit)

This is "documented" by the call chains of clang::CompilerInstance::createOutputFile.

I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any "contained" OnDiskOutputBackend. This seems pretty heavyweight though.

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

Firstly, `Output` has the user-facing abstraction. `OutputDestination` has low-level bits. Another reasonable name might be `OutputImpl` but `OutputDescription` seemed more descriptive.

Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:
- Avoid buffering content unless necessary.
- Avoid duplicating content buffers unless necessary.
- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious "interesting" case is in-memory + on-disk (in either order).
- Make sub-classes of `OutputDestination` as small as possible given the above.

Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.

Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).

(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn't need to exist.)

- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn't get around to posting it until this AM):

My intuition is:

Thread-safe:
- OutputBackend::createDestination
- Concurrently touching more than one Output/OutputDestination

Thread-unsafe:
- Using a single Output/OutputDestination concurrently

This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.

Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it's inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.

Relatedly, working-directory/relative-path handling should be considered.

Yeah, you're probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.

Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.

Two other thoughts related to paths:

1. I wonder if this abstraction treats the "path" as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)

I think it can. One approach is to use proxy backends:
- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)
- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)
- Outer layer passes the middle layer a proxy backend that "does the right thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)

=> If writing on-disk, "the right thing" is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.
=> If writing to (e.g.) Git's CAS, "the right thing" is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call "git-mktree" when the middle layer is "done" (maybe a callback in the backend-passed-to-the-middle-layer's destructor).

(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)

I think it's doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:
- Is `Output::getPath()` an abstraction leak?
- Should we have a `createOutput` that doesn't take a path?
- ...

Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
- The inner layer needs an output backend if it (sometimes) dumps "side" files (such as AST record layouts into ".ast-dump" or textual IR into ".ll"). This avoids needing to know the on-disk file path ("/path/to/output" => "/path/to/output.ll"), or to even know whether there's a disk.

2. How should we virtualize stdout / stderr?
- "'-' means stdout" is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?
- I'm not sure what to do with stderr. No one ever "closes" this stream.
- Are there other outputs that don't have path names?

3. Do we need to virtualize llvm::sys::fs::create_directories?
- If so, why?

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

No, I don't think that should be a requirement / expectation. It's a specific requirement for Clang's implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang's job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.

Note: implicit modules doesn't currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It's pretty awkward though.

On 2021 Jan  28, at 03:19, Sam McCall <[hidden email]> wrote:

Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:
https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

Questions for the reader
- Should we virtualize compiler outputs in Clang? (Hint: yes.)
Definitely agree.
 
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.
Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
 
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.
If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.
 
- Do you have a use case that this won't address well?
- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?
- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Relatedly, working-directory/relative-path handling should be considered.

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
 
- If you think the above patches should be split up for initial review / commit, how?
Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.
 

(Other feedback welcome too!)

Longer version
There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

- Tooling wants to capture outputs directly, without going through the filesystem.
- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.
- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.
- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

- OutputManager—a shared manager for creating outputs without knowing about the storage.
- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.
- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.
- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.
- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.
- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

The patch includes a few backends:

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Other work in the area
See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading!
Duncan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev

Hi Duncan,

 

Thanks for this work. At AMD we were working on similar lines for a use case that requires entire process of compilations and linking to be done in-memory. The approach we were following was introducing write capability in LLVM VFS and use it with compiler outputs. This patch allows us to have in-memory compilations. I have two observations from the point of view of the use case above (I am not sure if it makes sense to have it here)

  1. The capability of deleting directories is not available. Applications use temp directories for compilation and it will be good to delete them.
  2. The generated unique names when going through StableUniqueEntityHelper follows a predictable series of names unlike current implementation in LLVM which generates random names. This could pose a challenge in a scenario where application is performing repeated compilations each using separate Output backends.

 

Are you planning to extend similar support for LLD in future?

 

Thanks,

Dinesh

 

From: llvm-dev <[hidden email]> On Behalf Of Sam McCall via llvm-dev
Sent: Monday, February 22, 2021 7:41 PM
To: Duncan P. N. Exon Smith <[hidden email]>
Cc: LLVM Dev <[hidden email]>; CFE Dev <[hidden email]>
Subject: Re: [llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

 

[CAUTION: External Email]

Sorry about the long delay!

 

This looks much better to me, thank you for the simplifications!

I've made a bunch of notes on the patch itself which I'll send now too. High-level things inline: TL;DR: stream/buffer confusion in outputfile, scope.

 

On Wed, Feb 10, 2021 at 3:12 AM Duncan P. N. Exon Smith <[hidden email]> wrote:

Sam, any thoughts here?

 

On 2021 Feb  2, at 19:36, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

 

Update: I've incorporated much of Sam's feedback into the main patch (https://reviews.llvm.org/D95501).

 

- Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.

The remaining options look great, and agree that we can't expect everything to be configured globally.

- Remove OutputManager, instead exposing OutputBackend directly.

Hooray! 

- Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().

- Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().

- Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.

This is definitely needed, but there are a few design options and I can't say I precisely understand either the requirements or the solution in the patch.

I think it would be useful to review this separately. 

 

The main thing not settled is the threading guarantees. Restating the straw man I proposed:

 

- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory *is* an OutputBackend, it can be used concurrently as well.

- An OutputFile cannot be used concurrently, but two files from the same backend can be.

LGTM 

 

Interested in everyone's thoughts on that; if that sounds reasonable I can update the patch to make it so.

 

Two other points:

 

- Sam proposed dropping OutputConfig. I don't think we can, as the API client needs to communicate semantic information about specific outputs, not about the backends generally.

- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked through some of the details in my previous response; most of the complexity is there to make MirroringOutputBackend work efficiently and avoid duplication in subclasses. As a counterpoint, the only API a concrete subclass needs to override is storeContentBuffer().

Hmm, OutputFile still feels caught between two concepts: a buffer-based one and a stream based one.

I understand the goal to make this easy to implement and easy to use, but having a lot of conceptual distance between the two, and optional methods/implementation strategies is easy to misunderstand.

 

I may have missed it (apologies if so), but is there somewhere a breakdown of which performance-critical configurations this is aimed at?

(e.g. many small outputs that don't benefit from streaming, mirrored to two output streams, but one is null)

Performance is certainly going to cut against simplicity sometimes, but it's hard to evaluate without knowing what we're optimizing for.

 

 > MirroringOutputBackend work efficiently

One thing I don't understand about MirroringOutputBackend - why does it buffer the content? It seems possible to create a raw_pwrite_stream that broadcasts to raw_pwrite_streams of the underlying outputs.

 

Sam, let please take another look and let me know if you have more high-level comments.

Only other thing is the scope of the patch - it's useful that as an RFC it's comprehensive and shows things fitting together.

But the amount of detail is a bit overwhelming :-)

I think the parts regarding directories, unique files, mirroring/filtering backends, many of the output options would be easier to review in detail in isolated patches.

 

Others, if you have feedback, please let me know!

 

Thanks!

Duncan

 

On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

 

Thanks for the detailed response Sam!

 

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:

 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

 

Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn't that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there's a refactoring that could fix it.

 

Here are the problems I hit that led to this design:

 

1. Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:

- ClientIntentOutputConfig::NeedsSeeking

- OnDiskOutputConfig::OpenFlagText

 

Maybe also something like:

- OnDiskOutputConfig::CreateDirectories

(and maybe others)

 

I couldn't think of a good way to solve this besides passing in configuration at output creation time.

 

2. Some "configurers" may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:

- OnDiskOutputConfig::RemoveFileOnSignal

(others flags might benefit)

 

This is "documented" by the call chains of clang::CompilerInstance::createOutputFile.

 

I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any "contained" OnDiskOutputBackend. This seems pretty heavyweight though.

 

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

 

Firstly, `Output` has the user-facing abstraction. `OutputDestination` has low-level bits. Another reasonable name might be `OutputImpl` but `OutputDescription` seemed more descriptive.

 

Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:

- Avoid buffering content unless necessary.

- Avoid duplicating content buffers unless necessary.

- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious "interesting" case is in-memory + on-disk (in either order).

- Make sub-classes of `OutputDestination` as small as possible given the above.

 

Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.

 

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

 

In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.

 

Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).

 

(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn't need to exist.)

 

- Any other major concerns / suggestions?

Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.

It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

 

Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn't get around to posting it until this AM):

 

My intuition is:

 

Thread-safe:

- OutputBackend::createDestination

- Concurrently touching more than one Output/OutputDestination

 

Thread-unsafe:

- Using a single Output/OutputDestination concurrently

 

This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.

 

Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it's inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.

 

Relatedly, working-directory/relative-path handling should be considered.

 

Yeah, you're probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.

 

Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.

 

Two other thoughts related to paths:

 

1. I wonder if this abstraction treats the "path" as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)

 

I think it can. One approach is to use proxy backends:

- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)

- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)

- Outer layer passes the middle layer a proxy backend that "does the right thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)

 

=> If writing on-disk, "the right thing" is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.

=> If writing to (e.g.) Git's CAS, "the right thing" is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call "git-mktree" when the middle layer is "done" (maybe a callback in the backend-passed-to-the-middle-layer's destructor).

 

(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)

 

I think it's doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:

- Is `Output::getPath()` an abstraction leak?

- Should we have a `createOutput` that doesn't take a path?

- ...

 

Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?

- The inner layer needs an output backend if it (sometimes) dumps "side" files (such as AST record layouts into ".ast-dump" or textual IR into ".ll"). This avoids needing to know the on-disk file path ("/path/to/output" => "/path/to/output.ll"), or to even know whether there's a disk.

 

2. How should we virtualize stdout / stderr?

- "'-' means stdout" is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?

- I'm not sure what to do with stderr. No one ever "closes" this stream.

- Are there other outputs that don't have path names?

 

3. Do we need to virtualize llvm::sys::fs::create_directories?

- If so, why?

 

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

 

Why doesn't this inherit from llvm::vfs::FileSystem?

Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).

I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

 

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

 

No, I don't think that should be a requirement / expectation. It's a specific requirement for Clang's implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.

 

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

 

Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang's job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.

 

Note: implicit modules doesn't currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It's pretty awkward though.

 

On 2021 Jan  28, at 03:19, Sam McCall <[hidden email]> wrote:

 

Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...

 

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:

TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:

https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager

https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

 

Questions for the reader

- Should we virtualize compiler outputs in Clang? (Hint: yes.)

Definitely agree.

 

- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)

Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.

Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.

 

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:

 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

 

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

 

- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)

llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.

If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.

 

- Do you have a use case that this won't address well?

- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?

- Any other major concerns / suggestions?

Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.

It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

 

Relatedly, working-directory/relative-path handling should be considered.

 

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

 

- If you think the above patches should be split up for initial review / commit, how?

Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.

 

 

(Other feedback welcome too!)

 

Longer version

There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

 

- Tooling wants to capture outputs directly, without going through the filesystem.

- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.

- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.

- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

 

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

 

- OutputManager—a shared manager for creating outputs without knowing about the storage.

- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.

- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.

- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.

- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.

- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

 

The patch includes a few backends:

 

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).

I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

 

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

 

Other work in the area

See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading!

Duncan

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
In reply to this post by Dimitry Andric via cfe-dev
Thanks for the replies Sam! I've just back after being out for a bit; I'll try to give a substantive reply later this week.

Only other thing is the scope of the patch - it's useful that as an RFC it's comprehensive and shows things fitting together.

Absolutely; agreed it needs to be split up for review.

On 2021 Feb  22, at 06:11, Sam McCall <[hidden email]> wrote:

Sorry about the long delay!

This looks much better to me, thank you for the simplifications!
I've made a bunch of notes on the patch itself which I'll send now too. High-level things inline: TL;DR: stream/buffer confusion in outputfile, scope.

On Wed, Feb 10, 2021 at 3:12 AM Duncan P. N. Exon Smith <[hidden email]> wrote:
Sam, any thoughts here?

On 2021 Feb  2, at 19:36, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

Update: I've incorporated much of Sam's feedback into the main patch (https://reviews.llvm.org/D95501).

- Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.
The remaining options look great, and agree that we can't expect everything to be configured globally.
- Remove OutputManager, instead exposing OutputBackend directly.
Hooray! 
- Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().
- Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().
- Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.
This is definitely needed, but there are a few design options and I can't say I precisely understand either the requirements or the solution in the patch.
I think it would be useful to review this separately. 

The main thing not settled is the threading guarantees. Restating the straw man I proposed:

- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory *is* an OutputBackend, it can be used concurrently as well.
- An OutputFile cannot be used concurrently, but two files from the same backend can be.
LGTM 

Interested in everyone's thoughts on that; if that sounds reasonable I can update the patch to make it so.

Two other points:

- Sam proposed dropping OutputConfig. I don't think we can, as the API client needs to communicate semantic information about specific outputs, not about the backends generally.
- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked through some of the details in my previous response; most of the complexity is there to make MirroringOutputBackend work efficiently and avoid duplication in subclasses. As a counterpoint, the only API a concrete subclass needs to override is storeContentBuffer().
Hmm, OutputFile still feels caught between two concepts: a buffer-based one and a stream based one.
I understand the goal to make this easy to implement and easy to use, but having a lot of conceptual distance between the two, and optional methods/implementation strategies is easy to misunderstand.

I may have missed it (apologies if so), but is there somewhere a breakdown of which performance-critical configurations this is aimed at?
(e.g. many small outputs that don't benefit from streaming, mirrored to two output streams, but one is null)
Performance is certainly going to cut against simplicity sometimes, but it's hard to evaluate without knowing what we're optimizing for.

 > MirroringOutputBackend work efficiently
One thing I don't understand about MirroringOutputBackend - why does it buffer the content? It seems possible to create a raw_pwrite_stream that broadcasts to raw_pwrite_streams of the underlying outputs.

Sam, let please take another look and let me know if you have more high-level comments.
Only other thing is the scope of the patch - it's useful that as an RFC it's comprehensive and shows things fitting together.
But the amount of detail is a bit overwhelming :-)
I think the parts regarding directories, unique files, mirroring/filtering backends, many of the output options would be easier to review in detail in isolated patches.
 
Others, if you have feedback, please let me know!

Thanks!
Duncan

On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

Thanks for the detailed response Sam!

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn't that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there's a refactoring that could fix it.

Here are the problems I hit that led to this design:

1. Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:
- ClientIntentOutputConfig::NeedsSeeking
- OnDiskOutputConfig::OpenFlagText

Maybe also something like:
- OnDiskOutputConfig::CreateDirectories
(and maybe others)

I couldn't think of a good way to solve this besides passing in configuration at output creation time.

2. Some "configurers" may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
- OnDiskOutputConfig::RemoveFileOnSignal
(others flags might benefit)

This is "documented" by the call chains of clang::CompilerInstance::createOutputFile.

I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any "contained" OnDiskOutputBackend. This seems pretty heavyweight though.

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

Firstly, `Output` has the user-facing abstraction. `OutputDestination` has low-level bits. Another reasonable name might be `OutputImpl` but `OutputDescription` seemed more descriptive.

Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:
- Avoid buffering content unless necessary.
- Avoid duplicating content buffers unless necessary.
- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious "interesting" case is in-memory + on-disk (in either order).
- Make sub-classes of `OutputDestination` as small as possible given the above.

Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.

Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).

(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn't need to exist.)

- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn't get around to posting it until this AM):

My intuition is:

Thread-safe:
- OutputBackend::createDestination
- Concurrently touching more than one Output/OutputDestination

Thread-unsafe:
- Using a single Output/OutputDestination concurrently

This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.

Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it's inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.

Relatedly, working-directory/relative-path handling should be considered.

Yeah, you're probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.

Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.

Two other thoughts related to paths:

1. I wonder if this abstraction treats the "path" as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)

I think it can. One approach is to use proxy backends:
- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)
- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)
- Outer layer passes the middle layer a proxy backend that "does the right thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)

=> If writing on-disk, "the right thing" is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.
=> If writing to (e.g.) Git's CAS, "the right thing" is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call "git-mktree" when the middle layer is "done" (maybe a callback in the backend-passed-to-the-middle-layer's destructor).

(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)

I think it's doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:
- Is `Output::getPath()` an abstraction leak?
- Should we have a `createOutput` that doesn't take a path?
- ...

Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
- The inner layer needs an output backend if it (sometimes) dumps "side" files (such as AST record layouts into ".ast-dump" or textual IR into ".ll"). This avoids needing to know the on-disk file path ("/path/to/output" => "/path/to/output.ll"), or to even know whether there's a disk.

2. How should we virtualize stdout / stderr?
- "'-' means stdout" is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?
- I'm not sure what to do with stderr. No one ever "closes" this stream.
- Are there other outputs that don't have path names?

3. Do we need to virtualize llvm::sys::fs::create_directories?
- If so, why?

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

No, I don't think that should be a requirement / expectation. It's a specific requirement for Clang's implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang's job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.

Note: implicit modules doesn't currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It's pretty awkward though.

On 2021 Jan  28, at 03:19, Sam McCall <[hidden email]> wrote:

Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:
https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

Questions for the reader
- Should we virtualize compiler outputs in Clang? (Hint: yes.)
Definitely agree.
 
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.
Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
 
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.
If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.
 
- Do you have a use case that this won't address well?
- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?
- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

Relatedly, working-directory/relative-path handling should be considered.

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
 
- If you think the above patches should be split up for initial review / commit, how?
Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.
 

(Other feedback welcome too!)

Longer version
There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

- Tooling wants to capture outputs directly, without going through the filesystem.
- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.
- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.
- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

- OutputManager—a shared manager for creating outputs without knowing about the storage.
- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.
- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.
- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.
- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.
- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

The patch includes a few backends:

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

Other work in the area
See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading!
Duncan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev
In reply to this post by Dimitry Andric via cfe-dev
Hi Dinesh,

Thanks for the questions / thoughts! (Just back after being out-of-office for a bit...)

Are you planning to extend similar support for LLD in future?

I probably won't work on that myself since we're not using LLD in our toolchain, but once the generic support has landed in LLVM it should be just as easy for LLD to use it as for Clang.

  1. The capability of deleting directories is not available. Applications use temp directories for compilation and it will be good to delete them.

Hmm, interesting use case.

Sam and I are both hoping to support OutputBackends that are write-only (no reading), which conflicts with the temp-directory usage model (if the app is going to delete the temp directory, then I assume it also needs to read from it?).

There might be a clean way to support this (virtualizing the idea of a read/write temp directory), but I need to think a bit more. I wonder, can you be more specific about the usage model you have in mind?

The generated unique names when going through StableUniqueEntityHelper follows a predictable series of names unlike current implementation in LLVM which generates random names. This could pose a challenge in a scenario where application is performing repeated compilations each using separate Output backends.

Good point. For the OnDiskOutputBackend, we could default to non-stable / random names; for other backends, I prefer the assumption that the compilation is being done in isolation... but a backend that wants to coordinate better is free to do so. Does that make sense to you?

Duncan

On 2021 Mar  5, at 08:10, Bhaskaran, Dineshkumar (MLSE) <[hidden email]> wrote:

Hi Duncan, 
 
Thanks for this work. At AMD we were working on similar lines for a use case that requires entire process of compilations and linking to be done in-memory. The approach we were following was introducing write capability in LLVM VFS and use it with compiler outputs. This patch allows us to have in-memory compilations. I have two observations from the point of view of the use case above (I am not sure if it makes sense to have it here)
  1. The capability of deleting directories is not available. Applications use temp directories for compilation and it will be good to delete them.
  2. The generated unique names when going through StableUniqueEntityHelper follows a predictable series of names unlike current implementation in LLVM which generates random names. This could pose a challenge in a scenario where application is performing repeated compilations each using separate Output backends.
 
Are you planning to extend similar support for LLD in future?
 
Thanks,
Dinesh
 
From: llvm-dev <[hidden email]> On Behalf Of Sam McCall via llvm-dev
Sent: Monday, February 22, 2021 7:41 PM
To: Duncan P. N. Exon Smith <[hidden email]>
Cc: LLVM Dev <[hidden email]>; CFE Dev <[hidden email]>
Subject: Re: [llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs
 
[CAUTION: External Email] 
Sorry about the long delay!
 
This looks much better to me, thank you for the simplifications!
I've made a bunch of notes on the patch itself which I'll send now too. High-level things inline: TL;DR: stream/buffer confusion in outputfile, scope.
 
On Wed, Feb 10, 2021 at 3:12 AM Duncan P. N. Exon Smith <[hidden email]> wrote:
Sam, any thoughts here?

 

On 2021 Feb  2, at 19:36, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:
 
Update: I've incorporated much of Sam's feedback into the main patch (https://reviews.llvm.org/D95501).
 
- Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.
The remaining options look great, and agree that we can't expect everything to be configured globally.
- Remove OutputManager, instead exposing OutputBackend directly.
Hooray! 
- Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().
- Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().
- Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.
This is definitely needed, but there are a few design options and I can't say I precisely understand either the requirements or the solution in the patch.
I think it would be useful to review this separately. 
 
The main thing not settled is the threading guarantees. Restating the straw man I proposed:
 
- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory *is* an OutputBackend, it can be used concurrently as well.
- An OutputFile cannot be used concurrently, but two files from the same backend can be.
LGTM 
 
Interested in everyone's thoughts on that; if that sounds reasonable I can update the patch to make it so.
 
Two other points:
 
- Sam proposed dropping OutputConfig. I don't think we can, as the API client needs to communicate semantic information about specific outputs, not about the backends generally.
- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked through some of the details in my previous response; most of the complexity is there to make MirroringOutputBackend work efficiently and avoid duplication in subclasses. As a counterpoint, the only API a concrete subclass needs to override is storeContentBuffer().
Hmm, OutputFile still feels caught between two concepts: a buffer-based one and a stream based one.
I understand the goal to make this easy to implement and easy to use, but having a lot of conceptual distance between the two, and optional methods/implementation strategies is easy to misunderstand.
 
I may have missed it (apologies if so), but is there somewhere a breakdown of which performance-critical configurations this is aimed at?
(e.g. many small outputs that don't benefit from streaming, mirrored to two output streams, but one is null)
Performance is certainly going to cut against simplicity sometimes, but it's hard to evaluate without knowing what we're optimizing for.
 
 > MirroringOutputBackend work efficiently
One thing I don't understand about MirroringOutputBackend - why does it buffer the content? It seems possible to create a raw_pwrite_stream that broadcasts to raw_pwrite_streams of the underlying outputs.
 
Sam, let please take another look and let me know if you have more high-level comments.
Only other thing is the scope of the patch - it's useful that as an RFC it's comprehensive and shows things fitting together.
But the amount of detail is a bit overwhelming :-)
I think the parts regarding directories, unique files, mirroring/filtering backends, many of the output options would be easier to review in detail in isolated patches.
 
Others, if you have feedback, please let me know!
 
Thanks!
Duncan
 
On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:
 
Thanks for the detailed response Sam!

 

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
 
Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn't that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there's a refactoring that could fix it.
 
Here are the problems I hit that led to this design:
 
1. Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:
- ClientIntentOutputConfig::NeedsSeeking
- OnDiskOutputConfig::OpenFlagText
 
Maybe also something like:
- OnDiskOutputConfig::CreateDirectories
(and maybe others)
 
I couldn't think of a good way to solve this besides passing in configuration at output creation time.
 
2. Some "configurers" may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
- OnDiskOutputConfig::RemoveFileOnSignal
(others flags might benefit)
 
This is "documented" by the call chains of clang::CompilerInstance::createOutputFile.
 
I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any "contained" OnDiskOutputBackend. This seems pretty heavyweight though.
 
 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why
 
Firstly, `Output` has the user-facing abstraction. `OutputDestination` has low-level bits. Another reasonable name might be `OutputImpl` but `OutputDescription` seemed more descriptive.
 
Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:
- Avoid buffering content unless necessary.
- Avoid duplicating content buffers unless necessary.
- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious "interesting" case is in-memory + on-disk (in either order).
- Make sub-classes of `OutputDestination` as small as possible given the above.
 
Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.
 
As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
 
In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.
 
Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).
 
(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn't need to exist.)

 

- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.
 
Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn't get around to posting it until this AM):
 
My intuition is:
 
Thread-safe:
- OutputBackend::createDestination
- Concurrently touching more than one Output/OutputDestination
 
Thread-unsafe:
- Using a single Output/OutputDestination concurrently
 
This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.
 
Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it's inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.
 
Relatedly, working-directory/relative-path handling should be considered.
 
Yeah, you're probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.
 
Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.
 
Two other thoughts related to paths:
 
1. I wonder if this abstraction treats the "path" as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)
 
I think it can. One approach is to use proxy backends:
- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)
- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)
- Outer layer passes the middle layer a proxy backend that "does the right thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)
 
=> If writing on-disk, "the right thing" is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.
=> If writing to (e.g.) Git's CAS, "the right thing" is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call "git-mktree" when the middle layer is "done" (maybe a callback in the backend-passed-to-the-middle-layer's destructor).
 
(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)
 
I think it's doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:
- Is `Output::getPath()` an abstraction leak?
- Should we have a `createOutput` that doesn't take a path?
- ...
 
Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
- The inner layer needs an output backend if it (sometimes) dumps "side" files (such as AST record layouts into ".ast-dump" or textual IR into ".ll"). This avoids needing to know the on-disk file path ("/path/to/output" => "/path/to/output.ll"), or to even know whether there's a disk.
 
2. How should we virtualize stdout / stderr?
- "'-' means stdout" is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?
- I'm not sure what to do with stderr. No one ever "closes" this stream.
- Are there other outputs that don't have path names?
 
3. Do we need to virtualize llvm::sys::fs::create_directories?
- If so, why?
 
(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

 

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 
 
But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.
 
If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
 
No, I don't think that should be a requirement / expectation. It's a specific requirement for Clang's implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.

 

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.
 
Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang's job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.
 
Note: implicit modules doesn't currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It's pretty awkward though.

 

On 2021 Jan  28, at 03:19, Sam McCall <[hidden email]> wrote:
 
Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...
 
On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:
https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.
 
Questions for the reader
- Should we virtualize compiler outputs in Clang? (Hint: yes.)
Definitely agree.
 
- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)
Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.
Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.
 
This roughly corresponds to OutputBackend + OutputDestination in the patch, except:
 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction
 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why
 
As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.
 
- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)
llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.
If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.
 
- Do you have a use case that this won't address well?
- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?
- Any other major concerns / suggestions?
Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.
 
Relatedly, working-directory/relative-path handling should be considered.
 
(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)
 
- If you think the above patches should be split up for initial review / commit, how?
Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.
 
 
(Other feedback welcome too!)
 
Longer version
There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.
 
- Tooling wants to capture outputs directly, without going through the filesystem.
- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.
- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.
- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.
 
The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:
 
- OutputManager—a shared manager for creating outputs without knowing about the storage.
- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.
- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.
- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.
- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.
- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.
 
The patch includes a few backends:
 
- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 
 
But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.
 
If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.
 
Other work in the area
See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading! 
Duncan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
 
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
 
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Dimitry Andric via cfe-dev

Hi Duncan,

 

Thank you for getting back.

 

The use case is primarily JIT compilation for GPU specific code in OpenCL/HIP. We have a library that facilitates the JIT compilation. For a single compilation process it starts by creating temporary directories, write the kernel and header files to these directories. Thereafter invoke clang do the compilation and finally read the output into memory for loading onto the GPU. The cleanup process deletes the temporary directories.

 

Eventually we would like to bring the whole compilation process to be done in memory. One of the approaches I took while experimenting with OutputBackend patch is to have a persistent vfs::InmemoryFilesystem which can be used across compilations. An In-Memory OutputBackend using the above InmemoryFilesystem is created on the fly for each compilation. I chose not to use a persistent OutputBackend out of convenience as now I have only one object to pass around across the stack while InMemoryFileSystem is required to supply the input.

 

This is where I found that deletion/naming of temporary directories can pose a challenge. I could work around by either choosing to persist both OutputBackend and InmemoryFileSystem objects or none.

 

>>> Good point. For the OnDiskOutputBackend, we could default to non-stable / random names; for other backends, I prefer the assumption that the compilation is being done in isolation... but a backend that wants to coordinate better is free to do so. Does that make sense to you?

 

I understand.

 

Dinesh

 

From: Duncan P. N. Exon Smith <[hidden email]>
Sent: Tuesday, March 16, 2021 3:02 AM
To: Bhaskaran, Dineshkumar (MLSE) <[hidden email]>
Cc: Sam McCall <[hidden email]>; LLVM Dev <[hidden email]>; CFE Dev <[hidden email]>
Subject: Re: [llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

 

[CAUTION: External Email]

Hi Dinesh,

 

Thanks for the questions / thoughts! (Just back after being out-of-office for a bit...)

 

Are you planning to extend similar support for LLD in future?

 

I probably won't work on that myself since we're not using LLD in our toolchain, but once the generic support has landed in LLVM it should be just as easy for LLD to use it as for Clang.

 

  1. The capability of deleting directories is not available. Applications use temp directories for compilation and it will be good to delete them.

 

Hmm, interesting use case.

 

Sam and I are both hoping to support OutputBackends that are write-only (no reading), which conflicts with the temp-directory usage model (if the app is going to delete the temp directory, then I assume it also needs to read from it?).

 

There might be a clean way to support this (virtualizing the idea of a read/write temp directory), but I need to think a bit more. I wonder, can you be more specific about the usage model you have in mind?

 

The generated unique names when going through StableUniqueEntityHelper follows a predictable series of names unlike current implementation in LLVM which generates random names. This could pose a challenge in a scenario where application is performing repeated compilations each using separate Output backends.

 

Good point. For the OnDiskOutputBackend, we could default to non-stable / random names; for other backends, I prefer the assumption that the compilation is being done in isolation... but a backend that wants to coordinate better is free to do so. Does that make sense to you?

 

Duncan



On 2021 Mar  5, at 08:10, Bhaskaran, Dineshkumar (MLSE) <[hidden email]> wrote:

 

Hi Duncan, 

 

Thanks for this work. At AMD we were working on similar lines for a use case that requires entire process of compilations and linking to be done in-memory. The approach we were following was introducing write capability in LLVM VFS and use it with compiler outputs. This patch allows us to have in-memory compilations. I have two observations from the point of view of the use case above (I am not sure if it makes sense to have it here)

  1. The capability of deleting directories is not available. Applications use temp directories for compilation and it will be good to delete them.
  2. The generated unique names when going through StableUniqueEntityHelper follows a predictable series of names unlike current implementation in LLVM which generates random names. This could pose a challenge in a scenario where application is performing repeated compilations each using separate Output backends.

 

Are you planning to extend similar support for LLD in future?

 

Thanks,

Dinesh

 

From: llvm-dev <[hidden email]> On Behalf Of Sam McCall via llvm-dev
Sent: Monday, February 22, 2021 7:41 PM
To: Duncan P. N. Exon Smith <[hidden email]>
Cc: LLVM Dev <[hidden email]>; CFE Dev <[hidden email]>
Subject: Re: [llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

 

[CAUTION: External Email] 

Sorry about the long delay!

 

This looks much better to me, thank you for the simplifications!

I've made a bunch of notes on the patch itself which I'll send now too. High-level things inline: TL;DR: stream/buffer confusion in outputfile, scope.

 

On Wed, Feb 10, 2021 at 3:12 AM Duncan P. N. Exon Smith <[hidden email]> wrote:

Sam, any thoughts here?

 

On 2021 Feb  2, at 19:36, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

 

Update: I've incorporated much of Sam's feedback into the main patch (https://reviews.llvm.org/D95501).

 

- Simplify OutputConfig, restricting it to semantic information about a specific output. Sink all backend configuration to flags on the backends themselves.

The remaining options look great, and agree that we can't expect everything to be configured globally.

- Remove OutputManager, instead exposing OutputBackend directly.

Hooray! 

- Merge Output and OutputDestination into a single class called OutputFile, and rename the API for creating them to OutputBackend::createFile().

- Add support for working directories via OutputDirectory, and add OutputBackend::getDirectory() and OutputBackend::createDirectory().

- Add support for createUniqueFile() and createUniqueDirectory(), both heavily used in clang. Backends without read access can use StableUniqueEntityAdaptor to implement these.

This is definitely needed, but there are a few design options and I can't say I precisely understand either the requirements or the solution in the patch.

I think it would be useful to review this separately. 

 

The main thing not settled is the threading guarantees. Restating the straw man I proposed:

 

- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory *is* an OutputBackend, it can be used concurrently as well.

- An OutputFile cannot be used concurrently, but two files from the same backend can be.

LGTM 

 

Interested in everyone's thoughts on that; if that sounds reasonable I can update the patch to make it so.

 

Two other points:

 

- Sam proposed dropping OutputConfig. I don't think we can, as the API client needs to communicate semantic information about specific outputs, not about the backends generally.

- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked through some of the details in my previous response; most of the complexity is there to make MirroringOutputBackend work efficiently and avoid duplication in subclasses. As a counterpoint, the only API a concrete subclass needs to override is storeContentBuffer().

Hmm, OutputFile still feels caught between two concepts: a buffer-based one and a stream based one.

I understand the goal to make this easy to implement and easy to use, but having a lot of conceptual distance between the two, and optional methods/implementation strategies is easy to misunderstand.

 

I may have missed it (apologies if so), but is there somewhere a breakdown of which performance-critical configurations this is aimed at?

(e.g. many small outputs that don't benefit from streaming, mirrored to two output streams, but one is null)

Performance is certainly going to cut against simplicity sometimes, but it's hard to evaluate without knowing what we're optimizing for.

 

 > MirroringOutputBackend work efficiently

One thing I don't understand about MirroringOutputBackend - why does it buffer the content? It seems possible to create a raw_pwrite_stream that broadcasts to raw_pwrite_streams of the underlying outputs.

 

Sam, let please take another look and let me know if you have more high-level comments.

Only other thing is the scope of the patch - it's useful that as an RFC it's comprehensive and shows things fitting together.

But the amount of detail is a bit overwhelming :-)

I think the parts regarding directories, unique files, mirroring/filtering backends, many of the output options would be easier to review in detail in isolated patches.

 

Others, if you have feedback, please let me know!

 

Thanks!

Duncan

 

On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev <[hidden email]> wrote:

 

Thanks for the detailed response Sam!

 

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:

 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

 

Ideally the OnDiskOutputConfig would almost entirely be settings on OnDiskOutputBackend since no one else will care. It isn't that way in the initial patch because Clang decides most of this stuff on a per-output basis. Maybe there's a refactoring that could fix it.

 

Here are the problems I hit that led to this design:

 

1. Some properties need to be associated with specific outputs, not backends, because they relate to properties of the outputs themselves. Here are two:

- ClientIntentOutputConfig::NeedsSeeking

- OnDiskOutputConfig::OpenFlagText

 

Maybe also something like:

- OnDiskOutputConfig::CreateDirectories

(and maybe others)

 

I couldn't think of a good way to solve this besides passing in configuration at output creation time.

 

2. Some "configurers" may be handed a pre-constructed opaque OutputManager / OutputBackend and need to configure internal OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:

- OnDiskOutputConfig::RemoveFileOnSignal

(others flags might benefit)

 

This is "documented" by the call chains of clang::CompilerInstance::createOutputFile.

 

I reused the solution from #1 since it needed (?) to exist anyway. Another option would be to add an OutputBackend visitor, to find and reconfigure any "contained" OnDiskOutputBackend. This seems pretty heavyweight though.

 

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

 

Firstly, `Output` has the user-facing abstraction. `OutputDestination` has low-level bits. Another reasonable name might be `OutputImpl` but `OutputDescription` seemed more descriptive.

 

Secondly, most of the low-level bits avoid unnecessary copies of content buffers. Maybe there are simpler ideas for how to do this, but here are the design goals I was working around:

- Avoid buffering content unless necessary.

- Avoid duplicating content buffers unless necessary.

- Support multiple destinations (for mirrored backends) without breaking the above rules. The obvious "interesting" case is in-memory + on-disk (in either order).

- Make sub-classes of `OutputDestination` as small as possible given the above.

 

Thirdly, there's some boilerplate related to lifetimes of multiple destinations. Probably having an explicit `MirroringOutputDestination` would be better.

 

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

 

In the end, its main job is to wrap an OutputDestination (low-level abstraction) in an Output (high-level abstraction). Output does a bunch of work for OutputDestination, such as manage the intermediate content buffer.

 

Probably it's better to have the OutputBackend return an Output directly (and get rid of the OutputManager).

 

(At one point OutputManager was managing multiple backends directly and was involved in the store operation(s); but since I factored that logic out to MirroringOutputBackend and OutputDestination it probably doesn't need to exist.)

 

- Any other major concerns / suggestions?

Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.

It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

 

Right, I hit multi-threading limitations myself when prototyping a follow-up patch (didn't get around to posting it until this AM):

 

My intuition is:

 

Thread-safe:

- OutputBackend::createDestination

- Concurrently touching more than one Output/OutputDestination

 

Thread-unsafe:

- Using a single Output/OutputDestination concurrently

 

This all seems cheap and easy to maintain because of the limited interface. The problem I hit in the above patch is that for the InMemoryOutputBackend you also need any readers of the InMemoryFileSystem to be thread-safe.

 

Relatedly, https://reviews.llvm.org/D95583 (a prep patch for the above) allows the llvm::LockManager to be skipped. This is not really about outputs — it's inter-process coordination to avoid doing redundant work in competing Clang command-line invocations (at one point it was needed for correctness, but the main benefit now is to avoid taxing the on-disk filesystem) — but it does touch on the idea of exclusive write access.

 

Relatedly, working-directory/relative-path handling should be considered.

 

Yeah, you're probably right. Any specific thoughts on what to do here? It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.) the discussion at the end of https://reviews.llvm.org/D70701.

 

Even for POSIX, working directories could complicate concurrency guarantees. A simple solution is to not have an API for changing working directories, and instead model that by creating a proxy backend (ChangeDirectoryOutputBackend) that prepends the (new) working directory to new outputs; since each backend has an immutable view of the CWD concurrent access should be fine.

 

Two other thoughts related to paths:

 

1. I wonder if this abstraction treats the "path" as too central a property of the output. Can this evolve to allow a compiler to build a directory structure bottom-up without having to know the destination a priori? (E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of those, and an outer layer decides where to put the tree.)

 

I think it can. One approach is to use proxy backends:

- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed Output/raw_pwrite_stream? See note below.)

- Middle layer passes the instances of the inner layer a proxy backend that maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)

- Outer layer passes the middle layer a proxy backend that "does the right thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)

 

=> If writing on-disk, "the right thing" is to prepend the correct output directory for the tree and pass each file to a regular OnDiskOutputBackend.

=> If writing to (e.g.) Git's CAS, "the right thing" is to call git-hash-object on Output::close and track the name-to-hash mapping as outputs come in, and then call "git-mktree" when the middle layer is "done" (maybe a callback in the backend-passed-to-the-middle-layer's destructor).

 

(IOW, a refactoring where instead of passing absolute paths / directories / filenames down through all the layers, proxy output backends build up the path/destination piece-by-piece.)

 

I think it's doable with the abstraction as-proposed. But let me know if anyone has concerns. For example:

- Is `Output::getPath()` an abstraction leak?

- Should we have a `createOutput` that doesn't take a path?

- ...

 

Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?

- The inner layer needs an output backend if it (sometimes) dumps "side" files (such as AST record layouts into ".ast-dump" or textual IR into ".ll"). This avoids needing to know the on-disk file path ("/path/to/output" => "/path/to/output.ll"), or to even know whether there's a disk.

 

2. How should we virtualize stdout / stderr?

- "'-' means stdout" is probably good enough since LLVM makes that assumption all over. Unless someone disagrees?

- I'm not sure what to do with stderr. No one ever "closes" this stream.

- Are there other outputs that don't have path names?

 

3. Do we need to virtualize llvm::sys::fs::create_directories?

- If so, why?

 

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

 

Why doesn't this inherit from llvm::vfs::FileSystem?

Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).

I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

 

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

 

No, I don't think that should be a requirement / expectation. It's a specific requirement for Clang's implicitly built modules, and I think Clang should be responsible for hooking them together when necessary.

 

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

 

Agreed, explicitly virtualizing module caching might be a good thing to do. Either way this is Clang's job to coordinate; I just think the output manager should efficiently support mirroring outputs to an additional/custom backend that Clang installs.

 

Note: implicit modules doesn't currently rely on reading the modules it has just built from disk. It uses InMemoryModuleCache to avoid having to read anything it has written to disk and to ensure consistency between CompilerInstances across an implicit build. It's pretty awkward though.

 

On 2021 Jan  28, at 03:19, Sam McCall <[hidden email]> wrote:

 

Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts...

 

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <[hidden email]> wrote:

TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started:

https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager

https://reviews.llvm.org/D95502 Initial adoption of llvm::vfs::OutputManager in Clang.

 

Questions for the reader

- Should we virtualize compiler outputs in Clang? (Hint: yes.)

Definitely agree.

 

- Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.)

Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it.

Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it.

 

This roughly corresponds to OutputBackend + OutputDestination in the patch, except:

 - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction

 - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why

 

As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.

 

- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.)

llvm::vfs:: is definitely the right namespace for the core writing stuff IMO.

If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.

 

- Do you have a use case that this won't address well?

- Should that be fixed in the initial patch, or could this be evolved in-tree to address that?

- Any other major concerns / suggestions?

Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem.

It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used.

 

Relatedly, working-directory/relative-path handling should be considered.

 

(And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)

 

- If you think the above patches should be split up for initial review / commit, how?

Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.

 

 

(Other feedback welcome too!)

 

Longer version

There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing.

 

- Tooling wants to capture outputs directly, without going through the filesystem.

- Sometimes, tertiary outputs can be ignored, or still need to be written to disk.

- clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better.

- clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream.

 

The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501). Here are the high-level structures in the design:

 

- OutputManager—a shared manager for creating outputs without knowing about the storage.

- OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs.

- Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations.

- OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored.

- OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output.

- ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.

 

The patch includes a few backends:

 

- NullOutputBackend, for ignoring all backends.
- OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`.
- InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
- MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this.
- FilteringOutputBackend, for filtering which outputs get written to the underlying backend.

Why doesn't this inherit from llvm::vfs::FileSystem?
Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).

I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. 

 

But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities.

 

If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object.

But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access.

 

Other work in the area

See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal).

Thanks for reading! 

Duncan

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

 

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev