"devirtualizing" files in the VFS

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

"devirtualizing" files in the VFS

Oleg Smolsky via cfe-dev
I'd like to get some more perspectives on the role of the VirtualFileSystem abstraction in llvm/Support.
(The VFS layer has recently moved from Clang to LLVM, so crossposting to both lists)

https://reviews.llvm.org/D54277 proposed adding a function to VirtualFileSystem to get the underlying "real file" path from a VFS path. LLDB is starting to use VFS for some filesystem interactions, but wants/needs to keep using native IO (FILE*, file descriptors) for others. There's some more context/discussion in the review.

My perspective is coloured by work on clang tooling, clangd etc. There we rely on VFS to ensure code (typically clang library code) works in a variety of environments, e.g:
in an IDE the edited file is consistently used rather than the one on disk
clang-tidy checks work on a local codebase, but our code review tool also runs them as a service
This works because all IO goes through the VFS, so VFSes are substitutable. We tend to rely on the static type system to ensure this (most people write lit tests that use the real FS).

Adding facilities to use native IO together with VFS works against this, e.g. a likely interface is
  // Returns the OS-native path to the specified virtual file.
  // Returns None if Path doesn't describe a native file, or its path is unknown.
  Optional<string> FileSystem::getNativePath(string Path)
Most potential uses of such a function are going to produce code that doesn't work well with arbitrary VFSes.
Anecdotally, filesystems are confusing, and most features exposed by VFS end up getting misused if possible.

So those are my reasons for pushing back on this change, but I'm not sure how strong they are.
I think broadly the alternatives for LLDB are:
make a change like this to the VFS APIs
migrate to actually doing IO using VFS (likely a lot of work)
know which concrete VFSes they construct, and track the needed info externally
stop using VFS, and build separate abstractions for tracking remapping of native files etc
abandon the new features that depend on this file remapping

As a purist, 2 and 4 seem like the cleanest options, but that's easy to say when it's someone else's work.
What path should we take here?

Cheers, Sam

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "devirtualizing" files in the VFS

Oleg Smolsky via cfe-dev
I am really not sure if adding real file system functionality strictly into the VFS is a good approach. This "ExternalFileSystem" thing sounds weird to me.

Does LLDB need to *write* the files through the VFS? I'm not sure perhaps a "WritableVFS" could be implemented, or the necessary casting/conversion options.

In case:
 - there is a real path behind the file --- you could spawn an llvm::RealFileSystem (the fqdn might not be this after the migration patch!) and use that to obtain the file's buffer.

How can you be sure the file actually exists on the FS? That's what the VFS should be all about, hiding this abstraction... if you *are* sure it exists, or want to make sure, you need to pull the appropriate realFS from the VFS Overlay (most tools have an overlay of a memoryFS above the realFS).

What I am not sure about is extending the general interface in a way that it caters to a particular (or half of a particular) use case.

For example, in CodeCompass, we used a custom VFS implementation that "hijacked" the overlay and included itself between the realFS and the memoryFS. It obtains files from the database!

See:

These files *do not necessarily* (in 99% of the cases, not at all) exist on the hard drive at the moment of the code wanting to pull the file, hence why we implemented this to give the file source buffer from DB. The ClangTool that needs this still gets the memoryFS for its own purposes, and for the clang libraries, the realFS is still under there.


Sam McCall via cfe-dev <[hidden email]> ezt írta (időpont: 2018. nov. 15., Cs, 12:02):
I'd like to get some more perspectives on the role of the VirtualFileSystem abstraction in llvm/Support.
(The VFS layer has recently moved from Clang to LLVM, so crossposting to both lists)

https://reviews.llvm.org/D54277 proposed adding a function to VirtualFileSystem to get the underlying "real file" path from a VFS path. LLDB is starting to use VFS for some filesystem interactions, but wants/needs to keep using native IO (FILE*, file descriptors) for others. There's some more context/discussion in the review.

My perspective is coloured by work on clang tooling, clangd etc. There we rely on VFS to ensure code (typically clang library code) works in a variety of environments, e.g:
in an IDE the edited file is consistently used rather than the one on disk
clang-tidy checks work on a local codebase, but our code review tool also runs them as a service
This works because all IO goes through the VFS, so VFSes are substitutable. We tend to rely on the static type system to ensure this (most people write lit tests that use the real FS).

Adding facilities to use native IO together with VFS works against this, e.g. a likely interface is
  // Returns the OS-native path to the specified virtual file.
  // Returns None if Path doesn't describe a native file, or its path is unknown.
  Optional<string> FileSystem::getNativePath(string Path)
Most potential uses of such a function are going to produce code that doesn't work well with arbitrary VFSes.
Anecdotally, filesystems are confusing, and most features exposed by VFS end up getting misused if possible.

So those are my reasons for pushing back on this change, but I'm not sure how strong they are.
I think broadly the alternatives for LLDB are:
make a change like this to the VFS APIs
migrate to actually doing IO using VFS (likely a lot of work)
know which concrete VFSes they construct, and track the needed info externally
stop using VFS, and build separate abstractions for tracking remapping of native files etc
abandon the new features that depend on this file remapping

As a purist, 2 and 4 seem like the cleanest options, but that's easy to say when it's someone else's work.
What path should we take here?

Cheers, Sam
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "devirtualizing" files in the VFS

Oleg Smolsky via cfe-dev
In reply to this post by Oleg Smolsky via cfe-dev
HI Sam,

Thanks again for taking the time to discuss this. 

On Nov 15, 2018, at 3:02 AM, Sam McCall <[hidden email]> wrote:

I'd like to get some more perspectives on the role of the VirtualFileSystem abstraction in llvm/Support.
(The VFS layer has recently moved from Clang to LLVM, so crossposting to both lists)

https://reviews.llvm.org/D54277 proposed adding a function to VirtualFileSystem to get the underlying "real file" path from a VFS path. LLDB is starting to use VFS for some filesystem interactions, but wants/needs to keep using native IO (FILE*, file descriptors) for others. There's some more context/discussion in the review.

My perspective is coloured by work on clang tooling, clangd etc. There we rely on VFS to ensure code (typically clang library code) works in a variety of environments, e.g:
in an IDE the edited file is consistently used rather than the one on disk
clang-tidy checks work on a local codebase, but our code review tool also runs them as a service
This works because all IO goes through the VFS, so VFSes are substitutable. We tend to rely on the static type system to ensure this (most people write lit tests that use the real FS).

I want to emphasize that I don't have any intention of breaking any of those or other existing use cases. I opted for the virtual file system because it provides 95% of the functionality that's needed for reproducers: the real filesystem and the redirecting file system. It has the yaml mapping writer and reader, the abstraction level above the two, etc. It feels silly to implement everything again in LLDB (actually it would be more like copy/pasting everything) just because we miss that 5%, so I'm really motivated to find a solution that works for all of us :-) 

Adding facilities to use native IO together with VFS works against this, e.g. a likely interface is
  // Returns the OS-native path to the specified virtual file.
  // Returns None if Path doesn't describe a native file, or its path is unknown.
  Optional<string> FileSystem::getNativePath(string Path)
Most potential uses of such a function are going to produce code that doesn't work well with arbitrary VFSes.
Anecdotally, filesystems are confusing, and most features exposed by VFS end up getting misused if possible.

You're right and this is a problem/limitation for LLDB as well. This was the motivation for the `ExternalFileSystem` (please forgive me for the terrible name, just wanted to get the code up in phab) because it had "some" semantic meaning for both implementations. But I also understand your concerns there. 

So those are my reasons for pushing back on this change, but I'm not sure how strong they are.
I think broadly the alternatives for LLDB are:
make a change like this to the VFS APIs
migrate to actually doing IO using VFS (likely a lot of work)
know which concrete VFSes they construct, and track the needed info externally
stop using VFS, and build separate abstractions for tracking remapping of native files etc
abandon the new features that depend on this file remapping

Can you elaborate on what you have in mind for (3) and how it differs from (4)?

As a purist, 2 and 4 seem like the cleanest options, but that's easy to say when it's someone else's work.
What path should we take here?

I'll withhold from answering this as I'm one of the stakeholders ;-) 


Cheers, Sam


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "devirtualizing" files in the VFS

Oleg Smolsky via cfe-dev
In reply to this post by Oleg Smolsky via cfe-dev


On Nov 15, 2018, at 3:34 AM, Whisperity <[hidden email]> wrote:

I am really not sure if adding real file system functionality strictly into the VFS is a good approach. This "ExternalFileSystem" thing sounds weird to me.

The `ExternalFileSystem` was an attempt to provide a more limited interface while exposing the "external" path in a way that made sense for the RedirectingFileSystem. Like Sam said in the review it's not great because it only does half of the work. 

Does LLDB need to *write* the files through the VFS? I'm not sure perhaps a "WritableVFS" could be implemented, or the necessary casting/conversion options.

Most likely yes because of LLDB's design that abstracts over flies without prior knowledge about whether they'd only get read or written. However wouldn't it suffer form the exact same problems? 

In case:
 - there is a real path behind the file --- you could spawn an llvm::RealFileSystem (the fqdn might not be this after the migration patch!) and use that to obtain the file's buffer.

I'm not sure I follow what you have in mind here. Can you give a little more detail?

How can you be sure the file actually exists on the FS? That's what the VFS should be all about, hiding this abstraction... if you *are* sure it exists, or want to make sure, you need to pull the appropriate realFS from the VFS Overlay (most tools have an overlay of a memoryFS above the realFS).

That makes sense, for LLDB's use case we would be happy having just a real or redirecting filesystem (with fall through). 

What I am not sure about is extending the general interface in a way that it caters to a particular (or half of a particular) use case.

I totally understand this sentiment but I don't think that's totally fair. Finding files in different locations is an important feature of the VFS, when it was introduced in 2014 this was the only use case. The "devirtualization" aspect is unfortunate because native IO. 

For example, in CodeCompass, we used a custom VFS implementation that "hijacked" the overlay and included itself between the realFS and the memoryFS. It obtains files from the database!

See:

These files *do not necessarily* (in 99% of the cases, not at all) exist on the hard drive at the moment of the code wanting to pull the file, hence why we implemented this to give the file source buffer from DB. The ClangTool that needs this still gets the memoryFS for its own purposes, and for the clang libraries, the realFS is still under there.


This sounds like an interesting idea. We already have the option to expose the external name here, would it be reasonable to also expose the external path here? (of course being an optional)


Sam McCall via cfe-dev <[hidden email]> ezt írta (időpont: 2018. nov. 15., Cs, 12:02):
I'd like to get some more perspectives on the role of the VirtualFileSystem abstraction in llvm/Support.
(The VFS layer has recently moved from Clang to LLVM, so crossposting to both lists)

https://reviews.llvm.org/D54277 proposed adding a function to VirtualFileSystem to get the underlying "real file" path from a VFS path. LLDB is starting to use VFS for some filesystem interactions, but wants/needs to keep using native IO (FILE*, file descriptors) for others. There's some more context/discussion in the review.

My perspective is coloured by work on clang tooling, clangd etc. There we rely on VFS to ensure code (typically clang library code) works in a variety of environments, e.g:
in an IDE the edited file is consistently used rather than the one on disk
clang-tidy checks work on a local codebase, but our code review tool also runs them as a service
This works because all IO goes through the VFS, so VFSes are substitutable. We tend to rely on the static type system to ensure this (most people write lit tests that use the real FS).

Adding facilities to use native IO together with VFS works against this, e.g. a likely interface is
  // Returns the OS-native path to the specified virtual file.
  // Returns None if Path doesn't describe a native file, or its path is unknown.
  Optional<string> FileSystem::getNativePath(string Path)
Most potential uses of such a function are going to produce code that doesn't work well with arbitrary VFSes.
Anecdotally, filesystems are confusing, and most features exposed by VFS end up getting misused if possible.

So those are my reasons for pushing back on this change, but I'm not sure how strong they are.
I think broadly the alternatives for LLDB are:
make a change like this to the VFS APIs
migrate to actually doing IO using VFS (likely a lot of work)
know which concrete VFSes they construct, and track the needed info externally
stop using VFS, and build separate abstractions for tracking remapping of native files etc
abandon the new features that depend on this file remapping

As a purist, 2 and 4 seem like the cleanest options, but that's easy to say when it's someone else's work.
What path should we take here?

Cheers, Sam
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "devirtualizing" files in the VFS

Oleg Smolsky via cfe-dev
Hi Sam,

Does extending the status with a path sound reasonable? This would work similar to the current Name field, which is controlled by UseExternalName. 

Please let me know what you think.

Thanks,
Jonas 

On Nov 15, 2018, at 10:10 AM, Jonas Devlieghere via cfe-dev <[hidden email]> wrote:


On Nov 15, 2018, at 3:34 AM, Whisperity <[hidden email]> wrote:

I am really not sure if adding real file system functionality strictly into the VFS is a good approach. This "ExternalFileSystem" thing sounds weird to me.

The `ExternalFileSystem` was an attempt to provide a more limited interface while exposing the "external" path in a way that made sense for the RedirectingFileSystem. Like Sam said in the review it's not great because it only does half of the work. 

Does LLDB need to *write* the files through the VFS? I'm not sure perhaps a "WritableVFS" could be implemented, or the necessary casting/conversion options.

Most likely yes because of LLDB's design that abstracts over flies without prior knowledge about whether they'd only get read or written. However wouldn't it suffer form the exact same problems? 

In case:
 - there is a real path behind the file --- you could spawn an llvm::RealFileSystem (the fqdn might not be this after the migration patch!) and use that to obtain the file's buffer.

I'm not sure I follow what you have in mind here. Can you give a little more detail?

How can you be sure the file actually exists on the FS? That's what the VFS should be all about, hiding this abstraction... if you *are* sure it exists, or want to make sure, you need to pull the appropriate realFS from the VFS Overlay (most tools have an overlay of a memoryFS above the realFS).

That makes sense, for LLDB's use case we would be happy having just a real or redirecting filesystem (with fall through). 

What I am not sure about is extending the general interface in a way that it caters to a particular (or half of a particular) use case.

I totally understand this sentiment but I don't think that's totally fair. Finding files in different locations is an important feature of the VFS, when it was introduced in 2014 this was the only use case. The "devirtualization" aspect is unfortunate because native IO. 

For example, in CodeCompass, we used a custom VFS implementation that "hijacked" the overlay and included itself between the realFS and the memoryFS. It obtains files from the database!

See:

These files *do not necessarily* (in 99% of the cases, not at all) exist on the hard drive at the moment of the code wanting to pull the file, hence why we implemented this to give the file source buffer from DB. The ClangTool that needs this still gets the memoryFS for its own purposes, and for the clang libraries, the realFS is still under there.


This sounds like an interesting idea. We already have the option to expose the external name here, would it be reasonable to also expose the external path here? (of course being an optional)


Sam McCall via cfe-dev <[hidden email]> ezt írta (időpont: 2018. nov. 15., Cs, 12:02):
I'd like to get some more perspectives on the role of the VirtualFileSystem abstraction in llvm/Support.
(The VFS layer has recently moved from Clang to LLVM, so crossposting to both lists)

https://reviews.llvm.org/D54277 proposed adding a function to VirtualFileSystem to get the underlying "real file" path from a VFS path. LLDB is starting to use VFS for some filesystem interactions, but wants/needs to keep using native IO (FILE*, file descriptors) for others. There's some more context/discussion in the review.

My perspective is coloured by work on clang tooling, clangd etc. There we rely on VFS to ensure code (typically clang library code) works in a variety of environments, e.g:
in an IDE the edited file is consistently used rather than the one on disk
clang-tidy checks work on a local codebase, but our code review tool also runs them as a service
This works because all IO goes through the VFS, so VFSes are substitutable. We tend to rely on the static type system to ensure this (most people write lit tests that use the real FS).

Adding facilities to use native IO together with VFS works against this, e.g. a likely interface is
  // Returns the OS-native path to the specified virtual file.
  // Returns None if Path doesn't describe a native file, or its path is unknown.
  Optional<string> FileSystem::getNativePath(string Path)
Most potential uses of such a function are going to produce code that doesn't work well with arbitrary VFSes.
Anecdotally, filesystems are confusing, and most features exposed by VFS end up getting misused if possible.

So those are my reasons for pushing back on this change, but I'm not sure how strong they are.
I think broadly the alternatives for LLDB are:
make a change like this to the VFS APIs
migrate to actually doing IO using VFS (likely a lot of work)
know which concrete VFSes they construct, and track the needed info externally
stop using VFS, and build separate abstractions for tracking remapping of native files etc
abandon the new features that depend on this file remapping

As a purist, 2 and 4 seem like the cleanest options, but that's easy to say when it's someone else's work.
What path should we take here?

Cheers, Sam
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "devirtualizing" files in the VFS

Oleg Smolsky via cfe-dev
On Tue, Nov 27, 2018 at 7:01 PM Jonas Devlieghere <[hidden email]> wrote:
Hi Sam,

Does extending the status with a path sound reasonable? This would work similar to the current Name field, which is controlled by UseExternalName. 

Please let me know what you think.

Thanks,
Jonas 

Design-wise, adding a second Name field to Status doesn't seem significantly different than adding a "getOtherName" field to VFS.
I don't think it's really reasonable to have this in the VFS interfaces, because using it in any way means you're coupled to particular implementations.

Implementation-wise, it seems the only way to do that would be to add an extra string to Status, and some applications probably keep a lot of Status objects. So it doesn't seem like an improvement vs the VFS method.

Overall I think if LLDB wants to wants to break abstraction and use the native filesystem sometimes, it should have concrete private implementations with wider interfaces, and bridge to common code by having them implement the VFS interfaces. The contract of VFS may have been "real filesystem with some redirected files", but I don't think it has been for several years, and people rely on this.

I'm not an owner here though. I don't really know how these decisions get made, and would like to hear more opinions.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "devirtualizing" files in the VFS

Oleg Smolsky via cfe-dev
The contract of the VFS was a tree of "directories" containing "files" that you can access via providing a "path" and getting a "char buffer" with content. Noone ever said that any of these "files" have any sort of semblance on the "real drives". That's what ~~clang~~ llvm::fs::getRealFS() was about, giving you a VFS that bridged back to the real deal.

If some client code puts effort into properly mounting up Clang's standard library files into memory or their own VFS impl --- or only allows usage of code that does *NOT* use the standard library *AT ALL* --- then the frontend can go through without every touching, or even seeing, the realFS at all.


Sam McCall <[hidden email]> ezt írta (időpont: 2018. nov. 27., K, 19:52):
On Tue, Nov 27, 2018 at 7:01 PM Jonas Devlieghere <[hidden email]> wrote:
Hi Sam,

Does extending the status with a path sound reasonable? This would work similar to the current Name field, which is controlled by UseExternalName. 

Please let me know what you think.

Thanks,
Jonas 

Design-wise, adding a second Name field to Status doesn't seem significantly different than adding a "getOtherName" field to VFS.
I don't think it's really reasonable to have this in the VFS interfaces, because using it in any way means you're coupled to particular implementations.

Implementation-wise, it seems the only way to do that would be to add an extra string to Status, and some applications probably keep a lot of Status objects. So it doesn't seem like an improvement vs the VFS method.

Overall I think if LLDB wants to wants to break abstraction and use the native filesystem sometimes, it should have concrete private implementations with wider interfaces, and bridge to common code by having them implement the VFS interfaces. The contract of VFS may have been "real filesystem with some redirected files", but I don't think it has been for several years, and people rely on this.

I'm not an owner here though. I don't really know how these decisions get made, and would like to hear more opinions.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev