RFC: Improving Clang's FileManager

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: Improving Clang's FileManager

Nathan Ridge via cfe-dev
Hi,

As it stands right now, Clang's FileManager doesn't model file entries with multiple names correctly. Whenever the `getFile` method is called with a path it hasn't seen before, it will return a FileEntry whose name is guaranteed to be that path. However, this guarantee is lost whenever a symlink file that points to original path is opened using `getFile`, as the `name` of the FileEntry is overwritten with the symlink path. This happens because the FileManager uniques the FileEntries using inodes, which means that symlinks always get the same file entry as the original file.

I would like to fix this issue as it's blocking clang-scan-deps from reusing the FileManager across multiple invocations. Here's my idea:

- Return FileEntryRef that stores the FileEntry * and a Name.
- Don't store the Name in the FileEntry (keep the RealPathName) still.
- A comparison between FileEntryRef will compare the pointer values of FileEntries, so the code that compares the pointer values right now can still compare two FileEntryRefs with the equivalent result (important for modules).
- DenseMaps will still mostly use the FileEntry * values to preserve the existing behavior which expects one FileEntry per inode.

Additionally to fixing the issue described above, Harlan (CCed) and I would like to also perform the following minor cleanups and improvements:

- Start using error_code/ErrorOr for FileManager for improved API.
- Get rid of the MemorizeStatCalls stat cache, which isn't actually used right now by Clang (probably a bug)!
- Cache the VFS stat value in the FileEntryRef itself, so that the FileManager can actually have proper caching that works.

Please let me know if you have any feedback. I'm hoping to post a patch for FileEntryRef sometime in the next 1-2 weeks.

Thanks,
Alex

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Improving Clang's FileManager

Nathan Ridge via cfe-dev

Additionally to fixing the issue described above, Harlan (CCed) and I would like to also perform the following minor cleanups and improvements:

- Start using error_code/ErrorOr for FileManager for improved API.

As a first step of this refactor, I’ve posted a patch that changes FileManager::getFile and FileManager::getDirectory to use llvm::ErrorOr.


It’s a big patch, but the bulk of the changes are updating clients throughout clang, lldb, and clang-tools-extra.

— Harlan

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Improving Clang's FileManager

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev
I'm sympathetic to wanting to change this behaviour, as it's super subtle and weird.
However this is a pretty risky change because it's subtle and filename handling often doesn't have a clearly correct answer. It'd be nice if there was a way to land this patch in isolation without an API change, so it can bake for a while and be more easily reverted if need be.
One detail: IIUC, this
  >  it will return a FileEntry whose name is guaranteed to be that path
isn't quite accurate, it stat()s the the file, assigns that to InterndFileName, and the assigns that to UFE.Name. So getFile() performs some canonicalization, and callers may rely on that now (clangd does).

On Wed, Jul 31, 2019 at 9:30 PM Alex L via cfe-dev <[hidden email]> wrote:
Hi,

As it stands right now, Clang's FileManager doesn't model file entries with multiple names correctly. Whenever the `getFile` method is called with a path it hasn't seen before, it will return a FileEntry whose name is guaranteed to be that path. However, this guarantee is lost whenever a symlink file that points to original path is opened using `getFile`, as the `name` of the FileEntry is overwritten with the symlink path. This happens because the FileManager uniques the FileEntries using inodes, which means that symlinks always get the same file entry as the original file.

I would like to fix this issue as it's blocking clang-scan-deps from reusing the FileManager across multiple invocations. Here's my idea:

- Return FileEntryRef that stores the FileEntry * and a Name.
- Don't store the Name in the FileEntry (keep the RealPathName) still.
- A comparison between FileEntryRef will compare the pointer values of FileEntries, so the code that compares the pointer values right now can still compare two FileEntryRefs with the equivalent result (important for modules).
- DenseMaps will still mostly use the FileEntry * values to preserve the existing behavior which expects one FileEntry per inode.

Additionally to fixing the issue described above, Harlan (CCed) and I would like to also perform the following minor cleanups and improvements:

- Start using error_code/ErrorOr for FileManager for improved API.
- Get rid of the MemorizeStatCalls stat cache, which isn't actually used right now by Clang (probably a bug)!
- Cache the VFS stat value in the FileEntryRef itself, so that the FileManager can actually have proper caching that works.

Please let me know if you have any feedback. I'm hoping to post a patch for FileEntryRef sometime in the next 1-2 weeks.

Thanks,
Alex
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Improving Clang's FileManager

Nathan Ridge via cfe-dev
Hi Sam!

I was thinking about this as well, it would be highly risky to land a massive change like this at once. I don't think it would be wise for me to try that, especially given that this was attempted before by others.

Right now I'm thinking about introducing a parallel API to `getFile`, which would return the FileEntryRef, and starting the migration by migrating only those clients that are needed for the initial use-case (FileManager reuse for dependency scanning). I think it would be fine if the migration happened on per-need basis, as long as new clients use the new API from the beginning. Do you think that's a good approach?

On Fri, 2 Aug 2019 at 14:12, Sam McCall <[hidden email]> wrote:
I'm sympathetic to wanting to change this behaviour, as it's super subtle and weird.
However this is a pretty risky change because it's subtle and filename handling often doesn't have a clearly correct answer. It'd be nice if there was a way to land this patch in isolation without an API change, so it can bake for a while and be more easily reverted if need be.
One detail: IIUC, this
  >  it will return a FileEntry whose name is guaranteed to be that path
isn't quite accurate, it stat()s the the file, assigns that to InterndFileName, and the assigns that to UFE.Name. So getFile() performs some canonicalization, and callers may rely on that now (clangd does).

Right, that's true, thanks for clarifying. Right now I don't want to change this particular aspect of this behavior. Do you know some of the scenarios when the name is different? I don't think it canonicalized symlinks, right?
 

On Wed, Jul 31, 2019 at 9:30 PM Alex L via cfe-dev <[hidden email]> wrote:
Hi,

As it stands right now, Clang's FileManager doesn't model file entries with multiple names correctly. Whenever the `getFile` method is called with a path it hasn't seen before, it will return a FileEntry whose name is guaranteed to be that path. However, this guarantee is lost whenever a symlink file that points to original path is opened using `getFile`, as the `name` of the FileEntry is overwritten with the symlink path. This happens because the FileManager uniques the FileEntries using inodes, which means that symlinks always get the same file entry as the original file.

I would like to fix this issue as it's blocking clang-scan-deps from reusing the FileManager across multiple invocations. Here's my idea:

- Return FileEntryRef that stores the FileEntry * and a Name.
- Don't store the Name in the FileEntry (keep the RealPathName) still.
- A comparison between FileEntryRef will compare the pointer values of FileEntries, so the code that compares the pointer values right now can still compare two FileEntryRefs with the equivalent result (important for modules).
- DenseMaps will still mostly use the FileEntry * values to preserve the existing behavior which expects one FileEntry per inode.

Additionally to fixing the issue described above, Harlan (CCed) and I would like to also perform the following minor cleanups and improvements:

- Start using error_code/ErrorOr for FileManager for improved API.
- Get rid of the MemorizeStatCalls stat cache, which isn't actually used right now by Clang (probably a bug)!
- Cache the VFS stat value in the FileEntryRef itself, so that the FileManager can actually have proper caching that works.

Please let me know if you have any feedback. I'm hoping to post a patch for FileEntryRef sometime in the next 1-2 weeks.

Thanks,
Alex
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Improving Clang's FileManager

Nathan Ridge via cfe-dev
On Tue, Aug 6, 2019 at 3:19 AM Alex L <[hidden email]> wrote:
Hi Sam!

I was thinking about this as well, it would be highly risky to land a massive change like this at once. I don't think it would be wise for me to try that, especially given that this was attempted before by others.

Right now I'm thinking about introducing a parallel API to `getFile`, which would return the FileEntryRef, and starting the migration by migrating only those clients that are needed for the initial use-case (FileManager reuse for dependency scanning). I think it would be fine if the migration happened on per-need basis, as long as new clients use the new API from the beginning. Do you think that's a good approach?
This sounds good to me. There's two big caveats:
 - if there are reasons existing clients can't easily be moved, these may represent (weird) use-cases that the new API doesn't support, like the canonicalization. So the old API may continue to get new users.
 - in practice, migrating clients between subtly different APIs may not happen, for "ain't broke, don't fix" reasons
But it's probably still the best approach I think.

On Fri, 2 Aug 2019 at 14:12, Sam McCall <[hidden email]> wrote:
I'm sympathetic to wanting to change this behaviour, as it's super subtle and weird.
However this is a pretty risky change because it's subtle and filename handling often doesn't have a clearly correct answer. It'd be nice if there was a way to land this patch in isolation without an API change, so it can bake for a while and be more easily reverted if need be.
One detail: IIUC, this
  >  it will return a FileEntry whose name is guaranteed to be that path
isn't quite accurate, it stat()s the the file, assigns that to InterndFileName, and the assigns that to UFE.Name. So getFile() performs some canonicalization, and callers may rely on that now (clangd does).

Right, that's true, thanks for clarifying. Right now I don't want to change this particular aspect of this behavior. Do you know some of the scenarios when the name is different? I don't think it canonicalized symlinks, right?
Reading through the implementation again (it's been a while!), it delegates to ether VFS.openFileForRead(N).status().name(), or VFS.status(N).name().
For RealFileSystem these propagate N directly, but it's up to the VFS.
RedirectingFileSystem seems to optionally use the original name or the provided name. I know internally we (embarrasingly) have a build system integration that uses this facility to do some path mapping (in this case we need to *un*-resolve a symlink, sigh...) 
 
On Wed, Jul 31, 2019 at 9:30 PM Alex L via cfe-dev <[hidden email]> wrote:
Hi,

As it stands right now, Clang's FileManager doesn't model file entries with multiple names correctly. Whenever the `getFile` method is called with a path it hasn't seen before, it will return a FileEntry whose name is guaranteed to be that path. However, this guarantee is lost whenever a symlink file that points to original path is opened using `getFile`, as the `name` of the FileEntry is overwritten with the symlink path. This happens because the FileManager uniques the FileEntries using inodes, which means that symlinks always get the same file entry as the original file.

I would like to fix this issue as it's blocking clang-scan-deps from reusing the FileManager across multiple invocations. Here's my idea:

- Return FileEntryRef that stores the FileEntry * and a Name.
- Don't store the Name in the FileEntry (keep the RealPathName) still.
- A comparison between FileEntryRef will compare the pointer values of FileEntries, so the code that compares the pointer values right now can still compare two FileEntryRefs with the equivalent result (important for modules).
- DenseMaps will still mostly use the FileEntry * values to preserve the existing behavior which expects one FileEntry per inode.

Additionally to fixing the issue described above, Harlan (CCed) and I would like to also perform the following minor cleanups and improvements:

- Start using error_code/ErrorOr for FileManager for improved API.
- Get rid of the MemorizeStatCalls stat cache, which isn't actually used right now by Clang (probably a bug)!
- Cache the VFS stat value in the FileEntryRef itself, so that the FileManager can actually have proper caching that works.

Please let me know if you have any feedback. I'm hoping to post a patch for FileEntryRef sometime in the next 1-2 weeks.

Thanks,
Alex
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Improving Clang's FileManager

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev


On Wed, 31 Jul 2019 at 12:29, Alex L <[hidden email]> wrote:
Hi,

As it stands right now, Clang's FileManager doesn't model file entries with multiple names correctly. Whenever the `getFile` method is called with a path it hasn't seen before, it will return a FileEntry whose name is guaranteed to be that path. However, this guarantee is lost whenever a symlink file that points to original path is opened using `getFile`, as the `name` of the FileEntry is overwritten with the symlink path. This happens because the FileManager uniques the FileEntries using inodes, which means that symlinks always get the same file entry as the original file.

I would like to fix this issue as it's blocking clang-scan-deps from reusing the FileManager across multiple invocations. Here's my idea:

- Return FileEntryRef that stores the FileEntry * and a Name.
- Don't store the Name in the FileEntry (keep the RealPathName) still.
- A comparison between FileEntryRef will compare the pointer values of FileEntries, so the code that compares the pointer values right now can still compare two FileEntryRefs with the equivalent result (important for modules).
- DenseMaps will still mostly use the FileEntry * values to preserve the existing behavior which expects one FileEntry per inode.

I posted a patch that introduces FileEntryRef with a parallel FileManager API and migrates some clients that are used to deal with includes. You can find it here:

Cheers,
Alex
 

Additionally to fixing the issue described above, Harlan (CCed) and I would like to also perform the following minor cleanups and improvements:

- Start using error_code/ErrorOr for FileManager for improved API.
- Get rid of the MemorizeStatCalls stat cache, which isn't actually used right now by Clang (probably a bug)!
- Cache the VFS stat value in the FileEntryRef itself, so that the FileManager can actually have proper caching that works.

Please let me know if you have any feedback. I'm hoping to post a patch for FileEntryRef sometime in the next 1-2 weeks.

Thanks,
Alex

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev