Adding indexing support to Clangd

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Adding indexing support to Clangd

Alex Denisov via cfe-dev
Hi,

I’ve been thinking about how to add features to Clangd requiring an index, i.e. features that need a database containing information of all source files (Go to definition, find references, etc). I’d like to share with you my thoughts on how things are and what approaches could be taken before getting too deep into implementing something.

My understanding of the current Clang indexing facilities is as follow:
  - It is part of the libclang so it is meant to have a stable API which can be limiting because it does not expose the full Clang C/C++ API
  - It does not have persistence. I.e. the index cannot be reloaded from disk at a later time after it is built.
  - There is no header caching mechanism in order to allow faster reparsing when a source file changes but its included headers haven’t (a common occurrence during code editing).

-- Other indexing solutions --

I have done a very high level exploration of some other projects using Clang for indexing, you can find some notes here:
https://docs.google.com/document/d/1Z0pDZpUlhyRkw1yB9frVVeb_xgSb5PuXD0-aeUtKkpo/edit?usp=sharing
(Feel free to add your own notes if you’d like!)

From what I gathered:
  - Some projects are using libclang, others use the Clang C++ APIs (AST) directly because of libclang limitations
  - Some projects have a custom index formats on disk, others use RDMS (PostgreSql, Sqlite) or other already available solutions (Elastic Search, etc).
  - I didn’t notice any projects based on Clang doing header caching, although perhaps I missed it. Ilya Biryukov wrote that JetBrains CLion does header caching but it’s not clear how they are stored or if it is using Clang. On the Eclipse CDT side, Clang is not used but there is header caching by storing the semantic model in the index (not plain AST). Then the source files can be parsed reusing that cached information.

Possible approach for Clangd:
  - I suggest using Clang libraries directly and not using libclang in order to not have any limitations. I think that using a stable API is not as important since Clangd resides in the same tree and is built and tested in coordination with Clang itself. The downside is that it will not reuse some of the work already done in libclang such as finding references, etc.
  - I think introducing a big dependency such as PostgreSql is not acceptable for Clangd (correct me if I’m wrong!). So a custom tailored file format for the index make more sense to me.
  - For header caching, I wonder if it is possible to reuse the precompiled header support in Clang. There would be some logic that would decide whether or not a precompiled header could be used depending on the preprocessing context (same macro definitions, etc).

-- The Index model --

Here’s what the data model could look like. For sure it’s partial and I expect it will evolve quite a bit. But it should be enough to communicate the general idea.

Index: Represents the model of the code base as a whole
  - IndexFile []

IndexFile: Represents an indexed file
  - URI path
  - IndexFile includedBy [ ]
  - IndexName [ ]
  - Last modified timestamp, checksum, etc

IndexName: Represents a declaration, definition, macro, etc
  - Source Location
  - IndexReference [ ]

IndexNameReference: Reference to a name
  - Source Location
  - Access (read, write, read/write)

IndexTypeName extends IndexName: represents classes, structs, etc
  - IndexTypeName bases [ ]

IndexFunctionName extends IndexName: represents functions, methods, etc
  - IndexFunctionName callers [ ]

Note that a lot of information probably doesn’t need to be modeled because a lot of information only needs to be available with an opened file for which we can have access to the full AST.

-- The persisted file format --

All elements in the model mentioned above could have a querying interface which could be implemented for an “in memory” database (simpler to debug and fast for small projects) and also for an on-disk database. From my experience in Eclipse CDT, the index on disk was stored in the form of a BTree which worked quite well. The BTree is made out of chunks. Chunks can be cached in memory and fetched from disk as required. Every information in the model is fetched from the database (from cache otherwise from disk). A similar approach could be used for Clangd if it’s deemed suitable.



In summary, I’m proposing for Clangd an index on disk stored in the form of a BTree that is populated using Clang’s C++ API (not libclang). Any concerns or input would be greatly appreciated. Just as a side note, I’m aware that this is just one line of thinking and others could be considered.

Best regards,
Marc-André Laperle
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Hi,

Thanks for making a summary of existing solutions!

On 17 May 2017 at 23:38, Marc-André Laperle via cfe-dev <[hidden email]> wrote:
Hi,

I’ve been thinking about how to add features to Clangd requiring an index, i.e. features that need a database containing information of all source files (Go to definition, find references, etc). I’d like to share with you my thoughts on how things are and what approaches could be taken before getting too deep into implementing something.

My understanding of the current Clang indexing facilities is as follow:
  - It is part of the libclang so it is meant to have a stable API which can be limiting because it does not expose the full Clang C/C++ API
  - It does not have persistence. I.e. the index cannot be reloaded from disk at a later time after it is built.
  - There is no header caching mechanism in order to allow faster reparsing when a source file changes but its included headers haven’t (a common occurrence during code editing).

Have you looked into the precompiled preamble? I believe it can (and is) used when indexing.


-- Other indexing solutions --

I have done a very high level exploration of some other projects using Clang for indexing, you can find some notes here:
https://docs.google.com/document/d/1Z0pDZpUlhyRkw1yB9frVVeb_xgSb5PuXD0-aeUtKkpo/edit?usp=sharing
(Feel free to add your own notes if you’d like!)

From what I gathered:
  - Some projects are using libclang, others use the Clang C++ APIs (AST) directly because of libclang limitations
  - Some projects have a custom index formats on disk, others use RDMS (PostgreSql, Sqlite) or other already available solutions (Elastic Search, etc).
  - I didn’t notice any projects based on Clang doing header caching, although perhaps I missed it. Ilya Biryukov wrote that JetBrains CLion does header caching but it’s not clear how they are stored or if it is using Clang.

IIRC CLion uses a custom C++ parser instead of Clang.
 
On the Eclipse CDT side, Clang is not used but there is header caching by storing the semantic model in the index (not plain AST). Then the source files can be parsed reusing that cached information.

Possible approach for Clangd:
  - I suggest using Clang libraries directly and not using libclang in order to not have any limitations. I think that using a stable API is not as important since Clangd resides in the same tree and is built and tested in coordination with Clang itself. The downside is that it will not reuse some of the work already done in libclang such as finding references, etc.

I agree, Clangd should not use libclang. Note that in general libclang's indexer API is intended to be a wrapper around the core implementation in lib/Index. I also don't think libclang doesn't expose any means to find references.

I would encourage Clangd to reuse existing code in lib/Index. Even though it has bugs, we are (and will be) currently fixing a lot of issues in the library to ensure that our consumer records all of the possible declarations and references for both C++ and Obj-C.
 
  - I think introducing a big dependency such as PostgreSql is not acceptable for Clangd (correct me if I’m wrong!). So a custom tailored file format for the index make more sense to me.
  - For header caching, I wonder if it is possible to reuse the precompiled header support in Clang. There would be some logic that would decide whether or not a precompiled header could be used depending on the preprocessing context (same macro definitions, etc).

-- The Index model --

Here’s what the data model could look like. For sure it’s partial and I expect it will evolve quite a bit. But it should be enough to communicate the general idea.

Index: Represents the model of the code base as a whole
  - IndexFile []

IndexFile: Represents an indexed file
  - URI path
  - IndexFile includedBy [ ]
  - IndexName [ ]
  - Last modified timestamp, checksum, etc

IndexName: Represents a declaration, definition, macro, etc
  - Source Location
  - IndexReference [ ]

IndexNameReference: Reference to a name
  - Source Location
  - Access (read, write, read/write)

IndexTypeName extends IndexName: represents classes, structs, etc
  - IndexTypeName bases [ ]

IndexFunctionName extends IndexName: represents functions, methods, etc
  - IndexFunctionName callers [ ]

Note that a lot of information probably doesn’t need to be modeled because a lot of information only needs to be available with an opened file for which we can have access to the full AST.

-- The persisted file format --

All elements in the model mentioned above could have a querying interface which could be implemented for an “in memory” database (simpler to debug and fast for small projects) and also for an on-disk database. From my experience in Eclipse CDT, the index on disk was stored in the form of a BTree which worked quite well. The BTree is made out of chunks. Chunks can be cached in memory and fetched from disk as required. Every information in the model is fetched from the database (from cache otherwise from disk). A similar approach could be used for Clangd if it’s deemed suitable.

Have you looked into LLVM's bitcode as a possible format for the persistent index? Clang currently uses it for serialized diagnostics and modules.
 



In summary, I’m proposing for Clangd an index on disk stored in the form of a BTree that is populated using Clang’s C++ API (not libclang). Any concerns or input would be greatly appreciated. Just as a side note, I’m aware that this is just one line of thinking and others could be considered.

Best regards,
Marc-André Laperle
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
In reply to this post by Alex Denisov via cfe-dev
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
In reply to this post by Alex Denisov via cfe-dev
Hi Alex! Some replies in-lined.

On Thu, 2017-05-18 at 12:30 +0100, Alex L wrote:
Hi,

Thanks for making a summary of existing solutions!

On 17 May 2017 at 23:38, Marc-André Laperle via cfe-dev <[hidden email]> wrote:
Hi,

I’ve been thinking about how to add features to Clangd requiring an index, i.e. features that need a database containing information of all source files (Go to definition, find references, etc). I’d like to share with you my thoughts on how things are and what approaches could be taken before getting too deep into implementing something.

My understanding of the current Clang indexing facilities is as follow:
  - It is part of the libclang so it is meant to have a stable API which can be limiting because it does not expose the full Clang C/C++ API
  - It does not have persistence. I.e. the index cannot be reloaded from disk at a later time after it is built.
  - There is no header caching mechanism in order to allow faster reparsing when a source file changes but its included headers haven’t (a common occurrence during code editing).

Have you looked into the precompiled preamble? I believe it can (and is) used when indexing.

I haven't really looked into it yet but it looks very useful, especially this section: https://clang.llvm.org/docs/PCHInternals.html#chained-precompiled-headers



-- Other indexing solutions --

I have done a very high level exploration of some other projects using Clang for indexing, you can find some notes here:
https://docs.google.com/document/d/1Z0pDZpUlhyRkw1yB9frVVeb_xgSb5PuXD0-aeUtKkpo/edit?usp=sharing
(Feel free to add your own notes if you’d like!)

From what I gathered:
  - Some projects are using libclang, others use the Clang C++ APIs (AST) directly because of libclang limitations
  - Some projects have a custom index formats on disk, others use RDMS (PostgreSql, Sqlite) or other already available solutions (Elastic Search, etc).
  - I didn’t notice any projects based on Clang doing header caching, although perhaps I missed it. Ilya Biryukov wrote that JetBrains CLion does header caching but it’s not clear how they are stored or if it is using Clang.

IIRC CLion uses a custom C++ parser instead of Clang.
 
On the Eclipse CDT side, Clang is not used but there is header caching by storing the semantic model in the index (not plain AST). Then the source files can be parsed reusing that cached information.

Possible approach for Clangd:
  - I suggest using Clang libraries directly and not using libclang in order to not have any limitations. I think that using a stable API is not as important since Clangd resides in the same tree and is built and tested in coordination with Clang itself. The downside is that it will not reuse some of the work already done in libclang such as finding references, etc.

I agree, Clangd should not use libclang. Note that in general libclang's indexer API is intended to be a wrapper around the core implementation in lib/Index. I also don't think libclang doesn't expose any means to find references.

I would encourage Clangd to reuse existing code in lib/Index. Even though it has bugs, we are (and will be) currently fixing a lot of issues in the library to ensure that our consumer records all of the possible declarations and references for both C++ and Obj-C.

Thanks, I was under the wrong impression that this was all part of libclang but I see that this is not the case. I'm all for reusing code and I can help fix issues if there are any. I'll give this a try!

-- The persisted file format --

All elements in the model mentioned above could have a querying interface which could be implemented for an “in memory” database (simpler to debug and fast for small projects) and also for an on-disk database. From my experience in Eclipse CDT, the index on disk was stored in the form of a BTree which worked quite well. The BTree is made out of chunks. Chunks can be cached in memory and fetched from disk as required. Every information in the model is fetched from the database (from cache otherwise from disk). A similar approach could be used for Clangd if it’s deemed suitable.

Have you looked into LLVM's bitcode as a possible format for the persistent index? Clang currently uses it for serialized diagnostics and modules.
 

I will have a look. It seems very well defined. It's not clear to me yet if this can be used across the board but I'll play around with it a bit.

Thank you so much for mentioning these things! It's easy to miss some of the useful parts when getting into a new code base.

Regards,
Marc-André




In summary, I’m proposing for Clangd an index on disk stored in the form of a BTree that is populated using Clang’s C++ API (not libclang). Any concerns or input would be greatly appreciated. Just as a side note, I’m aware that this is just one line of thinking and others could be considered.

Best regards,
Marc-André Laperle
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
In reply to this post by Alex Denisov via cfe-dev

Yeah, it sounds like a good approach to tackle header caching a bit later. "Precompiled preamble" looks promising so we can keep this in mind as we go.


Cheers,

Marc-André


From: Doug Schaefer <[hidden email]>
Sent: Thursday, May 18, 2017 10:44:18 AM
To: Marc-André Laperle; via cfe-dev
Cc: [hidden email]; [hidden email]; Dániel Krupp; Zoltan Porkoláb; Marton Csordas
Subject: Re: Adding indexing support to Clangd
 
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Hi everyone,

The problem with PCHs(either chained or not) is that they only work for the source file, 
i.e. you can only use it when you start the new file from scratch, right? For header caching we
really want to reuse whatever information we have cached even if it's included in a different 
context(i.e. the order of includes is different in the other translation unit), which is not possible with PCHs.

My point is that it's not at all straightforward how(or if?) the PCHs can improve performance of processing 
the same header twice.

And as long as building an index is as fast as a recompile and we can reuse information from the previous 
version(slightly outdated) the index while the new version is building, we can probably get a good enough 
UX without any compromises on correctness(and introducing additional complexity, since I don't think there's
a way to do header caching without significant changes to clang itself).

CLion indeed has a custom parser and serialization format, it's not clang-based.


On Thu, May 18, 2017 at 10:33 PM, Marc-André Laperle via cfe-dev <[hidden email]> wrote:

Yeah, it sounds like a good approach to tackle header caching a bit later. "Precompiled preamble" looks promising so we can keep this in mind as we go.


Cheers,

Marc-André


From: Doug Schaefer <[hidden email]>
Sent: Thursday, May 18, 2017 10:44:18 AM
To: Marc-André Laperle; via cfe-dev
Cc: [hidden email]; [hidden email]; Dániel Krupp; Zoltan Porkoláb; Marton Csordas
Subject: Re: Adding indexing support to Clangd
 
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev

Those are good points. I think we'll have to see how fast the indexing without caching and then go from there. I think the precompiled preamble sounds useful at least for opened editors for quick parsing as there can be many reparsings of the same file without anything changes in the inclusions (while typing, etc). I think reusing the information from the previous version is a good compromise but we also have to make sure building the index for the first time is not too long or at least make sure that there is sufficient functionality available for users to start working and communicate that some functionality is not available.


Cheers,

Marc-André


From: Ilya Biryukov <[hidden email]>
Sent: Friday, May 19, 2017 8:27:48 AM
To: Marc-André Laperle
Cc: Doug Schaefer; via cfe-dev; [hidden email]; Zoltan Porkoláb; Marton Csordas
Subject: Re: [cfe-dev] Adding indexing support to Clangd
 
Hi everyone,

The problem with PCHs(either chained or not) is that they only work for the source file, 
i.e. you can only use it when you start the new file from scratch, right? For header caching we
really want to reuse whatever information we have cached even if it's included in a different 
context(i.e. the order of includes is different in the other translation unit), which is not possible with PCHs.

My point is that it's not at all straightforward how(or if?) the PCHs can improve performance of processing 
the same header twice.

And as long as building an index is as fast as a recompile and we can reuse information from the previous 
version(slightly outdated) the index while the new version is building, we can probably get a good enough 
UX without any compromises on correctness(and introducing additional complexity, since I don't think there's
a way to do header caching without significant changes to clang itself).

CLion indeed has a custom parser and serialization format, it's not clang-based.


On Thu, May 18, 2017 at 10:33 PM, Marc-André Laperle via cfe-dev <[hidden email]> wrote:

Yeah, it sounds like a good approach to tackle header caching a bit later. "Precompiled preamble" looks promising so we can keep this in mind as we go.


Cheers,

Marc-André


From: Doug Schaefer <[hidden email]>
Sent: Thursday, May 18, 2017 10:44:18 AM
To: Marc-André Laperle; via cfe-dev
Cc: [hidden email]; [hidden email]; Dániel Krupp; Zoltan Porkoláb; Marton Csordas
Subject: Re: Adding indexing support to Clangd
 
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Hello everyone,

On 23.05.2017 07:02, Marc-André Laperle via cfe-dev wrote:

Those are good points. I think we'll have to see how fast the indexing without caching and then go from there.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was:
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.

I think the precompiled preamble sounds useful at least for opened editors for quick parsing as there can be many reparsings of the same file without anything changes in the inclusions (while typing, etc).

+1. Preambles need to be used to provides reasonable responsiveness. Otherwise i.e. included boost header can consume unexpected time.
In fact sometimes it's worth to have preambles granularity per-functions (i.e. for files opened in editor), because when developers modify code then most of the time they modify bodies.

I think reusing the information from the previous version is a good compromise but we also have to make sure building the index for the first time is not too long or at least make sure that there is sufficient functionality available for users to start working and communicate that some functionality is not available.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.

Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.

Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.

Vladimir
NetBeans C/C++ Project Lead


Cheers,

Marc-André


From: Ilya Biryukov [hidden email]
Sent: Friday, May 19, 2017 8:27:48 AM
To: Marc-André Laperle
Cc: Doug Schaefer; via cfe-dev; [hidden email]; Zoltan Porkoláb; Marton Csordas
Subject: Re: [cfe-dev] Adding indexing support to Clangd
 
Hi everyone,

The problem with PCHs(either chained or not) is that they only work for the source file, 
i.e. you can only use it when you start the new file from scratch, right? For header caching we
really want to reuse whatever information we have cached even if it's included in a different 
context(i.e. the order of includes is different in the other translation unit), which is not possible with PCHs.

My point is that it's not at all straightforward how(or if?) the PCHs can improve performance of processing 
the same header twice.

And as long as building an index is as fast as a recompile and we can reuse information from the previous 
version(slightly outdated) the index while the new version is building, we can probably get a good enough 
UX without any compromises on correctness(and introducing additional complexity, since I don't think there's
a way to do header caching without significant changes to clang itself).

CLion indeed has a custom parser and serialization format, it's not clang-based.


On Thu, May 18, 2017 at 10:33 PM, Marc-André Laperle via cfe-dev <[hidden email]> wrote:

Yeah, it sounds like a good approach to tackle header caching a bit later. "Precompiled preamble" looks promising so we can keep this in mind as we go.


Cheers,

Marc-André


From: Doug Schaefer <[hidden email]>
Sent: Thursday, May 18, 2017 10:44:18 AM
To: Marc-André Laperle; via cfe-dev
Cc: [hidden email]; [hidden email]; Dániel Krupp; Zoltan Porkoláb; Marton Csordas
Subject: Re: Adding indexing support to Clangd
 
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Regards,
Ilya Biryukov


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Hi Vladimir,

Thanks for sharing your experience.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was: 
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.
It's a good reason to focus on that issue from the very start than. Would be nice to have some exact measurements, though. (i.e. on LLVM).
Just to know how slow exactly was it.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.
We definitely want diagnostics/completions for the currently open file to be available. Good point, we definitely want to explicitly name the available features in the docs/discussions.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.
Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.
Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.
Is this work open-source? The interesting part is how to "reuse" the PCH for a header that's included in a different order. 
I.e. is there a way to reuse some cached information(PCH, or anything else) for <map> and <vector> when parsing these two files:
```
// foo.cpp
#include <vector>
#include <map>
...

// bar.cpp
#include <map>
#include <vector>
....
```

--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Hi Ilia,

Sorry for the late reply.
Unfortunately mentioned hacks were done long time ago and I couldn't find the changes at the first glance :-(

But you can think about reusable chaned PCHs in the "module" way.
Each system header is a module.
There are special index_headers.c and index_headers.cpp files which includes all standard headers.
These files are indexed first and create "module" per #include.
Module is created once or several times if preprocessor contexts are very different like C vs. C++98 vs. C++14.
Then reused.
Of course it could compromise the accuracy, but for proof of concept was enough to see that expected indexing speed can be achieved theoretically.

Btw, another hint: implementing FileSystemStatCache gave the next visible speedup. Of course need to carefully invalidate/update it when file was modified in IDE or externally.
So, finally we got just 2x slowdown, but the accuracy of "real" compiler. And then as you know we have started Clank :-)

Hope it helps,
Vladimir.

On 29.05.2017 11:58, Ilya Biryukov wrote:
Hi Vladimir,

Thanks for sharing your experience.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was: 
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.
It's a good reason to focus on that issue from the very start than. Would be nice to have some exact measurements, though. (i.e. on LLVM).
Just to know how slow exactly was it.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.
We definitely want diagnostics/completions for the currently open file to be available. Good point, we definitely want to explicitly name the available features in the docs/discussions.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.
Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.
Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.
Is this work open-source? The interesting part is how to "reuse" the PCH for a header that's included in a different order. 
I.e. is there a way to reuse some cached information(PCH, or anything else) for <map> and <vector> when parsing these two files:
```
// foo.cpp
#include <vector>
#include <map>
...

// bar.cpp
#include <map>
#include <vector>
....
```

--
Regards,
Ilya Biryukov


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Thanks for the insights, I think I get the gist of the idea with the "module" PCH. 
One question is: what if the system headers are included after the user includes? Then we abandon the PCH cache and run the parsing from scratch, right?

FileSystemStatCache that is reused between compilation units? Sounds like a low-hanging fruit for indexing, thanks.

On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky <[hidden email]> wrote:
Hi Ilia,

Sorry for the late reply.
Unfortunately mentioned hacks were done long time ago and I couldn't find the changes at the first glance :-(

But you can think about reusable chaned PCHs in the "module" way.
Each system header is a module.
There are special index_headers.c and index_headers.cpp files which includes all standard headers.
These files are indexed first and create "module" per #include.
Module is created once or several times if preprocessor contexts are very different like C vs. C++98 vs. C++14.
Then reused.
Of course it could compromise the accuracy, but for proof of concept was enough to see that expected indexing speed can be achieved theoretically.

Btw, another hint: implementing FileSystemStatCache gave the next visible speedup. Of course need to carefully invalidate/update it when file was modified in IDE or externally.
So, finally we got just 2x slowdown, but the accuracy of "real" compiler. And then as you know we have started Clank :-)

Hope it helps,
Vladimir.


On 29.05.2017 11:58, Ilya Biryukov wrote:
Hi Vladimir,

Thanks for sharing your experience.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was: 
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.
It's a good reason to focus on that issue from the very start than. Would be nice to have some exact measurements, though. (i.e. on LLVM).
Just to know how slow exactly was it.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.
We definitely want diagnostics/completions for the currently open file to be available. Good point, we definitely want to explicitly name the available features in the docs/discussions.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.
Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.
Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.
Is this work open-source? The interesting part is how to "reuse" the PCH for a header that's included in a different order. 
I.e. is there a way to reuse some cached information(PCH, or anything else) for <map> and <vector> when parsing these two files:
```
// foo.cpp
#include <vector>
#include <map>
...

// bar.cpp
#include <map>
#include <vector>
....
```

--
Regards,
Ilya Biryukov




--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
I thought I’d chip in and describe Eclipse CDT’s strategy with header caching. It’s actually a big cheat but the results have proven to be pretty good.

CDT’s hack actually starts in the preprocessor. If we see a header file has already been indexed, we skip including it. At the back end, we seamlessly use the index or the current symbol table when doing symbol lookup. Symbols that get missed because we skipped header files get picked up out of the index instead. We also do that in the preprocessor to look up missing macros out of the index when doing macro substitution.

The performance gains were about an order of magnitude and it magically works most of the time with the main issue being header files that get included multiple times affected by different macro values but the effects of that haven’t been major.

With clang being a real compiler, I had my doubts that you could even do something like this without adding hooks in places the front-end gang might not like. Love to be proven wrong. It really is very hard to keep up with the evolving C++ standard and we could sure use the help clangd could offer.

Hope that helps,
Doug.

From: cfe-dev <[hidden email]> on behalf of Ilya Biryukov via cfe-dev <[hidden email]>
Reply-To: Ilya Biryukov <[hidden email]>
Date: Thursday, June 1, 2017 at 10:52 AM
To: Vladimir Voskresensky <[hidden email]>
Cc: via cfe-dev <[hidden email]>
Subject: Re: [cfe-dev] Adding indexing support to Clangd

Thanks for the insights, I think I get the gist of the idea with the "module" PCH. 
One question is: what if the system headers are included after the user includes? Then we abandon the PCH cache and run the parsing from scratch, right?

FileSystemStatCache that is reused between compilation units? Sounds like a low-hanging fruit for indexing, thanks.

On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky <[hidden email]> wrote:
Hi Ilia,

Sorry for the late reply.
Unfortunately mentioned hacks were done long time ago and I couldn't find the changes at the first glance :-(

But you can think about reusable chaned PCHs in the "module" way.
Each system header is a module.
There are special index_headers.c and index_headers.cpp files which includes all standard headers.
These files are indexed first and create "module" per #include.
Module is created once or several times if preprocessor contexts are very different like C vs. C++98 vs. C++14.
Then reused.
Of course it could compromise the accuracy, but for proof of concept was enough to see that expected indexing speed can be achieved theoretically.

Btw, another hint: implementing FileSystemStatCache gave the next visible speedup. Of course need to carefully invalidate/update it when file was modified in IDE or externally.
So, finally we got just 2x slowdown, but the accuracy of "real" compiler. And then as you know we have started Clank :-)

Hope it helps,
Vladimir.


On 29.05.2017 11:58, Ilya Biryukov wrote:
Hi Vladimir,

Thanks for sharing your experience.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was: 
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.
It's a good reason to focus on that issue from the very start than. Would be nice to have some exact measurements, though. (i.e. on LLVM).
Just to know how slow exactly was it.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.
We definitely want diagnostics/completions for the currently open file to be available. Good point, we definitely want to explicitly name the available features in the docs/discussions.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.
Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.
Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.
Is this work open-source? The interesting part is how to "reuse" the PCH for a header that's included in a different order. 
I.e. is there a way to reuse some cached information(PCH, or anything else) for <map> and <vector> when parsing these two files:
```
// foo.cpp
#include <vector>
#include <map>
...

// bar.cpp
#include <map>
#include <vector>
....
```

--
Regards,
Ilya Biryukov




--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Not sure this has already been discussed, but would it be practical/reasonable to use Clang's modules support for this? Might keep the implementation much simpler - and perhaps provide an extra incentive for users to modularize their build/code which would help their actual build tymes (& heck, parsed modules could even potentially be reused between indexer and final build - making apparent build times /really/ fast)

On Thu, Jun 1, 2017 at 8:12 AM Doug Schaefer via cfe-dev <[hidden email]> wrote:
I thought I’d chip in and describe Eclipse CDT’s strategy with header caching. It’s actually a big cheat but the results have proven to be pretty good.

CDT’s hack actually starts in the preprocessor. If we see a header file has already been indexed, we skip including it. At the back end, we seamlessly use the index or the current symbol table when doing symbol lookup. Symbols that get missed because we skipped header files get picked up out of the index instead. We also do that in the preprocessor to look up missing macros out of the index when doing macro substitution.

The performance gains were about an order of magnitude and it magically works most of the time with the main issue being header files that get included multiple times affected by different macro values but the effects of that haven’t been major.

With clang being a real compiler, I had my doubts that you could even do something like this without adding hooks in places the front-end gang might not like. Love to be proven wrong. It really is very hard to keep up with the evolving C++ standard and we could sure use the help clangd could offer.

Hope that helps,
Doug.

From: cfe-dev <[hidden email]> on behalf of Ilya Biryukov via cfe-dev <[hidden email]>
Reply-To: Ilya Biryukov <[hidden email]>
Date: Thursday, June 1, 2017 at 10:52 AM
To: Vladimir Voskresensky <[hidden email]>
Cc: via cfe-dev <[hidden email]>

Subject: Re: [cfe-dev] Adding indexing support to Clangd

Thanks for the insights, I think I get the gist of the idea with the "module" PCH. 
One question is: what if the system headers are included after the user includes? Then we abandon the PCH cache and run the parsing from scratch, right?

FileSystemStatCache that is reused between compilation units? Sounds like a low-hanging fruit for indexing, thanks.

On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky <[hidden email]> wrote:
Hi Ilia,

Sorry for the late reply.
Unfortunately mentioned hacks were done long time ago and I couldn't find the changes at the first glance :-(

But you can think about reusable chaned PCHs in the "module" way.
Each system header is a module.
There are special index_headers.c and index_headers.cpp files which includes all standard headers.
These files are indexed first and create "module" per #include.
Module is created once or several times if preprocessor contexts are very different like C vs. C++98 vs. C++14.
Then reused.
Of course it could compromise the accuracy, but for proof of concept was enough to see that expected indexing speed can be achieved theoretically.

Btw, another hint: implementing FileSystemStatCache gave the next visible speedup. Of course need to carefully invalidate/update it when file was modified in IDE or externally.
So, finally we got just 2x slowdown, but the accuracy of "real" compiler. And then as you know we have started Clank :-)

Hope it helps,
Vladimir.


On 29.05.2017 11:58, Ilya Biryukov wrote:
Hi Vladimir,

Thanks for sharing your experience.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was: 
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.
It's a good reason to focus on that issue from the very start than. Would be nice to have some exact measurements, though. (i.e. on LLVM).
Just to know how slow exactly was it.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.
We definitely want diagnostics/completions for the currently open file to be available. Good point, we definitely want to explicitly name the available features in the docs/discussions.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.
Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.
Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.
Is this work open-source? The interesting part is how to "reuse" the PCH for a header that's included in a different order. 
I.e. is there a way to reuse some cached information(PCH, or anything else) for <map> and <vector> when parsing these two files:
```
// foo.cpp
#include <vector>
#include <map>
...

// bar.cpp
#include <map>
#include <vector>
....
```

--
Regards,
Ilya Biryukov




--
Regards,
Ilya Biryukov
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Other IDEs do that very similarly to CDT, AFAIK. Compromising correctness, but getting better performance.
Reusing modules would be nice, and I wonder if it could also be made transparent to the users of the tool (i.e. we could have an option 'pretend these headers are modules every time you encounter them')
I would expect that to break on most projects, though. Not sure if people would be willing to use something that spits tons of errors on them.
Interesting direction for prototyping...

On Thu, Jun 1, 2017 at 5:14 PM, David Blaikie <[hidden email]> wrote:
Not sure this has already been discussed, but would it be practical/reasonable to use Clang's modules support for this? Might keep the implementation much simpler - and perhaps provide an extra incentive for users to modularize their build/code which would help their actual build tymes (& heck, parsed modules could even potentially be reused between indexer and final build - making apparent build times /really/ fast)

On Thu, Jun 1, 2017 at 8:12 AM Doug Schaefer via cfe-dev <[hidden email]> wrote:
I thought I’d chip in and describe Eclipse CDT’s strategy with header caching. It’s actually a big cheat but the results have proven to be pretty good.

CDT’s hack actually starts in the preprocessor. If we see a header file has already been indexed, we skip including it. At the back end, we seamlessly use the index or the current symbol table when doing symbol lookup. Symbols that get missed because we skipped header files get picked up out of the index instead. We also do that in the preprocessor to look up missing macros out of the index when doing macro substitution.

The performance gains were about an order of magnitude and it magically works most of the time with the main issue being header files that get included multiple times affected by different macro values but the effects of that haven’t been major.

With clang being a real compiler, I had my doubts that you could even do something like this without adding hooks in places the front-end gang might not like. Love to be proven wrong. It really is very hard to keep up with the evolving C++ standard and we could sure use the help clangd could offer.

Hope that helps,
Doug.

From: cfe-dev <[hidden email]> on behalf of Ilya Biryukov via cfe-dev <[hidden email]>
Reply-To: Ilya Biryukov <[hidden email]>
Date: Thursday, June 1, 2017 at 10:52 AM
To: Vladimir Voskresensky <[hidden email]>
Cc: via cfe-dev <[hidden email]>

Subject: Re: [cfe-dev] Adding indexing support to Clangd

Thanks for the insights, I think I get the gist of the idea with the "module" PCH. 
One question is: what if the system headers are included after the user includes? Then we abandon the PCH cache and run the parsing from scratch, right?

FileSystemStatCache that is reused between compilation units? Sounds like a low-hanging fruit for indexing, thanks.

On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky <[hidden email]> wrote:
Hi Ilia,

Sorry for the late reply.
Unfortunately mentioned hacks were done long time ago and I couldn't find the changes at the first glance :-(

But you can think about reusable chaned PCHs in the "module" way.
Each system header is a module.
There are special index_headers.c and index_headers.cpp files which includes all standard headers.
These files are indexed first and create "module" per #include.
Module is created once or several times if preprocessor contexts are very different like C vs. C++98 vs. C++14.
Then reused.
Of course it could compromise the accuracy, but for proof of concept was enough to see that expected indexing speed can be achieved theoretically.

Btw, another hint: implementing FileSystemStatCache gave the next visible speedup. Of course need to carefully invalidate/update it when file was modified in IDE or externally.
So, finally we got just 2x slowdown, but the accuracy of "real" compiler. And then as you know we have started Clank :-)

Hope it helps,
Vladimir.


On 29.05.2017 11:58, Ilya Biryukov wrote:
Hi Vladimir,

Thanks for sharing your experience.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was: 
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.
It's a good reason to focus on that issue from the very start than. Would be nice to have some exact measurements, though. (i.e. on LLVM).
Just to know how slow exactly was it.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.
We definitely want diagnostics/completions for the currently open file to be available. Good point, we definitely want to explicitly name the available features in the docs/discussions.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.
Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.
Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.
Is this work open-source? The interesting part is how to "reuse" the PCH for a header that's included in a different order. 
I.e. is there a way to reuse some cached information(PCH, or anything else) for <map> and <vector> when parsing these two files:
```
// foo.cpp
#include <vector>
#include <map>
...

// bar.cpp
#include <map>
#include <vector>
....
```

--
Regards,
Ilya Biryukov




--
Regards,
Ilya Biryukov
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev



On 06/ 1/17 06:26 PM, Ilya Biryukov via cfe-dev wrote:
Other IDEs do that very similarly to CDT, AFAIK. Compromising correctness, but getting better performance.
Reusing modules would be nice, and I wonder if it could also be made transparent to the users of the tool (i.e. we could have an option 'pretend these headers are modules every time you encounter them')
I would expect that to break on most projects, though. Not sure if people would be willing to use something that spits tons of errors on them.
Interesting direction for prototyping...
As Doug mentioned, surprisingly the tricks with headers in the majority of projects give pretty good results :-)

In NetBeans we have similar to CDT headers caching approach.

The only difference is that when we hit #include the second time we only check if we can skip indexing,
But we always do "fair lightweight preprocessing" to keep fair context of all possible inner #ifdef/#else/#define directives (because they might affect the current file).
For that we use APT (Abstract Preprocessor Tree) per-file which is constant for the file and is created once - similar to clang's PTH (Pre-Tokenized headers).

Visiting file's APT we can produce different output based on input preprocessor state.
It can be visited in "light" mode or "produce tokens" mode, but it is always gives correct result from the strict compiler point of view.
We also do indexing in parallel and the APT (being immutable) is easily shared by index-visitors from all threads.
Btw stat cache is also reused from all indexing threads with appropriate synchronizations.

So in NetBeans we observe that using this tricks (which really looks like multi-modules per header file) the majority of projects are in very good accuracy + I can also confirm that it gives ~10x speedup.

Hope it helps,
Vladimir.


On Thu, Jun 1, 2017 at 5:14 PM, David Blaikie <[hidden email]> wrote:
Not sure this has already been discussed, but would it be practical/reasonable to use Clang's modules support for this? Might keep the implementation much simpler - and perhaps provide an extra incentive for users to modularize their build/code which would help their actual build tymes (& heck, parsed modules could even potentially be reused between indexer and final build - making apparent build times /really/ fast)

On Thu, Jun 1, 2017 at 8:12 AM Doug Schaefer via cfe-dev <[hidden email]> wrote:
I thought I’d chip in and describe Eclipse CDT’s strategy with header caching. It’s actually a big cheat but the results have proven to be pretty good.

CDT’s hack actually starts in the preprocessor. If we see a header file has already been indexed, we skip including it. At the back end, we seamlessly use the index or the current symbol table when doing symbol lookup. Symbols that get missed because we skipped header files get picked up out of the index instead. We also do that in the preprocessor to look up missing macros out of the index when doing macro substitution.

The performance gains were about an order of magnitude and it magically works most of the time with the main issue being header files that get included multiple times affected by different macro values but the effects of that haven’t been major.

With clang being a real compiler, I had my doubts that you could even do something like this without adding hooks in places the front-end gang might not like. Love to be proven wrong. It really is very hard to keep up with the evolving C++ standard and we could sure use the help clangd could offer.

Hope that helps,
Doug.

From: cfe-dev <[hidden email]> on behalf of Ilya Biryukov via cfe-dev <[hidden email]>
Reply-To: Ilya Biryukov <[hidden email]>
Date: Thursday, June 1, 2017 at 10:52 AM
To: Vladimir Voskresensky <[hidden email]>
Cc: via cfe-dev <[hidden email]>

Subject: Re: [cfe-dev] Adding indexing support to Clangd

Thanks for the insights, I think I get the gist of the idea with the "module" PCH. 
One question is: what if the system headers are included after the user includes? Then we abandon the PCH cache and run the parsing from scratch, right?

FileSystemStatCache that is reused between compilation units? Sounds like a low-hanging fruit for indexing, thanks.

On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky <[hidden email]> wrote:
Hi Ilia,

Sorry for the late reply.
Unfortunately mentioned hacks were done long time ago and I couldn't find the changes at the first glance :-(

But you can think about reusable chaned PCHs in the "module" way.
Each system header is a module.
There are special index_headers.c and index_headers.cpp files which includes all standard headers.
These files are indexed first and create "module" per #include.
Module is created once or several times if preprocessor contexts are very different like C vs. C++98 vs. C++14.
Then reused.
Of course it could compromise the accuracy, but for proof of concept was enough to see that expected indexing speed can be achieved theoretically.

Btw, another hint: implementing FileSystemStatCache gave the next visible speedup. Of course need to carefully invalidate/update it when file was modified in IDE or externally.
So, finally we got just 2x slowdown, but the accuracy of "real" compiler. And then as you know we have started Clank :-)

Hope it helps,
Vladimir.


On 29.05.2017 11:58, Ilya Biryukov wrote:
Hi Vladimir,

Thanks for sharing your experience.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was: 
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.
It's a good reason to focus on that issue from the very start than. Would be nice to have some exact measurements, though. (i.e. on LLVM).
Just to know how slow exactly was it.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.
We definitely want diagnostics/completions for the currently open file to be available. Good point, we definitely want to explicitly name the available features in the docs/discussions.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.
Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.
Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.
Is this work open-source? The interesting part is how to "reuse" the PCH for a header that's included in a different order. 
I.e. is there a way to reuse some cached information(PCH, or anything else) for <map> and <vector> when parsing these two files:
```
// foo.cpp
#include <vector>
#include <map>
...

// bar.cpp
#include <map>
#include <vector>
....
```

--
Regards,
Ilya Biryukov




--
Regards,
Ilya Biryukov
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



--
Regards,
Ilya Biryukov


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev

Hi!


I want to give a little update on the indexing prototype for Clangd I've been working on. I wanted to share the actual code on Github before I went on vacation but I ran out of time! Sorry about that!


Here's a short summary of the several components there now and what's in progress:


--ClangdIndexStorage--

malloc-like interface that allocates/frees data blocks of variable sizes on disk. The blocks can contain ints, shorts, strings, pointers (i.e. file offsets), etc. The data is cached in 4K pieces so that local and repeated accesses are all done quickly, in memory.

Clangd mallocs and writes its index model objects using this.


--BTree--

A pretty classic BTree implementation. Used to speed up lookup (symbol names, file names). It allocates its nodes using ClangdIndexStorage therefore it is stored on disk. Keys are actually records in ClangdIndexStorage so you can really think of the BTree as a collection of sorted pointers (sorted according to a provided comparator).




The index model is not very rich yet but the idea is that lower level building blocks (ClangdIndexStorage, BTree) will be there so that we can start iterating.


--ClangdIndexFile--

Path + information about inclusion in order to be able to represent an include graph.

The include graph is used to know which files need reindexing for example when a header file changes and also which compilation database entry to use when opening a header in the editor. The first source file including the header file is used to look up the entry in the compilation database. This will also be used for the "Include Browser" feature in the future.


--ClangdIndexSymbol--

USR + location (pointer to a ClangdIndexFile + start/end offsets)

This only represents definitions in source files for now. This is part of the indexing-based feature to "Open Definition".

This is likely to have more changes to represent declaration vs definition, references (Find references feature), call graph, etc.


--ClangdIndex--

Owns a BTree of ClangdIndexSymbol sorted by USR for fast lookup (used by Open Definition for now). getSymbol(USR) returns a ClangdIndexSymbol.

Also owns a BTree of ClangdIndexFile sorted by Path for fast lookup. As explained above, used to find proper compilation database entry and to react to header changes. getFile(Path) returns a ClangdIndexFile.


# Building the index


When Clangd gets the initialize request from the client, it is given a rootURI. It starts indexing each source files under this root, creating ClangdIndexFiles and ClangdIndexSymbols. This is done with the help of index::IndexDataConsumer.


At the moment, the main covered scenarios are building the index from scratch and adding files. Support for modifying files and removing them is not fully implemented yet (but this is a must of course!).


In case you are wondering, there is no fancy caching of headers or preamble used while indexing, so the process is quite slow. I have been focusing on the model and persistence of the index versus the input (parsing). This will have to be addressed too.


# Examples of using the index


When the user does the "Open Declaration" command, it retrieves the ClangdIndexSymbol from the ClangdIndex using the USR at the requested offset (sped up by the BTree). The location of the ClangdIndexSymbol (if found) is then returned to the editor.


When the user opens a header file, it retrieves the ClangdIndexFile from the ClangdIndex using the path of the header (sped up by the BTree). Then it recursively finds which file includes it until there is no more, at this point chances are that this is a source file. Use this source file path to find a potential entry in the compilation database (so that we gets proper compiler flags, etc).





This is just to give you a taste of what I have in mind and what kind of progress is being made. I'd like to have the "lower-level" parts ready for review soon after I come back from vacation (Aug 24th). I was thinking that ClangdIndexStorage and BTree can go in first as they are quite isolated and unit-testable. The rest of the code will also be available on Github to show more concrete usage of them if necessary.


Regards,

Marc-André


From: cfe-dev <[hidden email]> on behalf of Vladimir Voskresensky via cfe-dev <[hidden email]>
Sent: Thursday, June 1, 2017 3:10:55 PM
To: [hidden email]
Subject: Re: [cfe-dev] Adding indexing support to Clangd
 



On 06/ 1/17 06:26 PM, Ilya Biryukov via cfe-dev wrote:
Other IDEs do that very similarly to CDT, AFAIK. Compromising correctness, but getting better performance.
Reusing modules would be nice, and I wonder if it could also be made transparent to the users of the tool (i.e. we could have an option 'pretend these headers are modules every time you encounter them')
I would expect that to break on most projects, though. Not sure if people would be willing to use something that spits tons of errors on them.
Interesting direction for prototyping...
As Doug mentioned, surprisingly the tricks with headers in the majority of projects give pretty good results :-)

In NetBeans we have similar to CDT headers caching approach.

The only difference is that when we hit #include the second time we only check if we can skip indexing,
But we always do "fair lightweight preprocessing" to keep fair context of all possible inner #ifdef/#else/#define directives (because they might affect the current file).
For that we use APT (Abstract Preprocessor Tree) per-file which is constant for the file and is created once - similar to clang's PTH (Pre-Tokenized headers).

Visiting file's APT we can produce different output based on input preprocessor state.
It can be visited in "light" mode or "produce tokens" mode, but it is always gives correct result from the strict compiler point of view.
We also do indexing in parallel and the APT (being immutable) is easily shared by index-visitors from all threads.
Btw stat cache is also reused from all indexing threads with appropriate synchronizations.

So in NetBeans we observe that using this tricks (which really looks like multi-modules per header file) the majority of projects are in very good accuracy + I can also confirm that it gives ~10x speedup.

Hope it helps,
Vladimir.


On Thu, Jun 1, 2017 at 5:14 PM, David Blaikie <[hidden email]> wrote:
Not sure this has already been discussed, but would it be practical/reasonable to use Clang's modules support for this? Might keep the implementation much simpler - and perhaps provide an extra incentive for users to modularize their build/code which would help their actual build tymes (& heck, parsed modules could even potentially be reused between indexer and final build - making apparent build times /really/ fast)

On Thu, Jun 1, 2017 at 8:12 AM Doug Schaefer via cfe-dev <[hidden email]> wrote:
I thought I’d chip in and describe Eclipse CDT’s strategy with header caching. It’s actually a big cheat but the results have proven to be pretty good.

CDT’s hack actually starts in the preprocessor. If we see a header file has already been indexed, we skip including it. At the back end, we seamlessly use the index or the current symbol table when doing symbol lookup. Symbols that get missed because we skipped header files get picked up out of the index instead. We also do that in the preprocessor to look up missing macros out of the index when doing macro substitution.

The performance gains were about an order of magnitude and it magically works most of the time with the main issue being header files that get included multiple times affected by different macro values but the effects of that haven’t been major.

With clang being a real compiler, I had my doubts that you could even do something like this without adding hooks in places the front-end gang might not like. Love to be proven wrong. It really is very hard to keep up with the evolving C++ standard and we could sure use the help clangd could offer.

Hope that helps,
Doug.

From: cfe-dev <[hidden email]> on behalf of Ilya Biryukov via cfe-dev <[hidden email]>
Reply-To: Ilya Biryukov <[hidden email]>
Date: Thursday, June 1, 2017 at 10:52 AM
To: Vladimir Voskresensky <[hidden email]>
Cc: via cfe-dev <[hidden email]>

Subject: Re: [cfe-dev] Adding indexing support to Clangd

Thanks for the insights, I think I get the gist of the idea with the "module" PCH. 
One question is: what if the system headers are included after the user includes? Then we abandon the PCH cache and run the parsing from scratch, right?

FileSystemStatCache that is reused between compilation units? Sounds like a low-hanging fruit for indexing, thanks.

On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky <[hidden email]> wrote:
Hi Ilia,

Sorry for the late reply.
Unfortunately mentioned hacks were done long time ago and I couldn't find the changes at the first glance :-(

But you can think about reusable chaned PCHs in the "module" way.
Each system header is a module.
There are special index_headers.c and index_headers.cpp files which includes all standard headers.
These files are indexed first and create "module" per #include.
Module is created once or several times if preprocessor contexts are very different like C vs. C++98 vs. C++14.
Then reused.
Of course it could compromise the accuracy, but for proof of concept was enough to see that expected indexing speed can be achieved theoretically.

Btw, another hint: implementing FileSystemStatCache gave the next visible speedup. Of course need to carefully invalidate/update it when file was modified in IDE or externally.
So, finally we got just 2x slowdown, but the accuracy of "real" compiler. And then as you know we have started Clank :-)

Hope it helps,
Vladimir.


On 29.05.2017 11:58, Ilya Biryukov wrote:
Hi Vladimir,

Thanks for sharing your experience.

We did such measurements when evaluated clang as a technology to be used in NetBeans C/C++, I don't remember the exact absolute numbers now, but the conclusion was: 
to be on par with the existing NetBeans speed we have to use different caching, otherwise it was like 10 times slower.
It's a good reason to focus on that issue from the very start than. Would be nice to have some exact measurements, though. (i.e. on LLVM).
Just to know how slow exactly was it.

+1. Btw, may be It is worth to set some expectations what is available during and after initial index phase.
I.e. during initial phase you'd probably like to have navigation for file opened in editor and can work in functions bodies.
We definitely want diagnostics/completions for the currently open file to be available. Good point, we definitely want to explicitly name the available features in the docs/discussions.

As to initial indexing:
Using PTH (not PCH) gave significant speedup.
Skipping bodies gave significant speedup, but you miss the references and later have to reindex bodies on demand.
Using chainged PCH gave the next visible speedup.
Of course we had to made some hacks for PCHs to be more often "reusable" (comparing to strict compiler rule) and keep multiple versions. In average 2: one for C and one for C++ parse context.
Also there is a difference between system headers and projects headers, so systems' can be cached more aggressively.
Is this work open-source? The interesting part is how to "reuse" the PCH for a header that's included in a different order. 
I.e. is there a way to reuse some cached information(PCH, or anything else) for <map> and <vector> when parsing these two files:
```
// foo.cpp
#include <vector>
#include <map>
...

// bar.cpp
#include <map>
#include <vector>
....
```

--
Regards,
Ilya Biryukov




--
Regards,
Ilya Biryukov
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



--
Regards,
Ilya Biryukov


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
Awesome, indexing in clangd would be super cool. Do you have a plan on
how to combine this with the "indexing while building" stuff Apple
folks are going to contribute? That will be important when we want to
use clangd with larger projects.

On Tue, Aug 8, 2017 at 7:52 PM, Marc-André Laperle via cfe-dev
<[hidden email]> wrote:

> Hi!
>
>
> I want to give a little update on the indexing prototype for Clangd I've
> been working on. I wanted to share the actual code on Github before I went
> on vacation but I ran out of time! Sorry about that!
>
>
> Here's a short summary of the several components there now and what's in
> progress:
>
>
> --ClangdIndexStorage--
>
> malloc-like interface that allocates/frees data blocks of variable sizes on
> disk. The blocks can contain ints, shorts, strings, pointers (i.e. file
> offsets), etc. The data is cached in 4K pieces so that local and repeated
> accesses are all done quickly, in memory.
>
> Clangd mallocs and writes its index model objects using this.
>
>
> --BTree--
>
> A pretty classic BTree implementation. Used to speed up lookup (symbol
> names, file names). It allocates its nodes using ClangdIndexStorage
> therefore it is stored on disk. Keys are actually records in
> ClangdIndexStorage so you can really think of the BTree as a collection of
> sorted pointers (sorted according to a provided comparator).
>
>
>
>
> The index model is not very rich yet but the idea is that lower level
> building blocks (ClangdIndexStorage, BTree) will be there so that we can
> start iterating.
>
>
> --ClangdIndexFile--
>
> Path + information about inclusion in order to be able to represent an
> include graph.
>
> The include graph is used to know which files need reindexing for example
> when a header file changes and also which compilation database entry to use
> when opening a header in the editor. The first source file including the
> header file is used to look up the entry in the compilation database. This
> will also be used for the "Include Browser" feature in the future.
>
>
> --ClangdIndexSymbol--
>
> USR + location (pointer to a ClangdIndexFile + start/end offsets)
>
> This only represents definitions in source files for now. This is part of
> the indexing-based feature to "Open Definition".
>
> This is likely to have more changes to represent declaration vs definition,
> references (Find references feature), call graph, etc.
>
>
> --ClangdIndex--
>
> Owns a BTree of ClangdIndexSymbol sorted by USR for fast lookup (used by
> Open Definition for now). getSymbol(USR) returns a ClangdIndexSymbol.
>
> Also owns a BTree of ClangdIndexFile sorted by Path for fast lookup. As
> explained above, used to find proper compilation database entry and to react
> to header changes. getFile(Path) returns a ClangdIndexFile.
>
>
> # Building the index
>
>
> When Clangd gets the initialize request from the client, it is given a
> rootURI. It starts indexing each source files under this root, creating
> ClangdIndexFiles and ClangdIndexSymbols. This is done with the help of
> index::IndexDataConsumer.
>
>
> At the moment, the main covered scenarios are building the index from
> scratch and adding files. Support for modifying files and removing them is
> not fully implemented yet (but this is a must of course!).
>
>
> In case you are wondering, there is no fancy caching of headers or preamble
> used while indexing, so the process is quite slow. I have been focusing on
> the model and persistence of the index versus the input (parsing). This will
> have to be addressed too.
>
>
> # Examples of using the index
>
>
> When the user does the "Open Declaration" command, it retrieves the
> ClangdIndexSymbol from the ClangdIndex using the USR at the requested offset
> (sped up by the BTree). The location of the ClangdIndexSymbol (if found) is
> then returned to the editor.
>
>
> When the user opens a header file, it retrieves the ClangdIndexFile from the
> ClangdIndex using the path of the header (sped up by the BTree). Then it
> recursively finds which file includes it until there is no more, at this
> point chances are that this is a source file. Use this source file path to
> find a potential entry in the compilation database (so that we gets proper
> compiler flags, etc).
>
>
>
>
>
> This is just to give you a taste of what I have in mind and what kind of
> progress is being made. I'd like to have the "lower-level" parts ready for
> review soon after I come back from vacation (Aug 24th). I was thinking that
> ClangdIndexStorage and BTree can go in first as they are quite isolated and
> unit-testable. The rest of the code will also be available on Github to show
> more concrete usage of them if necessary.
>
>
> Regards,
>
> Marc-André
>
> ________________________________
> From: cfe-dev <[hidden email]> on behalf of Vladimir
> Voskresensky via cfe-dev <[hidden email]>
> Sent: Thursday, June 1, 2017 3:10:55 PM
> To: [hidden email]
>
> Subject: Re: [cfe-dev] Adding indexing support to Clangd
>
>
>
>
> On 06/ 1/17 06:26 PM, Ilya Biryukov via cfe-dev wrote:
>
> Other IDEs do that very similarly to CDT, AFAIK. Compromising correctness,
> but getting better performance.
> Reusing modules would be nice, and I wonder if it could also be made
> transparent to the users of the tool (i.e. we could have an option 'pretend
> these headers are modules every time you encounter them')
> I would expect that to break on most projects, though. Not sure if people
> would be willing to use something that spits tons of errors on them.
> Interesting direction for prototyping...
>
> As Doug mentioned, surprisingly the tricks with headers in the majority of
> projects give pretty good results :-)
>
> In NetBeans we have similar to CDT headers caching approach.
>
> The only difference is that when we hit #include the second time we only
> check if we can skip indexing,
> But we always do "fair lightweight preprocessing" to keep fair context of
> all possible inner #ifdef/#else/#define directives (because they might
> affect the current file).
> For that we use APT (Abstract Preprocessor Tree) per-file which is constant
> for the file and is created once - similar to clang's PTH (Pre-Tokenized
> headers).
>
> Visiting file's APT we can produce different output based on input
> preprocessor state.
> It can be visited in "light" mode or "produce tokens" mode, but it is always
> gives correct result from the strict compiler point of view.
> We also do indexing in parallel and the APT (being immutable) is easily
> shared by index-visitors from all threads.
> Btw stat cache is also reused from all indexing threads with appropriate
> synchronizations.
>
> So in NetBeans we observe that using this tricks (which really looks like
> multi-modules per header file) the majority of projects are in very good
> accuracy + I can also confirm that it gives ~10x speedup.
>
> Hope it helps,
> Vladimir.
>
>
> On Thu, Jun 1, 2017 at 5:14 PM, David Blaikie <[hidden email]> wrote:
>>
>> Not sure this has already been discussed, but would it be
>> practical/reasonable to use Clang's modules support for this? Might keep the
>> implementation much simpler - and perhaps provide an extra incentive for
>> users to modularize their build/code which would help their actual build
>> tymes (& heck, parsed modules could even potentially be reused between
>> indexer and final build - making apparent build times /really/ fast)
>>
>> On Thu, Jun 1, 2017 at 8:12 AM Doug Schaefer via cfe-dev
>> <[hidden email]> wrote:
>>>
>>> I thought I’d chip in and describe Eclipse CDT’s strategy with header
>>> caching. It’s actually a big cheat but the results have proven to be pretty
>>> good.
>>>
>>> CDT’s hack actually starts in the preprocessor. If we see a header file
>>> has already been indexed, we skip including it. At the back end, we
>>> seamlessly use the index or the current symbol table when doing symbol
>>> lookup. Symbols that get missed because we skipped header files get picked
>>> up out of the index instead. We also do that in the preprocessor to look up
>>> missing macros out of the index when doing macro substitution.
>>>
>>> The performance gains were about an order of magnitude and it magically
>>> works most of the time with the main issue being header files that get
>>> included multiple times affected by different macro values but the effects
>>> of that haven’t been major.
>>>
>>> With clang being a real compiler, I had my doubts that you could even do
>>> something like this without adding hooks in places the front-end gang might
>>> not like. Love to be proven wrong. It really is very hard to keep up with
>>> the evolving C++ standard and we could sure use the help clangd could offer.
>>>
>>> Hope that helps,
>>> Doug.
>>>
>>> From: cfe-dev <[hidden email]> on behalf of Ilya Biryukov
>>> via cfe-dev <[hidden email]>
>>> Reply-To: Ilya Biryukov <[hidden email]>
>>> Date: Thursday, June 1, 2017 at 10:52 AM
>>> To: Vladimir Voskresensky <[hidden email]>
>>> Cc: via cfe-dev <[hidden email]>
>>>
>>> Subject: Re: [cfe-dev] Adding indexing support to Clangd
>>>
>>> Thanks for the insights, I think I get the gist of the idea with the
>>> "module" PCH.
>>> One question is: what if the system headers are included after the user
>>> includes? Then we abandon the PCH cache and run the parsing from scratch,
>>> right?
>>>
>>> FileSystemStatCache that is reused between compilation units? Sounds like
>>> a low-hanging fruit for indexing, thanks.
>>>
>>> On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky
>>> <[hidden email]> wrote:
>>>>
>>>> Hi Ilia,
>>>>
>>>> Sorry for the late reply.
>>>> Unfortunately mentioned hacks were done long time ago and I couldn't
>>>> find the changes at the first glance :-(
>>>>
>>>> But you can think about reusable chaned PCHs in the "module" way.
>>>> Each system header is a module.
>>>> There are special index_headers.c and index_headers.cpp files which
>>>> includes all standard headers.
>>>> These files are indexed first and create "module" per #include.
>>>> Module is created once or several times if preprocessor contexts are
>>>> very different like C vs. C++98 vs. C++14.
>>>> Then reused.
>>>> Of course it could compromise the accuracy, but for proof of concept was
>>>> enough to see that expected indexing speed can be achieved theoretically.
>>>>
>>>> Btw, another hint: implementing FileSystemStatCache gave the next
>>>> visible speedup. Of course need to carefully invalidate/update it when file
>>>> was modified in IDE or externally.
>>>> So, finally we got just 2x slowdown, but the accuracy of "real"
>>>> compiler. And then as you know we have started Clank :-)
>>>>
>>>> Hope it helps,
>>>> Vladimir.
>>>>
>>>>
>>>> On 29.05.2017 11:58, Ilya Biryukov wrote:
>>>>
>>>> Hi Vladimir,
>>>>
>>>> Thanks for sharing your experience.
>>>>
>>>>> We did such measurements when evaluated clang as a technology to be
>>>>> used in NetBeans C/C++, I don't remember the exact absolute numbers now, but
>>>>> the conclusion was:
>>>>>
>>>>> to be on par with the existing NetBeans speed we have to use different
>>>>> caching, otherwise it was like 10 times slower.
>>>>
>>>> It's a good reason to focus on that issue from the very start than.
>>>> Would be nice to have some exact measurements, though. (i.e. on LLVM).
>>>> Just to know how slow exactly was it.
>>>>
>>>>> +1. Btw, may be It is worth to set some expectations what is available
>>>>> during and after initial index phase.
>>>>> I.e. during initial phase you'd probably like to have navigation for
>>>>> file opened in editor and can work in functions bodies.
>>>>
>>>> We definitely want diagnostics/completions for the currently open file
>>>> to be available. Good point, we definitely want to explicitly name the
>>>> available features in the docs/discussions.
>>>>
>>>>> As to initial indexing:
>>>>> Using PTH (not PCH) gave significant speedup.
>>>>>
>>>>> Skipping bodies gave significant speedup, but you miss the references
>>>>> and later have to reindex bodies on demand.
>>>>> Using chainged PCH gave the next visible speedup.
>>>>>
>>>>> Of course we had to made some hacks for PCHs to be more often
>>>>> "reusable" (comparing to strict compiler rule) and keep multiple versions.
>>>>> In average 2: one for C and one for C++ parse context.
>>>>> Also there is a difference between system headers and projects headers,
>>>>> so systems' can be cached more aggressively.
>>>>
>>>> Is this work open-source? The interesting part is how to "reuse" the PCH
>>>> for a header that's included in a different order.
>>>> I.e. is there a way to reuse some cached information(PCH, or anything
>>>> else) for <map> and <vector> when parsing these two files:
>>>> ```
>>>> // foo.cpp
>>>> #include <vector>
>>>> #include <map>
>>>> ...
>>>>
>>>> // bar.cpp
>>>> #include <map>
>>>> #include <vector>
>>>> ....
>>>> ```
>>>>
>>>> --
>>>> Regards,
>>>> Ilya Biryukov
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ilya Biryukov
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
>
>
> --
> Regards,
> Ilya Biryukov
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev

No concrete plan yet but it's something we'd definitely like to have. I haven't seen in detail how the "indexing while building" is going to work, maybe I missed it? I assume there will be some kind of indexer-agnostic hook in the same spirit of the refactoring work that was proposed. I'll keep an eye on this and make sure our efforts are compatible. In any case, I think there will be enough benefit to warrant major rewrites/refactorings if needed.


Marc-André



From: Benjamin Kramer <[hidden email]>
Sent: Wednesday, August 9, 2017 8:54:03 AM
To: Marc-André Laperle
Cc: via cfe-dev; [hidden email]; Doug Schaefer; Alex Lorenz; Duncan P. N. Exon Smith; Argyrios Kyrtzidis
Subject: Re: [cfe-dev] Adding indexing support to Clangd
 
Awesome, indexing in clangd would be super cool. Do you have a plan on
how to combine this with the "indexing while building" stuff Apple
folks are going to contribute? That will be important when we want to
use clangd with larger projects.

On Tue, Aug 8, 2017 at 7:52 PM, Marc-André Laperle via cfe-dev
<[hidden email]> wrote:
> Hi!
>
>
> I want to give a little update on the indexing prototype for Clangd I've
> been working on. I wanted to share the actual code on Github before I went
> on vacation but I ran out of time! Sorry about that!
>
>
> Here's a short summary of the several components there now and what's in
> progress:
>
>
> --ClangdIndexStorage--
>
> malloc-like interface that allocates/frees data blocks of variable sizes on
> disk. The blocks can contain ints, shorts, strings, pointers (i.e. file
> offsets), etc. The data is cached in 4K pieces so that local and repeated
> accesses are all done quickly, in memory.
>
> Clangd mallocs and writes its index model objects using this.
>
>
> --BTree--
>
> A pretty classic BTree implementation. Used to speed up lookup (symbol
> names, file names). It allocates its nodes using ClangdIndexStorage
> therefore it is stored on disk. Keys are actually records in
> ClangdIndexStorage so you can really think of the BTree as a collection of
> sorted pointers (sorted according to a provided comparator).
>
>
>
>
> The index model is not very rich yet but the idea is that lower level
> building blocks (ClangdIndexStorage, BTree) will be there so that we can
> start iterating.
>
>
> --ClangdIndexFile--
>
> Path + information about inclusion in order to be able to represent an
> include graph.
>
> The include graph is used to know which files need reindexing for example
> when a header file changes and also which compilation database entry to use
> when opening a header in the editor. The first source file including the
> header file is used to look up the entry in the compilation database. This
> will also be used for the "Include Browser" feature in the future.
>
>
> --ClangdIndexSymbol--
>
> USR + location (pointer to a ClangdIndexFile + start/end offsets)
>
> This only represents definitions in source files for now. This is part of
> the indexing-based feature to "Open Definition".
>
> This is likely to have more changes to represent declaration vs definition,
> references (Find references feature), call graph, etc.
>
>
> --ClangdIndex--
>
> Owns a BTree of ClangdIndexSymbol sorted by USR for fast lookup (used by
> Open Definition for now). getSymbol(USR) returns a ClangdIndexSymbol.
>
> Also owns a BTree of ClangdIndexFile sorted by Path for fast lookup. As
> explained above, used to find proper compilation database entry and to react
> to header changes. getFile(Path) returns a ClangdIndexFile.
>
>
> # Building the index
>
>
> When Clangd gets the initialize request from the client, it is given a
> rootURI. It starts indexing each source files under this root, creating
> ClangdIndexFiles and ClangdIndexSymbols. This is done with the help of
> index::IndexDataConsumer.
>
>
> At the moment, the main covered scenarios are building the index from
> scratch and adding files. Support for modifying files and removing them is
> not fully implemented yet (but this is a must of course!).
>
>
> In case you are wondering, there is no fancy caching of headers or preamble
> used while indexing, so the process is quite slow. I have been focusing on
> the model and persistence of the index versus the input (parsing). This will
> have to be addressed too.
>
>
> # Examples of using the index
>
>
> When the user does the "Open Declaration" command, it retrieves the
> ClangdIndexSymbol from the ClangdIndex using the USR at the requested offset
> (sped up by the BTree). The location of the ClangdIndexSymbol (if found) is
> then returned to the editor.
>
>
> When the user opens a header file, it retrieves the ClangdIndexFile from the
> ClangdIndex using the path of the header (sped up by the BTree). Then it
> recursively finds which file includes it until there is no more, at this
> point chances are that this is a source file. Use this source file path to
> find a potential entry in the compilation database (so that we gets proper
> compiler flags, etc).
>
>
>
>
>
> This is just to give you a taste of what I have in mind and what kind of
> progress is being made. I'd like to have the "lower-level" parts ready for
> review soon after I come back from vacation (Aug 24th). I was thinking that
> ClangdIndexStorage and BTree can go in first as they are quite isolated and
> unit-testable. The rest of the code will also be available on Github to show
> more concrete usage of them if necessary.
>
>
> Regards,
>
> Marc-André
>
> ________________________________
> From: cfe-dev <[hidden email]> on behalf of Vladimir
> Voskresensky via cfe-dev <[hidden email]>
> Sent: Thursday, June 1, 2017 3:10:55 PM
> To: [hidden email]
>
> Subject: Re: [cfe-dev] Adding indexing support to Clangd
>
>
>
>
> On 06/ 1/17 06:26 PM, Ilya Biryukov via cfe-dev wrote:
>
> Other IDEs do that very similarly to CDT, AFAIK. Compromising correctness,
> but getting better performance.
> Reusing modules would be nice, and I wonder if it could also be made
> transparent to the users of the tool (i.e. we could have an option 'pretend
> these headers are modules every time you encounter them')
> I would expect that to break on most projects, though. Not sure if people
> would be willing to use something that spits tons of errors on them.
> Interesting direction for prototyping...
>
> As Doug mentioned, surprisingly the tricks with headers in the majority of
> projects give pretty good results :-)
>
> In NetBeans we have similar to CDT headers caching approach.
>
> The only difference is that when we hit #include the second time we only
> check if we can skip indexing,
> But we always do "fair lightweight preprocessing" to keep fair context of
> all possible inner #ifdef/#else/#define directives (because they might
> affect the current file).
> For that we use APT (Abstract Preprocessor Tree) per-file which is constant
> for the file and is created once - similar to clang's PTH (Pre-Tokenized
> headers).
>
> Visiting file's APT we can produce different output based on input
> preprocessor state.
> It can be visited in "light" mode or "produce tokens" mode, but it is always
> gives correct result from the strict compiler point of view.
> We also do indexing in parallel and the APT (being immutable) is easily
> shared by index-visitors from all threads.
> Btw stat cache is also reused from all indexing threads with appropriate
> synchronizations.
>
> So in NetBeans we observe that using this tricks (which really looks like
> multi-modules per header file) the majority of projects are in very good
> accuracy + I can also confirm that it gives ~10x speedup.
>
> Hope it helps,
> Vladimir.
>
>
> On Thu, Jun 1, 2017 at 5:14 PM, David Blaikie <[hidden email]> wrote:
>>
>> Not sure this has already been discussed, but would it be
>> practical/reasonable to use Clang's modules support for this? Might keep the
>> implementation much simpler - and perhaps provide an extra incentive for
>> users to modularize their build/code which would help their actual build
>> tymes (& heck, parsed modules could even potentially be reused between
>> indexer and final build - making apparent build times /really/ fast)
>>
>> On Thu, Jun 1, 2017 at 8:12 AM Doug Schaefer via cfe-dev
>> <[hidden email]> wrote:
>>>
>>> I thought I’d chip in and describe Eclipse CDT’s strategy with header
>>> caching. It’s actually a big cheat but the results have proven to be pretty
>>> good.
>>>
>>> CDT’s hack actually starts in the preprocessor. If we see a header file
>>> has already been indexed, we skip including it. At the back end, we
>>> seamlessly use the index or the current symbol table when doing symbol
>>> lookup. Symbols that get missed because we skipped header files get picked
>>> up out of the index instead. We also do that in the preprocessor to look up
>>> missing macros out of the index when doing macro substitution.
>>>
>>> The performance gains were about an order of magnitude and it magically
>>> works most of the time with the main issue being header files that get
>>> included multiple times affected by different macro values but the effects
>>> of that haven’t been major.
>>>
>>> With clang being a real compiler, I had my doubts that you could even do
>>> something like this without adding hooks in places the front-end gang might
>>> not like. Love to be proven wrong. It really is very hard to keep up with
>>> the evolving C++ standard and we could sure use the help clangd could offer.
>>>
>>> Hope that helps,
>>> Doug.
>>>
>>> From: cfe-dev <[hidden email]> on behalf of Ilya Biryukov
>>> via cfe-dev <[hidden email]>
>>> Reply-To: Ilya Biryukov <[hidden email]>
>>> Date: Thursday, June 1, 2017 at 10:52 AM
>>> To: Vladimir Voskresensky <[hidden email]>
>>> Cc: via cfe-dev <[hidden email]>
>>>
>>> Subject: Re: [cfe-dev] Adding indexing support to Clangd
>>>
>>> Thanks for the insights, I think I get the gist of the idea with the
>>> "module" PCH.
>>> One question is: what if the system headers are included after the user
>>> includes? Then we abandon the PCH cache and run the parsing from scratch,
>>> right?
>>>
>>> FileSystemStatCache that is reused between compilation units? Sounds like
>>> a low-hanging fruit for indexing, thanks.
>>>
>>> On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky
>>> <[hidden email]> wrote:
>>>>
>>>> Hi Ilia,
>>>>
>>>> Sorry for the late reply.
>>>> Unfortunately mentioned hacks were done long time ago and I couldn't
>>>> find the changes at the first glance :-(
>>>>
>>>> But you can think about reusable chaned PCHs in the "module" way.
>>>> Each system header is a module.
>>>> There are special index_headers.c and index_headers.cpp files which
>>>> includes all standard headers.
>>>> These files are indexed first and create "module" per #include.
>>>> Module is created once or several times if preprocessor contexts are
>>>> very different like C vs. C++98 vs. C++14.
>>>> Then reused.
>>>> Of course it could compromise the accuracy, but for proof of concept was
>>>> enough to see that expected indexing speed can be achieved theoretically.
>>>>
>>>> Btw, another hint: implementing FileSystemStatCache gave the next
>>>> visible speedup. Of course need to carefully invalidate/update it when file
>>>> was modified in IDE or externally.
>>>> So, finally we got just 2x slowdown, but the accuracy of "real"
>>>> compiler. And then as you know we have started Clank :-)
>>>>
>>>> Hope it helps,
>>>> Vladimir.
>>>>
>>>>
>>>> On 29.05.2017 11:58, Ilya Biryukov wrote:
>>>>
>>>> Hi Vladimir,
>>>>
>>>> Thanks for sharing your experience.
>>>>
>>>>> We did such measurements when evaluated clang as a technology to be
>>>>> used in NetBeans C/C++, I don't remember the exact absolute numbers now, but
>>>>> the conclusion was:
>>>>>
>>>>> to be on par with the existing NetBeans speed we have to use different
>>>>> caching, otherwise it was like 10 times slower.
>>>>
>>>> It's a good reason to focus on that issue from the very start than.
>>>> Would be nice to have some exact measurements, though. (i.e. on LLVM).
>>>> Just to know how slow exactly was it.
>>>>
>>>>> +1. Btw, may be It is worth to set some expectations what is available
>>>>> during and after initial index phase.
>>>>> I.e. during initial phase you'd probably like to have navigation for
>>>>> file opened in editor and can work in functions bodies.
>>>>
>>>> We definitely want diagnostics/completions for the currently open file
>>>> to be available. Good point, we definitely want to explicitly name the
>>>> available features in the docs/discussions.
>>>>
>>>>> As to initial indexing:
>>>>> Using PTH (not PCH) gave significant speedup.
>>>>>
>>>>> Skipping bodies gave significant speedup, but you miss the references
>>>>> and later have to reindex bodies on demand.
>>>>> Using chainged PCH gave the next visible speedup.
>>>>>
>>>>> Of course we had to made some hacks for PCHs to be more often
>>>>> "reusable" (comparing to strict compiler rule) and keep multiple versions.
>>>>> In average 2: one for C and one for C++ parse context.
>>>>> Also there is a difference between system headers and projects headers,
>>>>> so systems' can be cached more aggressively.
>>>>
>>>> Is this work open-source? The interesting part is how to "reuse" the PCH
>>>> for a header that's included in a different order.
>>>> I.e. is there a way to reuse some cached information(PCH, or anything
>>>> else) for <map> and <vector> when parsing these two files:
>>>> ```
>>>> // foo.cpp
>>>> #include <vector>
>>>> #include <map>
>>>> ...
>>>>
>>>> // bar.cpp
>>>> #include <map>
>>>> #include <vector>
>>>> ....
>>>> ```
>>>>
>>>> --
>>>> Regards,
>>>> Ilya Biryukov
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ilya Biryukov
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

>
>
>
>
> --
> Regards,
> Ilya Biryukov
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

>
>
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

>

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
In reply to this post by Alex Denisov via cfe-dev
Hi, Marc-André,

+1 to Ben's comment.
Thanks for your work, having indexing in clangd would be awesome.

One thing that is really important for us at Google is to come up with the index interface (for both querying and building an index) that would allow an alternative implementation that could scale for larger codebases.
That should not require significant changes to your design, merely extracting a few interface classes and figuring out the APIs should be easy.
We could figure it all out during the review process and I would highly encourage to start a review as early as possible.

We could add the index modification APIs during the review process as well.
The lack of header caching in the first implementation is fine. That's something we should iterate on later. That's a hard problem and it seems fine if we solve it separately.

Wish you a great vacation!

--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding indexing support to Clangd

Alex Denisov via cfe-dev
In reply to this post by Alex Denisov via cfe-dev
On 8 Aug 2017, at 18:52, Marc-André Laperle via cfe-dev <[hidden email]> wrote:
>
> --ClangdIndexStorage--
> malloc-like interface that allocates/frees data blocks of variable sizes on disk. The blocks can contain ints, shorts, strings, pointers (i.e. file offsets), etc. The data is cached in 4K pieces so that local and repeated accesses are all done quickly, in memory.
> Clangd mallocs and writes its index model objects using this.
>
> --BTree--
> A pretty classic BTree implementation. Used to speed up lookup (symbol names, file names). It allocates its nodes using ClangdIndexStorage therefore it is stored on disk. Keys are actually records in ClangdIndexStorage so you can really think of the BTree as a collection of sorted pointers (sorted according to a provided comparator).

This sounds very like bdb.  Is there a reason that we’re reimplementing a large chunk of bdb, rather than just using it (or using something like sqlite for the index storage)?

David


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12