Quantcast

Adding indexing support to Clangd

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Adding indexing support to Clangd

Lang Hames via cfe-dev
Hi,

I’ve been thinking about how to add features to Clangd requiring an index, i.e. features that need a database containing information of all source files (Go to definition, find references, etc). I’d like to share with you my thoughts on how things are and what approaches could be taken before getting too deep into implementing something.

My understanding of the current Clang indexing facilities is as follow:
  - It is part of the libclang so it is meant to have a stable API which can be limiting because it does not expose the full Clang C/C++ API
  - It does not have persistence. I.e. the index cannot be reloaded from disk at a later time after it is built.
  - There is no header caching mechanism in order to allow faster reparsing when a source file changes but its included headers haven’t (a common occurrence during code editing).

-- Other indexing solutions --

I have done a very high level exploration of some other projects using Clang for indexing, you can find some notes here:
https://docs.google.com/document/d/1Z0pDZpUlhyRkw1yB9frVVeb_xgSb5PuXD0-aeUtKkpo/edit?usp=sharing
(Feel free to add your own notes if you’d like!)

From what I gathered:
  - Some projects are using libclang, others use the Clang C++ APIs (AST) directly because of libclang limitations
  - Some projects have a custom index formats on disk, others use RDMS (PostgreSql, Sqlite) or other already available solutions (Elastic Search, etc).
  - I didn’t notice any projects based on Clang doing header caching, although perhaps I missed it. Ilya Biryukov wrote that JetBrains CLion does header caching but it’s not clear how they are stored or if it is using Clang. On the Eclipse CDT side, Clang is not used but there is header caching by storing the semantic model in the index (not plain AST). Then the source files can be parsed reusing that cached information.

Possible approach for Clangd:
  - I suggest using Clang libraries directly and not using libclang in order to not have any limitations. I think that using a stable API is not as important since Clangd resides in the same tree and is built and tested in coordination with Clang itself. The downside is that it will not reuse some of the work already done in libclang such as finding references, etc.
  - I think introducing a big dependency such as PostgreSql is not acceptable for Clangd (correct me if I’m wrong!). So a custom tailored file format for the index make more sense to me.
  - For header caching, I wonder if it is possible to reuse the precompiled header support in Clang. There would be some logic that would decide whether or not a precompiled header could be used depending on the preprocessing context (same macro definitions, etc).

-- The Index model --

Here’s what the data model could look like. For sure it’s partial and I expect it will evolve quite a bit. But it should be enough to communicate the general idea.

Index: Represents the model of the code base as a whole
  - IndexFile []

IndexFile: Represents an indexed file
  - URI path
  - IndexFile includedBy [ ]
  - IndexName [ ]
  - Last modified timestamp, checksum, etc

IndexName: Represents a declaration, definition, macro, etc
  - Source Location
  - IndexReference [ ]

IndexNameReference: Reference to a name
  - Source Location
  - Access (read, write, read/write)

IndexTypeName extends IndexName: represents classes, structs, etc
  - IndexTypeName bases [ ]

IndexFunctionName extends IndexName: represents functions, methods, etc
  - IndexFunctionName callers [ ]

Note that a lot of information probably doesn’t need to be modeled because a lot of information only needs to be available with an opened file for which we can have access to the full AST.

-- The persisted file format --

All elements in the model mentioned above could have a querying interface which could be implemented for an “in memory” database (simpler to debug and fast for small projects) and also for an on-disk database. From my experience in Eclipse CDT, the index on disk was stored in the form of a BTree which worked quite well. The BTree is made out of chunks. Chunks can be cached in memory and fetched from disk as required. Every information in the model is fetched from the database (from cache otherwise from disk). A similar approach could be used for Clangd if it’s deemed suitable.



In summary, I’m proposing for Clangd an index on disk stored in the form of a BTree that is populated using Clang’s C++ API (not libclang). Any concerns or input would be greatly appreciated. Just as a side note, I’m aware that this is just one line of thinking and others could be considered.

Best regards,
Marc-André Laperle
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding indexing support to Clangd

Lang Hames via cfe-dev
Hi,

Thanks for making a summary of existing solutions!

On 17 May 2017 at 23:38, Marc-André Laperle via cfe-dev <[hidden email]> wrote:
Hi,

I’ve been thinking about how to add features to Clangd requiring an index, i.e. features that need a database containing information of all source files (Go to definition, find references, etc). I’d like to share with you my thoughts on how things are and what approaches could be taken before getting too deep into implementing something.

My understanding of the current Clang indexing facilities is as follow:
  - It is part of the libclang so it is meant to have a stable API which can be limiting because it does not expose the full Clang C/C++ API
  - It does not have persistence. I.e. the index cannot be reloaded from disk at a later time after it is built.
  - There is no header caching mechanism in order to allow faster reparsing when a source file changes but its included headers haven’t (a common occurrence during code editing).

Have you looked into the precompiled preamble? I believe it can (and is) used when indexing.


-- Other indexing solutions --

I have done a very high level exploration of some other projects using Clang for indexing, you can find some notes here:
https://docs.google.com/document/d/1Z0pDZpUlhyRkw1yB9frVVeb_xgSb5PuXD0-aeUtKkpo/edit?usp=sharing
(Feel free to add your own notes if you’d like!)

From what I gathered:
  - Some projects are using libclang, others use the Clang C++ APIs (AST) directly because of libclang limitations
  - Some projects have a custom index formats on disk, others use RDMS (PostgreSql, Sqlite) or other already available solutions (Elastic Search, etc).
  - I didn’t notice any projects based on Clang doing header caching, although perhaps I missed it. Ilya Biryukov wrote that JetBrains CLion does header caching but it’s not clear how they are stored or if it is using Clang.

IIRC CLion uses a custom C++ parser instead of Clang.
 
On the Eclipse CDT side, Clang is not used but there is header caching by storing the semantic model in the index (not plain AST). Then the source files can be parsed reusing that cached information.

Possible approach for Clangd:
  - I suggest using Clang libraries directly and not using libclang in order to not have any limitations. I think that using a stable API is not as important since Clangd resides in the same tree and is built and tested in coordination with Clang itself. The downside is that it will not reuse some of the work already done in libclang such as finding references, etc.

I agree, Clangd should not use libclang. Note that in general libclang's indexer API is intended to be a wrapper around the core implementation in lib/Index. I also don't think libclang doesn't expose any means to find references.

I would encourage Clangd to reuse existing code in lib/Index. Even though it has bugs, we are (and will be) currently fixing a lot of issues in the library to ensure that our consumer records all of the possible declarations and references for both C++ and Obj-C.
 
  - I think introducing a big dependency such as PostgreSql is not acceptable for Clangd (correct me if I’m wrong!). So a custom tailored file format for the index make more sense to me.
  - For header caching, I wonder if it is possible to reuse the precompiled header support in Clang. There would be some logic that would decide whether or not a precompiled header could be used depending on the preprocessing context (same macro definitions, etc).

-- The Index model --

Here’s what the data model could look like. For sure it’s partial and I expect it will evolve quite a bit. But it should be enough to communicate the general idea.

Index: Represents the model of the code base as a whole
  - IndexFile []

IndexFile: Represents an indexed file
  - URI path
  - IndexFile includedBy [ ]
  - IndexName [ ]
  - Last modified timestamp, checksum, etc

IndexName: Represents a declaration, definition, macro, etc
  - Source Location
  - IndexReference [ ]

IndexNameReference: Reference to a name
  - Source Location
  - Access (read, write, read/write)

IndexTypeName extends IndexName: represents classes, structs, etc
  - IndexTypeName bases [ ]

IndexFunctionName extends IndexName: represents functions, methods, etc
  - IndexFunctionName callers [ ]

Note that a lot of information probably doesn’t need to be modeled because a lot of information only needs to be available with an opened file for which we can have access to the full AST.

-- The persisted file format --

All elements in the model mentioned above could have a querying interface which could be implemented for an “in memory” database (simpler to debug and fast for small projects) and also for an on-disk database. From my experience in Eclipse CDT, the index on disk was stored in the form of a BTree which worked quite well. The BTree is made out of chunks. Chunks can be cached in memory and fetched from disk as required. Every information in the model is fetched from the database (from cache otherwise from disk). A similar approach could be used for Clangd if it’s deemed suitable.

Have you looked into LLVM's bitcode as a possible format for the persistent index? Clang currently uses it for serialized diagnostics and modules.
 



In summary, I’m proposing for Clangd an index on disk stored in the form of a BTree that is populated using Clang’s C++ API (not libclang). Any concerns or input would be greatly appreciated. Just as a side note, I’m aware that this is just one line of thinking and others could be considered.

Best regards,
Marc-André Laperle
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding indexing support to Clangd

Lang Hames via cfe-dev
In reply to this post by Lang Hames via cfe-dev
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding indexing support to Clangd

Lang Hames via cfe-dev
In reply to this post by Lang Hames via cfe-dev
Hi Alex! Some replies in-lined.

On Thu, 2017-05-18 at 12:30 +0100, Alex L wrote:
Hi,

Thanks for making a summary of existing solutions!

On 17 May 2017 at 23:38, Marc-André Laperle via cfe-dev <[hidden email]> wrote:
Hi,

I’ve been thinking about how to add features to Clangd requiring an index, i.e. features that need a database containing information of all source files (Go to definition, find references, etc). I’d like to share with you my thoughts on how things are and what approaches could be taken before getting too deep into implementing something.

My understanding of the current Clang indexing facilities is as follow:
  - It is part of the libclang so it is meant to have a stable API which can be limiting because it does not expose the full Clang C/C++ API
  - It does not have persistence. I.e. the index cannot be reloaded from disk at a later time after it is built.
  - There is no header caching mechanism in order to allow faster reparsing when a source file changes but its included headers haven’t (a common occurrence during code editing).

Have you looked into the precompiled preamble? I believe it can (and is) used when indexing.

I haven't really looked into it yet but it looks very useful, especially this section: https://clang.llvm.org/docs/PCHInternals.html#chained-precompiled-headers



-- Other indexing solutions --

I have done a very high level exploration of some other projects using Clang for indexing, you can find some notes here:
https://docs.google.com/document/d/1Z0pDZpUlhyRkw1yB9frVVeb_xgSb5PuXD0-aeUtKkpo/edit?usp=sharing
(Feel free to add your own notes if you’d like!)

From what I gathered:
  - Some projects are using libclang, others use the Clang C++ APIs (AST) directly because of libclang limitations
  - Some projects have a custom index formats on disk, others use RDMS (PostgreSql, Sqlite) or other already available solutions (Elastic Search, etc).
  - I didn’t notice any projects based on Clang doing header caching, although perhaps I missed it. Ilya Biryukov wrote that JetBrains CLion does header caching but it’s not clear how they are stored or if it is using Clang.

IIRC CLion uses a custom C++ parser instead of Clang.
 
On the Eclipse CDT side, Clang is not used but there is header caching by storing the semantic model in the index (not plain AST). Then the source files can be parsed reusing that cached information.

Possible approach for Clangd:
  - I suggest using Clang libraries directly and not using libclang in order to not have any limitations. I think that using a stable API is not as important since Clangd resides in the same tree and is built and tested in coordination with Clang itself. The downside is that it will not reuse some of the work already done in libclang such as finding references, etc.

I agree, Clangd should not use libclang. Note that in general libclang's indexer API is intended to be a wrapper around the core implementation in lib/Index. I also don't think libclang doesn't expose any means to find references.

I would encourage Clangd to reuse existing code in lib/Index. Even though it has bugs, we are (and will be) currently fixing a lot of issues in the library to ensure that our consumer records all of the possible declarations and references for both C++ and Obj-C.

Thanks, I was under the wrong impression that this was all part of libclang but I see that this is not the case. I'm all for reusing code and I can help fix issues if there are any. I'll give this a try!

-- The persisted file format --

All elements in the model mentioned above could have a querying interface which could be implemented for an “in memory” database (simpler to debug and fast for small projects) and also for an on-disk database. From my experience in Eclipse CDT, the index on disk was stored in the form of a BTree which worked quite well. The BTree is made out of chunks. Chunks can be cached in memory and fetched from disk as required. Every information in the model is fetched from the database (from cache otherwise from disk). A similar approach could be used for Clangd if it’s deemed suitable.

Have you looked into LLVM's bitcode as a possible format for the persistent index? Clang currently uses it for serialized diagnostics and modules.
 

I will have a look. It seems very well defined. It's not clear to me yet if this can be used across the board but I'll play around with it a bit.

Thank you so much for mentioning these things! It's easy to miss some of the useful parts when getting into a new code base.

Regards,
Marc-André




In summary, I’m proposing for Clangd an index on disk stored in the form of a BTree that is populated using Clang’s C++ API (not libclang). Any concerns or input would be greatly appreciated. Just as a side note, I’m aware that this is just one line of thinking and others could be considered.

Best regards,
Marc-André Laperle
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding indexing support to Clangd

Lang Hames via cfe-dev
In reply to this post by Lang Hames via cfe-dev

Yeah, it sounds like a good approach to tackle header caching a bit later. "Precompiled preamble" looks promising so we can keep this in mind as we go.


Cheers,

Marc-André


From: Doug Schaefer <[hidden email]>
Sent: Thursday, May 18, 2017 10:44:18 AM
To: Marc-André Laperle; via cfe-dev
Cc: [hidden email]; [hidden email]; Dániel Krupp; Zoltan Porkoláb; Marton Csordas
Subject: Re: Adding indexing support to Clangd
 
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding indexing support to Clangd

Lang Hames via cfe-dev
Hi everyone,

The problem with PCHs(either chained or not) is that they only work for the source file, 
i.e. you can only use it when you start the new file from scratch, right? For header caching we
really want to reuse whatever information we have cached even if it's included in a different 
context(i.e. the order of includes is different in the other translation unit), which is not possible with PCHs.

My point is that it's not at all straightforward how(or if?) the PCHs can improve performance of processing 
the same header twice.

And as long as building an index is as fast as a recompile and we can reuse information from the previous 
version(slightly outdated) the index while the new version is building, we can probably get a good enough 
UX without any compromises on correctness(and introducing additional complexity, since I don't think there's
a way to do header caching without significant changes to clang itself).

CLion indeed has a custom parser and serialization format, it's not clang-based.


On Thu, May 18, 2017 at 10:33 PM, Marc-André Laperle via cfe-dev <[hidden email]> wrote:

Yeah, it sounds like a good approach to tackle header caching a bit later. "Precompiled preamble" looks promising so we can keep this in mind as we go.


Cheers,

Marc-André


From: Doug Schaefer <[hidden email]>
Sent: Thursday, May 18, 2017 10:44:18 AM
To: Marc-André Laperle; via cfe-dev
Cc: [hidden email]; [hidden email]; Dániel Krupp; Zoltan Porkoláb; Marton Csordas
Subject: Re: Adding indexing support to Clangd
 
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Adding indexing support to Clangd

Lang Hames via cfe-dev

Those are good points. I think we'll have to see how fast the indexing without caching and then go from there. I think the precompiled preamble sounds useful at least for opened editors for quick parsing as there can be many reparsings of the same file without anything changes in the inclusions (while typing, etc). I think reusing the information from the previous version is a good compromise but we also have to make sure building the index for the first time is not too long or at least make sure that there is sufficient functionality available for users to start working and communicate that some functionality is not available.


Cheers,

Marc-André


From: Ilya Biryukov <[hidden email]>
Sent: Friday, May 19, 2017 8:27:48 AM
To: Marc-André Laperle
Cc: Doug Schaefer; via cfe-dev; [hidden email]; Zoltan Porkoláb; Marton Csordas
Subject: Re: [cfe-dev] Adding indexing support to Clangd
 
Hi everyone,

The problem with PCHs(either chained or not) is that they only work for the source file, 
i.e. you can only use it when you start the new file from scratch, right? For header caching we
really want to reuse whatever information we have cached even if it's included in a different 
context(i.e. the order of includes is different in the other translation unit), which is not possible with PCHs.

My point is that it's not at all straightforward how(or if?) the PCHs can improve performance of processing 
the same header twice.

And as long as building an index is as fast as a recompile and we can reuse information from the previous 
version(slightly outdated) the index while the new version is building, we can probably get a good enough 
UX without any compromises on correctness(and introducing additional complexity, since I don't think there's
a way to do header caching without significant changes to clang itself).

CLion indeed has a custom parser and serialization format, it's not clang-based.


On Thu, May 18, 2017 at 10:33 PM, Marc-André Laperle via cfe-dev <[hidden email]> wrote:

Yeah, it sounds like a good approach to tackle header caching a bit later. "Precompiled preamble" looks promising so we can keep this in mind as we go.


Cheers,

Marc-André


From: Doug Schaefer <[hidden email]>
Sent: Thursday, May 18, 2017 10:44:18 AM
To: Marc-André Laperle; via cfe-dev
Cc: [hidden email]; [hidden email]; Dániel Krupp; Zoltan Porkoláb; Marton Csordas
Subject: Re: Adding indexing support to Clangd
 
On 2017-05-17, 6:38 PM, "Marc-André Laperle"
<[hidden email]> wrote:
>  - For header caching, I wonder if it is possible to reuse the
>precompiled header support in Clang. There would be some logic that would
>decide whether or not a precompiled header could be used depending on the
>preprocessing context (same macro definitions, etc).

I¹m not certain header cacheing is needed right away. We did it in the CDT
because our parsers were fairly slow and indexing a project took a very
long time. I have hope that clang would be faster. At the very least, you
would want cacheing to be optional so you need to be able to work without
it. But make sure you have the architecture to graft it in later.

In CDT we cheated a lot to gain performance and the cost of accuracy. The
results are still very good so its an interesting balancing act.

Doug Schaefer
Eclipse CDT Project Lead


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




--
Regards,
Ilya Biryukov

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Loading...