[RFC] Clang SourceLocation overflow

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

Original thread here: https://lists.llvm.org/pipermail/cfe-dev/2019-October/063459.html

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev
A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.

On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 
I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:
#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>
 
Now causes us to run out of source locations, hitting:
 
/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.
 
 
From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.
 
SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.
 
-Erihc 
 
>>>>>>>>>>>>>>>>>>>> 
Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>
 
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]>
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.



On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev
An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.
 
That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?
 
 
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.
 
Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).
 
This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.


On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:
 
 
I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:
#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>
 
Now causes us to run out of source locations, hitting:
 
/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.
 
 
From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.
 
SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.
 
-Erihc 
 
>>>>>>>>>>>>>>>>>>>> 
Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>
 
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

From: David Rector <[hidden email]>
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.




On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

Sounds like we could learn a few lessons from UTF-8 and use the first several bits to say where to find the rest (ie spread the rest over multiple 'extra' containers in SourceManager)?

Thanks,

Stephen.


On 02/02/2021 18:40, Keane, Erich via cfe-dev wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

From: David Rector [hidden email]
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich [hidden email]
Cc: clang developer list [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.




On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

I’m not sure what this bucketing would accomplish besides potentially reducing the key-space?  If I have 31 bits of keyspace (as I do now), and we take 1 bit away to decide between two ‘extra’ containers, we end up having 30 bits of keyspace *2 (at most), which is 32 bits, right? 

 

I thought the purpose of doing all this in UTF-8 was to encode information in their keyspace as well as making them unique.

 

From: Stephen Kelly <[hidden email]>
Sent: Tuesday, February 2, 2021 12:48 PM
To: Keane, Erich <[hidden email]>; David Rector <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

Sounds like we could learn a few lessons from UTF-8 and use the first several bits to say where to find the rest (ie spread the rest over multiple 'extra' containers in SourceManager)?

Thanks,

Stephen.

 

On 02/02/2021 18:40, Keane, Erich via cfe-dev wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

From: David Rector [hidden email]
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich [hidden email]
Cc: clang developer list [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.





On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev
In reply to this post by shirley breuer via cfe-dev
On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 


I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.
 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).


That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).
 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?


We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.

One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.

If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.

There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

From: David Rector <[hidden email]>
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.




On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.

 

>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity.

This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.

 

>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.

 

 

 

From: Richard Smith <[hidden email]>
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.

 

One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.

 

If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.

 

There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

From: David Rector <[hidden email]>
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.



On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev
Not sure if this helps but: I didn’t think about the boost file you were trying to compile at first, but now that I see it involves the boost preprocessor stuff, I bet the issue involves code somewhere that is written like 

`BOOST_PP_IF(condition, HUGEMACROEXPANSION_A, HUGEMACROEXPANSION_B)`

which should instead be written

`BOOST_PP_CAT(HUGEMACROEXPANSION, BOOST_PP_IF(0, _A, _B))`

The former results in two huge expansions, one of which is thrown away, whereas the latter only results in one expansion.

I recall writing reasonably small but complex boost pp code that would take like 15 minutes just to preprocess, and now that I think about it, it might have resulted in the "ran out of source locations" a few times, before I finally figured out to always write those conditionals the second way.  

Not sure if that’s the issue, but it’s a place to look.

Dave


On Feb 3, 2021, at 3:10 PM, Keane, Erich <[hidden email]> wrote:

Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.
 
>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. 
This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.
 
>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.
 
I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.
 
 
 
From: Richard Smith <[hidden email]> 
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:
Presumably, yes, we could hit that limit, and that limit is closer than we’d think.
 
Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.
 
I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 
 
One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 
 
I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.
 
Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).
 
That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).
 
My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?
 
We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.
 
One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.
 
If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.
 
There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.
 
On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:
 
Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.
 
That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?
 
 
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.
 
Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).
 
This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.



On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:
 
 
I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:
#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>
 
Now causes us to run out of source locations, hitting:
 
/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.
 
 
From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.
 
SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.
 
-Erihc 
 
>>>>>>>>>>>>>>>>>>>> 
Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>
 
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
 
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev
In reply to this post by shirley breuer via cfe-dev
On Wed, 3 Feb 2021 at 12:10, Keane, Erich <[hidden email]> wrote:

Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.

 

>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity.

This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.

 

>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.


Not necessarily, no -- in some configurations, we don't find out which AST files we want to load until we see a #include, which could happen arbitrarily late through preprocessing.

I don't think we need to pick the split point up front. Currently, we essentially use positive source locations for local locs and negative ones for imported locs. So we could handle this by effectively allowing both/either to wrap around, making sure they don't pass each other, and checking which side a location is on by performing an (unsigned) comparison instead of checking the sign bit.
 

From: Richard Smith <[hidden email]>
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.

 

One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.

 

If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.

 

There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

From: David Rector <[hidden email]>
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.



On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev
An important test to perform before committing to any action might be this: what is the proportion of unique SourceLocations constructed by the Lexer to the number of bytes in the input buffers, when you run out of SourceLocations?  

If it is extremely high, that suggests the problem may be large unused macro expansions cluttering up the buffer (if that is possible).  This means however that the "index solution" introduced earlier would solve the issue.

I don’t know quite how the Preprocessor works but the reason I suspect there may be large unused macro expansions is a) the use of BOOST_PP stuff in this problematic case, and b) the documentation of SourceLocation:
```
/// Technically, a source location is simply an offset into the manager's view
/// of the input source, which is all input buffers (including macro
/// expansions) concatenated in an effectively arbitrary order. 
```
Does "all input buffers (including macro expansion)" include even unused expansions, such as the first arg in in 
`BOOST_PP_IF(1, UNUSED_BUT_EXPANDED_MACRO(a,b,c), USED_EXPANDED_MACRO(a,b,c))`?

If so I suspect there will be many more bytes than SourceLocations, and thus the index solution may be viable, just in case it is easier to implement than other solutions (which is another matter).


On Feb 3, 2021, at 4:19 PM, Richard Smith <[hidden email]> wrote:

On Wed, 3 Feb 2021 at 12:10, Keane, Erich <[hidden email]> wrote:

Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.

 

>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity.

This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.

 

>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.


Not necessarily, no -- in some configurations, we don't find out which AST files we want to load until we see a #include, which could happen arbitrarily late through preprocessing.

I don't think we need to pick the split point up front. Currently, we essentially use positive source locations for local locs and negative ones for imported locs. So we could handle this by effectively allowing both/either to wrap around, making sure they don't pass each other, and checking which side a location is on by performing an (unsigned) comparison instead of checking the sign bit.
 

From: Richard Smith <[hidden email]>
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.

 

One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.

 

If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.

 

There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

From: David Rector <[hidden email]>
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.



On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev
In reply to this post by shirley breuer via cfe-dev

I don’t know if it is related to that, but I had the reporter run -dD and it crashed with these two at the end:

 

#define TUPLE_IS_VALID_ARRAY_E (2,(3,4))

#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP

 

So presumably it is related to the boost-preprocessor that you mentioned, however from a quick googling of the source code, that macro is:

#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP_NIL)

 

So there isn’t any huge expansion here.  That said, since this is just a header include FROM boost, I’d expect us to be able to compile it.

 

 

From: David Rector <[hidden email]>
Sent: Wednesday, February 3, 2021 12:33 PM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

Not sure if this helps but: I didn’t think about the boost file you were trying to compile at first, but now that I see it involves the boost preprocessor stuff, I bet the issue involves code somewhere that is written like 

 

`BOOST_PP_IF(condition, HUGEMACROEXPANSION_A, HUGEMACROEXPANSION_B)`

 

which should instead be written

 

`BOOST_PP_CAT(HUGEMACROEXPANSION, BOOST_PP_IF(0, _A, _B))`

 

The former results in two huge expansions, one of which is thrown away, whereas the latter only results in one expansion.

 

I recall writing reasonably small but complex boost pp code that would take like 15 minutes just to preprocess, and now that I think about it, it might have resulted in the "ran out of source locations" a few times, before I finally figured out to always write those conditionals the second way.  

 

Not sure if that’s the issue, but it’s a place to look.

 

Dave

 



On Feb 3, 2021, at 3:10 PM, Keane, Erich <[hidden email]> wrote:

 

Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.

 

>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. 

This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.

 

>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.

 

 

 

From: Richard Smith <[hidden email]> 
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.

 

One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.

 

If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.

 

There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.




On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev


On Feb 4, 2021, at 9:34 AM, Keane, Erich <[hidden email]> wrote:

I don’t know if it is related to that, but I had the reporter run -dD and it crashed with these two at the end:
 
#define TUPLE_IS_VALID_ARRAY_E (2,(3,4))
#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP
 
So presumably it is related to the boost-preprocessor that you mentioned, however from a quick googling of the source code, that macro is:
#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP_NIL)
 
So there isn’t any huge expansion here.  That said, since this is just a header include FROM boost, I’d expect us to be able to compile it.

Agreed, Boost probably isn’t doing unnecessary expansions.

However, it is possible it is doing a lot of nested expansions that are necessary for it to implement its logic, but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  

If these ultimately-unused expansions (e.g. expansions into arguments of other macros cannot be used in the AST) are added to the concatenation of buffers through which a SourceLocation must specify an offset, then there may be many more bytes in the buffers than SourceLocations in the AST when dealing with extremely heavy macro usage such as with the Boost VMD/PP stuff.

Hard to wade through the Preprocessor/Lexer details to know if this is the case, so an easier question is just: how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  If the number of AST-used SourceLocations exceeds 2^31, the only answer is 64-bit SourceLocations.  If it significantly under that, an indexing solution would work, and might well give the best bang for the buck.

Tthe other cases mentioned in the thread you linked to are also worth considering.  In the current Boost VMD case, the problem results from macro expansions, somehow.  In the previous thread, the cause was extremely frequent use of large unguarded #includes (but which might involve macro expansions too, or somehow otherwise result in large chunks of bytes never referenced via SourceLocations in the AST).  

However someone else in that thread encountered problems while working on modules, which seems not to involve macros/preprocessor buffers at all, and so might be the most troubling case, and might indicate that even if macros are the problem here, the only long term solution is to allow the user to select 64-bit SourceLocations — unless there are *also* large chunks of bytes in the imported-from-AST buffers that do not contain SourceLocations.  So again would be interesting to know the number of AST-used unique SourceLocations in those other cases, when we run out of them.


 
 
From: David Rector <[hidden email]> 
Sent: Wednesday, February 3, 2021 12:33 PM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Not sure if this helps but: I didn’t think about the boost file you were trying to compile at first, but now that I see it involves the boost preprocessor stuff, I bet the issue involves code somewhere that is written like 
 
`BOOST_PP_IF(condition, HUGEMACROEXPANSION_A, HUGEMACROEXPANSION_B)`
 
which should instead be written
 
`BOOST_PP_CAT(HUGEMACROEXPANSION, BOOST_PP_IF(0, _A, _B))`
 
The former results in two huge expansions, one of which is thrown away, whereas the latter only results in one expansion.
 
I recall writing reasonably small but complex boost pp code that would take like 15 minutes just to preprocess, and now that I think about it, it might have resulted in the "ran out of source locations" a few times, before I finally figured out to always write those conditionals the second way.  
 
Not sure if that’s the issue, but it’s a place to look.
 
Dave
 


On Feb 3, 2021, at 3:10 PM, Keane, Erich <[hidden email]> wrote:
 
Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.
 
>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. 
This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.
 
>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.
 
I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.
 
 
 
From: Richard Smith <[hidden email]> 
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:
Presumably, yes, we could hit that limit, and that limit is closer than we’d think.
 
Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.
 
I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 
 
One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 
 
I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.
 
Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).
 
That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).
 
My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?
 
We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.
 
One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.
 
If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.
 
There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.
 
On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:
 
Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.
 
That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?
 
 
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.
 
Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).
 
This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.




On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:
 
 
I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:
#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>
 
Now causes us to run out of source locations, hitting:
 
/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.
 
 
From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.
 
SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.
 
-Erihc 
 
>>>>>>>>>>>>>>>>>>>> 
Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>
 
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
 
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

> but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  

 

I think we still need to keep non-ast source locations around, since they are necessary to print the ‘expanded from macro’ notes.  So even ones that aren’t in the AST get referenced anyway.  So we would need some way of differentiating the two.  I wonder if we would be able to create a 64-bit “large source location” that converted to a “small source location” when added to the AST that used the indexing feature.

 

> how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  


This I do not have a good feel of unfortunately.

 

>, the only long term solution is to allow the user to select 64-bit SourceLocations

 

I can’t think of any way to do this that doesn’t mean we have to have 2x of a number of AST nodes in the compiler.  It feels like that would be a pretty massive inflation of our executable size, so I’m not sure what we’re willing to allow for that. 

 

Another side-note: If we start using some sort of ‘index into a vector of real source locations’ thing, we would likely need to change how modules are emitted/retrieved, though presumably we could just have them contain the ‘large source location’ listed above.

 

From: David Rector <[hidden email]>
Sent: Thursday, February 4, 2021 7:12 AM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

 



On Feb 4, 2021, at 9:34 AM, Keane, Erich <[hidden email]> wrote:

 

I don’t know if it is related to that, but I had the reporter run -dD and it crashed with these two at the end:

 

#define TUPLE_IS_VALID_ARRAY_E (2,(3,4))

#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP

 

So presumably it is related to the boost-preprocessor that you mentioned, however from a quick googling of the source code, that macro is:

#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP_NIL)

 

So there isn’t any huge expansion here.  That said, since this is just a header include FROM boost, I’d expect us to be able to compile it.

 

Agreed, Boost probably isn’t doing unnecessary expansions.

 

However, it is possible it is doing a lot of nested expansions that are necessary for it to implement its logic, but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  

 

If these ultimately-unused expansions (e.g. expansions into arguments of other macros cannot be used in the AST) are added to the concatenation of buffers through which a SourceLocation must specify an offset, then there may be many more bytes in the buffers than SourceLocations in the AST when dealing with extremely heavy macro usage such as with the Boost VMD/PP stuff.

 

Hard to wade through the Preprocessor/Lexer details to know if this is the case, so an easier question is just: how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  If the number of AST-used SourceLocations exceeds 2^31, the only answer is 64-bit SourceLocations.  If it significantly under that, an indexing solution would work, and might well give the best bang for the buck.

 

Tthe other cases mentioned in the thread you linked to are also worth considering.  In the current Boost VMD case, the problem results from macro expansions, somehow.  In the previous thread, the cause was extremely frequent use of large unguarded #includes (but which might involve macro expansions too, or somehow otherwise result in large chunks of bytes never referenced via SourceLocations in the AST).  

 

However someone else in that thread encountered problems while working on modules, which seems not to involve macros/preprocessor buffers at all, and so might be the most troubling case, and might indicate that even if macros are the problem here, the only long term solution is to allow the user to select 64-bit SourceLocations — unless there are *also* large chunks of bytes in the imported-from-AST buffers that do not contain SourceLocations.  So again would be interesting to know the number of AST-used unique SourceLocations in those other cases, when we run out of them.

 



 

 

From: David Rector <[hidden email]> 
Sent: Wednesday, February 3, 2021 12:33 PM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

Not sure if this helps but: I didn’t think about the boost file you were trying to compile at first, but now that I see it involves the boost preprocessor stuff, I bet the issue involves code somewhere that is written like 

 

`BOOST_PP_IF(condition, HUGEMACROEXPANSION_A, HUGEMACROEXPANSION_B)`

 

which should instead be written

 

`BOOST_PP_CAT(HUGEMACROEXPANSION, BOOST_PP_IF(0, _A, _B))`

 

The former results in two huge expansions, one of which is thrown away, whereas the latter only results in one expansion.

 

I recall writing reasonably small but complex boost pp code that would take like 15 minutes just to preprocess, and now that I think about it, it might have resulted in the "ran out of source locations" a few times, before I finally figured out to always write those conditionals the second way.  

 

Not sure if that’s the issue, but it’s a place to look.

 

Dave

 




On Feb 3, 2021, at 3:10 PM, Keane, Erich <[hidden email]> wrote:

 

Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.

 

>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. 

This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.

 

>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.

 

 

 

From: Richard Smith <[hidden email]> 
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.

 

One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.

 

If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.

 

There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.





On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev


On Feb 4, 2021, at 10:23 AM, Keane, Erich <[hidden email]> wrote:

> but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  
 
I think we still need to keep non-ast source locations around, since they are necessary to print the ‘expanded from macro’ notes.  So even ones that aren’t in the AST get referenced anyway.  So we would need some way of differentiating the two.  I wonder if we would be able to create a 64-bit “large source location” that converted to a “small source location” when added to the AST that used the indexing feature.
 
> how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  

This I do not have a good feel of unfortunately.

I think in this case, the macros probably create very few if any AST nodes at all.  Here’s a link to the problematic code, I think it is pure macro wizardry:

So if our solution allowed passed around 64-bit SourceLocations in the preprocessing stage, and only converting them to 32-bit (interpreted as an index to 64-bit data where necessary, or all the time) when constructing an AST node, I am pretty sure that would solve the problem, at least in this case.

 
>, the only long term solution is to allow the user to select 64-bit SourceLocations
 
I can’t think of any way to do this that doesn’t mean we have to have 2x of a number of AST nodes in the compiler.  It feels like that would be a pretty massive inflation of our executable size, so I’m not sure what we’re willing to allow for that.  

Agree, but if the AST  can get to be above that size (disregarding template instantiation nodes) that is the only answer I think.  But somehow I doubt that is happening, or that any case in which that is happening is reasonable code.

 
Another side-note: If we start using some sort of ‘index into a vector of real source locations’ thing, we would likely need to change how modules are emitted/retrieved, though presumably we could just have them contain the ‘large source location’ listed above.

I know preciously little about the modules implementation, but I would imagine we could store the vector<uint64_t> in the module as well - or store the big locations, either way.

Also, I wonder if that, when "Ran out of SourceLocations" is encountered with modules, the problem might be that template instantiation data within the modules are occupying large chunks of bytes, such that the problem could again be resolved with an indexing solution  that only ever refers to the SourceLocations of their patterns (i.e. reducing the number of unique indices to keep it under 2^31).

 
From: David Rector <[hidden email]> 
Sent: Thursday, February 4, 2021 7:12 AM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
 


On Feb 4, 2021, at 9:34 AM, Keane, Erich <[hidden email]> wrote:
 
I don’t know if it is related to that, but I had the reporter run -dD and it crashed with these two at the end:
 
#define TUPLE_IS_VALID_ARRAY_E (2,(3,4))
#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP
 
So presumably it is related to the boost-preprocessor that you mentioned, however from a quick googling of the source code, that macro is:
#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP_NIL)
 
So there isn’t any huge expansion here.  That said, since this is just a header include FROM boost, I’d expect us to be able to compile it.
 
Agreed, Boost probably isn’t doing unnecessary expansions.
 
However, it is possible it is doing a lot of nested expansions that are necessary for it to implement its logic, but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  
 
If these ultimately-unused expansions (e.g. expansions into arguments of other macros cannot be used in the AST) are added to the concatenation of buffers through which a SourceLocation must specify an offset, then there may be many more bytes in the buffers than SourceLocations in the AST when dealing with extremely heavy macro usage such as with the Boost VMD/PP stuff.
 
Hard to wade through the Preprocessor/Lexer details to know if this is the case, so an easier question is just: how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  If the number of AST-used SourceLocations exceeds 2^31, the only answer is 64-bit SourceLocations.  If it significantly under that, an indexing solution would work, and might well give the best bang for the buck.
 
Tthe other cases mentioned in the thread you linked to are also worth considering.  In the current Boost VMD case, the problem results from macro expansions, somehow.  In the previous thread, the cause was extremely frequent use of large unguarded #includes (but which might involve macro expansions too, or somehow otherwise result in large chunks of bytes never referenced via SourceLocations in the AST).  
 
However someone else in that thread encountered problems while working on modules, which seems not to involve macros/preprocessor buffers at all, and so might be the most troubling case, and might indicate that even if macros are the problem here, the only long term solution is to allow the user to select 64-bit SourceLocations — unless there are *also* large chunks of bytes in the imported-from-AST buffers that do not contain SourceLocations.  So again would be interesting to know the number of AST-used unique SourceLocations in those other cases, when we run out of them.
 


 
 
From: David Rector <[hidden email]> 
Sent: Wednesday, February 3, 2021 12:33 PM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Not sure if this helps but: I didn’t think about the boost file you were trying to compile at first, but now that I see it involves the boost preprocessor stuff, I bet the issue involves code somewhere that is written like 
 
`BOOST_PP_IF(condition, HUGEMACROEXPANSION_A, HUGEMACROEXPANSION_B)`
 
which should instead be written
 
`BOOST_PP_CAT(HUGEMACROEXPANSION, BOOST_PP_IF(0, _A, _B))`
 
The former results in two huge expansions, one of which is thrown away, whereas the latter only results in one expansion.
 
I recall writing reasonably small but complex boost pp code that would take like 15 minutes just to preprocess, and now that I think about it, it might have resulted in the "ran out of source locations" a few times, before I finally figured out to always write those conditionals the second way.  
 
Not sure if that’s the issue, but it’s a place to look.
 
Dave
 



On Feb 3, 2021, at 3:10 PM, Keane, Erich <[hidden email]> wrote:
 
Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.
 
>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. 
This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.
 
>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.
 
I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.
 
 
 
From: Richard Smith <[hidden email]> 
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:
Presumably, yes, we could hit that limit, and that limit is closer than we’d think.
 
Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.
 
I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 
 
One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 
 
I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.
 
Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).
 
That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).
 
My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?
 
We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.
 
One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.
 
If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.
 
There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.
 
On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:
 
Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.
 
That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?
 
 
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.
 
Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).
 
This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.





On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:
 
 
I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:
#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>
 
Now causes us to run out of source locations, hitting:
 
/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.
 
 
From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.
 
SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.
 
-Erihc 
 
>>>>>>>>>>>>>>>>>>>> 
Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>
 
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
 
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

> So if our solution allowed passed around 64-bit SourceLocations in the preprocessing stage, and only converting them to 32-bit (interpreted as an index to 64-bit data where necessary, or all the time) when constructing an AST node, I am pretty sure that would solve the problem, at least in this case.

 

Right… So what I envision is at the moment is:

 

PPSourceLocation: Does everything the current one does, so it is represented as a 63 bit location + 1 bit for macro/file discriminator.

ASTSourceLocation: This is ONLY a 32 bit unsigned index and contains no info nor functionality itself (besides conversions).  However, ASTSourceLocation and a SourceManager (which contains the lookup vector) gets you the PPSourceLocation, where we do all the work. 

 

Lex/Parse deal with PPSourceLocation.  When it comes to store in the AST, we store the ASTSourceLocation and create an index in the SourceManager’s vector<uint64_t>.  When the AST needs to do something with the location besides copying/storing it, it does the conversion.  The conversion from PPSourceLocation->ASTSourceLocation might be expensive (since we would presumably want to de-dupe the vector), but ASTSourceLocation->PPSourceLocation is just a vector-lookup.

 

> Agree, but if the AST  can get to be above that size (disregarding template instantiation nodes) that is the only answer I think.  But somehow I doubt that is happening, or that any case in which that is happening is reasonable code.

At the moment, we aren’t getting to the 31^2-1 ast-source-locations, since obviously we would run out of offsets way before that would happen (assuming the de-dupe step above).  But the above system buys us at least 2x*, but likely much more.  At the moment, this is likely enough to make me happy.

 

> I know preciously little about the modules implementation, but I would imagine we could store the vector<uint64_t> in the module as well - or store the big locations, either way.

I Hadn’t considered storing the array itself in the module header.  I know about as much as you do (besides the ASTReader/ASTWriter interface), but my initial thought was to just convert to big-locations first, then store those.  Hopefully Richard can comment.

 

> Also, I wonder if that, when "Ran out of SourceLocations" is encountered with modules, the problem might be that template instantiation data within the modules are occupying large chunks of bytes, such that the problem could again be resolved with an indexing solution  that only ever refers to the SourceLocations of their patterns (i.e. reducing the number of unique indices to keep it under 2^31).

Agreed…  I think the above/below indexing solution buys us quite a bit of headroom.  It isn’t as good as true 64 bit source-locations, but the work to do this would make the eventual 64-bit switch quite a bit easier I would expect (since it would just be removing the vector and making the conversion be a simple copy).

 

So I guess the open questions are:

  1. Is this acceptable to **Richard**/etc?
  2. Does anyone want to help 😊

 

*My assumption that I think holds is that ASTSourceLocations would end up representing significantly less than half of the actual locations.  In reality, I’m guessing this system buys us closer to 5x than it does 2x, simply because I’m guessing the average identifier/token/etc is closer to 5 bytes than 2. Even a trivial example which is a ‘bad’ case (int foo(int i, int j, int k) { return i + j + k;}) Has 50 characters, but I believe ~18 source locations in the AST.

 

From: David Rector <[hidden email]>
Sent: Thursday, February 4, 2021 7:57 AM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

 



On Feb 4, 2021, at 10:23 AM, Keane, Erich <[hidden email]> wrote:

 

> but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  

 

I think we still need to keep non-ast source locations around, since they are necessary to print the ‘expanded from macro’ notes.  So even ones that aren’t in the AST get referenced anyway.  So we would need some way of differentiating the two.  I wonder if we would be able to create a 64-bit “large source location” that converted to a “small source location” when added to the AST that used the indexing feature.

 

> how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  


This I do not have a good feel of unfortunately.

 

I think in this case, the macros probably create very few if any AST nodes at all.  Here’s a link to the problematic code, I think it is pure macro wizardry:

 

So if our solution allowed passed around 64-bit SourceLocations in the preprocessing stage, and only converting them to 32-bit (interpreted as an index to 64-bit data where necessary, or all the time) when constructing an AST node, I am pretty sure that would solve the problem, at least in this case.



 

>, the only long term solution is to allow the user to select 64-bit SourceLocations

 

I can’t think of any way to do this that doesn’t mean we have to have 2x of a number of AST nodes in the compiler.  It feels like that would be a pretty massive inflation of our executable size, so I’m not sure what we’re willing to allow for that.  

 

Agree, but if the AST  can get to be above that size (disregarding template instantiation nodes) that is the only answer I think.  But somehow I doubt that is happening, or that any case in which that is happening is reasonable code.

 

 

Another side-note: If we start using some sort of ‘index into a vector of real source locations’ thing, we would likely need to change how modules are emitted/retrieved, though presumably we could just have them contain the ‘large source location’ listed above.

 

I know preciously little about the modules implementation, but I would imagine we could store the vector<uint64_t> in the module as well - or store the big locations, either way.

 

Also, I wonder if that, when "Ran out of SourceLocations" is encountered with modules, the problem might be that template instantiation data within the modules are occupying large chunks of bytes, such that the problem could again be resolved with an indexing solution  that only ever refers to the SourceLocations of their patterns (i.e. reducing the number of unique indices to keep it under 2^31).



 

From: David Rector <[hidden email]> 
Sent: Thursday, February 4, 2021 7:12 AM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

 




On Feb 4, 2021, at 9:34 AM, Keane, Erich <[hidden email]> wrote:

 

I don’t know if it is related to that, but I had the reporter run -dD and it crashed with these two at the end:

 

#define TUPLE_IS_VALID_ARRAY_E (2,(3,4))

#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP

 

So presumably it is related to the boost-preprocessor that you mentioned, however from a quick googling of the source code, that macro is:

#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP_NIL)

 

So there isn’t any huge expansion here.  That said, since this is just a header include FROM boost, I’d expect us to be able to compile it.

 

Agreed, Boost probably isn’t doing unnecessary expansions.

 

However, it is possible it is doing a lot of nested expansions that are necessary for it to implement its logic, but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  

 

If these ultimately-unused expansions (e.g. expansions into arguments of other macros cannot be used in the AST) are added to the concatenation of buffers through which a SourceLocation must specify an offset, then there may be many more bytes in the buffers than SourceLocations in the AST when dealing with extremely heavy macro usage such as with the Boost VMD/PP stuff.

 

Hard to wade through the Preprocessor/Lexer details to know if this is the case, so an easier question is just: how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  If the number of AST-used SourceLocations exceeds 2^31, the only answer is 64-bit SourceLocations.  If it significantly under that, an indexing solution would work, and might well give the best bang for the buck.

 

Tthe other cases mentioned in the thread you linked to are also worth considering.  In the current Boost VMD case, the problem results from macro expansions, somehow.  In the previous thread, the cause was extremely frequent use of large unguarded #includes (but which might involve macro expansions too, or somehow otherwise result in large chunks of bytes never referenced via SourceLocations in the AST).  

 

However someone else in that thread encountered problems while working on modules, which seems not to involve macros/preprocessor buffers at all, and so might be the most troubling case, and might indicate that even if macros are the problem here, the only long term solution is to allow the user to select 64-bit SourceLocations — unless there are *also* large chunks of bytes in the imported-from-AST buffers that do not contain SourceLocations.  So again would be interesting to know the number of AST-used unique SourceLocations in those other cases, when we run out of them.

 




 

 

From: David Rector <[hidden email]> 
Sent: Wednesday, February 3, 2021 12:33 PM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

Not sure if this helps but: I didn’t think about the boost file you were trying to compile at first, but now that I see it involves the boost preprocessor stuff, I bet the issue involves code somewhere that is written like 

 

`BOOST_PP_IF(condition, HUGEMACROEXPANSION_A, HUGEMACROEXPANSION_B)`

 

which should instead be written

 

`BOOST_PP_CAT(HUGEMACROEXPANSION, BOOST_PP_IF(0, _A, _B))`

 

The former results in two huge expansions, one of which is thrown away, whereas the latter only results in one expansion.

 

I recall writing reasonably small but complex boost pp code that would take like 15 minutes just to preprocess, and now that I think about it, it might have resulted in the "ran out of source locations" a few times, before I finally figured out to always write those conditionals the second way.  

 

Not sure if that’s the issue, but it’s a place to look.

 

Dave

 





On Feb 3, 2021, at 3:10 PM, Keane, Erich <[hidden email]> wrote:

 

Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.

 

>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. 

This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.

 

>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.

 

 

 

From: Richard Smith <[hidden email]> 
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.

 

One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.

 

If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.

 

There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.






On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev

So 1 more thing:

Richard suggested one way about this was to switch from ‘negative numbers are loaded, positive are from files’ to something with a split.


I believe based on my reading that we already DO that for SourceLocation.  They are represented by an unsigned, and we have a “CurrentLoadedOffset” that is the split between the two.  SO, we are already using as much room as we can for that.

 

The negative/positive discriminator Richard was mentioning seems to be the FileID, which is an index into 1 of 2 vectors (which contain the SLocEntrys, which store a list of macro expansions AND file + offset).  I don’t believe we are running into that limit however.  SO, I don’t think that buys us much.

 

I’ve also been looking into replacing the bit for IsMacro vs IsFile.  I believe that wouldn’t be too horrible, other than the need to access SourceManager.  It WOULD be somewhat expensive (since we end up binary searching the files list to get the FileID), but would buy us 2x the data we have now.

 

From: Keane, Erich
Sent: Thursday, February 4, 2021 8:18 AM
To: David Rector <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: RE: [cfe-dev] [RFC] Clang SourceLocation overflow

 

> So if our solution allowed passed around 64-bit SourceLocations in the preprocessing stage, and only converting them to 32-bit (interpreted as an index to 64-bit data where necessary, or all the time) when constructing an AST node, I am pretty sure that would solve the problem, at least in this case.

 

Right… So what I envision is at the moment is:

 

PPSourceLocation: Does everything the current one does, so it is represented as a 63 bit location + 1 bit for macro/file discriminator.

ASTSourceLocation: This is ONLY a 32 bit unsigned index and contains no info nor functionality itself (besides conversions).  However, ASTSourceLocation and a SourceManager (which contains the lookup vector) gets you the PPSourceLocation, where we do all the work. 

 

Lex/Parse deal with PPSourceLocation.  When it comes to store in the AST, we store the ASTSourceLocation and create an index in the SourceManager’s vector<uint64_t>.  When the AST needs to do something with the location besides copying/storing it, it does the conversion.  The conversion from PPSourceLocation->ASTSourceLocation might be expensive (since we would presumably want to de-dupe the vector), but ASTSourceLocation->PPSourceLocation is just a vector-lookup.

 

> Agree, but if the AST  can get to be above that size (disregarding template instantiation nodes) that is the only answer I think.  But somehow I doubt that is happening, or that any case in which that is happening is reasonable code.

At the moment, we aren’t getting to the 31^2-1 ast-source-locations, since obviously we would run out of offsets way before that would happen (assuming the de-dupe step above).  But the above system buys us at least 2x*, but likely much more.  At the moment, this is likely enough to make me happy.

 

> I know preciously little about the modules implementation, but I would imagine we could store the vector<uint64_t> in the module as well - or store the big locations, either way.

I Hadn’t considered storing the array itself in the module header.  I know about as much as you do (besides the ASTReader/ASTWriter interface), but my initial thought was to just convert to big-locations first, then store those.  Hopefully Richard can comment.

 

> Also, I wonder if that, when "Ran out of SourceLocations" is encountered with modules, the problem might be that template instantiation data within the modules are occupying large chunks of bytes, such that the problem could again be resolved with an indexing solution  that only ever refers to the SourceLocations of their patterns (i.e. reducing the number of unique indices to keep it under 2^31).

Agreed…  I think the above/below indexing solution buys us quite a bit of headroom.  It isn’t as good as true 64 bit source-locations, but the work to do this would make the eventual 64-bit switch quite a bit easier I would expect (since it would just be removing the vector and making the conversion be a simple copy).

 

So I guess the open questions are:

  1. Is this acceptable to **Richard**/etc?
  2. Does anyone want to help 😊

 

*My assumption that I think holds is that ASTSourceLocations would end up representing significantly less than half of the actual locations.  In reality, I’m guessing this system buys us closer to 5x than it does 2x, simply because I’m guessing the average identifier/token/etc is closer to 5 bytes than 2. Even a trivial example which is a ‘bad’ case (int foo(int i, int j, int k) { return i + j + k;}) Has 50 characters, but I believe ~18 source locations in the AST.

 

From: David Rector <[hidden email]>
Sent: Thursday, February 4, 2021 7:57 AM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

 

 

On Feb 4, 2021, at 10:23 AM, Keane, Erich <[hidden email]> wrote:

 

> but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  

 

I think we still need to keep non-ast source locations around, since they are necessary to print the ‘expanded from macro’ notes.  So even ones that aren’t in the AST get referenced anyway.  So we would need some way of differentiating the two.  I wonder if we would be able to create a 64-bit “large source location” that converted to a “small source location” when added to the AST that used the indexing feature.

 

> how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  


This I do not have a good feel of unfortunately.

 

I think in this case, the macros probably create very few if any AST nodes at all.  Here’s a link to the problematic code, I think it is pure macro wizardry:

 

So if our solution allowed passed around 64-bit SourceLocations in the preprocessing stage, and only converting them to 32-bit (interpreted as an index to 64-bit data where necessary, or all the time) when constructing an AST node, I am pretty sure that would solve the problem, at least in this case.

 

 

>, the only long term solution is to allow the user to select 64-bit SourceLocations

 

I can’t think of any way to do this that doesn’t mean we have to have 2x of a number of AST nodes in the compiler.  It feels like that would be a pretty massive inflation of our executable size, so I’m not sure what we’re willing to allow for that.  

 

Agree, but if the AST  can get to be above that size (disregarding template instantiation nodes) that is the only answer I think.  But somehow I doubt that is happening, or that any case in which that is happening is reasonable code.

 

 

Another side-note: If we start using some sort of ‘index into a vector of real source locations’ thing, we would likely need to change how modules are emitted/retrieved, though presumably we could just have them contain the ‘large source location’ listed above.

 

I know preciously little about the modules implementation, but I would imagine we could store the vector<uint64_t> in the module as well - or store the big locations, either way.

 

Also, I wonder if that, when "Ran out of SourceLocations" is encountered with modules, the problem might be that template instantiation data within the modules are occupying large chunks of bytes, such that the problem could again be resolved with an indexing solution  that only ever refers to the SourceLocations of their patterns (i.e. reducing the number of unique indices to keep it under 2^31).

 

 

From: David Rector <[hidden email]> 
Sent: Thursday, February 4, 2021 7:12 AM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

 

 

On Feb 4, 2021, at 9:34 AM, Keane, Erich <[hidden email]> wrote:

 

I don’t know if it is related to that, but I had the reporter run -dD and it crashed with these two at the end:

 

#define TUPLE_IS_VALID_ARRAY_E (2,(3,4))

#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP

 

So presumably it is related to the boost-preprocessor that you mentioned, however from a quick googling of the source code, that macro is:

#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP_NIL)

 

So there isn’t any huge expansion here.  That said, since this is just a header include FROM boost, I’d expect us to be able to compile it.

 

Agreed, Boost probably isn’t doing unnecessary expansions.

 

However, it is possible it is doing a lot of nested expansions that are necessary for it to implement its logic, but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  

 

If these ultimately-unused expansions (e.g. expansions into arguments of other macros cannot be used in the AST) are added to the concatenation of buffers through which a SourceLocation must specify an offset, then there may be many more bytes in the buffers than SourceLocations in the AST when dealing with extremely heavy macro usage such as with the Boost VMD/PP stuff.

 

Hard to wade through the Preprocessor/Lexer details to know if this is the case, so an easier question is just: how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  If the number of AST-used SourceLocations exceeds 2^31, the only answer is 64-bit SourceLocations.  If it significantly under that, an indexing solution would work, and might well give the best bang for the buck.

 

Tthe other cases mentioned in the thread you linked to are also worth considering.  In the current Boost VMD case, the problem results from macro expansions, somehow.  In the previous thread, the cause was extremely frequent use of large unguarded #includes (but which might involve macro expansions too, or somehow otherwise result in large chunks of bytes never referenced via SourceLocations in the AST).  

 

However someone else in that thread encountered problems while working on modules, which seems not to involve macros/preprocessor buffers at all, and so might be the most troubling case, and might indicate that even if macros are the problem here, the only long term solution is to allow the user to select 64-bit SourceLocations — unless there are *also* large chunks of bytes in the imported-from-AST buffers that do not contain SourceLocations.  So again would be interesting to know the number of AST-used unique SourceLocations in those other cases, when we run out of them.

 

 

 

 

From: David Rector <[hidden email]> 
Sent: Wednesday, February 3, 2021 12:33 PM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

Not sure if this helps but: I didn’t think about the boost file you were trying to compile at first, but now that I see it involves the boost preprocessor stuff, I bet the issue involves code somewhere that is written like 

 

`BOOST_PP_IF(condition, HUGEMACROEXPANSION_A, HUGEMACROEXPANSION_B)`

 

which should instead be written

 

`BOOST_PP_CAT(HUGEMACROEXPANSION, BOOST_PP_IF(0, _A, _B))`

 

The former results in two huge expansions, one of which is thrown away, whereas the latter only results in one expansion.

 

I recall writing reasonably small but complex boost pp code that would take like 15 minutes just to preprocess, and now that I think about it, it might have resulted in the "ran out of source locations" a few times, before I finally figured out to always write those conditionals the second way.  

 

Not sure if that’s the issue, but it’s a place to look.

 

Dave

 



On Feb 3, 2021, at 3:10 PM, Keane, Erich <[hidden email]> wrote:

 

Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.

 

>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. 

This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.

 

>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.

 

 

 

From: Richard Smith <[hidden email]> 
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:

Presumably, yes, we could hit that limit, and that limit is closer than we’d think.

 

Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.

 

I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 

 

One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 

 

I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.

 

Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).

 

That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).

 

My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?

 

We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.

 

One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.

 

If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.

 

There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.

 

On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:

 

Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.

 

That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?

 

 

 

From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.

 

Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).

 

This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.




On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:

 

 

I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:

#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>

 

Now causes us to run out of source locations, hitting:

 

/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.

 

 

From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.

 

SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.

 

-Erihc 

 

>>>>>>>>>>>>>>>>>>>> 

Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)
 
Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).
 
Other suggestions are welcome.
 
I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.
 
Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191010/bc5ff8cd/attachment.html>

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

shirley breuer via cfe-dev


On Feb 4, 2021, at 1:24 PM, Keane, Erich <[hidden email]> wrote:

So 1 more thing:
Richard suggested one way about this was to switch from ‘negative numbers are loaded, positive are from files’ to something with a split.

I believe based on my reading that we already DO that for SourceLocation.  They are represented by an unsigned, and we have a “CurrentLoadedOffset” that is the split between the two.  SO, we are already using as much room as we can for that.
 
The negative/positive discriminator Richard was mentioning seems to be the FileID, which is an index into 1 of 2 vectors (which contain the SLocEntrys, which store a list of macro expansions AND file + offset).  I don’t believe we are running into that limit however.  SO, I don’t think that buys us much.
 
I’ve also been looking into replacing the bit for IsMacro vs IsFile.  I believe that wouldn’t be too horrible, other than the need to access SourceManager.  It WOULD be somewhat expensive (since we end up binary searching the files list to get the FileID), but would buy us 2x the data we have now.
 
From: Keane, Erich 
Sent: Thursday, February 4, 2021 8:18 AM
To: David Rector <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: RE: [cfe-dev] [RFC] Clang SourceLocation overflow
 
> So if our solution allowed passed around 64-bit SourceLocations in the preprocessing stage, and only converting them to 32-bit (interpreted as an index to 64-bit data where necessary, or all the time) when constructing an AST node, I am pretty sure that would solve the problem, at least in this case.
 
Right… So what I envision is at the moment is:
 
PPSourceLocation: Does everything the current one does, so it is represented as a 63 bit location + 1 bit for macro/file discriminator.
ASTSourceLocation: This is ONLY a 32 bit unsigned index and contains no info nor functionality itself (besides conversions).  However, ASTSourceLocation and a SourceManager (which contains the lookup vector) gets you the PPSourceLocation, where we do all the work.  
 
Lex/Parse deal with PPSourceLocation.  When it comes to store in the AST, we store the ASTSourceLocation and create an index in the SourceManager’s vector<uint64_t>.  When the AST needs to do something with the location besides copying/storing it, it does the conversion.  The conversion from PPSourceLocation->ASTSourceLocation might be expensive (since we would presumably want to de-dupe the vector), but ASTSourceLocation->PPSourceLocation is just a vector-lookup.

Agree.  Maybe leave "ASTSourceLocation" as "SourceLocation" since the change will otherwise be invisible to the vast majority of users, who only interact with SourceLocations via AST nodes (I think).

 
> Agree, but if the AST  can get to be above that size (disregarding template instantiation nodes) that is the only answer I think.  But somehow I doubt that is happening, or that any case in which that is happening is reasonable code.
At the moment, we aren’t getting to the 31^2-1 ast-source-locations, since obviously we would run out of offsets way before that would happen (assuming the de-dupe step above).  But the above system buys us at least 2x*, but likely much more.  At the moment, this is likely enough to make me happy.
 
> I know preciously little about the modules implementation, but I would imagine we could store the vector<uint64_t> in the module as well - or store the big locations, either way.
I Hadn’t considered storing the array itself in the module header.  I know about as much as you do (besides the ASTReader/ASTWriter interface), but my initial thought was to just convert to big-locations first, then store those.  Hopefully Richard can comment.
 
> Also, I wonder if that, when "Ran out of SourceLocations" is encountered with modules, the problem might be that template instantiation data within the modules are occupying large chunks of bytes, such that the problem could again be resolved with an indexing solution  that only ever refers to the SourceLocations of their patterns (i.e. reducing the number of unique indices to keep it under 2^31).
Agreed…  I think the above/below indexing solution buys us quite a bit of headroom.  It isn’t as good as true 64 bit source-locations, but the work to do this would make the eventual 64-bit switch quite a bit easier I would expect (since it would just be removing the vector and making the conversion be a simple copy).


If I understood the Boost VMD example correctly (i.e. almost no AST nodes actually generated in that file, notwithstanding the huge buffers generated by macro expansions), the indexing approach will free up plenty of bits in all but the most unusual cases, such that I would venture it will keep us from ever having to go 64 bit.

The only case that would not be solved by this, I think, is if a huge, non-templated AST were to be built, almost certainly by doing massive template-like stuff with the preprocessor, in which case the indexing would only buy us what you say below: a multiple equal to the average token length in bytes (though even more to the extent there are also unused-in-the-AST macro expansions in the concatenation of buffers).

If such cases really need to be supported, then at that point I would think it reasonable to add an option to build clang with 64-bit SourceLocations used in the AST — i.e. the user would have to use a different clang binary to compile that sort of stuff (is that the option that was proposed before?).  But it might be worthwhile to put that off until we see actual cases in which that really is necessary, where we need more than 2^32 indices to uniquely identify source locations for all the non-template-instantiated nodes in the AST.

 

So I guess the open questions are:
  1. Is this acceptable to **Richard**/etc?

Bump.

  1. Does anyone want to help 😊

While I’m not super set up for clang development at the moment, if, pending Richard’s thoughts, you wanted to start and when the task came into full focus it looked not-so-trivial, I would be happy to lend a hand, since I think we are on the same page about what it seems needs to be done.

*My assumption that I think holds is that ASTSourceLocations would end up representing significantly less than half of the actual locations.  In reality, I’m guessing this system buys us closer to 5x than it does 2x, simply because I’m guessing the average identifier/token/etc is closer to 5 bytes than 2. Even a trivial example which is a ‘bad’ case (int foo(int i, int j, int k) { return i + j + k;}) Has 50 characters, but I believe ~18 source locations in the AST.
 
From: David Rector <[hidden email]> 
Sent: Thursday, February 4, 2021 7:57 AM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
 

 

On Feb 4, 2021, at 10:23 AM, Keane, Erich <[hidden email]> wrote:
 
> but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  
 
I think we still need to keep non-ast source locations around, since they are necessary to print the ‘expanded from macro’ notes.  So even ones that aren’t in the AST get referenced anyway.  So we would need some way of differentiating the two.  I wonder if we would be able to create a 64-bit “large source location” that converted to a “small source location” when added to the AST that used the indexing feature.
 
> how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  

This I do not have a good feel of unfortunately.
 
I think in this case, the macros probably create very few if any AST nodes at all.  Here’s a link to the problematic code, I think it is pure macro wizardry:
 
So if our solution allowed passed around 64-bit SourceLocations in the preprocessing stage, and only converting them to 32-bit (interpreted as an index to 64-bit data where necessary, or all the time) when constructing an AST node, I am pretty sure that would solve the problem, at least in this case.

 

 
>, the only long term solution is to allow the user to select 64-bit SourceLocations
 
I can’t think of any way to do this that doesn’t mean we have to have 2x of a number of AST nodes in the compiler.  It feels like that would be a pretty massive inflation of our executable size, so I’m not sure what we’re willing to allow for that.  
 
Agree, but if the AST  can get to be above that size (disregarding template instantiation nodes) that is the only answer I think.  But somehow I doubt that is happening, or that any case in which that is happening is reasonable code.
 
 
Another side-note: If we start using some sort of ‘index into a vector of real source locations’ thing, we would likely need to change how modules are emitted/retrieved, though presumably we could just have them contain the ‘large source location’ listed above.
 
I know preciously little about the modules implementation, but I would imagine we could store the vector<uint64_t> in the module as well - or store the big locations, either way.
 
Also, I wonder if that, when "Ran out of SourceLocations" is encountered with modules, the problem might be that template instantiation data within the modules are occupying large chunks of bytes, such that the problem could again be resolved with an indexing solution  that only ever refers to the SourceLocations of their patterns (i.e. reducing the number of unique indices to keep it under 2^31).

 

 
From: David Rector <[hidden email]> 
Sent: Thursday, February 4, 2021 7:12 AM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
 

 

On Feb 4, 2021, at 9:34 AM, Keane, Erich <[hidden email]> wrote:
 
I don’t know if it is related to that, but I had the reporter run -dD and it crashed with these two at the end:
 
#define TUPLE_IS_VALID_ARRAY_E (2,(3,4))
#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP
 
So presumably it is related to the boost-preprocessor that you mentioned, however from a quick googling of the source code, that macro is:
#define TUPLE_IS_VALID_LIST_E (anydata,BOOST_PP_NIL)
 
So there isn’t any huge expansion here.  That said, since this is just a header include FROM boost, I’d expect us to be able to compile it.
 
Agreed, Boost probably isn’t doing unnecessary expansions.
 
However, it is possible it is doing a lot of nested expansions that are necessary for it to implement its logic, but which don’t actually contain any SourceLocations *which end up in the AST* (since those are the only SourceLocations that really need to be 32 bit).  
 
If these ultimately-unused expansions (e.g. expansions into arguments of other macros cannot be used in the AST) are added to the concatenation of buffers through which a SourceLocation must specify an offset, then there may be many more bytes in the buffers than SourceLocations in the AST when dealing with extremely heavy macro usage such as with the Boost VMD/PP stuff.
 
Hard to wade through the Preprocessor/Lexer details to know if this is the case, so an easier question is just: how many unique SourceLocations are in the AST, in each of the various described cases in which a user runs out of SourceLocations?  If the number of AST-used SourceLocations exceeds 2^31, the only answer is 64-bit SourceLocations.  If it significantly under that, an indexing solution would work, and might well give the best bang for the buck.
 
Tthe other cases mentioned in the thread you linked to are also worth considering.  In the current Boost VMD case, the problem results from macro expansions, somehow.  In the previous thread, the cause was extremely frequent use of large unguarded #includes (but which might involve macro expansions too, or somehow otherwise result in large chunks of bytes never referenced via SourceLocations in the AST).  
 
However someone else in that thread encountered problems while working on modules, which seems not to involve macros/preprocessor buffers at all, and so might be the most troubling case, and might indicate that even if macros are the problem here, the only long term solution is to allow the user to select 64-bit SourceLocations — unless there are *also* large chunks of bytes in the imported-from-AST buffers that do not contain SourceLocations.  So again would be interesting to know the number of AST-used unique SourceLocations in those other cases, when we run out of them.
 

 

 
 
From: David Rector <[hidden email]> 
Sent: Wednesday, February 3, 2021 12:33 PM
To: Keane, Erich <[hidden email]>
Cc: Richard Smith <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Not sure if this helps but: I didn’t think about the boost file you were trying to compile at first, but now that I see it involves the boost preprocessor stuff, I bet the issue involves code somewhere that is written like 
 
`BOOST_PP_IF(condition, HUGEMACROEXPANSION_A, HUGEMACROEXPANSION_B)`
 
which should instead be written
 
`BOOST_PP_CAT(HUGEMACROEXPANSION, BOOST_PP_IF(0, _A, _B))`
 
The former results in two huge expansions, one of which is thrown away, whereas the latter only results in one expansion.
 
I recall writing reasonably small but complex boost pp code that would take like 15 minutes just to preprocess, and now that I think about it, it might have resulted in the "ran out of source locations" a few times, before I finally figured out to always write those conditionals the second way.  
 
Not sure if that’s the issue, but it’s a place to look.
 
Dave
 



On Feb 3, 2021, at 3:10 PM, Keane, Erich <[hidden email]> wrote:
 
Thanks for the response Richard!  I’m working with my QA team to get a better reproducer to see if we can figure out what is the root-cause.  It DOES include a couple of other files so I’m not sure all of what is entailed.
 
>There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. 
This seems like a useful and cheap thing to do, if find this is a legitimate issue, I’ll see if I can put a patch together to do this one.
 
>And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.
 
I don’t have a great idea on how to do this, or how we use this, but based on your description it seems worth-while.  At least in non-module TUs I’d assume that the “imported-from-ast-file” is relatively rare.  Also, I would think these imports happen ‘first’, right?  So we could figure out the split as soon as we’re done with imports and optimize our space.
 
 
 
From: Richard Smith <[hidden email]> 
Sent: Wednesday, February 3, 2021 11:54 AM
To: Keane, Erich <[hidden email]>
Cc: David Rector <[hidden email]>; clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 2 Feb 2021 at 10:41, Keane, Erich via cfe-dev <[hidden email]> wrote:
Presumably, yes, we could hit that limit, and that limit is closer than we’d think.
 
Based on my understanding though, we currently stores the ‘offset’ into the buffer.  If we were to go with your plan, we’d reduce the ‘SourceLocation’ space significantly, since we’d only have 1 value taken up by “ALongFunctionName” instead of ~15?  Or, at least, only 1 instead of 3 in ‘int’.
 
I presume that anything short of expanding the size of SourceLocation is going to be a temporary measure at best. 
 
One consideration: What do we think about making SourceLocation have an underlying type of ‘bitfield’ that is a smaller size than 64 bits?  Then at least we could make SourceRange be a packed value?  So make SourceLocation be 40 bits (obviously would still be 64 if stored directly), but at least we could pack 2 of them into 80 bits for any type that does some sort of bit-packing? 
 
I doubt this would help in practice -- most of the storage in AST nodes tends to be SourceLocations and pointers.
 
Alternatively, what about making SourceRange (instead of 2 SourceLocation objects) be a SourceLocation + unsigned?  It would make it a little smaller?  Then we could do similar optimizations over time to the ASTNodes.  That is, all of the TypeLoc types could just store things like open/close parens as offsets-from-the-original rather than SourceLocations, then calculate them when needed?  I would assume anything that required re-calculation would be acceptable, since SourceLocations are seemingly quite rarely used (except for the few cases that Richard mentioned below).
 
That doesn't seem likely to work: we can't assume the beginning and end of a source range are close together in source location space (they could be in different SLocEntrys, which could be a long way apart from each other).
 
My preference is obviously to just make SourceLocation be uint64_t instead, but the impact on AST size is going to be significant.  So I guess I’m hoping that Richard Smith can comment and help us figure out how much pain we are willing to go through for this?
 
We could do that, and having a build-time selector for 32/64-bit SourceLocations seems like it might not impose a huge cost. But I think we should first try to gain some confidence that we're addressing the right problem -- running out of source locations seems likely to indicate that there's something more fundamental wrong with the compilation. Currently we reserve one bit for an 'is macro location' flag, and divide the remaining addressable 2GB into a 1GB local region and a 1GB imported-from-AST-file region. The resulting "1GB of preprocessed source (including all intermediate stages of macro expansion)" implementation limit does not seem especially restrictive to me, so the first thing I think we should find out is how we're actually hitting that limit in the boost example. We can currently handle huge compilations without hitting the limit, so if a single header file can hit it, that seems indicative of a bug that's not just caused by the limit being too low.
 
One possible cause would be that a large header is included a *lot* and doesn't have a proper include guard. If that's the case, that seems like a problem that we should get fixed in boost -- or in Clang if it's a bug in include guard detection -- because that will contribute to long compile times too. I expect there are other cases where source location address space gets wasted that we could drill down into.
 
If we find the problem is our location tracking for macro expansions (eg, pushing a large volume of tokens through deeply nested macro expansions), we could probably find a way to turn that off; there may even be ways we can intelligently turn it off selectively in cases where we think the intermediary information is unlikely to be useful. We could also look into throwing away source location address space for macro expansions that ended up producing no tokens. But we need to understand the nature of the problem first.
 
There are also some relatively cheap things we could do to expand our capacity: we could remove the 'is macro location' bit without much effort (though this would slow down the current places that check the flag), which would double our available source location capacity. And we could allow the division between local and imported locations to be determined dynamically rather than fixing a 50/50 split, which would in practice be likely to double the available capacity again. But those would only be useful if we're just a little over the limit; they wouldn't help if there's an asymptotic problem in our source location usage.
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 10:28 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
An additional consideration: if the files/buffers are so large that each SourceLocation cannot fit in 30 bits, does that imply that so many SourceLocations will be generated that the indices to them won’t be able to fit in 30 bits?  That may be the case now that I think of it, in which case the only solution really would be 64 bit SourceLocations, or something still more creative/costly.
 
On Feb 2, 2021, at 11:53 AM, Keane, Erich <[hidden email]> wrote:
 
Huh, that is an interesting idea!  The issue ends up being the callers of getOffset though, since it is a previate method.
 
That said, I wonder if the extra bit-use is just putting off the inevitable, and we should just use the vector in SourceManager for everything. My initial thought is that we could do away with the ‘IsMacro’ bit as well, and just encode that in the vector too.  It makes  isFileID and isMacroID require the SourceManager as well, but I wonder if that is OK?
 
 
 
From: David Rector <[hidden email]> 
Sent: Tuesday, February 2, 2021 8:46 AM
To: Keane, Erich <[hidden email]>
Cc: clang developer list <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
A possible solution not mentioned: reserve a bit in SourceLocation for "IsBigLoc".  When IsBigLoc = true, the remaining 30 bits of the "ID" field is not an offset but an index into a vector of uintptr_ts in SourceManager, each holding 64-bit offset data.
 
Only a few changes to private methods and isolated details would be needed I think: e.g. method `unsigned SourceLocation::getOffset()` would become `uintptr_t SourceLocation::getOffset(const SourceManager &SM)`, and would dereference and return the larger data when IsBigLoc.  And more generally `unsigned` types should be changed to `uintptr_t` everywhere 32-bit encodings are currently assumed (mostly in SourceManager).
 
This might be easier and less intrusive than allowing 64-bit SourceLocations depending on a build flag, if I understand that proposal correctly, since that would require templating virtually everything with the SourceLocation type.




On Feb 2, 2021, at 9:21 AM, Keane, Erich via cfe-dev <[hidden email]> wrote:
 
 
I’m bringing this back up since we have a reproduction of this in Boost now.  We haven’t finished analyzing what boost is doing, but simply doing an include of:
#include <libs/vmd/test/test_doc_modifiers_return_type.cxx>
 
Now causes us to run out of source locations, hitting:
 
/llvm/clang/lib/Basic/SourceManager.cpp:680: clang::SourceLocation clang::SourceManager::createExpansionLocImpl(const clang::SrcMgr::ExpansionInfo &, unsigned int, int, unsigned int): Assertion `NextLocalOffset + TokLength + 1 > NextLocalOffset && NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset && "Ran out of source locations!"' failed.
 
 
From the last discussion, it seems that increasing our source-location size isn’t really acceptable due to how much it is stored in the AST( Multiple times per node), and giving up on location isn’t viable either.  Additionally, the source location/source manager layout is quite complex and I don’t quite understand it yet, so I don’t have a good way of suggesting an alternative.
 
SO, I’d like to re-start the discussion into this, we need to find a way to make our compiler able to support more source-locations, as I can’t imagine this is going to be the only time we run into this.
 
-Erihc 
 
>>>>>>>>>>>>>>>>>>>> 
Hi Matt.
 
Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.
 
Thanks,
Christof
 
From: Matt Asplund <mwasplund at gmail.com>
Sent: 10 October 2019 14:09
To: Christof Douma <Christof.Douma at arm.com>
Cc: Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.
 
-Matt
 
On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>> wrote:
Hi Richard.
 
Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.
 
If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. @Matt Asplund<mailto:mwasplund at gmail.com> I believe you said that you’ve done some prototyping, is there something you can share?
 
Thanks,
Christof
 
From: Richard Smith <richard at metafoo.co.uk<mailto:richard at metafoo.co.uk>>
Sent: 08 October 2019 19:53
To: Christof Douma <Christof.Douma at arm.com<mailto:Christof.Douma at arm.com>>
Cc: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>; Clang Dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>; nd <nd at arm.com<mailto:nd at arm.com>>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Richard, Paul and other.
 
Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.
 
We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.
 
Hope that gave some more background of where this question comes from.
 
Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)
Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.
Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).
 
As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.
If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.
Thanks,
Christof
 
From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <Mikhail.Maltsev at arm.com<mailto:Mikhail.Maltsev at arm.com>>
Cc: nd <nd at arm.com<mailto:nd at arm.com>>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow
 
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
 
We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.
 
When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.
 
The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:
 
Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:
 
#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""
 
Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"
 
In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.
 
The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.
 
We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.<