[RFC] Clang SourceLocation overflow

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

[1]. https://www.autosar.org
[2].
https://www.autosar.org/fileadmin/user_upload/standards/classic/3-0/AUTOSAR_SWS_MemoryMapping.pdf

--
Regards,
  Mikhail Maltsev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
I don't want to distract from your question, but wanted to add that I have been seeing source location overflow issues for many months when using clangs implementation of c++20 modules. I have a personal branch where I have made a partial conversion over to 64 bit source locations for test purposes.

-Matt

On Wed, Oct 2, 2019, 9:26 AM Mikhail Maltsev via cfe-dev <[hidden email]> wrote:
Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

[1]. https://www.autosar.org
[2].
https://www.autosar.org/fileadmin/user_upload/standards/classic/3-0/AUTOSAR_SWS_MemoryMapping.pdf

--
Regards,
  Mikhail Maltsev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
I think that we can rule out option 2, given that you've found a pattern used in real-world code which hits this limit. We'd like to avoid giving <invalid location> to any real users.

I'm not familiar with how GCC does source locations, but this sounds like it's just delaying the problem by a factor of the average line length? If so, adding two different thresholds at which line number accuracy is degraded doesn't seem worth it.

I also don't think option 4 is viable, as it would increase complexity a lot, and would only work for this exact pattern. I've also seen bug reports from customers trying to pre-process (mechanically generated) assembly files >2GB, which this solution wouldn't help with.

That leaves option 1, which looks like the best solution to me. The obvious concern with this one is the increase in memory consumption, since there will be a large number of these objects (one per token/AST node?). @Matt: do you have any memory consumption numbers from your prototype whoch could help here?

Oliver

On Thu, 3 Oct 2019 at 17:23, Matt Asplund via cfe-dev <[hidden email]> wrote:
I don't want to distract from your question, but wanted to add that I have been seeing source location overflow issues for many months when using clangs implementation of c++20 modules. I have a personal branch where I have made a partial conversion over to 64 bit source locations for test purposes.

-Matt

On Wed, Oct 2, 2019, 9:26 AM Mikhail Maltsev via cfe-dev <[hidden email]> wrote:
Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

[1]. https://www.autosar.org
[2].
https://www.autosar.org/fileadmin/user_upload/standards/classic/3-0/AUTOSAR_SWS_MemoryMapping.pdf

--
Regards,
  Mikhail Maltsev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
The increase in memory usage would be a darn shame. Most of the complexity of the source location machinery exists just to save these four bytes. If we used 64-bit values, source locations could easily be pointers into mapped files or something simple.

Can we detect files with pathologically many source locations, and collapse them to one source location, or something like that?
For a 2+GB file, it's not unreasonable to say "error, somewhere in this file, you figure it out, it's your own fault for generating source code this large".
I'm sure this would break some invariants, but it's worth exploring before increasing memory usage.

Also, what is the current failure mode when source locations are exhausted? Does clang print a useful error message instructing the user to reduce the size of their pre-processed input? That seems like a good place to start.

On Mon, Oct 7, 2019 at 2:54 AM Oliver Stannard via cfe-dev <[hidden email]> wrote:
I think that we can rule out option 2, given that you've found a pattern used in real-world code which hits this limit. We'd like to avoid giving <invalid location> to any real users.

I'm not familiar with how GCC does source locations, but this sounds like it's just delaying the problem by a factor of the average line length? If so, adding two different thresholds at which line number accuracy is degraded doesn't seem worth it.

I also don't think option 4 is viable, as it would increase complexity a lot, and would only work for this exact pattern. I've also seen bug reports from customers trying to pre-process (mechanically generated) assembly files >2GB, which this solution wouldn't help with.

That leaves option 1, which looks like the best solution to me. The obvious concern with this one is the increase in memory consumption, since there will be a large number of these objects (one per token/AST node?). @Matt: do you have any memory consumption numbers from your prototype whoch could help here?

Oliver

On Thu, 3 Oct 2019 at 17:23, Matt Asplund via cfe-dev <[hidden email]> wrote:
I don't want to distract from your question, but wanted to add that I have been seeing source location overflow issues for many months when using clangs implementation of c++20 modules. I have a personal branch where I have made a partial conversion over to 64 bit source locations for test purposes.

-Matt

On Wed, Oct 2, 2019, 9:26 AM Mikhail Maltsev via cfe-dev <[hidden email]> wrote:
Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

[1]. https://www.autosar.org
[2].
https://www.autosar.org/fileadmin/user_upload/standards/classic/3-0/AUTOSAR_SWS_MemoryMapping.pdf

--
Regards,
  Mikhail Maltsev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:
Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev

Richard Smith wrote:

Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?

 

In my experience, these sorts of things are doing two kinds of dispatch: by function, and by compiler.  Reading the AUTOSAR spec, it looks like there are many different compilers supported, each of which has its own set of pragmas and so forth to use for each bit of functionality.  You could have the top-level MemMap.h conditionally #include compiler-specific header files.  Or, you could break up the file functionally, for example EEP_{START,STOP}_SEC_VAR_16BIT could be implemented in one header file, other features in other header files, and have MemMap.h conditionally #include the appropriate function-specific header.  The spec says that MemMap.h “shall provide a mechanism” to do this and that; it doesn’t appear to require that the entire mechanism for all compilers be embodied in a single header file.  Either way, breaking up the file would improve compilation time (fewer lines to process) as well as addressing the source-location size problem.

--paulr

 

From: cfe-dev <[hidden email]> On Behalf Of Richard Smith via cfe-dev
Sent: Monday, October 07, 2019 3:36 PM
To: Mikhail Maltsev <[hidden email]>
Cc: nd <[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:

Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

 

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

 

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev

Hi Richard, Paul and other.

 

Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.

 

We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.

 

Hope that gave some more background of where this question comes from.

 

Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”) Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).

 

As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.

 

Thanks,

Christof

 

From: cfe-dev <[hidden email]> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <[hidden email]>
Cc: nd <[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:

Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

 

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

 

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
On Mon, Oct 7, 2019 at 2:45 PM Reid Kleckner via cfe-dev
<[hidden email]> wrote:
>
> The increase in memory usage would be a darn shame. Most of the complexity of the source location machinery exists just to save these four bytes. If we used 64-bit values, source locations could easily be pointers into mapped files or something simple.
>
> Can we detect files with pathologically many source locations, and collapse them to one source location, or something like that?
> For a 2+GB file, it's not unreasonable to say "error, somewhere in this file, you figure it out, it's your own fault for generating source code this large".
> I'm sure this would break some invariants, but it's worth exploring before increasing memory usage.

I'm not keen on this approach because source locations aren't just for
humans. For instance, source location information is output as part of
the static analyzer and may be dumped to SARIF or a plist file to be
consumed by other tools (sometimes through automation). I doubt those
tools would expect this behavior (especially if there's a spec, like
for SARIF), and it would be unfortunate to expect them to work around
Clang's bug.

That said, I am also super wary of increasing the memory usage --
source locations are *everywhere*.

~Aaron

> Also, what is the current failure mode when source locations are exhausted? Does clang print a useful error message instructing the user to reduce the size of their pre-processed input? That seems like a good place to start.
>
> On Mon, Oct 7, 2019 at 2:54 AM Oliver Stannard via cfe-dev <[hidden email]> wrote:
>>
>> I think that we can rule out option 2, given that you've found a pattern used in real-world code which hits this limit. We'd like to avoid giving <invalid location> to any real users.
>>
>> I'm not familiar with how GCC does source locations, but this sounds like it's just delaying the problem by a factor of the average line length? If so, adding two different thresholds at which line number accuracy is degraded doesn't seem worth it.
>>
>> I also don't think option 4 is viable, as it would increase complexity a lot, and would only work for this exact pattern. I've also seen bug reports from customers trying to pre-process (mechanically generated) assembly files >2GB, which this solution wouldn't help with.
>>
>> That leaves option 1, which looks like the best solution to me. The obvious concern with this one is the increase in memory consumption, since there will be a large number of these objects (one per token/AST node?). @Matt: do you have any memory consumption numbers from your prototype whoch could help here?
>>
>> Oliver
>>
>> On Thu, 3 Oct 2019 at 17:23, Matt Asplund via cfe-dev <[hidden email]> wrote:
>>>
>>> I don't want to distract from your question, but wanted to add that I have been seeing source location overflow issues for many months when using clangs implementation of c++20 modules. I have a personal branch where I have made a partial conversion over to 64 bit source locations for test purposes.
>>>
>>> -Matt
>>>
>>> On Wed, Oct 2, 2019, 9:26 AM Mikhail Maltsev via cfe-dev <[hidden email]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We are experiencing a problem with Clang SourceLocation overflow.
>>>> Currently source locations are 32-bit values, one bit is a flag, which gives
>>>> a source location space of 2^31 characters.
>>>>
>>>> When the Clang lexer processes an #include directive it reserves the total size
>>>> of the file being included in the source location space. An overflow can occur
>>>> if a large file (which does not have include guards by design) is included many
>>>> times into a single TU.
>>>>
>>>> The pattern of including a file multiple times is for example required by
>>>> the AUTOSAR standard [1], which is widely used in the automotive industry.
>>>> Specifically the pattern is described in the Specification of Memory Mapping [2]:
>>>>
>>>> Section 8.2.1, MEMMAP003:
>>>> "The start and stop symbols for section control are configured with section
>>>> identifiers defined in MemMap.h [...] For instance:
>>>>
>>>> #define EEP_START_SEC_VAR_16BIT
>>>> #include "MemMap.h"
>>>> static uint16 EepTimer;
>>>> static uint16 EepRemainingBytes;
>>>> #define EEP_STOP_SEC_VAR_16BIT
>>>> #include "MemMap.h""
>>>>
>>>> Section 8.2.2, MEMMAP005:
>>>> "The file MemMap.h shall provide a mechanism to select different code, variable
>>>> or constant sections by checking the definition of the module specific memory
>>>> allocation key words for starting a section [...]"
>>>>
>>>> In practice MemMap.h can reach several MBs and can be included several thousand
>>>> times causing an overflow in the source location space.
>>>>
>>>> The problem does not occur with GCC because it tracks line numbers rather than
>>>> file offsets. Column numbers are tracked separately and are optional. I.e., in
>>>> GCC a source location can be either a (line+column) tuple packed into 32 bits or
>>>> (when the line number exceeds a certain threshold) a 32-bit line number.
>>>>
>>>> We are looking for an acceptable way of resolving the problem and propose the
>>>> following approaches for discussion:
>>>> 1. Use 64 bits for source location tracking.
>>>> 2. Track until an overflow occurs after that make the lexer output
>>>>    the <invalid location> special value for all subsequent tokens.
>>>> 3. Implement an approach similar to the one used by GCC and start tracking line
>>>>    numbers instead of file offsets after a certain threshold. Resort to (2)
>>>>    when even line numbers overflow.
>>>> 4. (?) Detect the multiple inclusion pattern and track it differently (for now
>>>>    we don't have specific ideas on how to implement this)
>>>>
>>>> Is any of these approaches viable? What caveats should we expect? (we already
>>>> know about static_asserts guarding the sizes of certain class fields which start
>>>> failing in the first approach).
>>>>
>>>> Other suggestions are welcome.
>>>>
>>>> [1]. https://www.autosar.org
>>>> [2].
>>>> https://www.autosar.org/fileadmin/user_upload/standards/classic/3-0/AUTOSAR_SWS_MemoryMapping.pdf
>>>>
>>>> --
>>>> Regards,
>>>>   Mikhail Maltsev
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> [hidden email]
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
I realize it makes calculating them way harder, but it is a shame we give away so much of our address space to locations that will never be used.  For example:

void foo();

We'll never create a SourceLocation to 'oi' in void, or the middle 'o' in foo, yet we take up space anyway.  I realize it makes going from SourceLocation to actual line/column, but it is a shame that SourceLocation isn't just a 'Token Index' into the file.  2^31 Tokens seems a lot less limiting than 2^31 characters.

Unfortunately, this ends up being a sizable change I suspect.

-----Original Message-----
From: cfe-dev <[hidden email]> On Behalf Of Aaron Ballman via cfe-dev
Sent: Tuesday, October 8, 2019 10:50 AM
To: Reid Kleckner <[hidden email]>
Cc: cfe-dev <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

On Mon, Oct 7, 2019 at 2:45 PM Reid Kleckner via cfe-dev <[hidden email]> wrote:
>
> The increase in memory usage would be a darn shame. Most of the complexity of the source location machinery exists just to save these four bytes. If we used 64-bit values, source locations could easily be pointers into mapped files or something simple.
>
> Can we detect files with pathologically many source locations, and collapse them to one source location, or something like that?
> For a 2+GB file, it's not unreasonable to say "error, somewhere in this file, you figure it out, it's your own fault for generating source code this large".
> I'm sure this would break some invariants, but it's worth exploring before increasing memory usage.

I'm not keen on this approach because source locations aren't just for humans. For instance, source location information is output as part of the static analyzer and may be dumped to SARIF or a plist file to be consumed by other tools (sometimes through automation). I doubt those tools would expect this behavior (especially if there's a spec, like for SARIF), and it would be unfortunate to expect them to work around Clang's bug.

That said, I am also super wary of increasing the memory usage -- source locations are *everywhere*.

~Aaron

> Also, what is the current failure mode when source locations are exhausted? Does clang print a useful error message instructing the user to reduce the size of their pre-processed input? That seems like a good place to start.
>
> On Mon, Oct 7, 2019 at 2:54 AM Oliver Stannard via cfe-dev <[hidden email]> wrote:
>>
>> I think that we can rule out option 2, given that you've found a pattern used in real-world code which hits this limit. We'd like to avoid giving <invalid location> to any real users.
>>
>> I'm not familiar with how GCC does source locations, but this sounds like it's just delaying the problem by a factor of the average line length? If so, adding two different thresholds at which line number accuracy is degraded doesn't seem worth it.
>>
>> I also don't think option 4 is viable, as it would increase complexity a lot, and would only work for this exact pattern. I've also seen bug reports from customers trying to pre-process (mechanically generated) assembly files >2GB, which this solution wouldn't help with.
>>
>> That leaves option 1, which looks like the best solution to me. The obvious concern with this one is the increase in memory consumption, since there will be a large number of these objects (one per token/AST node?). @Matt: do you have any memory consumption numbers from your prototype whoch could help here?
>>
>> Oliver
>>
>> On Thu, 3 Oct 2019 at 17:23, Matt Asplund via cfe-dev <[hidden email]> wrote:
>>>
>>> I don't want to distract from your question, but wanted to add that I have been seeing source location overflow issues for many months when using clangs implementation of c++20 modules. I have a personal branch where I have made a partial conversion over to 64 bit source locations for test purposes.
>>>
>>> -Matt
>>>
>>> On Wed, Oct 2, 2019, 9:26 AM Mikhail Maltsev via cfe-dev <[hidden email]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We are experiencing a problem with Clang SourceLocation overflow.
>>>> Currently source locations are 32-bit values, one bit is a flag,
>>>> which gives a source location space of 2^31 characters.
>>>>
>>>> When the Clang lexer processes an #include directive it reserves
>>>> the total size of the file being included in the source location
>>>> space. An overflow can occur if a large file (which does not have
>>>> include guards by design) is included many times into a single TU.
>>>>
>>>> The pattern of including a file multiple times is for example
>>>> required by the AUTOSAR standard [1], which is widely used in the automotive industry.
>>>> Specifically the pattern is described in the Specification of Memory Mapping [2]:
>>>>
>>>> Section 8.2.1, MEMMAP003:
>>>> "The start and stop symbols for section control are configured with
>>>> section identifiers defined in MemMap.h [...] For instance:
>>>>
>>>> #define EEP_START_SEC_VAR_16BIT
>>>> #include "MemMap.h"
>>>> static uint16 EepTimer;
>>>> static uint16 EepRemainingBytes;
>>>> #define EEP_STOP_SEC_VAR_16BIT
>>>> #include "MemMap.h""
>>>>
>>>> Section 8.2.2, MEMMAP005:
>>>> "The file MemMap.h shall provide a mechanism to select different
>>>> code, variable or constant sections by checking the definition of
>>>> the module specific memory allocation key words for starting a section [...]"
>>>>
>>>> In practice MemMap.h can reach several MBs and can be included
>>>> several thousand times causing an overflow in the source location space.
>>>>
>>>> The problem does not occur with GCC because it tracks line numbers
>>>> rather than file offsets. Column numbers are tracked separately and
>>>> are optional. I.e., in GCC a source location can be either a
>>>> (line+column) tuple packed into 32 bits or (when the line number exceeds a certain threshold) a 32-bit line number.
>>>>
>>>> We are looking for an acceptable way of resolving the problem and
>>>> propose the following approaches for discussion:
>>>> 1. Use 64 bits for source location tracking.
>>>> 2. Track until an overflow occurs after that make the lexer output
>>>>    the <invalid location> special value for all subsequent tokens.
>>>> 3. Implement an approach similar to the one used by GCC and start tracking line
>>>>    numbers instead of file offsets after a certain threshold. Resort to (2)
>>>>    when even line numbers overflow.
>>>> 4. (?) Detect the multiple inclusion pattern and track it differently (for now
>>>>    we don't have specific ideas on how to implement this)
>>>>
>>>> Is any of these approaches viable? What caveats should we expect?
>>>> (we already know about static_asserts guarding the sizes of certain
>>>> class fields which start failing in the first approach).
>>>>
>>>> Other suggestions are welcome.
>>>>
>>>> [1]. https://www.autosar.org
>>>> [2].
>>>> https://www.autosar.org/fileadmin/user_upload/standards/classic/3-0
>>>> /AUTOSAR_SWS_MemoryMapping.pdf
>>>>
>>>> --
>>>> Regards,
>>>>   Mikhail Maltsev
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> [hidden email]
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <[hidden email]> wrote:

Hi Richard, Paul and other.

 

Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.

 

We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.

 

Hope that gave some more background of where this question comes from.

 

Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)

Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.

Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).

 

As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.

If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.

Thanks,

Christof

 

From: cfe-dev <[hidden email]> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <[hidden email]>
Cc: nd <[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:

Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

 

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

 

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev

Hi Richard.

 

Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.

 

If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. [hidden email] I believe you said that you’ve done some prototyping, is there something you can share?

 

Thanks,

Christof

 

From: Richard Smith <[hidden email]>
Sent: 08 October 2019 19:53
To: Christof Douma <[hidden email]>
Cc: Mikhail Maltsev <[hidden email]>; Clang Dev <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <[hidden email]> wrote:

Hi Richard, Paul and other.

 

Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.

 

We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.

 

Hope that gave some more background of where this question comes from.

 

Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)

Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.

Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).

 

As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.

If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.

Thanks,

Christof

 

From: cfe-dev <[hidden email]> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <
[hidden email]>
Cc: nd <
[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:

Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

 

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

 

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.

-Matt

On Thu, Oct 10, 2019, 2:13 AM Christof Douma <[hidden email]> wrote:

Hi Richard.

 

Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.

 

If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. [hidden email] I believe you said that you’ve done some prototyping, is there something you can share?

 

Thanks,

Christof

 

From: Richard Smith <[hidden email]>
Sent: 08 October 2019 19:53
To: Christof Douma <[hidden email]>
Cc: Mikhail Maltsev <[hidden email]>; Clang Dev <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <[hidden email]> wrote:

Hi Richard, Paul and other.

 

Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.

 

We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.

 

Hope that gave some more background of where this question comes from.

 

Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)

Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.

Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).

 

As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.

If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.

Thanks,

Christof

 

From: cfe-dev <[hidden email]> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <
[hidden email]>
Cc: nd <
[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:

Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

 

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

 

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev

Hi Matt.

 

Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.

 

Thanks,

Christof

 

From: Matt Asplund <[hidden email]>
Sent: 10 October 2019 14:09
To: Christof Douma <[hidden email]>
Cc: Richard Smith <[hidden email]>; Mikhail Maltsev <[hidden email]>; Clang Dev <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.

 

-Matt

 

On Thu, Oct 10, 2019, 2:13 AM Christof Douma <[hidden email]> wrote:

Hi Richard.

 

Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.

 

If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. [hidden email] I believe you said that you’ve done some prototyping, is there something you can share?

 

Thanks,

Christof

 

From: Richard Smith <[hidden email]>
Sent: 08 October 2019 19:53
To: Christof Douma <
[hidden email]>
Cc: Mikhail Maltsev <
[hidden email]>; Clang Dev <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <[hidden email]> wrote:

Hi Richard, Paul and other.

 

Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.

 

We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.

 

Hope that gave some more background of where this question comes from.

 

Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)

Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.

Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).

 

As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.

If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.

Thanks,

Christof

 

From: cfe-dev <[hidden email]> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <
[hidden email]>
Cc: nd <
[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:

Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

 

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

 

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
I have attached a patch for the changes to update SourceLocation to internally use an unsigned long long (I tried to put it on phabricator, but it was not having my changes). The change is as isolated as I could make it to allow a majority of clang to continue to use the old 32 bit ids whenever possible and only convert locations that I was personally hitting overflows (when trying to convert down) during large pcm lazy loading. Note: There may be a few random C++20 Modules fixes and/or hacks sprinkled in, this was entirely to unblock my own investigations into a C++ Modules build system.

-Matt

On Thu, Oct 10, 2019 at 7:59 AM Christof Douma <[hidden email]> wrote:

Hi Matt.

 

Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.

 

Thanks,

Christof

 

From: Matt Asplund <[hidden email]>
Sent: 10 October 2019 14:09
To: Christof Douma <[hidden email]>
Cc: Richard Smith <[hidden email]>; Mikhail Maltsev <[hidden email]>; Clang Dev <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.

 

-Matt

 

On Thu, Oct 10, 2019, 2:13 AM Christof Douma <[hidden email]> wrote:

Hi Richard.

 

Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.

 

If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. [hidden email] I believe you said that you’ve done some prototyping, is there something you can share?

 

Thanks,

Christof

 

From: Richard Smith <[hidden email]>
Sent: 08 October 2019 19:53
To: Christof Douma <
[hidden email]>
Cc: Mikhail Maltsev <
[hidden email]>; Clang Dev <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <[hidden email]> wrote:

Hi Richard, Paul and other.

 

Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.

 

We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.

 

Hope that gave some more background of where this question comes from.

 

Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)

Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.

Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).

 

As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.

If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.

Thanks,

Christof

 

From: cfe-dev <[hidden email]> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <
[hidden email]>
Cc: nd <
[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:

Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

 

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

 

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

new-feature.patch (107K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
On 11/2/19 5:00 PM, Matt Asplund wrote:

> I have attached a patch for the changes to update SourceLocation to internally
> use an unsigned long long (I tried to put it on phabricator, but it was not
> having my changes). The change is as isolated as I could make it to allow a
> majority of clang to continue to use the old 32 bit ids whenever possible and
> only convert locations that I was personally hitting overflows (when trying to
> convert down) during large pcm lazy loading. Note: There may be a few random
> C++20 Modules fixes and/or hacks sprinkled in, this was entirely to unblock my
> own investigations into a C++ Modules build system.
>
> -Matt

Thanks for sharing. We will look into your patch.


--
Regards,
  Mikhail Maltsev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Clang SourceLocation overflow

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
(Looks like the patch converts a few files to dos-style line endings, which makes it larger than it needs to be.)

On Sun, Nov 3, 2019 at 3:51 PM Matt Asplund via cfe-dev <[hidden email]> wrote:
I have attached a patch for the changes to update SourceLocation to internally use an unsigned long long (I tried to put it on phabricator, but it was not having my changes). The change is as isolated as I could make it to allow a majority of clang to continue to use the old 32 bit ids whenever possible and only convert locations that I was personally hitting overflows (when trying to convert down) during large pcm lazy loading. Note: There may be a few random C++20 Modules fixes and/or hacks sprinkled in, this was entirely to unblock my own investigations into a C++ Modules build system.

-Matt

On Thu, Oct 10, 2019 at 7:59 AM Christof Douma <[hidden email]> wrote:

Hi Matt.

 

Thanks for the offer. Whenever you’re back and have a moment is fine by me, I don’t think we’re in a massive hurry.

 

Thanks,

Christof

 

From: Matt Asplund <[hidden email]>
Sent: 10 October 2019 14:09
To: Christof Douma <[hidden email]>
Cc: Richard Smith <[hidden email]>; Mikhail Maltsev <[hidden email]>; Clang Dev <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

Sure, I can share my changes. In general my changes are very localized to the code paths I was hitting oveflow issues and tried to keep as much as possible using 32bit encodings. Is there an ETA for when you will start investigating, I am out of town for the next week but if needed can get access to my desktop.

 

-Matt

 

On Thu, Oct 10, 2019, 2:13 AM Christof Douma <[hidden email]> wrote:

Hi Richard.

 

Thanks for the clarification, I certainly had not realized that location tracking was needed for correctness of clang. Decoupling this sounds like a painful processes, and the only thing we get is errors without any location information. Sounds like not a great trade-off. We’ll go experiment a bit with the size impact of using 64-bits and will attempt to take the route of a cmake configuration option.

 

If there are people that have already done some experiments on the size impact of 64-bits location pointers, we’re welcome your insights. [hidden email] I believe you said that you’ve done some prototyping, is there something you can share?

 

Thanks,

Christof

 

From: Richard Smith <[hidden email]>
Sent: 08 October 2019 19:53
To: Christof Douma <
[hidden email]>
Cc: Mikhail Maltsev <
[hidden email]>; Clang Dev <[hidden email]>; nd <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <[hidden email]> wrote:

Hi Richard, Paul and other.

 

Thanks for the input so far. I wanted to point out that it’s not our code-base. Rather, we’re seeing more use of the LLVM technology in the automotive market and as usual we’re faced with existing code bases that are tried and tested with other toolchains (gcc or others) and when LLVM comes along things don’t always work directly.

 

We’ve suggested better ways of structuring their code and your suggestions are certainly good input. However, legacy code is especially sticky in any market that has to handle ‘safety’ concerns, like automotive, aerospace and medical markets. Code changes are pretty expensive in those fields. So while I hope that over time we see more sensible coding structures, I don’t expect that to happen any time soon. In the mean time, we’re searching for a solution for this coding pattern that doesn’t play well with clang.

 

Hope that gave some more background of where this question comes from.

 

Do all options that were suggested by Mikhail really require fundamental restructuring of major parts of clang? This surprised me, I had expected that the option 2 to be possible without a complete overhaul. (2 is “Track until an overflow occurs after that make the lexer output the <invalid location> special value for all subsequent tokens.”)

Clang uses source locations as part of the semantic representation of the AST in some cases (simple example: some forms of initialization might use parens or not, with different semantics, and we distinguish between them based on whether we have paren locations; there are also some checks that look at which of two source locations came first when determining which warnings or errors to produce, and so on). Maybe we could go through and fix all of those, but that's still a very large task and it'd be hard to check we got them all.

Not nice user experience but maybe doable? I was hoping there was something slightly better that still works without a major restructuring (maybe something that at least gives a rough location or something that only gives the location of the error and not the include stack under an option or using some kind of heuristic to detect that things go haywire).

 

As an alternative, I was curious if it would be possible and acceptable to make the switch between 32-bit and 64-bit location tracking a build-time/cmake decision? I’ve not done any estimation on the memory size growth, so maybe this is a dead end.

If someone is prepared to do the work to add and maintain this build mode, I think this might be a feasible option.

Thanks,

Christof

 

From: cfe-dev <[hidden email]> On Behalf Of Richard Smith via cfe-dev
Sent: 07 October 2019 20:36
To: Mikhail Maltsev <
[hidden email]>
Cc: nd <
[hidden email]>; [hidden email]
Subject: Re: [cfe-dev] [RFC] Clang SourceLocation overflow

 

On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <[hidden email]> wrote:

Hi all,

We are experiencing a problem with Clang SourceLocation overflow.
Currently source locations are 32-bit values, one bit is a flag, which gives
a source location space of 2^31 characters.

When the Clang lexer processes an #include directive it reserves the total size
of the file being included in the source location space. An overflow can occur
if a large file (which does not have include guards by design) is included many
times into a single TU.

The pattern of including a file multiple times is for example required by
the AUTOSAR standard [1], which is widely used in the automotive industry.
Specifically the pattern is described in the Specification of Memory Mapping [2]:

Section 8.2.1, MEMMAP003:
"The start and stop symbols for section control are configured with section
identifiers defined in MemMap.h [...] For instance:

#define EEP_START_SEC_VAR_16BIT
#include "MemMap.h"
static uint16 EepTimer;
static uint16 EepRemainingBytes;
#define EEP_STOP_SEC_VAR_16BIT
#include "MemMap.h""

Section 8.2.2, MEMMAP005:
"The file MemMap.h shall provide a mechanism to select different code, variable
or constant sections by checking the definition of the module specific memory
allocation key words for starting a section [...]"

In practice MemMap.h can reach several MBs and can be included several thousand
times causing an overflow in the source location space.

The problem does not occur with GCC because it tracks line numbers rather than
file offsets. Column numbers are tracked separately and are optional. I.e., in
GCC a source location can be either a (line+column) tuple packed into 32 bits or
(when the line number exceeds a certain threshold) a 32-bit line number.

We are looking for an acceptable way of resolving the problem and propose the
following approaches for discussion:
1. Use 64 bits for source location tracking.
2. Track until an overflow occurs after that make the lexer output
   the <invalid location> special value for all subsequent tokens.
3. Implement an approach similar to the one used by GCC and start tracking line
   numbers instead of file offsets after a certain threshold. Resort to (2)
   when even line numbers overflow.
4. (?) Detect the multiple inclusion pattern and track it differently (for now
   we don't have specific ideas on how to implement this)

Is any of these approaches viable? What caveats should we expect? (we already
know about static_asserts guarding the sizes of certain class fields which start
failing in the first approach).

Other suggestions are welcome.

 

I don't think any of the above approaches are reasonable; they would all require fundamental restructuring of major parts of Clang, an efficiency or memory size hit for all other users of Clang, or some combination of those.

 

Your code pattern seems unreasonable; including a multi-megabyte file thousands of times is not a good idea. Can you split out parts of MemMap.h into a separate header that is only included once, and keep only the parts that actually change on repeated inclusion in MemMap.h itself?

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev