Extend Stmt with proper end location?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Extend Stmt with proper end location?

suyash singh via cfe-dev
Hello,

currently Stmt getEndLoc returns slightly different results depending on Stmt type.
Specifically DoWhile, GotoStmt, ContinueStmt, BreakStmt, ReturnStmt, AsmStmt and SEHLeaveStmt do not track the location of the mandatory semicolon at the end.
(Expr is out of scope of this mail thread)

This is not really a high priority problem, but it makes some replacements in clang-tidy unnecessarily difficult.
Currently one has to differentiate by statement type and then parse past it's end skipping comments until a tok::semicolon within checkers.
Of course based on the last Stmt in case of children like an IfStmt without parenthesis.

However I feel this is a kind of an ugly workaround and Stmt.getEndLoc() should just return the proper end location for all statements incl all mandatory tokens.
To accomplish this the beforementioned statements require a new SourceLocation member.
My assumption is that this has little impact on memory & cache-locality, since those are not really high-occurrence statements - but I'm no expert.
Proof of concept is available here: https://reviews.llvm.org/D76108 (it has many many flaws, don't take it as ready for any kind of review).

Does it make sense to continue that way?

Regards,
Alexander Lanin


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Extend Stmt with proper end location?

suyash singh via cfe-dev

On 13 Mar 2020, at 5:08, via cfe-dev wrote:

currently Stmt getEndLoc returns slightly different results depending on Stmt type.
Specifically DoWhile, GotoStmt, ContinueStmt, BreakStmt, ReturnStmt, AsmStmt and SEHLeaveStmt do not track the location of the mandatory semicolon at the end.
(Expr is out of scope of this mail thread)

To be precise, the only statement that does track the location
of the semicolon is NullStmt, and that’s only because it would
otherwise have no associated source locations and so could not
meaningfully implement getSourceRange().

The Clang AST serves several different masters. It’s understood
that some of the information we track for the use of source tools
isn’t very useful for compilation, and maybe the reverse as well.
We do accept that this adds overhead to compilation. But we’ve
generally drawn the line at storing delimiters because:

  • vanishingly few clients of the AST need exact delimiter locations,
  • the clients that do care need them for vanishingly few AST nodes,
  • there are an awful lot of delimiters in a typical program, and storing them all would eat a lot of memory, and
  • in most cases, it should be easy and cheap to retroactively find those delimiters from the information we do track.

Furthermore, I think expressions are important to consider,
because the practical limitations on finding the semicolon after
an expression are exactly the same as finding it after break.
Fortunately, most of these restrictions coincide with things that
are mostly uninteresting to source tools: it’s possible that the
semicolon could be inside a macro, but you probably can’t safely
rewrite code around macro expansions anyway.

Now, of course there should be one method that implements this
scan instead of forcing it to be re-implemented again and again.
That method can have a special case for NullStmt.

John.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Extend Stmt with proper end location?

suyash singh via cfe-dev

I’m not sure of the exact list.

At least DeclStmt includes the semicolon:

https://github.com/llvm/llvm-project/blob/master/clang/lib/Parse/ParseDecl.cpp#L1801

 

But this does not matter that much. If tracking those Semicolons is too much, then it probably does

not matter much whether all Stmt behave the same or not as long as there is a way to get a location

with semicolon for those that need it.

I would try to add a comment to Stmt.getEndLoc to describe this as it’s at least surprising otherwise.

 

Alexander

 

 

Von: [hidden email] <[hidden email]>
Gesendet: Dienstag, 17. März 2020 06:11
An: [hidden email]
Cc: Clang Dev <[hidden email]>
Betreff: Re: [cfe-dev] Extend Stmt with proper end location?

 

On 13 Mar 2020, at 5:08, via cfe-dev wrote:

currently Stmt getEndLoc returns slightly different results depending on Stmt type.
Specifically DoWhile, GotoStmt, ContinueStmt, BreakStmt, ReturnStmt, AsmStmt and SEHLeaveStmt do not track the location of the mandatory semicolon at the end.
(Expr is out of scope of this mail thread)

To be precise, the only statement that does track the location
of the semicolon is
NullStmt, and that’s only because it would
otherwise have no associated source locations and so could not
meaningfully implement
getSourceRange().

The Clang AST serves several different masters. It’s understood
that some of the information we track for the use of source tools
isn’t very useful for compilation, and maybe the reverse as well.
We do accept that this adds overhead to compilation. But we’ve
generally drawn the line at storing delimiters because:

  • vanishingly few clients of the AST need exact delimiter locations,
  • the clients that do care need them for vanishingly few AST nodes,
  • there are an awful lot of delimiters in a typical program, and storing them all would eat a lot of memory, and
  • in most cases, it should be easy and cheap to retroactively find those delimiters from the information we do track.

Furthermore, I think expressions are important to consider,
because the practical limitations on finding the semicolon after
an expression are exactly the same as finding it after
break.
Fortunately, most of these restrictions coincide with things that
are mostly uninteresting to source tools: it’s possible that the
semicolon could be inside a macro, but you probably can’t safely
rewrite code around macro expansions anyway.

Now, of course there should be one method that implements this
scan instead of forcing it to be re-implemented again and again.
That method can have a special case for
NullStmt.

John.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Extend Stmt with proper end location?

suyash singh via cfe-dev
In reply to this post by suyash singh via cfe-dev


On Tue, Mar 17, 2020 at 6:11 AM John McCall via cfe-dev <[hidden email]> wrote:

Furthermore, I think expressions are important to consider,

because the practical limitations on finding the semicolon after
an expression are exactly the same as finding it after break.

To spell this out a little more: formally in `foo();` there's an expression-statement which consists of a the call expression and the semicolon. But clang just uses the CallExpr node to represent both, and CallExpr obviously(?) shouldn't include the semicolon in its source range.

Alex: I think the Syntax library might be more suitable for tasks that need this precise info such as refactoring (and it doesn't suffer in the same way from the multiple masters problem). Unfortunately it's not complete.

The clang::syntax::TokenBuffer class allows you to capture the expanded token stream (bounds and kind of every token) as the parse runs (using TokenCollector). Effectively this lets you opt into making clang record more token-level info at the cost of memory. You then have to poke at this token stream yourself to find the semicolons you're after.

The rest of the Syntax library ("syntax trees") uses a clang AST to build up a true syntactic (grammar-based) tree out of these tokens. "TEST_F(SyntaxTreeTest, While)" in TreeTest.cpp shows how this includes the semicolons of the (grammatical) BreakStatement.
The plan is/was to make it easy to then map between semantic and syntactic nodes, e.g. AST BreakStmt to the corresponding syntax BreakStatement. This hasn't been implemented yet I think.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Extend Stmt with proper end location?

suyash singh via cfe-dev
In reply to this post by suyash singh via cfe-dev
While it most likely won't solve all cases, there's Lexer::getLocForEndOfToken. It can be used to create a range that grabs the semi at the end - assuming there is one... having to be context-sensitive on when we try grabbing it is still a requirement.

I do vaguely remember seeing it used in Tidy in a few places.

But other tools seems to have an issue with this too, looking at the docs for getLocForEndOfToken, it shows arcmt has a method named findSemiAfterLocation.


off {
  I'm in awe that extending Stmt classes worked. I tried something similar recently and it just exploded in my face.
}

On Fri, 13 Mar 2020, 10:08 via cfe-dev, <[hidden email]> wrote:
Hello,

currently Stmt getEndLoc returns slightly different results depending on Stmt type.
Specifically DoWhile, GotoStmt, ContinueStmt, BreakStmt, ReturnStmt, AsmStmt and SEHLeaveStmt do not track the location of the mandatory semicolon at the end.
(Expr is out of scope of this mail thread)

This is not really a high priority problem, but it makes some replacements in clang-tidy unnecessarily difficult.
Currently one has to differentiate by statement type and then parse past it's end skipping comments until a tok::semicolon within checkers.
Of course based on the last Stmt in case of children like an IfStmt without parenthesis.

However I feel this is a kind of an ugly workaround and Stmt.getEndLoc() should just return the proper end location for all statements incl all mandatory tokens.
To accomplish this the beforementioned statements require a new SourceLocation member.
My assumption is that this has little impact on memory & cache-locality, since those are not really high-occurrence statements - but I'm no expert.
Proof of concept is available here: https://reviews.llvm.org/D76108 (it has many many flaws, don't take it as ready for any kind of review).

Does it make sense to continue that way?

Regards,
Alexander Lanin


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Extend Stmt with proper end location?

suyash singh via cfe-dev

Hi,

 

by now I’ve extended LexerUtils to handle exactly what you are mentioning.

See ‘getUnifiedEndLoc’ in https://reviews.llvm.org/D75813#change-Bc7Hs8CXH7Z9 (and feel free to review ;-) )

 

findSemiAfterLocation is indeed very similar, but not quite the same.

Maybe those two approaches could be merged.

Is there anyone willing to touch findSemiAfterLocation as the last change is ~6 years ago?

 

 

@off:

As I was following up on that proof of concept and making all places handle the new members it did somewhat explode in my face. Co_return didn’t make life easier.

I gave up on improving the proof of concept – also as general feedback here seems to be clear. No change of Stmt.

 

Alex

 

 

 

Von: Whisperity <[hidden email]>
Gesendet: Donnerstag, 26. März 2020 10:17
An: [hidden email]
Cc: Clang Dev <[hidden email]>
Betreff: Re: [cfe-dev] Extend Stmt with proper end location?

 

While it most likely won't solve all cases, there's Lexer::getLocForEndOfToken. It can be used to create a range that grabs the semi at the end - assuming there is one... having to be context-sensitive on when we try grabbing it is still a requirement.

 

I do vaguely remember seeing it used in Tidy in a few places.

 

But other tools seems to have an issue with this too, looking at the docs for getLocForEndOfToken, it shows arcmt has a method named findSemiAfterLocation.

 

 

off {

  I'm in awe that extending Stmt classes worked. I tried something similar recently and it just exploded in my face.

}

 

On Fri, 13 Mar 2020, 10:08 via cfe-dev, <[hidden email]> wrote:

Hello,

currently Stmt getEndLoc returns slightly different results depending on Stmt type.
Specifically DoWhile, GotoStmt, ContinueStmt, BreakStmt, ReturnStmt, AsmStmt and SEHLeaveStmt do not track the location of the mandatory semicolon at the end.
(Expr is out of scope of this mail thread)

This is not really a high priority problem, but it makes some replacements in clang-tidy unnecessarily difficult.
Currently one has to differentiate by statement type and then parse past it's end skipping comments until a tok::semicolon within checkers.
Of course based on the last Stmt in case of children like an IfStmt without parenthesis.

However I feel this is a kind of an ugly workaround and Stmt.getEndLoc() should just return the proper end location for all statements incl all mandatory tokens.
To accomplish this the beforementioned statements require a new SourceLocation member.
My assumption is that this has little impact on memory & cache-locality, since those are not really high-occurrence statements - but I'm no expert.
Proof of concept is available here: https://reviews.llvm.org/D76108 (it has many many flaws, don't take it as ready for any kind of review).

Does it make sense to continue that way?

Regards,
Alexander Lanin


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Extend Stmt with proper end location?

suyash singh via cfe-dev
In reply to this post by suyash singh via cfe-dev

Hi,

 

If I understand the concept correctly, currently many checkers rely on parsing a few relevant tokens themselves.

Potentially with the help of some utils/helpers that simplify the checker like the one I’m trying to introduce (see other mail).

 

With TokenBuffer/SyntaxTree, parsing is no longer needed for some/most checkers.

TokenBuffer/SyntaxTree would abstract “one step further” then e.g. LexerUtils and provide an enhanced/specialized AST.

 

However it’s not used that widely throughout llvm.

Is it new and the way to go? Do you expect clang-tidy checkers to rely on the SyntaxTree in the future?

It does indeed provide e.g. a BreakStatement which does exactly what I did by introducing a method into LexerUtils.

 

Alex

 

 

Von: Sam McCall <[hidden email]>
Gesendet: Donnerstag, 26. März 2020 01:08
An: [hidden email]
Cc: Clang Dev <[hidden email]>; John McCall <[hidden email]>
Betreff: Re: [cfe-dev] Extend Stmt with proper end location?

 

 

 

On Tue, Mar 17, 2020 at 6:11 AM John McCall via cfe-dev <[hidden email]> wrote:

Furthermore, I think expressions are important to consider,

because the practical limitations on finding the semicolon after
an expression are exactly the same as finding it after
break.

To spell this out a little more: formally in `foo();` there's an expression-statement which consists of a the call expression and the semicolon. But clang just uses the CallExpr node to represent both, and CallExpr obviously(?) shouldn't include the semicolon in its source range.

 

Alex: I think the Syntax library might be more suitable for tasks that need this precise info such as refactoring (and it doesn't suffer in the same way from the multiple masters problem). Unfortunately it's not complete.

 

The clang::syntax::TokenBuffer class allows you to capture the expanded token stream (bounds and kind of every token) as the parse runs (using TokenCollector). Effectively this lets you opt into making clang record more token-level info at the cost of memory. You then have to poke at this token stream yourself to find the semicolons you're after.

 

The rest of the Syntax library ("syntax trees") uses a clang AST to build up a true syntactic (grammar-based) tree out of these tokens. "TEST_F(SyntaxTreeTest, While)" in TreeTest.cpp shows how this includes the semicolons of the (grammatical) BreakStatement.

The plan is/was to make it easy to then map between semantic and syntactic nodes, e.g. AST BreakStmt to the corresponding syntax BreakStatement. This hasn't been implemented yet I think.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Extend Stmt with proper end location?

suyash singh via cfe-dev
On Thu, Mar 26, 2020 at 7:02 PM <[hidden email]> wrote:

Hi,

 

If I understand the concept correctly, currently many checkers rely on parsing a few relevant tokens themselves.

Potentially with the help of some utils/helpers that simplify the checker like the one I’m trying to introduce (see other mail).

 

With TokenBuffer/SyntaxTree, parsing is no longer needed for some/most checkers.

TokenBuffer/SyntaxTree would abstract “one step further” then e.g. LexerUtils and provide an enhanced/specialized AST.

 

However it’s not used that widely throughout llvm.

Is it new and the way to go?

TokenBuffer is pretty robust, we're using it heavily in clangd. It may have some incomplete parts. This is one step up from LexerUtils.
Syntax tree is new and ... not finished yet :-) This is a second step up.
 

Do you expect clang-tidy checkers to rely on the SyntaxTree in the future?

I don't know, if it gets nicely finished I think it'd be very useful for rewriting code (i.e. checker fixes). But I don't know anyone actively working on it.
 

It does indeed provide e.g. a BreakStatement which does exactly what I did by introducing a method into LexerUtils.

 

Alex

 

 

Von: Sam McCall <[hidden email]>
Gesendet: Donnerstag, 26. März 2020 01:08
An: [hidden email]
Cc: Clang Dev <[hidden email]>; John McCall <[hidden email]>
Betreff: Re: [cfe-dev] Extend Stmt with proper end location?

 

 

 

On Tue, Mar 17, 2020 at 6:11 AM John McCall via cfe-dev <[hidden email]> wrote:

Furthermore, I think expressions are important to consider,

because the practical limitations on finding the semicolon after
an expression are exactly the same as finding it after
break.

To spell this out a little more: formally in `foo();` there's an expression-statement which consists of a the call expression and the semicolon. But clang just uses the CallExpr node to represent both, and CallExpr obviously(?) shouldn't include the semicolon in its source range.

 

Alex: I think the Syntax library might be more suitable for tasks that need this precise info such as refactoring (and it doesn't suffer in the same way from the multiple masters problem). Unfortunately it's not complete.

 

The clang::syntax::TokenBuffer class allows you to capture the expanded token stream (bounds and kind of every token) as the parse runs (using TokenCollector). Effectively this lets you opt into making clang record more token-level info at the cost of memory. You then have to poke at this token stream yourself to find the semicolons you're after.

 

The rest of the Syntax library ("syntax trees") uses a clang AST to build up a true syntactic (grammar-based) tree out of these tokens. "TEST_F(SyntaxTreeTest, While)" in TreeTest.cpp shows how this includes the semicolons of the (grammatical) BreakStatement.

The plan is/was to make it easy to then map between semantic and syntactic nodes, e.g. AST BreakStmt to the corresponding syntax BreakStatement. This hasn't been implemented yet I think.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Extend Stmt with proper end location?

suyash singh via cfe-dev
In reply to this post by suyash singh via cfe-dev
On Wed, Mar 25, 2020 at 5:08 PM Sam McCall via cfe-dev <[hidden email]> wrote:
On Tue, Mar 17, 2020 at 6:11 AM John McCall via cfe-dev <[hidden email]> wrote:

Furthermore, I think expressions are important to consider,

because the practical limitations on finding the semicolon after
an expression are exactly the same as finding it after break.

To spell this out a little more: formally in `foo();` there's an expression-statement which consists of a the call expression and the semicolon. But clang just uses the CallExpr node to represent both, and CallExpr obviously(?) shouldn't include the semicolon in its source range.

Disclaimer: This is idle speculation, and I don't have any experience writing refactoring tools.

I wonder if we could store the delimiter locations in a memory efficient way by adding a second TrailingObject array to CompoundStmt. The overhead would be 4 bytes for every semicolon in the TU, which is potentially a lot when considering unused inline functions, but is not much when compared to the overhead of the Stmt node itself and the 8 byte Stmt* pointer already held in CompoundStmt.

There is of course the case of semicolon locations for non-compound statements (`if (cond) foo();`), but one could invent a new AST node for that case without wasting much memory.

It sounds like there are a lot of clang tools that use the lexer to search for semicolons. Would it help to store them this way instead?

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev