Stmt.getEndLoc() vs semicolon

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Stmt.getEndLoc() vs semicolon

suyash singh via cfe-dev

Hello,

 

I’m having trouble with the locations returned by Stmt getEndLoc()/getSourceRange() and am wondering whether this is a bug/flaw:

the returned location sometimes includes and sometimes excludes the semicolon in the end.

This gets even more interesting when there are (multiple) comments before the semicolon, simply because the difference between the returned values gets even bigger.

Here is an example showing how DeclStmt includes the semicolon, while CallExpr and BinaryOperator exclude the semicolon http://ce.steveire.com/z/TW3IAG.

 

So Stmt behaves “completely differently” depending on it’s type, which is no way suggested by it’s interface.

Wouldn’t it be better for Stmt-Users if it would always be the same?

(Not sure whether it should always be included or excluded. It’s not even always a semicolon as in that 3rd foo() in the example)

https://en.wikipedia.org/wiki/Liskov_substitution_principle

 

Here is one of the effects of this complexity for users of Stmt as they struggle to find that semicolon:

https://bugs.llvm.org/show_bug.cgi?id=25970 / https://reviews.llvm.org/D16267

Of course it’s fixable there, but that would imply working around the issue in multiple places.

 

Regards,

Alexander Lanin


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Stmt.getEndLoc() vs semicolon

suyash singh via cfe-dev
Alexander,

I agree that this behavior is rather confusing, especially to first-time users. The cause of the issue is that Expr derives from Stmt so that Exprs can appear directly in, for example, compound statements, rather than being wrapped in an explicit node to represent statements that consist only of an expression.  As far as I understand, this choice to avoid an explicit node had some (positive) performance implications for the AST when it was originally designed.  While I'd love to see this choice revisited, I think it would be a significant effort and don't expect it to happen.

Instead, we tend to work around the issue.  For getting proper ranges of statements, you might find clang::tooling::getExtendedText/getExtendedRange meet your needs:

For declarations, the (just added!) getAssociatedRange might be even better: https://github.com/llvm/llvm-project/blob/master/clang/include/clang/Tooling/Transformer/SourceCode.h#L39

Cheers,
Yitzhak

On Tue, Feb 25, 2020 at 5:15 PM Alexander Lanin via cfe-dev <[hidden email]> wrote:

Hello,

 

I’m having trouble with the locations returned by Stmt getEndLoc()/getSourceRange() and am wondering whether this is a bug/flaw:

the returned location sometimes includes and sometimes excludes the semicolon in the end.

This gets even more interesting when there are (multiple) comments before the semicolon, simply because the difference between the returned values gets even bigger.

Here is an example showing how DeclStmt includes the semicolon, while CallExpr and BinaryOperator exclude the semicolon http://ce.steveire.com/z/TW3IAG.

 

So Stmt behaves “completely differently” depending on it’s type, which is no way suggested by it’s interface.

Wouldn’t it be better for Stmt-Users if it would always be the same?

(Not sure whether it should always be included or excluded. It’s not even always a semicolon as in that 3rd foo() in the example)

https://en.wikipedia.org/wiki/Liskov_substitution_principle

 

Here is one of the effects of this complexity for users of Stmt as they struggle to find that semicolon:

https://bugs.llvm.org/show_bug.cgi?id=25970 / https://reviews.llvm.org/D16267

Of course it’s fixable there, but that would imply working around the issue in multiple places.

 

Regards,

Alexander Lanin

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Stmt.getEndLoc() vs semicolon

suyash singh via cfe-dev

Hi,

 

thanks, that’s interesting, but I’m not sure how this relates to trailing comments + semicolon?

Why does it matter whether Expr is wrapped or not? It can contain comments in general. But doesn’t contain those at the end.

At least for such examples I find it strange: http://ce.steveire.com/z/KdXBDg

 

 

Is Expr the only Stmt that doesn’t include „it’s end“?

Would it make sense to have some unifyStmtRange() in SourceCode.h with special handling for Expr which would extend it with (n x comment) + (optional comma or semi) [+ all the special cases I’m not thinking about right now]?

 

Currently the check mentioned below always subtracts one char and then checks whether it’s a semicolon. Might be nicer to distinguish by isa<Expr>?

 

Regards,

Alex

 

 

 

Von: [hidden email]
Gesendet: Mittwoch, 26. Februar 2020 16:40
An: [hidden email]
Cc: [hidden email]
Betreff: Re: [cfe-dev] Stmt.getEndLoc() vs semicolon

 

Alexander,

 

I agree that this behavior is rather confusing, especially to first-time users. The cause of the issue is that Expr derives from Stmt so that Exprs can appear directly in, for example, compound statements, rather than being wrapped in an explicit node to represent statements that consist only of an expression.  As far as I understand, this choice to avoid an explicit node had some (positive) performance implications for the AST when it was originally designed.  While I'd love to see this choice revisited, I think it would be a significant effort and don't expect it to happen.

 

Instead, we tend to work around the issue.  For getting proper ranges of statements, you might find clang::tooling::getExtendedText/getExtendedRange meet your needs:

 

For declarations, the (just added!) getAssociatedRange might be even better: https://github.com/llvm/llvm-project/blob/master/clang/include/clang/Tooling/Transformer/SourceCode.h#L39

 

Cheers,

Yitzhak

 

On Tue, Feb 25, 2020 at 5:15 PM Alexander Lanin via cfe-dev <[hidden email]> wrote:

Hello,

 

I’m having trouble with the locations returned by Stmt getEndLoc()/getSourceRange() and am wondering whether this is a bug/flaw:

the returned location sometimes includes and sometimes excludes the semicolon in the end.

This gets even more interesting when there are (multiple) comments before the semicolon, simply because the difference between the returned values gets even bigger.

Here is an example showing how DeclStmt includes the semicolon, while CallExpr and BinaryOperator exclude the semicolon http://ce.steveire.com/z/TW3IAG.

 

So Stmt behaves “completely differently” depending on it’s type, which is no way suggested by it’s interface.

Wouldn’t it be better for Stmt-Users if it would always be the same?

(Not sure whether it should always be included or excluded. It’s not even always a semicolon as in that 3rd foo() in the example)

https://en.wikipedia.org/wiki/Liskov_substitution_principle

 

Here is one of the effects of this complexity for users of Stmt as they struggle to find that semicolon:

https://bugs.llvm.org/show_bug.cgi?id=25970 / https://reviews.llvm.org/D16267

Of course it’s fixable there, but that would imply working around the issue in multiple places.

 

Regards,

Alexander Lanin

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Stmt.getEndLoc() vs semicolon

suyash singh via cfe-dev
Sorry, I overlooked your point about the trailing comments. Yes, there are two issues: trailing comments and inclusion of semicolon -- my response was with regards to the semicolon.  Since there are a lot issues here, I've tried to break them down below.

## Including semicolons (aka Expr vs Stmt)
When an expression is used in a context expecting a statement and is trailed by a semicolon, the range will not include the semicolon.  If there are any comments between the expression and the semicolon those, too, will be left out.  If we had an ExprStmt, then one could imagine the expression range covering just the expression, but the statement range extending to include the comment and semicolon.

## Including trailing comments
The trailing comments issue seems to be consistent across syntactic forms (at least, the ones you've presented).  Add a trailing comment after a decl statement, eg.
  int x; // foo
and the range still ends at the semicolon. Similarly,
  if (true) {} // foo
the range ends at '}'.

In general, associating comments is hard and it seems reasonable to me either way -- including or excluding.  I'll admit that it feels arbitrary that leading comments are included why trailing are excluded, but I'd venture a guess that this is caused by the parser. Morever, since it seems to be consistent, I think its ok, but should be documented clearly somewhere.

# re: Is Expr the only Stmt that doesn’t include „it’s end“?
I'd say no, with respect to trailing comments, but yes with respect to semicolons.  However, you could argue that it infects other statements as well, e.g.
  if (...) foo();
The range of the _if statement_ will not include the semicolon. So, it too does not include "it's end".

## unifyStmtRange()
I'm all for this kind of thing and, in fact, think it's really the only good solution (that is, adding a layer atop the AST to compute things like this). I'm happy to review patches. Or, if you wait long enough, I'll probably get to this myself. :)

Cheers,
Yitzhak


On Wed, Feb 26, 2020 at 4:53 PM Alexander Lanin <[hidden email]> wrote:

Hi,

 

thanks, that’s interesting, but I’m not sure how this relates to trailing comments + semicolon?

Why does it matter whether Expr is wrapped or not? It can contain comments in general. But doesn’t contain those at the end.

At least for such examples I find it strange: http://ce.steveire.com/z/KdXBDg

 

 

Is Expr the only Stmt that doesn’t include „it’s end“?

Would it make sense to have some unifyStmtRange() in SourceCode.h with special handling for Expr which would extend it with (n x comment) + (optional comma or semi) [+ all the special cases I’m not thinking about right now]?

 

Currently the check mentioned below always subtracts one char and then checks whether it’s a semicolon. Might be nicer to distinguish by isa<Expr>?

 

Regards,

Alex

 

 

 

Von: [hidden email]
Gesendet: Mittwoch, 26. Februar 2020 16:40
An: [hidden email]
Cc: [hidden email]
Betreff: Re: [cfe-dev] Stmt.getEndLoc() vs semicolon

 

Alexander,

 

I agree that this behavior is rather confusing, especially to first-time users. The cause of the issue is that Expr derives from Stmt so that Exprs can appear directly in, for example, compound statements, rather than being wrapped in an explicit node to represent statements that consist only of an expression.  As far as I understand, this choice to avoid an explicit node had some (positive) performance implications for the AST when it was originally designed.  While I'd love to see this choice revisited, I think it would be a significant effort and don't expect it to happen.

 

Instead, we tend to work around the issue.  For getting proper ranges of statements, you might find clang::tooling::getExtendedText/getExtendedRange meet your needs:

 

For declarations, the (just added!) getAssociatedRange might be even better: https://github.com/llvm/llvm-project/blob/master/clang/include/clang/Tooling/Transformer/SourceCode.h#L39

 

Cheers,

Yitzhak

 

On Tue, Feb 25, 2020 at 5:15 PM Alexander Lanin via cfe-dev <[hidden email]> wrote:

Hello,

 

I’m having trouble with the locations returned by Stmt getEndLoc()/getSourceRange() and am wondering whether this is a bug/flaw:

the returned location sometimes includes and sometimes excludes the semicolon in the end.

This gets even more interesting when there are (multiple) comments before the semicolon, simply because the difference between the returned values gets even bigger.

Here is an example showing how DeclStmt includes the semicolon, while CallExpr and BinaryOperator exclude the semicolon http://ce.steveire.com/z/TW3IAG.

 

So Stmt behaves “completely differently” depending on it’s type, which is no way suggested by it’s interface.

Wouldn’t it be better for Stmt-Users if it would always be the same?

(Not sure whether it should always be included or excluded. It’s not even always a semicolon as in that 3rd foo() in the example)

https://en.wikipedia.org/wiki/Liskov_substitution_principle

 

Here is one of the effects of this complexity for users of Stmt as they struggle to find that semicolon:

https://bugs.llvm.org/show_bug.cgi?id=25970 / https://reviews.llvm.org/D16267

Of course it’s fixable there, but that would imply working around the issue in multiple places.

 

Regards,

Alexander Lanin

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Stmt.getEndLoc() vs semicolon

suyash singh via cfe-dev

Hi,

 

>> When an expression is used in a context expecting a statement and is trailed by a semicolon, the range will not include the semicolon

Assuming there is a reason for doing that, this explains the effects. It certainly sounds easier to extend Expr range instead of wrapping it in ExprStmt...

 

>> Expr vs Stmt

Turns out DoStmt is also fun as the conditions is an Expr.

do {} while(false)

/* Comment is not part of DoStmt,

although it's before the semicolon*/;

 

>> In general, associating comments is hard and it seems reasonable to me either way -- including or excluding.

Sure, as long as it’s consistent. The semicolon (and preceding comments) are not.

 

>> it infects other statements as well

Yeah. I realized that too late…

I’ve added a makeUnifiedFileCharRange function to LexerUtils which basically simply calls findTokenSkippingComments + some special handling for NullStmt.

I’ve tried token by token, but I cannot get this to work without a “full blown” parser as there is a very close match to consider:

- In the first case “end while” obviously belongs to the while statement and there is no semicolon to search for. I have to take tok::r_brace as an end. It’s also impossible to go for isa<WhileStmt> since it is nested.

- In the second case “assign” obviously belongs to the statement. Here I have to look beyond tok::r_brace and not stop there.

 

Any advice?

 

 

// RUN: %check_clang_tidy %s readability-braces-around-statements %t

 

int foo() { return 42; }

 

void test() {

  if (true)

    if (true)

      while ("ok") {

        foo();

      } // end while

 

  int s;

  if (true)

    if (true)

      s = int{

        foo()

      }; // assign

}

 

 

Best regards,

Alexander

 

 

Von: Yitzhak Mandelbaum <[hidden email]>
Gesendet: Donnerstag, 27. Februar 2020 16:08
An: Alexander Lanin <[hidden email]>
Cc: [hidden email]
Betreff: Re: [cfe-dev] Stmt.getEndLoc() vs semicolon

 

Sorry, I overlooked your point about the trailing comments. Yes, there are two issues: trailing comments and inclusion of semicolon -- my response was with regards to the semicolon.  Since there are a lot issues here, I've tried to break them down below.

 

## Including semicolons (aka Expr vs Stmt)

When an expression is used in a context expecting a statement and is trailed by a semicolon, the range will not include the semicolon.  If there are any comments between the expression and the semicolon those, too, will be left out.  If we had an ExprStmt, then one could imagine the expression range covering just the expression, but the statement range extending to include the comment and semicolon.

 

## Including trailing comments

The trailing comments issue seems to be consistent across syntactic forms (at least, the ones you've presented).  Add a trailing comment after a decl statement, eg.

  int x; // foo

and the range still ends at the semicolon. Similarly,

  if (true) {} // foo

the range ends at '}'.

 

In general, associating comments is hard and it seems reasonable to me either way -- including or excluding.  I'll admit that it feels arbitrary that leading comments are included why trailing are excluded, but I'd venture a guess that this is caused by the parser. Morever, since it seems to be consistent, I think its ok, but should be documented clearly somewhere.

 

# re: Is Expr the only Stmt that doesn’t include „it’s end“?

I'd say no, with respect to trailing comments, but yes with respect to semicolons.  However, you could argue that it infects other statements as well, e.g.

  if (...) foo();

The range of the _if statement_ will not include the semicolon. So, it too does not include "it's end".

 

## unifyStmtRange()

I'm all for this kind of thing and, in fact, think it's really the only good solution (that is, adding a layer atop the AST to compute things like this). I'm happy to review patches. Or, if you wait long enough, I'll probably get to this myself. :)

 

Cheers,

Yitzhak

 

 

On Wed, Feb 26, 2020 at 4:53 PM Alexander Lanin <[hidden email]> wrote:

Hi,

 

thanks, that’s interesting, but I’m not sure how this relates to trailing comments + semicolon?

Why does it matter whether Expr is wrapped or not? It can contain comments in general. But doesn’t contain those at the end.

At least for such examples I find it strange: http://ce.steveire.com/z/KdXBDg

 

 

Is Expr the only Stmt that doesn’t include „it’s end“?

Would it make sense to have some unifyStmtRange() in SourceCode.h with special handling for Expr which would extend it with (n x comment) + (optional comma or semi) [+ all the special cases I’m not thinking about right now]?

 

Currently the check mentioned below always subtracts one char and then checks whether it’s a semicolon. Might be nicer to distinguish by isa<Expr>?

 

Regards,

Alex

 

 

 

Von: [hidden email]
Gesendet: Mittwoch, 26. Februar 2020 16:40
An:
[hidden email]
Cc:
[hidden email]
Betreff: Re: [cfe-dev] Stmt.getEndLoc() vs semicolon

 

Alexander,

 

I agree that this behavior is rather confusing, especially to first-time users. The cause of the issue is that Expr derives from Stmt so that Exprs can appear directly in, for example, compound statements, rather than being wrapped in an explicit node to represent statements that consist only of an expression.  As far as I understand, this choice to avoid an explicit node had some (positive) performance implications for the AST when it was originally designed.  While I'd love to see this choice revisited, I think it would be a significant effort and don't expect it to happen.

 

Instead, we tend to work around the issue.  For getting proper ranges of statements, you might find clang::tooling::getExtendedText/getExtendedRange meet your needs:

 

For declarations, the (just added!) getAssociatedRange might be even better: https://github.com/llvm/llvm-project/blob/master/clang/include/clang/Tooling/Transformer/SourceCode.h#L39

 

Cheers,

Yitzhak

 

On Tue, Feb 25, 2020 at 5:15 PM Alexander Lanin via cfe-dev <[hidden email]> wrote:

Hello,

 

I’m having trouble with the locations returned by Stmt getEndLoc()/getSourceRange() and am wondering whether this is a bug/flaw:

the returned location sometimes includes and sometimes excludes the semicolon in the end.

This gets even more interesting when there are (multiple) comments before the semicolon, simply because the difference between the returned values gets even bigger.

Here is an example showing how DeclStmt includes the semicolon, while CallExpr and BinaryOperator exclude the semicolon http://ce.steveire.com/z/TW3IAG.

 

So Stmt behaves “completely differently” depending on it’s type, which is no way suggested by it’s interface.

Wouldn’t it be better for Stmt-Users if it would always be the same?

(Not sure whether it should always be included or excluded. It’s not even always a semicolon as in that 3rd foo() in the example)

https://en.wikipedia.org/wiki/Liskov_substitution_principle

 

Here is one of the effects of this complexity for users of Stmt as they struggle to find that semicolon:

https://bugs.llvm.org/show_bug.cgi?id=25970 / https://reviews.llvm.org/D16267

Of course it’s fixable there, but that would imply working around the issue in multiple places.

 

Regards,

Alexander Lanin

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Stmt.getEndLoc() vs semicolon

suyash singh via cfe-dev
responses inline...

On Sat, Feb 29, 2020 at 3:42 PM <[hidden email]> wrote:

Hi,

 

>> When an expression is used in a context expecting a statement and is trailed by a semicolon, the range will not include the semicolon

Assuming there is a reason for doing that, this explains the effects. It certainly sounds easier to extend Expr range instead of wrapping it in ExprStmt...


The problem is context. Consider expression `f(e)` in two cases:

1. `f(e);`
2. `return f(e) + 3;`

What is the range of `f(e)`? Is it wider in the first case than the second? if so, will this surprise/break code that might assume it invariant?  But, even worse, in case 1, how do you know which range the user wants? If I'm replacing statements, then I'll want through the semicolon, but if I'm updating call expressions, I'll only want the call itself.

So, both context and intent matter. An ExprStmt would clearly separate the two.

 

>> Expr vs Stmt

Turns out DoStmt is also fun as the conditions is an Expr.

do {} while(false)

/* Comment is not part of DoStmt,

although it's before the semicolon*/;

Good catch! But, I don't think that's the reason, because the condition is clearly delineated by the parentheses. It just seems to be an inconsistency. I can't see any reason that the comment and semi are excluded here while included in the DeclStmt. I think this should be changed.

>> In general, associating comments is hard and it seems reasonable to me either way -- including or excluding.

Sure, as long as it’s consistent. The semicolon (and preceding comments) are not.

 

>> it infects other statements as well

Yeah. I realized that too late…

I’ve added a makeUnifiedFileCharRange function to LexerUtils which basically simply calls findTokenSkippingComments + some special handling for NullStmt.

I’ve tried token by token, but I cannot get this to work without a “full blown” parser as there is a very close match to consider:

- In the first case “end while” obviously belongs to the while statement and there is no semicolon to search for. I have to take tok::r_brace as an end. It’s also impossible to go for isa<WhileStmt> since it is nested.

- In the second case “assign” obviously belongs to the statement. Here I have to look beyond tok::r_brace and not stop there.

 

Any advice?

Have you looked at the code I pointed to earlier (getAssociatedRange)? Properly handling those kinds of cases is quite difficult, but I don't think you need to parse the code. It's more about newlines and such.


 

 

 

// RUN: %check_clang_tidy %s readability-braces-around-statements %t

 

int foo() { return 42; }

 

void test() {

  if (true)

    if (true)

      while ("ok") {

        foo();

      } // end while

 

  int s;

  if (true)

    if (true)

      s = int{

        foo()

      }; // assign

}

 

 

Best regards,

Alexander

 

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Stmt.getEndLoc() vs semicolon

suyash singh via cfe-dev

Ok got it!

I thought all Expressions look like that missing ExprStmt.

There are Statements and Expressions and I initially assumed Expressions look like Statements, but that’s just one sub/supertype of Expr.

 

 

Looking into the DoWhile topic I found a corresponding commit:

https://github.com/llvm/llvm-project/commit/503fadc90f397a773ac17f40f72ca3e81cba2c49

 

Interestingly the statements touched in that commit have some high overlap with my “strange behavior” list in https://reviews.llvm.org/D75813#change-Bc7Hs8CXH7Z9

I came up with my list by comparing results of an AST based approach and a token based approach - over llvm codebase.

Apparently there are no gotos in the part which I have looked at ;-)

 

I think I could change ParseStmt.cpp for DoWhile, GotoStmt, ContinueStmt, BreakStmt, ReturnStmt, AsmStmt and SEHLeaveStmt.

Looks rather trivial (so it probably isn’t).

But I would have to change Stmt.h as well and add an SemiLoc or something to all of them.

For example BreakStmt doesn’t have any secondary SourceLocation, others already store RParenLoc – but that doesn’t help.

 

Does that make sense? I would still need to keep my getUnifiedEndLoc, but it could be simplified to getStmtLikeEndLocForStmtLikeExpr.

 

Alex

 

 

 

Von: Yitzhak Mandelbaum <[hidden email]>
Gesendet: Mittwoch, 11. März 2020 18:47
An: Alexander Lanin <[hidden email]>
Cc: Clang Dev <[hidden email]>
Betreff: Re: [cfe-dev] Stmt.getEndLoc() vs semicolon

 

responses inline...

 

On Sat, Feb 29, 2020 at 3:42 PM <[hidden email]> wrote:

Hi,

 

>> When an expression is used in a context expecting a statement and is trailed by a semicolon, the range will not include the semicolon

Assuming there is a reason for doing that, this explains the effects. It certainly sounds easier to extend Expr range instead of wrapping it in ExprStmt...

 

The problem is context. Consider expression `f(e)` in two cases:

 

1. `f(e);`

2. `return f(e) + 3;`

 

What is the range of `f(e)`? Is it wider in the first case than the second? if so, will this surprise/break code that might assume it invariant?  But, even worse, in case 1, how do you know which range the user wants? If I'm replacing statements, then I'll want through the semicolon, but if I'm updating call expressions, I'll only want the call itself.

 

So, both context and intent matter. An ExprStmt would clearly separate the two.

 

>> Expr vs Stmt

Turns out DoStmt is also fun as the conditions is an Expr.

do {} while(false)

/* Comment is not part of DoStmt,

although it's before the semicolon*/;

Good catch! But, I don't think that's the reason, because the condition is clearly delineated by the parentheses. It just seems to be an inconsistency. I can't see any reason that the comment and semi are excluded here while included in the DeclStmt. I think this should be changed.

>> In general, associating comments is hard and it seems reasonable to me either way -- including or excluding.

Sure, as long as it’s consistent. The semicolon (and preceding comments) are not.

 

>> it infects other statements as well

Yeah. I realized that too late…

I’ve added a makeUnifiedFileCharRange function to LexerUtils which basically simply calls findTokenSkippingComments + some special handling for NullStmt.

I’ve tried token by token, but I cannot get this to work without a “full blown” parser as there is a very close match to consider:

- In the first case “end while” obviously belongs to the while statement and there is no semicolon to search for. I have to take tok::r_brace as an end. It’s also impossible to go for isa<WhileStmt> since it is nested.

- In the second case “assign” obviously belongs to the statement. Here I have to look beyond tok::r_brace and not stop there.

 

Any advice?

Have you looked at the code I pointed to earlier (getAssociatedRange)? Properly handling those kinds of cases is quite difficult, but I don't think you need to parse the code. It's more about newlines and such.

 

 

 

 

 

// RUN: %check_clang_tidy %s readability-braces-around-statements %t

 

int foo() { return 42; }

 

void test() {

  if (true)

    if (true)

      while ("ok") {

        foo();

      } // end while

 

  int s;

  if (true)

    if (true)

      s = int{

        foo()

      }; // assign

}

 

 

Best regards,

Alexander

 

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Stmt.getEndLoc() vs semicolon

suyash singh via cfe-dev
I don't have any say in the AST code.  That would have to involve one of the owners (and I don't know who those are) and it's not a simple matter, because so much of the compiler is connected to the AST. If you're thinking of such a change I would start a new thread on this list explicitly raising the issue.

Regarding that change -- I'm not sure how relevant it is, given that it's from back in 2006.  But, it is interesting to see and probably indicates there's a good reason for the discrepancy (or, at least, there was at some point).

On Wed, Mar 11, 2020 at 5:35 PM <[hidden email]> wrote:

Ok got it!

I thought all Expressions look like that missing ExprStmt.

There are Statements and Expressions and I initially assumed Expressions look like Statements, but that’s just one sub/supertype of Expr.

 

 

Looking into the DoWhile topic I found a corresponding commit:

https://github.com/llvm/llvm-project/commit/503fadc90f397a773ac17f40f72ca3e81cba2c49

 

Interestingly the statements touched in that commit have some high overlap with my “strange behavior” list in https://reviews.llvm.org/D75813#change-Bc7Hs8CXH7Z9

I came up with my list by comparing results of an AST based approach and a token based approach - over llvm codebase.

Apparently there are no gotos in the part which I have looked at ;-)

 

I think I could change ParseStmt.cpp for DoWhile, GotoStmt, ContinueStmt, BreakStmt, ReturnStmt, AsmStmt and SEHLeaveStmt.

Looks rather trivial (so it probably isn’t).

But I would have to change Stmt.h as well and add an SemiLoc or something to all of them.

For example BreakStmt doesn’t have any secondary SourceLocation, others already store RParenLoc – but that doesn’t help.

 

Does that make sense? I would still need to keep my getUnifiedEndLoc, but it could be simplified to getStmtLikeEndLocForStmtLikeExpr.

 

Alex

 

 

 

Von: Yitzhak Mandelbaum <[hidden email]>
Gesendet: Mittwoch, 11. März 2020 18:47
An: Alexander Lanin <[hidden email]>
Cc: Clang Dev <[hidden email]>
Betreff: Re: [cfe-dev] Stmt.getEndLoc() vs semicolon

 

responses inline...

 

On Sat, Feb 29, 2020 at 3:42 PM <[hidden email]> wrote:

Hi,

 

>> When an expression is used in a context expecting a statement and is trailed by a semicolon, the range will not include the semicolon

Assuming there is a reason for doing that, this explains the effects. It certainly sounds easier to extend Expr range instead of wrapping it in ExprStmt...

 

The problem is context. Consider expression `f(e)` in two cases:

 

1. `f(e);`

2. `return f(e) + 3;`

 

What is the range of `f(e)`? Is it wider in the first case than the second? if so, will this surprise/break code that might assume it invariant?  But, even worse, in case 1, how do you know which range the user wants? If I'm replacing statements, then I'll want through the semicolon, but if I'm updating call expressions, I'll only want the call itself.

 

So, both context and intent matter. An ExprStmt would clearly separate the two.

 

>> Expr vs Stmt

Turns out DoStmt is also fun as the conditions is an Expr.

do {} while(false)

/* Comment is not part of DoStmt,

although it's before the semicolon*/;

Good catch! But, I don't think that's the reason, because the condition is clearly delineated by the parentheses. It just seems to be an inconsistency. I can't see any reason that the comment and semi are excluded here while included in the DeclStmt. I think this should be changed.

>> In general, associating comments is hard and it seems reasonable to me either way -- including or excluding.

Sure, as long as it’s consistent. The semicolon (and preceding comments) are not.

 

>> it infects other statements as well

Yeah. I realized that too late…

I’ve added a makeUnifiedFileCharRange function to LexerUtils which basically simply calls findTokenSkippingComments + some special handling for NullStmt.

I’ve tried token by token, but I cannot get this to work without a “full blown” parser as there is a very close match to consider:

- In the first case “end while” obviously belongs to the while statement and there is no semicolon to search for. I have to take tok::r_brace as an end. It’s also impossible to go for isa<WhileStmt> since it is nested.

- In the second case “assign” obviously belongs to the statement. Here I have to look beyond tok::r_brace and not stop there.

 

Any advice?

Have you looked at the code I pointed to earlier (getAssociatedRange)? Properly handling those kinds of cases is quite difficult, but I don't think you need to parse the code. It's more about newlines and such.

 

 

 

 

 

// RUN: %check_clang_tidy %s readability-braces-around-statements %t

 

int foo() { return 42; }

 

void test() {

  if (true)

    if (true)

      while ("ok") {

        foo();

      } // end while

 

  int s;

  if (true)

    if (true)

      s = int{

        foo()

      }; // assign

}

 

 

Best regards,

Alexander

 

 


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

smime.p7s (5K) Download Attachment