Identifier text in raw lexing

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Identifier text in raw lexing

Abramo Bagnara-2

Currently, during lexing in LexingRawMode only the literal text is
collected (setLiteralData/getLiteralData), while the identifier text is
lost.

We'd like to save the start of identifier text in raw lexing mode using
the same shared pointer to permit to raw lexing client to avoid to do a
lot of extra work to get the identifier text.

To implement that:

- we'd insert a raw_identifer token kind
- we'd add Token::isIdentifier method
- we'd add Token::get/setRawIdentiferData methods
- we'd change Token::getIdentifierInfo to return 0 on raw_identifier
- we'd replace occurrences of check for tok::identifier with
Token::isIdentifer()

The performance impact should be definitely zero, but we are willing to
do the benchmark you ask to proof that.

A patch following the above guidelines have some chances to be accepted
upstream?
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Identifier text in raw lexing

Chris Lattner
On Dec 4, 2010, at 2:36 AM, Abramo Bagnara wrote:
> Currently, during lexing in LexingRawMode only the literal text is
> collected (setLiteralData/getLiteralData), while the identifier text is
> lost.

Hi Abramo,

I'm not sure what you're trying to accomplish here: it sounds like you want to get the "identifier text" from the token itself without getting the spelling.  Is this correct? If so, are you looking for a performance win, a functionality improvement, or something else?  What problem are you trying to solve?

-Chris

>
> We'd like to save the start of identifier text in raw lexing mode using
> the same shared pointer to permit to raw lexing client to avoid to do a
> lot of extra work to get the identifier text.
>
> To implement that:
>
> - we'd insert a raw_identifer token kind
> - we'd add Token::isIdentifier method
> - we'd add Token::get/setRawIdentiferData methods
> - we'd change Token::getIdentifierInfo to return 0 on raw_identifier
> - we'd replace occurrences of check for tok::identifier with
> Token::isIdentifer()
>
> The performance impact should be definitely zero, but we are willing to
> do the benchmark you ask to proof that.
>
> A patch following the above guidelines have some chances to be accepted
> upstream?
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Identifier text in raw lexing

Abramo Bagnara-2
Il 05/12/2010 00:26, Chris Lattner ha scritto:
> On Dec 4, 2010, at 2:36 AM, Abramo Bagnara wrote:
>> Currently, during lexing in LexingRawMode only the literal text is
>> collected (setLiteralData/getLiteralData), while the identifier text is
>> lost.
>
> Hi Abramo,
>
> I'm not sure what you're trying to accomplish here: it sounds like you want to get the "identifier text" from the token itself without getting the spelling.  Is this correct? If so, are you looking for a performance win, a functionality improvement, or something else?  What problem are you trying to solve?

Yes, the idea is to have access to identifier text just like we have now
access to LiteralData.

We are looking for a performance win (and secondarily for a simmetric
interface wrt literals) for our raw lexer clients (currently they need
to use something like the above to get the text for each identifier):

  token_begin = sm.getSpellingLoc(token_begin);

  const char* current_char = sm.getCharacterData(token_begin);
  const char* end = current_char + length;

  while (current_char < end) {
    char c = clang::Lexer::getCharAndSizeNoWarn(current_char, length, lo);
    output += c;
    current_char += length;
  }
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

[Request for approval] Identifier text in raw lexing

Abramo Bagnara-2
Il 05/12/2010 08:42, Abramo Bagnara ha scritto:

> Il 05/12/2010 00:26, Chris Lattner ha scritto:
>> On Dec 4, 2010, at 2:36 AM, Abramo Bagnara wrote:
>>> Currently, during lexing in LexingRawMode only the literal text is
>>> collected (setLiteralData/getLiteralData), while the identifier text is
>>> lost.
>>
>> Hi Abramo,
>>
>> I'm not sure what you're trying to accomplish here: it sounds like you want to get the "identifier text" from the token itself without getting the spelling.  Is this correct? If so, are you looking for a performance win, a functionality improvement, or something else?  What problem are you trying to solve?
>
> Yes, the idea is to have access to identifier text just like we have now
> access to LiteralData.
>
> We are looking for a performance win (and secondarily for a simmetric
> interface wrt literals) for our raw lexer clients (currently they need
> to use something like the above to get the text for each identifier):
>
>   token_begin = sm.getSpellingLoc(token_begin);
>
>   const char* current_char = sm.getCharacterData(token_begin);
>   const char* end = current_char + length;
>
>   while (current_char < end) {
>     char c = clang::Lexer::getCharAndSizeNoWarn(current_char, length, lo);
>     output += c;
>     current_char += length;
>   }
I've attached the working patch for approval.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

raw-identifier.patch (52K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Request for approval] Identifier text in raw lexing

Chris Lattner

On Dec 7, 2010, at 1:32 AM, Abramo Bagnara wrote:

>> We are looking for a performance win (and secondarily for a simmetric
>> interface wrt literals) for our raw lexer clients (currently they need
>> to use something like the above to get the text for each identifier):
>>
>>  token_begin = sm.getSpellingLoc(token_begin);
>>
>>  const char* current_char = sm.getCharacterData(token_begin);
>>  const char* end = current_char + length;
>>
>>  while (current_char < end) {
>>    char c = clang::Lexer::getCharAndSizeNoWarn(current_char, length, lo);
>>    output += c;
>>    current_char += length;
>>  }
>
> I've attached the working patch for approval.

Hi Abramo,

This approach looks ok in principle, but I have some nit-pics about your patch :)

+++ lib/Parse/ParseExprCXX.cpp (working copy)
@@ -130,7 +130,7 @@
       SourceLocation TemplateKWLoc = ConsumeToken();
       
       UnqualifiedId TemplateName;
-      if (Tok.is(tok::identifier)) {
+      if (Tok.isIdentifier()) {
         // Consume the identifier.

You seem to have search and replaced all uses of tok::identifier with isIdentifier.  However, raw_identifiers can't get into the parser and can't get into most places in the preprocessor.  Please keep these as checks of just tok::identifier.  Only code that can actually get both should use "isIdentifier".  Also, for the same reason, please rename isIdentifier to isAnyIdentifier.


+++ lib/Lex/Preprocessor.cpp (working copy)
@@ -369,11 +369,11 @@
 // Lexer Event Handling.
 //===----------------------------------------------------------------------===//
 
-/// LookUpIdentifierInfo - Given a tok::identifier token, look up the
+/// LookUpIdentifierInfo - Given a (raw) identifier token, look up the
 /// identifier information for the token and install it into the token.
 IdentifierInfo *Preprocessor::LookUpIdentifierInfo(Token &Identifier,
                                                    const char *BufPtr) const {
-  assert(Identifier.is(tok::identifier) && "Not an identifier!");
+  assert(Identifier.isIdentifier() && "Not an identifier!");
   assert(Identifier.getIdentifierInfo() == 0 && "Identinfo already exists!");
 

This gets passed in a raw_identifier if I understand things correctly.  Please change the assert to verify that.  Given your changes, there is no need to pass in BufPtr either.  Please eliminate this argument, getting the pointer from the token instead.

+++ lib/Lex/Lexer.cpp (working copy)
@@ -473,7 +473,7 @@
       // we don't have an identifier table available. Instead, just look at
       // the raw identifier to recognize and categorize preprocessor directives.
       TheLexer.LexFromRawLexer(TheTok);
-      if (TheTok.getKind() == tok::identifier && !TheTok.needsCleaning()) {
+      if (TheTok.isIdentifier() && !TheTok.needsCleaning()) {
         const char *IdStart = Buffer->getBufferStart()
                             + TheTok.getLocation().getRawEncoding() - 1;
         llvm::StringRef Keyword(IdStart, TheTok.getLength());

Likewise, this seems like it should only check for a raw_identifier.  The code can be dramatically simplified by using the "const char*" in the token.  Please do.


Similarly, the code in Preprocessor::SkipExcludedConditionalBlock shouldn't have to use:

    const char *RawCharData = SourceMgr.getCharacterData(Tok.getLocation(),
                                                         &Invalid);

it can now use the const char* in the token.  While I was initially skeptical of the approach, I think it will lead to a nice cleanup of existing code (and is the right thing to do), but it is important not to scatter "isIdentifiers" everywhere, and important to make use of the new capabilities!

Thanks for working on this,

-Chris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Request for approval] Identifier text in raw lexing

Enea Zaffanella
Il 15/12/2010 08:52, Chris Lattner ha scritto:
>
> On Dec 7, 2010, at 1:32 AM, Abramo Bagnara wrote:
[...]
>> I've attached the working patch for approval.
>
> Hi Abramo,
>
> This approach looks ok in principle, but I have some nit-pics about
> your patch :)
[...]
> You seem to have search and replaced all uses of tok::identifier with
> isIdentifier.  However, raw_identifiers can't get into the parser and
> can't get into most places in the preprocessor.  Please keep these as
> checks of just tok::identifier.  Only code that can actually get both
> should use "isIdentifier".  Also, for the same reason, please rename
> isIdentifier to isAnyIdentifier.
[...]
> While I was initially
> skeptical of the approach, I think it will lead to a nice cleanup of
> existing code (and is the right thing to do), but it is important not
> to scatter "isIdentifiers" everywhere, and important to make use of
> the new capabilities!
>
> Thanks for working on this,
>
> -Chris

Here is the revised patch. It passes all clang tests, but we would
really appreciate to have it reviewed again before commit.

Cheers,
Enea.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

revised-raw-identifier.patch (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Request for approval] Identifier text in raw lexing

Abramo Bagnara-2
Il 17/12/2010 09:19, Enea Zaffanella ha scritto:

> Il 15/12/2010 08:52, Chris Lattner ha scritto:
>>
>> Hi Abramo,
>>
>> This approach looks ok in principle, but I have some nit-pics about
>> your patch :)
> [...]
>> You seem to have search and replaced all uses of tok::identifier with
>> isIdentifier.  However, raw_identifiers can't get into the parser and
>> can't get into most places in the preprocessor.  Please keep these as
>> checks of just tok::identifier.  Only code that can actually get both
>> should use "isIdentifier".  Also, for the same reason, please rename
>> isIdentifier to isAnyIdentifier.
> [...]
>> While I was initially
>> skeptical of the approach, I think it will lead to a nice cleanup of
>> existing code (and is the right thing to do), but it is important not
>> to scatter "isIdentifiers" everywhere, and important to make use of
>> the new capabilities!
>>
>> Thanks for working on this,
>>
>> -Chris
>
> Here is the revised patch. It passes all clang tests, but we would
> really appreciate to have it reviewed again before commit.

Ping.

Can I proceed with commit?
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Request for approval] Identifier text in raw lexing

Chris Lattner
In reply to this post by Enea Zaffanella

On Dec 17, 2010, at 12:19 AM, Enea Zaffanella wrote:

> Il 15/12/2010 08:52, Chris Lattner ha scritto:
>>
>> On Dec 7, 2010, at 1:32 AM, Abramo Bagnara wrote:
> [...]
>>> I've attached the working patch for approval.
>>
>> Hi Abramo,
>>
>> This approach looks ok in principle, but I have some nit-pics about
>> your patch :)
> [...]
>> You seem to have search and replaced all uses of tok::identifier with
>> isIdentifier.  However, raw_identifiers can't get into the parser and
>> can't get into most places in the preprocessor.  Please keep these as
>> checks of just tok::identifier.  Only code that can actually get both
>> should use "isIdentifier".  Also, for the same reason, please rename
>> isIdentifier to isAnyIdentifier.
> [...]
>> While I was initially
>> skeptical of the approach, I think it will lead to a nice cleanup of
>> existing code (and is the right thing to do), but it is important not
>> to scatter "isIdentifiers" everywhere, and important to make use of
>> the new capabilities!
>>
>> Thanks for working on this,
>>
>> -Chris
>
> Here is the revised patch. It passes all clang tests, but we would
> really appreciate to have it reviewed again before commit.

Sorry for the delay, this patch looks really great to me.  Please commit!

-Chris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev