Re: [lldb-dev] C++ method declaration parsing

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [lldb-dev] C++ method declaration parsing

Sumner, Brian via cfe-dev
If there is any way to re-use clang parser for this, it would be wonderful.  Even if it means adding support to clang for whatever you need in order to make it possible.  You mention performance, are you certain that clang's parser would be unacceptably slow?

+cfe-dev as they may have some more input on what it would take to extend clang to make this possible.

On Wed, Mar 15, 2017 at 4:48 PM Eugene Zemtsov via lldb-dev <[hidden email]> wrote:
Hi, Everyone.

Current implementation of CPlusPlusLanguage::MethodName::Parse() doesn't cover full extent of possible function declarations, 
or even declarations returned by abi::__cxa_demangle. 

Consider this code:
--------------------------------------------------
#include <stdio.h>
#include <functional>
#include <vector>

void func() {
  printf("func() was called\n");
}

struct Class
{
  Class() {
    printf("ctor was called\n");
  }

  Class(const Class& c) {
    printf("copy ctor was called\n");
  }

  ~Class() {
    printf("dtor was called\n");
  }
};


int main() {
  std::function<void()> f = func;
  f();

  Class c;
  std::vector<Class> v;
  v.push_back(c);

  return 0;
}
--------------------------------------------------

When compiled It has at least two symbols that currently cannot be correctly parsed by MethodName::Parse() .
void std::vector<Class, std::allocator<Class> >::_M_emplace_back_aux<Class const&>(Class const&)
void (* const&std::_Any_data::_M_access<void (*)()>() const)() - a template function that returns a reference to a function pointer.
It causes incorrect behavior in avoid-stepping and sometimes messes printing of thread backtrace.

I would like to solve this issue, but current implementation of method name parsing doesn't seem sustainable. 
Clever substrings and regexs are fine for trivial cases, but they become a nightmare once we consider more complex cases.
That's why I'd like to have code that follows some kind of grammar describing function declarations.

As I see it, choices for new implementation of MethodName::Parse() are
1. Reuse clang parsing code.
2. Parser generated by bison.
3. Handwritten recursive descent parser.

I looked at the option #1, at it appears to be impossible to reuse clang parser for this kind of zero-context parsing. 
Especially given that we care about performance of this code. Clang C++ lexer on the other hand can be reused.

Option #2. Using bison is tempting, but it would require introduction of new compile time dependency. 
That might be especially inconvenient on Windows.

That's why I think option #3 is the way to go. Recursive descent parser that reuses a C++ lexer from clang. 

LLDB doesn't need to parse everything (e.g. we don't care about details of function arguments), but it needs to be able to handle tricky return types and base names.
Eventually new implementation should be able to parse signature of every method generated by STL.  

Before starting implementation, I'd love to get some feedback. It might be that my overlooking something important.

-- 
Thanks,
Eugene Zemtsov.
_______________________________________________
lldb-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [lldb-dev] C++ method declaration parsing

Sumner, Brian via cfe-dev
Yes, it's a good idea to add cfe-dev.
It is totally possible that I overlooked something and clang can help with this kind of superficial parsing. 

As far as I can see even clang-format does it's own parsing (UnwrappedLineParser.cpp) and clang-format has very similar need of roughly understanding of code without knowing any context.

> are you certain that clang's parser would be unacceptably slow?

I don't have any perf numbers to back it up, but it does look like a lot of clang infrastructure needs to be set up before actual parsing begins. (see lldb_private::ClangExpressionParser). It's not important though, as at this stage I don't see how we can reuse clang at all.



On Wed, Mar 15, 2017 at 5:03 PM, Zachary Turner <[hidden email]> wrote:
If there is any way to re-use clang parser for this, it would be wonderful.  Even if it means adding support to clang for whatever you need in order to make it possible.  You mention performance, are you certain that clang's parser would be unacceptably slow?

+cfe-dev as they may have some more input on what it would take to extend clang to make this possible.

On Wed, Mar 15, 2017 at 4:48 PM Eugene Zemtsov via lldb-dev <[hidden email]> wrote:
Hi, Everyone.

Current implementation of CPlusPlusLanguage::MethodName::Parse() doesn't cover full extent of possible function declarations, 
or even declarations returned by abi::__cxa_demangle. 

Consider this code:
--------------------------------------------------
#include <stdio.h>
#include <functional>
#include <vector>

void func() {
  printf("func() was called\n");
}

struct Class
{
  Class() {
    printf("ctor was called\n");
  }

  Class(const Class& c) {
    printf("copy ctor was called\n");
  }

  ~Class() {
    printf("dtor was called\n");
  }
};


int main() {
  std::function<void()> f = func;
  f();

  Class c;
  std::vector<Class> v;
  v.push_back(c);

  return 0;
}
--------------------------------------------------

When compiled It has at least two symbols that currently cannot be correctly parsed by MethodName::Parse() .
void std::vector<Class, std::allocator<Class> >::_M_emplace_back_aux<Class const&>(Class const&)
void (* const&std::_Any_data::_M_access<void (*)()>() const)() - a template function that returns a reference to a function pointer.
It causes incorrect behavior in avoid-stepping and sometimes messes printing of thread backtrace.

I would like to solve this issue, but current implementation of method name parsing doesn't seem sustainable. 
Clever substrings and regexs are fine for trivial cases, but they become a nightmare once we consider more complex cases.
That's why I'd like to have code that follows some kind of grammar describing function declarations.

As I see it, choices for new implementation of MethodName::Parse() are
1. Reuse clang parsing code.
2. Parser generated by bison.
3. Handwritten recursive descent parser.

I looked at the option #1, at it appears to be impossible to reuse clang parser for this kind of zero-context parsing. 
Especially given that we care about performance of this code. Clang C++ lexer on the other hand can be reused.

Option #2. Using bison is tempting, but it would require introduction of new compile time dependency. 
That might be especially inconvenient on Windows.

That's why I think option #3 is the way to go. Recursive descent parser that reuses a C++ lexer from clang. 

LLDB doesn't need to parse everything (e.g. we don't care about details of function arguments), but it needs to be able to handle tricky return types and base names.
Eventually new implementation should be able to parse signature of every method generated by STL.  

Before starting implementation, I'd love to get some feedback. It might be that my overlooking something important.

-- 
Thanks,
Eugene Zemtsov.
_______________________________________________
lldb-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev



--
Thanks,
Eugene Zemtsov.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [lldb-dev] C++ method declaration parsing

Sumner, Brian via cfe-dev
A random idea: Instead of parsing demangled C++ method names what people think about writing or reusing a demangler what can gave back both the demangled name and the parsed name in some form?

My guess is that it would be both more efficient (we already have most of information during demangling) and possibly easier to implement as I expect less edge cases. Additionally I think it would be a nice library to have as part of the LLVM project.

Tamas

On Thu, Mar 16, 2017 at 2:43 AM Eugene Zemtsov via lldb-dev <[hidden email]> wrote:
Yes, it's a good idea to add cfe-dev.
It is totally possible that I overlooked something and clang can help with this kind of superficial parsing. 

As far as I can see even clang-format does it's own parsing (UnwrappedLineParser.cpp) and clang-format has very similar need of roughly understanding of code without knowing any context.

> are you certain that clang's parser would be unacceptably slow?

I don't have any perf numbers to back it up, but it does look like a lot of clang infrastructure needs to be set up before actual parsing begins. (see lldb_private::ClangExpressionParser). It's not important though, as at this stage I don't see how we can reuse clang at all.



On Wed, Mar 15, 2017 at 5:03 PM, Zachary Turner <[hidden email]> wrote:
If there is any way to re-use clang parser for this, it would be wonderful.  Even if it means adding support to clang for whatever you need in order to make it possible.  You mention performance, are you certain that clang's parser would be unacceptably slow?

+cfe-dev as they may have some more input on what it would take to extend clang to make this possible.

On Wed, Mar 15, 2017 at 4:48 PM Eugene Zemtsov via lldb-dev <[hidden email]> wrote:
Hi, Everyone.

Current implementation of CPlusPlusLanguage::MethodName::Parse() doesn't cover full extent of possible function declarations, 
or even declarations returned by abi::__cxa_demangle. 

Consider this code:
--------------------------------------------------
#include <stdio.h>
#include <functional>
#include <vector>

void func() {
  printf("func() was called\n");
}

struct Class
{
  Class() {
    printf("ctor was called\n");
  }

  Class(const Class& c) {
    printf("copy ctor was called\n");
  }

  ~Class() {
    printf("dtor was called\n");
  }
};


int main() {
  std::function<void()> f = func;
  f();

  Class c;
  std::vector<Class> v;
  v.push_back(c);

  return 0;
}
--------------------------------------------------

When compiled It has at least two symbols that currently cannot be correctly parsed by MethodName::Parse() .
void std::vector<Class, std::allocator<Class> >::_M_emplace_back_aux<Class const&>(Class const&)
void (* const&std::_Any_data::_M_access<void (*)()>() const)() - a template function that returns a reference to a function pointer.
It causes incorrect behavior in avoid-stepping and sometimes messes printing of thread backtrace.

I would like to solve this issue, but current implementation of method name parsing doesn't seem sustainable. 
Clever substrings and regexs are fine for trivial cases, but they become a nightmare once we consider more complex cases.
That's why I'd like to have code that follows some kind of grammar describing function declarations.

As I see it, choices for new implementation of MethodName::Parse() are
1. Reuse clang parsing code.
2. Parser generated by bison.
3. Handwritten recursive descent parser.

I looked at the option #1, at it appears to be impossible to reuse clang parser for this kind of zero-context parsing. 
Especially given that we care about performance of this code. Clang C++ lexer on the other hand can be reused.

Option #2. Using bison is tempting, but it would require introduction of new compile time dependency. 
That might be especially inconvenient on Windows.

That's why I think option #3 is the way to go. Recursive descent parser that reuses a C++ lexer from clang. 

LLDB doesn't need to parse everything (e.g. we don't care about details of function arguments), but it needs to be able to handle tricky return types and base names.
Eventually new implementation should be able to parse signature of every method generated by STL.  

Before starting implementation, I'd love to get some feedback. It might be that my overlooking something important.

-- 
Thanks,
Eugene Zemtsov.
_______________________________________________
lldb-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev



--
Thanks,
Eugene Zemtsov.
_______________________________________________
lldb-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [lldb-dev] C++ method declaration parsing

Sumner, Brian via cfe-dev
In reply to this post by Sumner, Brian via cfe-dev
I think clang-format's parser would be a better candidate for code
reuse, but even that might be too much, as we don't need that level of
detail (basically we just need to split the name into function name,
return type and argument list), and we can make a lot of simplifying
assumptions here (e.g. template arguments are fully resolved so '>>'
is either two closing template parens, or a part of "operator>>", no
function parameter names or default values, ...).

When you say "recursive descent", you make it sound scary, but is it
really so? AFAICT, the only cause of recursion are the function
pointer return types. Wouldn't that boil down to a single function
that splits out a string into: <qualifiers>(<rest>)(<arguments>) and
then recurses on <rest>




> A random idea: Instead of parsing demangled C++ method names what people think about writing or reusing a demangler what can gave back both the demangled name and the parsed name in some form?
> My guess is that it would be both more efficient (we already have most of information during demangling) and possibly easier to implement as I expect less edge cases. Additionally I think it would be a nice library to have as part of the LLVM project.

I was originally against that, as it does not solve the full problem,
as I recall sometimes we get names that don't come from a demangler
(e.g. gcc does not bother emitting mangled names in the dwarf for
static functions -- most of the time the mangled name will still be in
the symbol table, but not if the function is inlined, ...). In this
case we need to piece the name together from the dwarf context.
However, now that I think about it, it does not make sense to be
parsing the names that we ourselves have constructed -- we could just
construct the pieces we need from the original source. I think I am
starting to like this idea. It will be more a complicated project than
just fixing the existing parser though...


On 16 March 2017 at 02:42, Eugene Zemtsov via lldb-dev
<[hidden email]> wrote:

> Yes, it's a good idea to add cfe-dev.
> It is totally possible that I overlooked something and clang can help with
> this kind of superficial parsing.
>
> As far as I can see even clang-format does it's own parsing
> (UnwrappedLineParser.cpp) and clang-format has very similar need of roughly
> understanding of code without knowing any context.
>
>> are you certain that clang's parser would be unacceptably slow?
>
> I don't have any perf numbers to back it up, but it does look like a lot of
> clang infrastructure needs to be set up before actual parsing begins. (see
> lldb_private::ClangExpressionParser). It's not important though, as at this
> stage I don't see how we can reuse clang at all.
>
>
>
> On Wed, Mar 15, 2017 at 5:03 PM, Zachary Turner <[hidden email]> wrote:
>>
>> If there is any way to re-use clang parser for this, it would be
>> wonderful.  Even if it means adding support to clang for whatever you need
>> in order to make it possible.  You mention performance, are you certain that
>> clang's parser would be unacceptably slow?
>>
>> +cfe-dev as they may have some more input on what it would take to extend
>> clang to make this possible.
>>
>> On Wed, Mar 15, 2017 at 4:48 PM Eugene Zemtsov via lldb-dev
>> <[hidden email]> wrote:
>>>
>>> Hi, Everyone.
>>>
>>> Current implementation of CPlusPlusLanguage::MethodName::Parse() doesn't
>>> cover full extent of possible function declarations,
>>> or even declarations returned by abi::__cxa_demangle.
>>>
>>> Consider this code:
>>> --------------------------------------------------
>>>
>>> #include <stdio.h>
>>> #include <functional>
>>> #include <vector>
>>>
>>> void func() {
>>>   printf("func() was called\n");
>>> }
>>>
>>> struct Class
>>> {
>>>   Class() {
>>>     printf("ctor was called\n");
>>>   }
>>>
>>>   Class(const Class& c) {
>>>     printf("copy ctor was called\n");
>>>   }
>>>
>>>   ~Class() {
>>>     printf("dtor was called\n");
>>>   }
>>> };
>>>
>>>
>>> int main() {
>>>   std::function<void()> f = func;
>>>   f();
>>>
>>>   Class c;
>>>   std::vector<Class> v;
>>>   v.push_back(c);
>>>
>>>   return 0;
>>> }
>>>
>>> --------------------------------------------------
>>>
>>> When compiled It has at least two symbols that currently cannot be
>>> correctly parsed by MethodName::Parse() .
>>>
>>> void std::vector<Class, std::allocator<Class>
>>> >::_M_emplace_back_aux<Class const&>(Class const&)
>>> void (* const&std::_Any_data::_M_access<void (*)()>() const)() - a
>>> template function that returns a reference to a function pointer.
>>>
>>> It causes incorrect behavior in avoid-stepping and sometimes messes
>>> printing of thread backtrace.
>>>
>>> I would like to solve this issue, but current implementation of method
>>> name parsing doesn't seem sustainable.
>>> Clever substrings and regexs are fine for trivial cases, but they become
>>> a nightmare once we consider more complex cases.
>>> That's why I'd like to have code that follows some kind of grammar
>>> describing function declarations.
>>>
>>> As I see it, choices for new implementation of MethodName::Parse() are
>>> 1. Reuse clang parsing code.
>>> 2. Parser generated by bison.
>>> 3. Handwritten recursive descent parser.
>>>
>>> I looked at the option #1, at it appears to be impossible to reuse clang
>>> parser for this kind of zero-context parsing.
>>> Especially given that we care about performance of this code. Clang C++
>>> lexer on the other hand can be reused.
>>>
>>> Option #2. Using bison is tempting, but it would require introduction of
>>> new compile time dependency.
>>> That might be especially inconvenient on Windows.
>>>
>>> That's why I think option #3 is the way to go. Recursive descent parser
>>> that reuses a C++ lexer from clang.
>>>
>>> LLDB doesn't need to parse everything (e.g. we don't care about details
>>> of function arguments), but it needs to be able to handle tricky return
>>> types and base names.
>>> Eventually new implementation should be able to parse signature of every
>>> method generated by STL.
>>>
>>> Before starting implementation, I'd love to get some feedback. It might
>>> be that my overlooking something important.
>>>
>>> --
>>> Thanks,
>>> Eugene Zemtsov.
>>> _______________________________________________
>>> lldb-dev mailing list
>>> [hidden email]
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>
>
>
>
> --
> Thanks,
> Eugene Zemtsov.
>
> _______________________________________________
> lldb-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Loading...