[GSoC] "Improving Clang's AST source-fidelity" proposal

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[GSoC] "Improving Clang's AST source-fidelity" proposal

Nicola Gigante
Hello everybody. This is my other proposal for the Google Summer of Code.
Since the student application deadline is very near, I've already submitted it
(and the other one about lambdas) to melange.
Comments and suggestions are really appreciated.

--
Title: Improving Clang's AST source-fidelity

1. Introduction:
Among the many strengths of clang, a most noticeable one is the excellent
support for client applications different from traditional compilers.
When compared to other systems, clang behaves spectacularly
better as far as source-code fidelity is concerned and is going to become
the reference choice for source-code based applications such as refactoring
tools, pretty printers and beautifiers, coding-style checkers, etc.

Most of this is a direct result of the accuracy with which the syntactic
structure of the parsed sources is represented in the clang AST:
different syntactic construct, even though semantically equivalent,
are provided with a precise AST representation that almost allows for
a faithful reconstruction of the original sources. Besides this, the AST
nodes encode accurate source location information to be used, e.g.,
when issuing diagnostic messages.

Even though clang does an excellent job, its AST representation
is not yet as accurate as it should be (i.e., not perfect).
Here is a list of current issues that can affect the precision of
source-code based applications using clang as front-end. This list of
issues has been obtained in strict cooperation with Roberto Bagnara,
Abramo Bagnara and Enea Zaffanella, who actively work with clang on
source-based applications and would be very happy to mentor this idea.

2. Issues:
*) Coherent encoding of expressions occurring in initializers.

The AST representation adopted by clang, besides encoding the syntactic
constructs explicitly occurring in the source code, also represents
those semantic constructs that are only implicitly occurring.
The classical examples are those of implicit type casts, default argument
expressions in C++ function calls, implicit declarations of special
member functions in C++ classes, etc. That is, the syntactic structures
get appropriately "dressed" by semantic info so that, for instance, type
information is always accurate.

For the special case of initialization lists, a somewhat different
approach is taken: clang makes a strong distinction between the syntactic
and the semantic forms of the initializer list. Even though the syntactic
form appropriately represents those initializer expressions that were
present in the source code, these may or may not be "dressed" with the
semantic info. On the other hand, the initializers in the semantic form
are always appropriately dressed, but here we typically see initializers
that were not actually present in the original sources.
The issue is discussed in the following thread:

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-January/013037.html

We propose two changes to the handling of initializer lists.

The first change (which is of interest for source-code based applications)
is to modify the syntactic form of init list by requiring that all
expressions occurring in it are properly "dressed" by semantic info:
this will enhance the coherence of the AST representation for the targeted
applications.

The second change is an optimization of the representation of the
semantic form of initializer lists so as to increase node sharing and
hence greatly improve the memory footprint in special cases (again, see
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-January/013037.html).
Besides improving AST node sharing, the idea is to also allow for
range designators inside the semantic form of init lists.
An example of code that currently behaves badly is the following:

int a[1000000] = { [0] = 1, [999999] = 1 };

---
*) Completing work on attributed types.

Currently, clang is sub-optimal when handling the following kinds
of declarations:

typedef __attribute__((mode(byte))) int small; // attribute disappears
small s1;
int __attribute__((mode(byte))) s2; // attribute is ignored

In the first case, the attribute is "consumed" during parsing, producing
a typedef referring to the semantic type 'signed char', with no indication
that, on the syntactic side, an attribute was used.
In the second case, the attribute is just ignored.

Recently, John McCall added new AST node AttributedType in the Type
hierarchy, in order to correctly model situations such as those above.

http://llvm.org/viewvc/llvm-project?view=rev&revision=122942

This new AST node is not yet integrated into the Parse/Sema routines.

---
*) AST representation for explicit instantiation of templates.

clang does not differentiate between the (syntactic) AST nodes declaring
an explicit instantiation of a template and the (semantic) instantiation
of the template. This can be observed by dumping the AST for the following
example:

template <typename T> struct s { };
extern template class s<double>;
extern template class s<int>;
extern template class s<int>;

obtaining something like:

template <typename T> struct s {};
struct s {};
struct s {};
class s;

Apart from minor pretty printing issues (such as the lack of the
"extern template" keywords and the lack of template argument lists),
we can see that the first instantiations for s<double> and s<int>
are provided with a class body, because they reuse the semantic
instantiation node. This also has the side effect, for instance,
of printing "struct" instead of "class" as the tag keyword.
Compare this with the last declaration, which is a re-declaration of s<int>
and is therefore provided with a new, syntactic node (in this case, the AST
structure is fine, we only need to improve the pretty printer).

---
*) Friend declarations

One of the problems with friend declarations, maybe also related to the
issue above, occurs when a template function that was declared to be a
friend of a class gets later instantiated, as in the following code:

template <class T>
int foo() { return 0; }

class S {
friend int foo<S>();
};

void f() {
foo<S>();
}

The instantiation of foo inside f() causes the instantiation of the very
same function declaration that is referred by the friend declaration
in class S. Hence, the friend declaration becomes a definition,
causing troubles to source-based code tools.
Rather, the instantiation should result in the production of a new,
non-syntactic function declaration (i.e., a re-declaration of the
function declared as friend).

Another issue with friends has to do with so-called FriendTemplateDecl
AST nodes. According to current documentation, this AST node is never
produced by Parse/Sema:

01929 /// template <typename T> class A {
01930 /// friend class MyVector<T>; // not a friend template
01931 /// template <typename U> friend class B; // not a friend template
01932 /// template <typename U> friend class Foo<T>::Nested; // friend template
01933 /// };
01934 /// NOTE: This class is not currently in use. All of the above
01935 /// will yield a FriendDecl, not a FriendTemplateDecl.

---
*) Missing TypeSourceInfo for exception specifiers in FunctionProtoTypeLoc.

As an example, the AST representation for the exception specifier
in the following function declaration

void foo() throw(int (*)[2+1]);

is mapped into the semantic type where the array size is computed:

void foo() throw(int (*)[3]);

---
*) Missing flag for the presence of (optional) 'template' keywords
in nested name specifiers (NestedNameSpecifierLoc).

When parsing the following chunk of code

namespace BAR {

template <class U> static int foo();

template <typename T>
void f() {
int (*g)();
g = &BAR::template foo<int>;
g = &BAR::template foo<T>;
}

} // namespace BAR

we obtain an AST representation with no info about the presence/absence
of the 'template' keyword.

---------
3. Why I'm interested in this project:
My main academic interest area is in the development of compilers, and for this reason
I've followed and used the LLVM and Clang project for the last two years.
Since I plan to graduate this year, the GSoC timeline fits quite well in my plans and
I would like this project to be my undergraduate thesis work.
This specific project has been proposed to me by Prof. Roberto Bagnara, from
University of Parma, which is very interested in this project and would be my
thesis co-relator.

4. How the project will be useful for LLVM:
Clang's AST code-fidelity is crucial for applications based on source code manipulation.
Having the most coherent and precise AST as possible is very important to reach Clang's
design goals.

5. Prior knowledge in compilers and LLVM:
As I said, in my academic life I've been interested primarily in compilers
implementation and design. For this reason, I'm following the Compilers course
at the University as part of my undergraduate degree in CS, and I've been interested
in LLVM and Clang projects for the last two years.
I've used the LLVM backend for a small personal project last year, and I've sent a couple
of trivial patches to Clang in 2009.

6. Academic, Industry and other experiences:
I'm an undergraduate student of the Computer Science course at the University of Udine, Italy
(http://www.uniud.it)

I've worked a lot with C++ in the last five years, including the development of two complex custom
C++ software systems for a quite big local industry, regarding industrial production and product
testing automation.

In 2009, I've successfully applied to the Google Summer of Code for the KDE project, implementing
KAuth, the authentication and authorization framework used by KDE desktops applications.

6. Contact information
IRC nickname: gigabytes at FreeNode and OFTC
personal blog: http://www.gigabytes.it (In italian, sorry)

--

Bye,
Nicola


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [GSoC] "Improving Clang's AST source-fidelity" proposal

Enea Zaffanella
Il 06/04/2011 16:11, Nicola Gigante ha scritto:
> Hello everybody. This is my other proposal for the Google Summer of Code.
[...]
> Title: Improving Clang's AST source-fidelity
[...]
> This list of
> issues has been obtained in strict cooperation with Roberto Bagnara,
> Abramo Bagnara and Enea Zaffanella, who actively work with clang on
> source-based applications and would be very happy to mentor this idea.

As said by Nicola, we are really interested in seeing this project
completed and hence will be ready to serve as mentors to help him
achieving the stated goals.

Abramo and Enea.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [GSoC] "Improving Clang's AST source-fidelity" proposal

Ted Kremenek
In reply to this post by Nicola Gigante
Hi Nicola,

I think this is a good proposal.  My main concern about this project is the potential impact it will have on compile-time performance and memory footprint of the compiler.  I'd like to see in the proposal a plan on measuring regressions on these fronts (which are bound to happen by retaining more information in the ASTs) as well as what is deemed an acceptable level for performance to slide.  For example, if compile time regresses by 10%, I would strongly object to the changes.  Even 5% would probably be too high.

The other thing to keep in mind is that any AST changes will also require changes to the PCH format.  It's critical that PCH remain efficient, and that whatever changes we make to the AST does not adversely impact how much data is pulled from the PCH file (thus slowing down compile times).  For most of the topics you outline I don't think this is an issue, but this is definitely something to keep in mind.

I think the proposal covers a bunch of areas, and it is entirely possible that you won't have time to do them all.  If you were to tackle this project, my preference is that you work on one area to completion, and then proceed to the next one.

Cheers,
Ted

On Apr 6, 2011, at 7:11 AM, Nicola Gigante wrote:

> Hello everybody. This is my other proposal for the Google Summer of Code.
> Since the student application deadline is very near, I've already submitted it
> (and the other one about lambdas) to melange.
> Comments and suggestions are really appreciated.
>
> --
> Title: Improving Clang's AST source-fidelity
>
> 1. Introduction:
> Among the many strengths of clang, a most noticeable one is the excellent
> support for client applications different from traditional compilers.
> When compared to other systems, clang behaves spectacularly
> better as far as source-code fidelity is concerned and is going to become
> the reference choice for source-code based applications such as refactoring
> tools, pretty printers and beautifiers, coding-style checkers, etc.
>
> Most of this is a direct result of the accuracy with which the syntactic
> structure of the parsed sources is represented in the clang AST:
> different syntactic construct, even though semantically equivalent,
> are provided with a precise AST representation that almost allows for
> a faithful reconstruction of the original sources. Besides this, the AST
> nodes encode accurate source location information to be used, e.g.,
> when issuing diagnostic messages.
>
> Even though clang does an excellent job, its AST representation
> is not yet as accurate as it should be (i.e., not perfect).
> Here is a list of current issues that can affect the precision of
> source-code based applications using clang as front-end. This list of
> issues has been obtained in strict cooperation with Roberto Bagnara,
> Abramo Bagnara and Enea Zaffanella, who actively work with clang on
> source-based applications and would be very happy to mentor this idea.
>
> 2. Issues:
> *) Coherent encoding of expressions occurring in initializers.
>
> The AST representation adopted by clang, besides encoding the syntactic
> constructs explicitly occurring in the source code, also represents
> those semantic constructs that are only implicitly occurring.
> The classical examples are those of implicit type casts, default argument
> expressions in C++ function calls, implicit declarations of special
> member functions in C++ classes, etc. That is, the syntactic structures
> get appropriately "dressed" by semantic info so that, for instance, type
> information is always accurate.
>
> For the special case of initialization lists, a somewhat different
> approach is taken: clang makes a strong distinction between the syntactic
> and the semantic forms of the initializer list. Even though the syntactic
> form appropriately represents those initializer expressions that were
> present in the source code, these may or may not be "dressed" with the
> semantic info. On the other hand, the initializers in the semantic form
> are always appropriately dressed, but here we typically see initializers
> that were not actually present in the original sources.
> The issue is discussed in the following thread:
>
> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-January/013037.html
>
> We propose two changes to the handling of initializer lists.
>
> The first change (which is of interest for source-code based applications)
> is to modify the syntactic form of init list by requiring that all
> expressions occurring in it are properly "dressed" by semantic info:
> this will enhance the coherence of the AST representation for the targeted
> applications.
>
> The second change is an optimization of the representation of the
> semantic form of initializer lists so as to increase node sharing and
> hence greatly improve the memory footprint in special cases (again, see
> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-January/013037.html).
> Besides improving AST node sharing, the idea is to also allow for
> range designators inside the semantic form of init lists.
> An example of code that currently behaves badly is the following:
>
> int a[1000000] = { [0] = 1, [999999] = 1 };
>
> ---
> *) Completing work on attributed types.
>
> Currently, clang is sub-optimal when handling the following kinds
> of declarations:
>
> typedef __attribute__((mode(byte))) int small; // attribute disappears
> small s1;
> int __attribute__((mode(byte))) s2; // attribute is ignored
>
> In the first case, the attribute is "consumed" during parsing, producing
> a typedef referring to the semantic type 'signed char', with no indication
> that, on the syntactic side, an attribute was used.
> In the second case, the attribute is just ignored.
>
> Recently, John McCall added new AST node AttributedType in the Type
> hierarchy, in order to correctly model situations such as those above.
>
> http://llvm.org/viewvc/llvm-project?view=rev&revision=122942
>
> This new AST node is not yet integrated into the Parse/Sema routines.
>
> ---
> *) AST representation for explicit instantiation of templates.
>
> clang does not differentiate between the (syntactic) AST nodes declaring
> an explicit instantiation of a template and the (semantic) instantiation
> of the template. This can be observed by dumping the AST for the following
> example:
>
> template <typename T> struct s { };
> extern template class s<double>;
> extern template class s<int>;
> extern template class s<int>;
>
> obtaining something like:
>
> template <typename T> struct s {};
> struct s {};
> struct s {};
> class s;
>
> Apart from minor pretty printing issues (such as the lack of the
> "extern template" keywords and the lack of template argument lists),
> we can see that the first instantiations for s<double> and s<int>
> are provided with a class body, because they reuse the semantic
> instantiation node. This also has the side effect, for instance,
> of printing "struct" instead of "class" as the tag keyword.
> Compare this with the last declaration, which is a re-declaration of s<int>
> and is therefore provided with a new, syntactic node (in this case, the AST
> structure is fine, we only need to improve the pretty printer).
>
> ---
> *) Friend declarations
>
> One of the problems with friend declarations, maybe also related to the
> issue above, occurs when a template function that was declared to be a
> friend of a class gets later instantiated, as in the following code:
>
> template <class T>
> int foo() { return 0; }
>
> class S {
> friend int foo<S>();
> };
>
> void f() {
> foo<S>();
> }
>
> The instantiation of foo inside f() causes the instantiation of the very
> same function declaration that is referred by the friend declaration
> in class S. Hence, the friend declaration becomes a definition,
> causing troubles to source-based code tools.
> Rather, the instantiation should result in the production of a new,
> non-syntactic function declaration (i.e., a re-declaration of the
> function declared as friend).
>
> Another issue with friends has to do with so-called FriendTemplateDecl
> AST nodes. According to current documentation, this AST node is never
> produced by Parse/Sema:
>
> 01929 /// template <typename T> class A {
> 01930 /// friend class MyVector<T>; // not a friend template
> 01931 /// template <typename U> friend class B; // not a friend template
> 01932 /// template <typename U> friend class Foo<T>::Nested; // friend template
> 01933 /// };
> 01934 /// NOTE: This class is not currently in use. All of the above
> 01935 /// will yield a FriendDecl, not a FriendTemplateDecl.
>
> ---
> *) Missing TypeSourceInfo for exception specifiers in FunctionProtoTypeLoc.
>
> As an example, the AST representation for the exception specifier
> in the following function declaration
>
> void foo() throw(int (*)[2+1]);
>
> is mapped into the semantic type where the array size is computed:
>
> void foo() throw(int (*)[3]);
>
> ---
> *) Missing flag for the presence of (optional) 'template' keywords
> in nested name specifiers (NestedNameSpecifierLoc).
>
> When parsing the following chunk of code
>
> namespace BAR {
>
> template <class U> static int foo();
>
> template <typename T>
> void f() {
> int (*g)();
> g = &BAR::template foo<int>;
> g = &BAR::template foo<T>;
> }
>
> } // namespace BAR
>
> we obtain an AST representation with no info about the presence/absence
> of the 'template' keyword.
>
> ---------
> 3. Why I'm interested in this project:
> My main academic interest area is in the development of compilers, and for this reason
> I've followed and used the LLVM and Clang project for the last two years.
> Since I plan to graduate this year, the GSoC timeline fits quite well in my plans and
> I would like this project to be my undergraduate thesis work.
> This specific project has been proposed to me by Prof. Roberto Bagnara, from
> University of Parma, which is very interested in this project and would be my
> thesis co-relator.
>
> 4. How the project will be useful for LLVM:
> Clang's AST code-fidelity is crucial for applications based on source code manipulation.
> Having the most coherent and precise AST as possible is very important to reach Clang's
> design goals.
>
> 5. Prior knowledge in compilers and LLVM:
> As I said, in my academic life I've been interested primarily in compilers
> implementation and design. For this reason, I'm following the Compilers course
> at the University as part of my undergraduate degree in CS, and I've been interested
> in LLVM and Clang projects for the last two years.
> I've used the LLVM backend for a small personal project last year, and I've sent a couple
> of trivial patches to Clang in 2009.
>
> 6. Academic, Industry and other experiences:
> I'm an undergraduate student of the Computer Science course at the University of Udine, Italy
> (http://www.uniud.it)
>
> I've worked a lot with C++ in the last five years, including the development of two complex custom
> C++ software systems for a quite big local industry, regarding industrial production and product
> testing automation.
>
> In 2009, I've successfully applied to the Google Summer of Code for the KDE project, implementing
> KAuth, the authentication and authorization framework used by KDE desktops applications.
>
> 6. Contact information
> e-mail: [hidden email]
> IRC nickname: gigabytes at FreeNode and OFTC
> personal blog: http://www.gigabytes.it (In italian, sorry)
>
> --
>
> Bye,
> Nicola
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [GSoC] "Improving Clang's AST source-fidelity" proposal

Nicola Gigante
On 07 Apr, 2011,at 06:34 PM, Ted Kremenek <[hidden email]> wrote:

Hi Nicola,
Hello Ted.
 
I think this is a good proposal. My main concern about this project is the potential impact it will have on compile-time performance and memory footprint of the compiler. I'd like to see in the proposal a plan on measuring regressions on these fronts (which are bound to happen by retaining more information in the ASTs) as well as what is deemed an acceptable level for performance to slide. For example, if compile time regresses by 10%, I would strongly object to the changes. Even 5% would probably be too high.

The other thing to keep in mind is that any AST changes will also require changes to the PCH format. It's critical that PCH remain efficient, and that whatever changes we make to the AST does not adversely impact how much data is pulled from the PCH file (thus slowing down compile times). For most of the topics you outline I don't think this is an issue, but this is definitely something to keep in mind.
 
I agree with you. I've updated the proposal to reflect your concern. The new text is on melange at
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/gigabytes/1001

Since this proposal is made up of a number of almost independent issues, performance measurement can
clearly be done on a per-issue basis. What 'acceptable loss' means in this context will be discussed here.


I think the proposal covers a bunch of areas, and it is entirely possible that you won't have time to do them all. If you were to tackle this project, my preference is that you work on one area to completion, and then proceed to the next one.
 
Your suggestion is entirely right. Every issue will be addressed completely before proceeding to the next. It's better to have
n totally correct patches than 2*n approximative ones.

Cheers,
Ted
 
Bye,
Nicola

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev