Adding more info to builtin source types.

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding more info to builtin source types.

Enea Zaffanella
Hello.

Currently, class BuiltinTypeLoc provides a single source location,
accessible via member getNameLoc(), pointing to the name of the type
specifier type (e.g., for 'unsigned int' it would return the location of
'int'). It is quite common for coders to declare builtin integral types
using sign or width type specifiers only (e.g., 'unsigned', 'long',
'unsigned short', etc.). In these cases the getNameLoc() method would
return an invalid source location.

The attached patch augments BuiltinTypeLoc class by adding two source
locations, one for the sign specifier and the other for the width
specifier. We have added methods to get/set the three source locations.
We also modified method getNameLoc() so as to return what is _likely_ to
be the source location of the first type specifier token in the source
code, assuming that the three specifiers are provided in the "canonical"
order:
   <sign> <width> <type>

Please let us know if the patch is OK.

Cheers,
Enea Zaffanella.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

BuiltinTypeLoc.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Adding more info to builtin source types.

Douglas Gregor
Hello Enea,

On Jan 11, 2010, at 1:06 PM, Enea Zaffanella wrote:
> Currently, class BuiltinTypeLoc provides a single source location,  
> accessible via member getNameLoc(), pointing to the name of the type  
> specifier type (e.g., for 'unsigned int' it would return the  
> location of 'int'). It is quite common for coders to declare builtin  
> integral types using sign or width type specifiers only (e.g.,  
> 'unsigned', 'long', 'unsigned short', etc.). In these cases the  
> getNameLoc() method would return an invalid source location.

What specific problem do you need to address? Do you just want  
getNameLoc() to have a valid location (which points to the sign or  
width if there is no actual type named), or do you want fine-grained  
information about where the width and sign keywords were actually  
located? The former could be done with a simple change where we build  
BuiltinTypeLocs (without increasing memory usage), while the latter  
requires more work (as you've done in this patch).

> The attached patch augments BuiltinTypeLoc class by adding two  
> source locations, one for the sign specifier and the other for the  
> width specifier. We have added methods to get/set the three source  
> locations.
> We also modified method getNameLoc() so as to return what is  
> _likely_ to be the source location of the first type specifier token  
> in the source code, assuming that the three specifiers are provided  
> in the "canonical" order:
>  <sign> <width> <type>
>
> Please let us know if the patch is OK.

I have a few comments on this patch.

My biggest concern is that we're increasing the amount of storage  
needed for very common cases, and I'd like us to both try to minimize  
the amount of storage needed and to measure the actual cost of this  
change for some non-trivial header, to figure out how much extra  
memory its using in practice.

To minimize the amount of storage needed, we don't always want  
BuiltinTypeLoc to store 3 locations, because only the integral types  
actually need to store sign and width information, and that's only  
written in the source code occasionally. So, BuiltinTypeLoc should  
have a variable-length encoding dependent on the actual builtin type:  
types that always only have one source location (void, bool, float,  
double, nullptr, id, Class, SEL) should only store that location. For  
types that may have more than one location (e.g., the integral types),  
the BuiltinTypeLoc could encode which specifiers exist (in some prefix  
bytes) and then only have source locations for those keywords that  
actually show up in the source.

+  SourceRange getSourceRange() const {
+    SourceLocation loc = getNameLoc();
+    return SourceRange(loc, loc);
+  }

If we're going to keep the source locations for the type/width/sign,  
it'd be nice to the source range to cover the entire type. Storing the  
specifiers in order (with some tag saying which source location refers  
to which kind of keyword) would make this possible.


Of course, depending on how you answered the question at the top of my  
e-mail, all of this might be moot: if you just want  
BuiltinTypeLoc::getNameLoc() to point to some valid location for a  
integral type described only by its width or sign, that's easy to fix  
without any performance impact.

        - Doug
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding more info to builtin source types.

Enea Zaffanella
Douglas Gregor wrote:

> Hello Enea,
>
> On Jan 11, 2010, at 1:06 PM, Enea Zaffanella wrote:
>> Currently, class BuiltinTypeLoc provides a single source location,
>> accessible via member getNameLoc(), pointing to the name of the type
>> specifier type (e.g., for 'unsigned int' it would return the location
>> of 'int'). It is quite common for coders to declare builtin integral
>> types using sign or width type specifiers only (e.g., 'unsigned',
>> 'long', 'unsigned short', etc.). In these cases the getNameLoc()
>> method would return an invalid source location.
>
> What specific problem do you need to address? Do you just want
> getNameLoc() to have a valid location (which points to the sign or width
> if there is no actual type named), or do you want fine-grained
> information about where the width and sign keywords were actually
> located? The former could be done with a simple change where we build
> BuiltinTypeLocs (without increasing memory usage), while the latter
> requires more work (as you've done in this patch).

In our specific case, the answer lies somehow in the middle.
Besides the need for a valid source location, we do also want to know
whether or not the type, sign and width specifiers were written
(no matter if directly or indirectly via macro expansions).
So, once the source location is valid, three additional bits would be
enough for our specific purposes.


>> The attached patch augments BuiltinTypeLoc class by adding two source
>> locations, one for the sign specifier and the other for the width
>> specifier. We have added methods to get/set the three source locations.
>> We also modified method getNameLoc() so as to return what is _likely_
>> to be the source location of the first type specifier token in the
>> source code, assuming that the three specifiers are provided in the
>> "canonical" order:
>>  <sign> <width> <type>
>>
>> Please let us know if the patch is OK.
>
> I have a few comments on this patch.
>
> My biggest concern is that we're increasing the amount of storage needed
> for very common cases, and I'd like us to both try to minimize the
> amount of storage needed and to measure the actual cost of this change
> for some non-trivial header, to figure out how much extra memory its
> using in practice.
>
> To minimize the amount of storage needed, we don't always want
> BuiltinTypeLoc to store 3 locations, because only the integral types
> actually need to store sign and width information, and that's only
> written in the source code occasionally. So, BuiltinTypeLoc should have
> a variable-length encoding dependent on the actual builtin type: types
> that always only have one source location (void, bool, float, double,
> nullptr, id, Class, SEL) should only store that location. For types that
> may have more than one location (e.g., the integral types), the
> BuiltinTypeLoc could encode which specifiers exist (in some prefix
> bytes) and then only have source locations for those keywords that
> actually show up in the source.

Ok, we have made some measurements on gcc.c as an example.
This is a 20MB source that takes up as much as 284MB of memory
(as read from column RES of top(1)) when compiled using
clang -cc1 -fsyntax-only.

The memory size overhead due to the (current) patch is ~530000 bytes.

Based on what we observed, an alternative (minimal) solution using
3 bits for just the integer builtin types would reduce this overhead
to ~133000 bytes.
[NOTE: we are assuming a 4 bytes field to store these 3 bits, because:
   a) as far as we know, there are no free bits in a SourceLocation;
   b) we do not want to disrupt memory alignment by using a char.]

We can of course improve our current patch as you suggest (storing the
source locations, but only when needed): this would result in an
overhead of ~176000 bytes.


> +  SourceRange getSourceRange() const {
> +    SourceLocation loc = getNameLoc();
> +    return SourceRange(loc, loc);
> +  }
>
> If we're going to keep the source locations for the type/width/sign,
> it'd be nice to the source range to cover the entire type. Storing the
> specifiers in order (with some tag saying which source location refers
> to which kind of keyword) would make this possible.

This would be desirable, but it is (to our eyes) a bit complicated.

We extract location info from DeclSpec: even though here we have a
source range, this would probably include other stuff such as the
storage class specifier or type qualifiers, which is not really
relevant for builtin types. Anyway, if you think that this is OK,
adding the source range to the BuiltinTypeLoc will be easy:
it could be combined with any of the three solutions mentioned
above (NOTE: having the source range, we could even drop all
the source locations used above, but we would still need at
least the three additional bits).

As for the ordering of the specifiers: there seems to be no info
about it in class DeclSpec.


> Of course, depending on how you answered the question at the top of my
> e-mail, all of this might be moot: if you just want
> BuiltinTypeLoc::getNameLoc() to point to some valid location for a
> integral type described only by its width or sign, that's easy to fix
> without any performance impact.
>
>     - Doug
>
>

All considered, it is mainly a matter of deciding what to do.
To our eyes, the overhead seems not to be a real issue.

Cheers,
Enea Zaffanella.


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding more info to builtin source types.

Douglas Gregor

On Jan 12, 2010, at 3:13 AM, Enea Zaffanella wrote:

> Douglas Gregor wrote:
>> Hello Enea,
>> On Jan 11, 2010, at 1:06 PM, Enea Zaffanella wrote:
>>> Currently, class BuiltinTypeLoc provides a single source location,  
>>> accessible via member getNameLoc(), pointing to the name of the  
>>> type specifier type (e.g., for 'unsigned int' it would return the  
>>> location of 'int'). It is quite common for coders to declare  
>>> builtin integral types using sign or width type specifiers only  
>>> (e.g., 'unsigned', 'long', 'unsigned short', etc.). In these cases  
>>> the getNameLoc() method would return an invalid source location.
>> What specific problem do you need to address? Do you just want  
>> getNameLoc() to have a valid location (which points to the sign or  
>> width if there is no actual type named), or do you want fine-
>> grained information about where the width and sign keywords were  
>> actually located? The former could be done with a simple change  
>> where we build BuiltinTypeLocs (without increasing memory usage),  
>> while the latter requires more work (as you've done in this patch).
>
> In our specific case, the answer lies somehow in the middle.
> Besides the need for a valid source location, we do also want to know
> whether or not the type, sign and width specifiers were written
> (no matter if directly or indirectly via macro expansions).
> So, once the source location is valid, three additional bits would be
> enough for our specific purposes.
>
>
>>> The attached patch augments BuiltinTypeLoc class by adding two  
>>> source locations, one for the sign specifier and the other for the  
>>> width specifier. We have added methods to get/set the three source  
>>> locations.
>>> We also modified method getNameLoc() so as to return what is  
>>> _likely_ to be the source location of the first type specifier  
>>> token in the source code, assuming that the three specifiers are  
>>> provided in the "canonical" order:
>>> <sign> <width> <type>
>>>
>>> Please let us know if the patch is OK.
>> I have a few comments on this patch.
>> My biggest concern is that we're increasing the amount of storage  
>> needed for very common cases, and I'd like us to both try to  
>> minimize the amount of storage needed and to measure the actual  
>> cost of this change for some non-trivial header, to figure out how  
>> much extra memory its using in practice.
>> To minimize the amount of storage needed, we don't always want  
>> BuiltinTypeLoc to store 3 locations, because only the integral  
>> types actually need to store sign and width information, and that's  
>> only written in the source code occasionally. So, BuiltinTypeLoc  
>> should have a variable-length encoding dependent on the actual  
>> builtin type: types that always only have one source location  
>> (void, bool, float, double, nullptr, id, Class, SEL) should only  
>> store that location. For types that may have more than one location  
>> (e.g., the integral types), the BuiltinTypeLoc could encode which  
>> specifiers exist (in some prefix bytes) and then only have source  
>> locations for those keywords that actually show up in the source.
>
> Ok, we have made some measurements on gcc.c as an example.
> This is a 20MB source that takes up as much as 284MB of memory
> (as read from column RES of top(1)) when compiled using
> clang -cc1 -fsyntax-only.

284MB sounds really high for 20MB of source :(

> The memory size overhead due to the (current) patch is ~530000 bytes.
>
> Based on what we observed, an alternative (minimal) solution using
> 3 bits for just the integer builtin types would reduce this overhead
> to ~133000 bytes.
> [NOTE: we are assuming a 4 bytes field to store these 3 bits, because:
>  a) as far as we know, there are no free bits in a SourceLocation;
>  b) we do not want to disrupt memory alignment by using a char.]

Right, that makes sense.

> We can of course improve our current patch as you suggest (storing the
> source locations, but only when needed): this would result in an
> overhead of ~176000 bytes.

I actually like the idea of using 3 bits to say which of type/
signedness/width were specified in the type, then just making sure  
that the SourceLocation points to something that makes sense... the  
type specifier location if it's there, otherwise the signedness or  
width location. That gives us a lot more information in the AST  
without costing much at all.

>
>> +  SourceRange getSourceRange() const {
>> +    SourceLocation loc = getNameLoc();
>> +    return SourceRange(loc, loc);
>> +  }
>> If we're going to keep the source locations for the type/width/
>> sign, it'd be nice to the source range to cover the entire type.  
>> Storing the specifiers in order (with some tag saying which source  
>> location refers to which kind of keyword) would make this possible.
>
> This would be desirable, but it is (to our eyes) a bit complicated.
>
> We extract location info from DeclSpec: even though here we have a
> source range, this would probably include other stuff such as the
> storage class specifier or type qualifiers, which is not really
> relevant for builtin types. Anyway, if you think that this is OK,
> adding the source range to the BuiltinTypeLoc will be easy:
> it could be combined with any of the three solutions mentioned
> above (NOTE: having the source range, we could even drop all
> the source locations used above, but we would still need at
> least the three additional bits).
>
> As for the ordering of the specifiers: there seems to be no info
> about it in class DeclSpec.

Ah, okay. Let's not worry about the source range; a single *useful*  
location is good enough.

>> Of course, depending on how you answered the question at the top of  
>> my e-mail, all of this might be moot: if you just want  
>> BuiltinTypeLoc::getNameLoc() to point to some valid location for a  
>> integral type described only by its width or sign, that's easy to  
>> fix without any performance impact.
>>    - Doug
>
> All considered, it is mainly a matter of deciding what to do.
> To our eyes, the overhead seems not to be a real issue.


I'm a bit more paranoid about memory overhead and PCH size.

        - Doug
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding more info to builtin source types.

Enea Zaffanella
Douglas Gregor wrote:

[...]

> I actually like the idea of using 3 bits to say which of
> type/signedness/width were specified in the type, then just making sure
> that the SourceLocation points to something that makes sense... the type
> specifier location if it's there, otherwise the signedness or width
> location. That gives us a lot more information in the AST without
> costing much at all.

[...]

> Ah, okay. Let's not worry about the source range; a single *useful*
> location is good enough.

Here is a revised patch.

We store a single location for all builtin types where you cannot write
a width/sign specifier (void, char16, etc.).

For the others builtins, we have additional 4 bytes that are used to
store a total of 10 bits (instead of the 3 mentioned above): we found
that out initial analysis was incomplete in that we were not
considering, e.g., the possibility of having a 'mode' attribute
specifying the width of the builtin. Hence now we store:
  - the TST (5 bits), TSW (2 bits) and TSS (2 bits) fields
    **before** these are modified by the semantic analysis done
    in DeclSpec::Finish();
  - in the 10th bit, whether or not the ModeAttr was given
    **before** this is taken away by TakeAttributes().

Cheers,
Enea Zaffanella.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

BuiltinTypeLoc.patch (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Adding more info to builtin source types.

Douglas Gregor
Hello Enea,

On Jan 16, 2010, at 6:05 AM, Enea Zaffanella wrote:

> Douglas Gregor wrote:
>
> [...]
>
>> I actually like the idea of using 3 bits to say which of type/
>> signedness/width were specified in the type, then just making sure  
>> that the SourceLocation points to something that makes sense... the  
>> type specifier location if it's there, otherwise the signedness or  
>> width location. That gives us a lot more information in the AST  
>> without costing much at all.
>
> [...]
>
>> Ah, okay. Let's not worry about the source range; a single *useful*  
>> location is good enough.
>
> Here is a revised patch.
>
> We store a single location for all builtin types where you cannot  
> write a width/sign specifier (void, char16, etc.).
>
> For the others builtins, we have additional 4 bytes that are used to  
> store a total of 10 bits (instead of the 3 mentioned above): we  
> found that out initial analysis was incomplete in that we were not  
> considering, e.g., the possibility of having a 'mode' attribute  
> specifying the width of the builtin. Hence now we store:
> - the TST (5 bits), TSW (2 bits) and TSS (2 bits) fields
>   **before** these are modified by the semantic analysis done
>   in DeclSpec::Finish();
> - in the 10th bit, whether or not the ModeAttr was given
>   **before** this is taken away by TakeAttributes().

Okay, great. I committed your patch here:

        http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20100118/026437.html

with a couple of tweaks:

Index: include/clang/AST/TypeLoc.h
===================================================================
--- include/clang/AST/TypeLoc.h (revision 93512)
+++ include/clang/AST/TypeLoc.h (working copy)
@@ -16,6 +16,7 @@

  #include "clang/AST/Type.h"
  #include "clang/AST/TemplateBase.h"
+#include "clang/Parse/DeclSpec.h"

This is a layering violation, because the AST library is not supposed  
to depend on the Parse library. Since type specifiers are a basic  
property of the C languages, I've moved the various TSW/TSS/TST types  
off into a new header, clang/Basic/Specifiers.h to break the layering  
violation.

+  bool needsExtraLocalData() const {
+    BuiltinType::Kind bk = getTypePtr()->getKind();
+    return (bk >= BuiltinType::UShort && bk <= BuiltinType::UInt128)
+      || (bk >= BuiltinType::Short && bk <= BuiltinType::LongDouble)
+      || bk == BuiltinType::UChar
+      || bk == BuiltinType::SChar;
+  }

"float", "double", and "long double" don't need any extra data, since  
we'll always have a location for the type specifier (float or double)  
and the type is never implied by either a sign or a width.

+    default:
+      return DeclSpec::TST_unspecified;

We often try to avoid using "default", and prefer to write out the  
cases. That way, when someone adds a new builtin type, they can work  
through the "missing case" warnings to make sure that type is handled  
properly in various places within the compiler.

+        else if (TL.getWrittenWidthSpec() != DeclSpec::TSS_unspecified)
+          // Width spec loc overrides type spec loc (e.g., 'short  
int').
+          TL.setBuiltinLoc(DS.getTypeSpecWidthLoc());
+      }

Typo here; that TSS_unspecified should be TSW_unspecified.

+DeclSpec::TST BuiltinTypeLoc::getWrittenTypeSpec() const {
+  if (needsExtraLocalData())
+    return static_cast<TST>(getWrittenBuiltinSpecs().Type);
+  else {
+    switch (getTypePtr()->getKind()) {
+    case BuiltinType::Void:
+      return DeclSpec::TST_void;
+    case BuiltinType::Bool:
+      return DeclSpec::TST_bool;
+
+    case BuiltinType::Char_U: // Falling through.
+    case BuiltinType::Char16:
+    case BuiltinType::Char32:
+    case BuiltinType::Char_S:
+    case BuiltinType::WChar:
+      return DeclSpec::TST_char;

This doesn't seem right... wchar_t, char16_t, and char32_t are actual  
type specifiers in C++. Why would we return TST_char for them? Please  
look at the commit to see if you disagree with my changes here.

        - Doug

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding more info to builtin source types.

Enea Zaffanella
Douglas Gregor wrote:
> Hello Enea,

[...]

> Okay, great. I committed your patch here:
>
>     http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20100118/026437.html 
>
>
> with a couple of tweaks:

[...]

> +  bool needsExtraLocalData() const {
> +    BuiltinType::Kind bk = getTypePtr()->getKind();
> +    return (bk >= BuiltinType::UShort && bk <= BuiltinType::UInt128)
> +      || (bk >= BuiltinType::Short && bk <= BuiltinType::LongDouble)
> +      || bk == BuiltinType::UChar
> +      || bk == BuiltinType::SChar;
> +  }
>
> "float", "double", and "long double" don't need any extra data, since
> we'll always have a location for the type specifier (float or double)
> and the type is never implied by either a sign or a width.

Actually, the use of 'mode' attributes can change the width of floating
point types, hence the need to store, even in this case, the written
type specifier, the written width specifier and a Boolean flag for the
presence of any 'mode' attribute.

As an example, 'float' is translated by 'XF' to 'long double':

$ cat a.c
float f __attribute__((mode(XF)));

$ clang -cc1 -ast-dump a.c
typedef __int128_t __int128_t;
typedef __uint128_t __uint128_t;
struct __va_list_tag {
     unsigned int gp_offset;
     unsigned int fp_offset;
     void *overflow_arg_area;
     void *reg_save_area;
};
typedef struct __va_list_tag __va_list_tag;
typedef __va_list_tag __builtin_va_list[1];
long double f;



You are right that, in this case, a sign specifier is not really needed
... but this anyway incurs no additional space overhead.

All the other changes to my patch are plainly right.

Cheers,
Enea Zaffanella.

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Adding more info to builtin source types.

Douglas Gregor

On Jan 18, 2010, at 11:39 AM, Enea Zaffanella wrote:

> Douglas Gregor wrote:
>> Hello Enea,
>
> [...]
>
>> Okay, great. I committed your patch here:
>>    http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20100118/026437.html 
>>  with a couple of tweaks:
>
> [...]
>
>> +  bool needsExtraLocalData() const {
>> +    BuiltinType::Kind bk = getTypePtr()->getKind();
>> +    return (bk >= BuiltinType::UShort && bk <= BuiltinType::UInt128)
>> +      || (bk >= BuiltinType::Short && bk <= BuiltinType::LongDouble)
>> +      || bk == BuiltinType::UChar
>> +      || bk == BuiltinType::SChar;
>> +  }
>> "float", "double", and "long double" don't need any extra data,  
>> since we'll always have a location for the type specifier (float or  
>> double) and the type is never implied by either a sign or a width.
>
> Actually, the use of 'mode' attributes can change the width of  
> floating point types, hence the need to store, even in this case,  
> the written type specifier, the written width specifier and a  
> Boolean flag for the presence of any 'mode' attribute.

Oh, I understand now, thanks. Fixed here:

        http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20100118/026449.html

        - Doug
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev