Question about hashing AST nodes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about hashing AST nodes

David Blaikie via cfe-dev
Dear Clang Community:

A colleague and I need to use AST nodes as keys in a map. For example, having entered a few variable declaration nodes as keys, linking to associated meta-data, and now looking at a member function application node, with the variables as arguments, we'd like to be able to query the map to find the associated meta-data for those variable objects. 

To this end, we need to implement hash and == for AST nodes. That's what's needed, e.g., by the C++ unordered map class. 

There's little available information as far as we can tell on this topic. Would you be so kind as to let us know the best way to do this, in your view? Is ODRHash the key to unlocking a solution? Are there other approaches you'd recommend?

Kevin Sullivan
University of Virginia

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Question about hashing AST nodes

David Blaikie via cfe-dev
First, it depends on whether you want this storage to be persistent.
In case you only do these in one pass in memory, the memory address of
the AST node seems a key enough. To quote, "The abstract syntax tree
is not abstract, not syntax-only and not a tree" - for every usage of
a variable, e.g., you'll be able (assuming of course that it is found
in the same TU, etc...) to query the effective VarDecl (the definition
of the variable) the DeclRefExpr (e.g. a variable's usage) points to.

I think ODRHash was invented (apart from many other "node
semi-equivalence" checks) only to allow for diagnostics, not to act as
a true key. One thing that comes to mind is Modules, where I think
ODRHash is used to diagnose certain things, but last time I read about
them it was said that even ODRHash is lacking at parts.

The Clang Static Analyzer has a hash utility for bug reports to
distinguish between bugs that are in the same "file-line-col" triad.
But maybe that's also really tailored for this "bugpath" the SA emits.

For a persistent hash that can go into perhaps a database, in
CodeCompass we came up with our own, basically: using FNVHash on
manglednames and identifiers, locations, etc.

In case you want just for the current compilation / memory image,
you'll get no better answer than pointer identity and what Sema
creates for you in the AST.
If you want to store, anyone can invent their hash function. The big
question is, how will you make sure that the hash persists between
invocations.

For ==, there is the "structural equivalence" cheks which are used by
Sema and ASTImporter / CrossTU.

Kevin Sullivan via cfe-dev <[hidden email]> ezt írta (időpont:
2019. jan. 30., Sze, 8:27):

>
> Dear Clang Community:
>
> A colleague and I need to use AST nodes as keys in a map. For example, having entered a few variable declaration nodes as keys, linking to associated meta-data, and now looking at a member function application node, with the variables as arguments, we'd like to be able to query the map to find the associated meta-data for those variable objects.
>
> To this end, we need to implement hash and == for AST nodes. That's what's needed, e.g., by the C++ unordered map class.
>
> There's little available information as far as we can tell on this topic. Would you be so kind as to let us know the best way to do this, in your view? Is ODRHash the key to unlocking a solution? Are there other approaches you'd recommend?
>
> Kevin Sullivan
> University of Virginia
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Question about hashing AST nodes

David Blaikie via cfe-dev
In reply to this post by David Blaikie via cfe-dev
First of all, why do you want to hash them? Like, are you serializing
these hashes to disk or sending them over network? Are these hash values
allowed to change between runs as long as equal nodes still get equal
hashes within a single run?

Just in case you missed it, if the hashes you're looking for are never
going to outlive a single Clang invocation, then you can simply compare
AST nodes by pointers. For example, if a function call 'foo(x)' refers
to a variable 'x', then it's going to have a DeclRefExpr that points to
a VarDecl for 'x', and the pointer to VarDecl that you're going to
retrieve from that DeclRefExpr is going to be exactly the same pointer
that you'll retrieve by looking at the actual declaration of the
variable somewhere above the call. Every AST node is stored exactly once
in memory and is never moved to a different place.

Well, there are also forward declarations that are different from the
definition (the latter may not even exist in the current translation
unit), and sometimes some statements may refer to a forward declaration,
but it's easy to obtain a "canonical" declaration in all cases.

If you want something that's more stable across runs (pointers would
obviously change every time), you can have a look at the getID() method
in various AST nodes, which produces an integer that doesn't change
across runs but does change across host machines. These stable IDs are
mostly for debugging, but if your work is not supposed to interact with
other computers, these may do.

If you want a comparison up to a certain property (eg., you don't want
to discriminate between "x + 1" and "1 + x" but you do want to
discriminate both of them from "x + y"), then you might want to have a
look at the CloneDetection framework that conducts a series of passes to
find similar statements in different places of the program. One of the
existing passes involves some actual hashing. Also the infrastructure
for making these passes is re-usable for making different sets of passes
in order to look for different similarities.

On 1/29/19 1:49 PM, Kevin Sullivan via cfe-dev wrote:

> Dear Clang Community:
>
> A colleague and I need to use AST nodes as keys in a map. For example,
> having entered a few variable declaration nodes as keys, linking to
> associated meta-data, and now looking at a member function application
> node, with the variables as arguments, we'd like to be able to query
> the map to find the associated meta-data for those variable objects.
>
> To this end, we need to implement hash and == for AST nodes. That's
> what's needed, e.g., by the C++ unordered map class.
>
> There's little available information as far as we can tell on this
> topic. Would you be so kind as to let us know the best way to do this,
> in your view? Is ODRHash the key to unlocking a solution? Are there
> other approaches you'd recommend?
>
> Kevin Sullivan
> University of Virginia
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev