[RFC] cHash: Integration of AST hashing

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] cHash: Integration of AST hashing

Alex Denisov via cfe-dev


On Tue, Dec 12, 2017 at 12:24 AM, Christian Dietrich <[hidden email]> wrote:
Richard Trieu <[hidden email]> writes:

> On Mon, Dec 11, 2017 at 5:14 AM, Christian Dietrich <
> [hidden email]> wrote:
>
>> Richard Trieu <[hidden email]> writes:
>>
>> > I don't understand your need for pseudrandom identifiers.  If a specific
>> > type of AST node (for instance, a Decl) is expected, then adding to the
>> > stream the node kind (like from Decl::getKind()) is enough to avoid
>> > collisions.  If any node can be expected, then first pass in an enum to
>> > differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
>> > the node kind, and then continue on.  Stmt::Profile and ODRHash use this
>> > method to avoid collisions.  I don't see any gain from using new
>> > identifiers here.
>>
>> Ok, so let's explain a little bit my idea here: Of course, including the
>> Decl::getKind() also distinguishes node types from each other. However,
>> these values come from enums and are, therefore, small integers. Small
>> integers (SI) occur very often in an AST. So, if a sequence of addData()
>> operations produces the same sequence of SIs because node types are not
>> distinguishable from SI that occur in the AST, we get a collision[1].
>>
> How does the result of Decl::getKind() being a small integer allows any
> chance of collision?  I'll admit that arbitrary sub-sequences of different
> hash streams can be equal, but we shouldn't be hashing sub-sequences.

Sure, we shouldn't take an arbitrary subsequence of the stream and hash
it. That would be a bad idea.

> Decl
> <Decl::getKind()>
>
> Decl with two fields
> <Decl::getKind()> <Field 1> < Field 2>
>
> Decl with one sub-Decl
> <Decl::getKind()> <Stream of sub-decl>
>
> Decl with arbitrary number of sub-Decl's
> <Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream
> of sub-decl N>
>
> The result of Decl::getKind() will basically specify the format of the hash
> stream.  I think better controlling of the input stream is a better way of
> avoiding collisions then using magic numbers as identifiers.
>
> We could see the same sequence for different ASTs, if an AST node has a

Sure, this would definitly be a better option. A quick look into the
RecursiveASTVisitor tells me that there are at least 76 for (...) loops.
So we have to add them.

Perhaps we can combine both, length-indicators and and
non-small-integers, by using llvm::hash on the <Decl::getKind()> values.
As the documentation says, we have to spent less than 20 cycles per
32-bit value on this. By this, we avoid the magic numbers. I will start
on doing the changes to incorporate the length fields.

Furthermore, I'm not sure how the clang community handles the "how to
find a reviewer for my patch". Are people adding themselves to be a
reviewer? Or do I just add people how I think could be interessted in
reviewing the change?

 People will add themselves as reviewers, so you can CC people that you think may have an interest in it.  If neither happens, adding a code owner and they will help find reviewers.

Personally, I've only looked through small portions of your patch partially because I have other projects to work on and partially because I was waiting on your answers to Richard Smith's queries.  Basically, your plan for integrating with the existing hash functions and how the other hash functions, especially ODRHash, doesn't fill your need.  From what I gather, you think the ODRHash is too strict on the AST structure.  I would like to understand what differences in the AST that you are fine ignoring.
 
chris
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    <a href="tel:%2B49%20511%20762-19737" value="+4951176219737" target="_blank">+49 511 762-19737
Fax:    <a href="tel:%2B49%20511%20762-19733" value="+4951176219733" target="_blank">+49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] cHash: Integration of AST hashing

Alex Denisov via cfe-dev
Richard Trieu <[hidden email]> writes:


>  People will add themselves as reviewers, so you can CC people that you
> think may have an interest in it.  If neither happens, adding a code owner
> and they will help find reviewers.
>
> Personally, I've only looked through small portions of your patch partially
> because I have other projects to work on and partially because I was
> waiting on your answers to Richard Smith's queries.  Basically, your plan
> for integrating with the existing hash functions and how the other hash
> functions, especially ODRHash, doesn't fill your need.  From what I gather,
> you think the ODRHash is too strict on the AST structure.  I would like to
> understand what differences in the AST that you are fine ignoring.

As, we have discussed in this thread, there are different requirements
for hashing an AST.

(Syntactic <-> Semantic) Hash: One would be to to a syntactic hash,
where only the syntactical children of a node are included into the
hash. Our cHash approach uses a semantical hash, which means that the
hash changes, when the meaning of a element (i.e. function) changes.
Such a change can, for example, happen if a type that is used in the
current element changes. So the cHash Visitor follows the cross-tree
references. As far as I understand it, the hash used by Johannes' clone
detection does not follow these links.

(Compiler Abort on cHash) When given a whole TranslationUnitDecl, the
CHashVisitor-hash-value remains the same, if the resulting binary code
will also be the same. We ignore changes to declarations and type, if
they are not used in any definition. But not all users of a AST Hashing
will want this, therefore, a AST Hashing mechanism should be adaptable.

(DataCollector) Furthermore, the AST hash implementation should be
adaptable to use different data collectors (our Clang Plugin uses a
MurMur3 Hash for performance reasons).

With ODRHash/Stmt::Profile I see the the problem that it is very
monolitic. It implements its own version of an AST traversal and I
really do not understand why. Using the RecursiveASTVisitor is so much
more convenient and maintainable. So I think that basing
ODRHash/Stmt::Profile on this TableGen-based mechanism would increase
its maintainability. As I already mentioned, we have also implemented
our own version of an AST Visitor in the past. And switching to the
RecursiveASTVisitor saved around 1200 lines of boilerplate code.

I would suggest, that the differences between the current state of the
.td files is improved by looking through ODRHash/Stmt::Profile and we
base the later one on the RecursiveASTVisitor and the
*DataCollectors.td. Furthermore, I would like to integrate large parts
of our test-suite into the AST unit tests to give it a rigerous testing.

chris
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12