[RFC] cHash: Integration of AST hashing

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] cHash: Integration of AST hashing

Anastasia Stulova via cfe-dev


On Tue, Dec 12, 2017 at 12:24 AM, Christian Dietrich <[hidden email]> wrote:
Richard Trieu <[hidden email]> writes:

> On Mon, Dec 11, 2017 at 5:14 AM, Christian Dietrich <
> [hidden email]> wrote:
>
>> Richard Trieu <[hidden email]> writes:
>>
>> > I don't understand your need for pseudrandom identifiers.  If a specific
>> > type of AST node (for instance, a Decl) is expected, then adding to the
>> > stream the node kind (like from Decl::getKind()) is enough to avoid
>> > collisions.  If any node can be expected, then first pass in an enum to
>> > differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
>> > the node kind, and then continue on.  Stmt::Profile and ODRHash use this
>> > method to avoid collisions.  I don't see any gain from using new
>> > identifiers here.
>>
>> Ok, so let's explain a little bit my idea here: Of course, including the
>> Decl::getKind() also distinguishes node types from each other. However,
>> these values come from enums and are, therefore, small integers. Small
>> integers (SI) occur very often in an AST. So, if a sequence of addData()
>> operations produces the same sequence of SIs because node types are not
>> distinguishable from SI that occur in the AST, we get a collision[1].
>>
> How does the result of Decl::getKind() being a small integer allows any
> chance of collision?  I'll admit that arbitrary sub-sequences of different
> hash streams can be equal, but we shouldn't be hashing sub-sequences.

Sure, we shouldn't take an arbitrary subsequence of the stream and hash
it. That would be a bad idea.

> Decl
> <Decl::getKind()>
>
> Decl with two fields
> <Decl::getKind()> <Field 1> < Field 2>
>
> Decl with one sub-Decl
> <Decl::getKind()> <Stream of sub-decl>
>
> Decl with arbitrary number of sub-Decl's
> <Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream
> of sub-decl N>
>
> The result of Decl::getKind() will basically specify the format of the hash
> stream.  I think better controlling of the input stream is a better way of
> avoiding collisions then using magic numbers as identifiers.
>
> We could see the same sequence for different ASTs, if an AST node has a

Sure, this would definitly be a better option. A quick look into the
RecursiveASTVisitor tells me that there are at least 76 for (...) loops.
So we have to add them.

Perhaps we can combine both, length-indicators and and
non-small-integers, by using llvm::hash on the <Decl::getKind()> values.
As the documentation says, we have to spent less than 20 cycles per
32-bit value on this. By this, we avoid the magic numbers. I will start
on doing the changes to incorporate the length fields.

Furthermore, I'm not sure how the clang community handles the "how to
find a reviewer for my patch". Are people adding themselves to be a
reviewer? Or do I just add people how I think could be interessted in
reviewing the change?

 People will add themselves as reviewers, so you can CC people that you think may have an interest in it.  If neither happens, adding a code owner and they will help find reviewers.

Personally, I've only looked through small portions of your patch partially because I have other projects to work on and partially because I was waiting on your answers to Richard Smith's queries.  Basically, your plan for integrating with the existing hash functions and how the other hash functions, especially ODRHash, doesn't fill your need.  From what I gather, you think the ODRHash is too strict on the AST structure.  I would like to understand what differences in the AST that you are fine ignoring.
 
chris
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    <a href="tel:%2B49%20511%20762-19737" value="+4951176219737" target="_blank">+49 511 762-19737
Fax:    <a href="tel:%2B49%20511%20762-19733" value="+4951176219733" target="_blank">+49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] cHash: Integration of AST hashing

Anastasia Stulova via cfe-dev
Richard Trieu <[hidden email]> writes:


>  People will add themselves as reviewers, so you can CC people that you
> think may have an interest in it.  If neither happens, adding a code owner
> and they will help find reviewers.
>
> Personally, I've only looked through small portions of your patch partially
> because I have other projects to work on and partially because I was
> waiting on your answers to Richard Smith's queries.  Basically, your plan
> for integrating with the existing hash functions and how the other hash
> functions, especially ODRHash, doesn't fill your need.  From what I gather,
> you think the ODRHash is too strict on the AST structure.  I would like to
> understand what differences in the AST that you are fine ignoring.

As, we have discussed in this thread, there are different requirements
for hashing an AST.

(Syntactic <-> Semantic) Hash: One would be to to a syntactic hash,
where only the syntactical children of a node are included into the
hash. Our cHash approach uses a semantical hash, which means that the
hash changes, when the meaning of a element (i.e. function) changes.
Such a change can, for example, happen if a type that is used in the
current element changes. So the cHash Visitor follows the cross-tree
references. As far as I understand it, the hash used by Johannes' clone
detection does not follow these links.

(Compiler Abort on cHash) When given a whole TranslationUnitDecl, the
CHashVisitor-hash-value remains the same, if the resulting binary code
will also be the same. We ignore changes to declarations and type, if
they are not used in any definition. But not all users of a AST Hashing
will want this, therefore, a AST Hashing mechanism should be adaptable.

(DataCollector) Furthermore, the AST hash implementation should be
adaptable to use different data collectors (our Clang Plugin uses a
MurMur3 Hash for performance reasons).

With ODRHash/Stmt::Profile I see the the problem that it is very
monolitic. It implements its own version of an AST traversal and I
really do not understand why. Using the RecursiveASTVisitor is so much
more convenient and maintainable. So I think that basing
ODRHash/Stmt::Profile on this TableGen-based mechanism would increase
its maintainability. As I already mentioned, we have also implemented
our own version of an AST Visitor in the past. And switching to the
RecursiveASTVisitor saved around 1200 lines of boilerplate code.

I would suggest, that the differences between the current state of the
.td files is improved by looking through ODRHash/Stmt::Profile and we
base the later one on the RecursiveASTVisitor and the
*DataCollectors.td. Furthermore, I would like to integrate large parts
of our test-suite into the AST unit tests to give it a rigerous testing.

chris
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] cHash: Integration of AST hashing

Anastasia Stulova via cfe-dev
Hey!

I wanted to ping on this patch[1] as it still lacks some reviewers. I
rebased the patch against the current HEAD, checked compilation, run our
testsuite and the ASTTest unit test.

If there are further questions or open points to discuss, I would love
to participate in that discussion. Thank you all for your efforts on
this patch.

chris


[1] https://reviews.llvm.org/D40731
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] cHash: Integration of AST hashing

Anastasia Stulova via cfe-dev
I posted some questions to the phab site about the design of the hasher.  I suggest you try your hasher on clang/test/Modules/odr_hash.h with the flag -DFIRST.  I put many of the issues I encountered into there.

I'm still iffy on how basing all the hashers on *Collectors.td would work.  Is there enough flexibility for changes each hasher needs that it wouldn't result in a lot of custom code anyways?

On Wed, Feb 21, 2018 at 4:04 AM, Christian Dietrich <[hidden email]> wrote:
Hey!

I wanted to ping on this patch[1] as it still lacks some reviewers. I
rebased the patch against the current HEAD, checked compilation, run our
testsuite and the ASTTest unit test.

If there are further questions or open points to discuss, I would love
to participate in that discussion. Thank you all for your efforts on
this patch.

chris


[1] https://reviews.llvm.org/D40731
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    <a href="tel:%2B49%20511%20762-19737" value="+4951176219737">+49 511 762-19737
Fax:    <a href="tel:%2B49%20511%20762-19733" value="+4951176219733">+49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] cHash: Integration of AST hashing

Anastasia Stulova via cfe-dev
So first, I want to make some points clear:

- The patch extends the already existing DataCollector interface to
  support more more AST node types. This extension tries to accomplish
  the principle to include only data that is local to that node[1]. So
  here, I do not introduce a new hashing infrastructure, but merely
  extend an exisiting infrastructure that I found in LLVM.

- The CHashVisitor only is a user of the provided infrastructure. It
  provides a semantic[2] hash for C.  At the moment the
  CHashVisitor does not cover all C++ AST nodes. C++ is the next step,
  after merging this change.

@Regarding Tests: There is an extensive test suite for CHash[3] in our external
 repository. I only included a very rudimentary unittests into this
 patch. After we resolved the discussion about the best principle how to
 handle AST Hashing in clang, I will integrate and consolidate our tests
 into CHashTest.cpp.

@New Visit Functions in CHashVisitor: The CHashVisitor overrides some
  Visit* functions from the DataCollector infrastructure to include
  information form cross-tree references to AST nodes (like types) to
  accomplish its goal of an semantic hash. By overriding these in the
  user of the DataCollector, the central DataCollector infrastructure
  can remain "node local" while CHashVisitor includes links to other
  nodes that can influence the semantic of the AST node.

  By this separation of concers, a family of AST hashers should be
  impementable without reinventing the wheel over and over again.

Richard Trieu via cfe-dev <[hidden email]> writes:

> I posted some questions to the phab site about the design of the hasher.  I
> suggest you try your hasher on clang/test/Modules/odr_hash.h with the flag
> -DFIRST.  I put many of the issues I encountered into there.

Thank you for that pointer. The file actually revealed some problem
(read stackoverflows) that the current CHashVisitor included for CXX
code. This included that CXXThisExpr references to the enclosing
CXXRecord and that NamedDecl::getName() cannot be called on all CXX
nodes. However, now CHashVisitor can process the odr_hash.h.


> I'm still iffy on how basing all the hashers on *Collectors.td would work.
> Is there enough flexibility for changes each hasher needs that it wouldn't
> result in a lot of custom code anyways?

But wouldn't that be an argument to include more hashers with totally
different implementations? Howver, even if there are custom changes for
different hashers, basing them on TableGen and RecursiveASTVisitor is
probably the best route to handle the issue. Especially, if we consider
including something like D40781.

Regarding ODR Hash: Can you describe the semantic of the ODR hashing in
a few sentences? Under what circumstance does the hash change? What is
the high-level principle? Under what circumstances is a hash collision
possible.

Vassil Vassilev on Phabricator wrote:

> I am not sure if this is the right time but maybe testing and comparing
> the behavior of this patch against something such as the ODRHash will
> help you, me (and probably others) to understand the current patch
> better.

I think that ODRHash.cpp and StmtProfile.cpp include very valuable
knowledge what should should be included into an AST hash and how to
handle edge cases. Especially when it comes to fully handle C++ nodes.

However, CHashVisitor does not try to replace ODRHash/StmtProfiler, it
has its own purpose. The goal of this patch is to start consolidating
these mechanisms. For this consolidation, I took the time to rewrite
CHash to be based on RecursiveASTVisitor and DataCollector. As a result,
the CHashVisitor dropped down to 320 lines of custom code to establish
cross-tree links.

> In https://reviews.llvm.org/D41416, [...], essentially I am comparing
> the behavior of the ODRHash and the Stmt::Profile logic for some
> cases.
>
> Perhaps you can plugin your algorithm to it, create a pch and start
> comparing your logic with the when it comes to template arguments
> hashing and alike.
>
> Of course this is not something that will give you perfect coverage
> and so on but it can stress test the implementation and compare it to
> the 2 major existing ones in clang.

For C code, we already stress tested CHash with 2300 commits from musl
and the CHash AST hash always changed when the object file changed (no
false negative). Furthermore, we compiled 500 commits from lua, mbedTLS,
musl, bash, CPython and PostgreSQL. So at least for C code I'm very
confident that CHashVisitor handles a lot of code in the wild.

chris

[1] This principle breaks at the "// CrossRef" Sections. Since there I
   call addData() on range<> and ArrayRef<> types to give the user of
   the DataCollector the chance to include the number of children into
   its hash.

[2] I refer you to the previous messages from this discussion and to the
    Usenix paper for the term "semantic hash".

[3] https://github.com/luhsra/chash/tree/master/test
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] cHash: Integration of AST hashing

Anastasia Stulova via cfe-dev
The original StmtDataCollector was made for StmtVisitor, not RecursiveASTVisitor.  There's a difference in how the function work that needs to be noted.

I don't like using the RecursiveASTVisitor in this form.  It was designed so that users can define custom Visit* functions to occur when an AST node is traversed.  The most evident part of this is that cHash needed to include sizes of various containers far from the point where they are iterated over.  The size being added is decouped from where it's used, and can be inaccurate if nodes are skipped.  I believe you picked this way because of how much code you saved, but it also hides away how each part of the AST is treated.

You are also using printing on some nodes.  Although it works, printing should be avoided.  

Another point I'd like to bring up is about the hashing itself.  The method you are using is to push individual values to the hash.  Other places use a vector to store values and only pushing it to hash as a large chunk when finished.  I'm not an expert in hashing, but I've thought that hashing would work faster on a large chunk than spread out over smaller values.

On Mon, Mar 12, 2018 at 8:39 AM, Christian Dietrich <[hidden email]> wrote:
So first, I want to make some points clear:

- The patch extends the already existing DataCollector interface to
  support more more AST node types. This extension tries to accomplish
  the principle to include only data that is local to that node[1]. So
  here, I do not introduce a new hashing infrastructure, but merely
  extend an exisiting infrastructure that I found in LLVM.

The DataCollector implementation raises a lot of concerns.  I know you are just working off an existing interface, but if it gains more users, fixing it now is better than fixing it later.

addData functions
I've used FoldingNodeSetID as the collection point for data.  Doc: http://llvm.org/doxygen/classllvm_1_1FoldingSetNodeID.html
It has an overloaded AddInteger function, plus AddBoolean and AddString function.  Having one addData for integers, strings, and QualTypes is confusing, plus the proposed addData for range sizes.

llvm::hash_code/llvm::hash_value
llvm::hash_value functions return a llvm::hash_code.  The comment in the definition of hash_code say "this value should not be trusted to be stable or predictable across processes or executions."  That means it is not suitable anywhere a stable hash is required.

Printing to strings
Don't use any method for printing to strings then pushing the strings to addData.  Printing to string takes extra work and is based on information can be gathered from the node anyway.  Just add that information and skip the printing.


- The CHashVisitor only is a user of the provided infrastructure. It
  provides a semantic[2] hash for C.  At the moment the
  CHashVisitor does not cover all C++ AST nodes. C++ is the next step,
  after merging this change.

@Regarding Tests: There is an extensive test suite for CHash[3] in our external
 repository. I only included a very rudimentary unittests into this
 patch. After we resolved the discussion about the best principle how to
 handle AST Hashing in clang, I will integrate and consolidate our tests
 into CHashTest.cpp.

@New Visit Functions in CHashVisitor: The CHashVisitor overrides some
  Visit* functions from the DataCollector infrastructure to include
  information form cross-tree references to AST nodes (like types) to
  accomplish its goal of an semantic hash. By overriding these in the
  user of the DataCollector, the central DataCollector infrastructure
  can remain "node local" while CHashVisitor includes links to other
  nodes that can influence the semantic of the AST node.
  By this separation of concers, a family of AST hashers should be
  impementable without reinventing the wheel over and over again.

I still have questions about local versus cross-ref.  It looks like you allow all the functions to call Type's, but don't allow them to call Decl's and Expr's.  Is that the distinction you are going for?  Also, a bunch of stuff label cross-ref is just sizes of a collection without processing the collection itself.

Coming from the ODRHash, Type's, Decl's, and Expr's were free to visit each other.  I did create two levels of Decl visiting to help avoid the recursion this caused.  And the only time a collection sized was used was when the elements of the collection were processed next.
 
Richard Trieu via cfe-dev <[hidden email]> writes:

> I posted some questions to the phab site about the design of the hasher.  I
> suggest you try your hasher on clang/test/Modules/odr_hash.h with the flag
> -DFIRST.  I put many of the issues I encountered into there.

Thank you for that pointer. The file actually revealed some problem
(read stackoverflows) that the current CHashVisitor included for CXX
code. This included that CXXThisExpr references to the enclosing
CXXRecord and that NamedDecl::getName() cannot be called on all CXX
nodes. However, now CHashVisitor can process the odr_hash.h.


> I'm still iffy on how basing all the hashers on *Collectors.td would work.
> Is there enough flexibility for changes each hasher needs that it wouldn't
> result in a lot of custom code anyways?

But wouldn't that be an argument to include more hashers with totally
different implementations? Howver, even if there are custom changes for
different hashers, basing them on TableGen and RecursiveASTVisitor is
probably the best route to handle the issue. Especially, if we consider
including something like D40781.

Regarding ODR Hash: Can you describe the semantic of the ODR hashing in
a few sentences? Under what circumstance does the hash change? What is
the high-level principle? Under what circumstances is a hash collision
possible.

Ideally, hash collisions will only occur on token-wise identical code.  If two pieces of code have identical behavior, but was created with different tokens, their hashes would differ.  This is so that every hash mismatch would be the result of an ODR violation.  The ODRHash is a work in progress, so any unsupported AST nodes show up the same as an empty nodes, since having false hash collision is better than having a false hash match.

Vassil Vassilev on Phabricator wrote:

> I am not sure if this is the right time but maybe testing and comparing
> the behavior of this patch against something such as the ODRHash will
> help you, me (and probably others) to understand the current patch
> better.

I think that ODRHash.cpp and StmtProfile.cpp include very valuable
knowledge what should should be included into an AST hash and how to
handle edge cases. Especially when it comes to fully handle C++ nodes.

However, CHashVisitor does not try to replace ODRHash/StmtProfiler, it
has its own purpose. The goal of this patch is to start consolidating
these mechanisms. For this consolidation, I took the time to rewrite
CHash to be based on RecursiveASTVisitor and DataCollector. As a result,
the CHashVisitor dropped down to 320 lines of custom code to establish
cross-tree links.

> In https://reviews.llvm.org/D41416, [...], essentially I am comparing
> the behavior of the ODRHash and the Stmt::Profile logic for some
> cases.
>
> Perhaps you can plugin your algorithm to it, create a pch and start
> comparing your logic with the when it comes to template arguments
> hashing and alike.
>
> Of course this is not something that will give you perfect coverage
> and so on but it can stress test the implementation and compare it to
> the 2 major existing ones in clang.

For C code, we already stress tested CHash with 2300 commits from musl
and the CHash AST hash always changed when the object file changed (no
false negative). Furthermore, we compiled 500 commits from lua, mbedTLS,
musl, bash, CPython and PostgreSQL. So at least for C code I'm very
confident that CHashVisitor handles a lot of code in the wild.

chris

[1] This principle breaks at the "// CrossRef" Sections. Since there I
   call addData() on range<> and ArrayRef<> types to give the user of
   the DataCollector the chance to include the number of children into
   its hash.

[2] I refer you to the previous messages from this discussion and to the
    Usenix paper for the term "semantic hash".

[3] https://github.com/luhsra/chash/tree/master/test
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    <a href="tel:%2B49%20511%20762-19737" value="+4951176219737" target="_blank">+49 511 762-19737
Fax:    <a href="tel:%2B49%20511%20762-19733" value="+4951176219733" target="_blank">+49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
12