[RFC] cHash: Integration of AST hashing

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
Hello!

First, I'd like to give you a little bit of context, before I get into
the details what all of this has to do with the cfe-dev list.

TL;DR: cHash calculates an hash over the AST and, therefore, allows
       the detection of redundant builds.

In the last months, we have been working on a mechanism to avoid
redundant recompilations. In C/C++ projects, we observe that a change
to a central header file can cause a lot of compiler invocations.
However, not all compiler runs will lead to an object file that
differs from the last invocation. For example, if I add a typedef
that is never used in any .c file, none of the object files will
change.

Our approach to this problem is that we hook into the compiler front
end, after the parsing and the semantic analysis and recursively
calculate an AST hash. We start with all definitions (variables and
functions) and include all relevant AST-node properties and the hashes
of all children and referenced types. Thereby, we get a hash value
that changes if a changed translation unit would result in a different
object file. As a byproduct, we get a hash for every function, every
referenced type, and all global variables.

So far, we have only implemented cHash for C, not yet for C++. We have
already published our current version of the clang plugin[1] and are
also working on a GCC version.

Furthermore, we presented our approach at USENIX ATC (and got a best
paper award, yay!)[2]. Now, we consider proposing some parts of cHash
for mainline.

Our impression is that the AST hashing mechanism[3], which calculates
a semantic fingerprint for a given AST node is not only useful to
detect redundant builds. AST hashes are (computationally) fast to
calculate, easy to store, easy to calculate, and they capture all of
the semantic of an AST node. These properties should ease other techniques:

- partial recompilation: we now know at an early state that a function
  body has changed directly or indirectly.
- change impact analysis: we can now quantify which parts of the
  program is actually touched, even if we only modfiy a type.
- project-wide analysis: have all types with the same name also
  the name hash, or is there an inconsistency?

Besides the AST hashing mechanism, cHash also includes the caching and
compiler abortion logic. However, in my opinion this is better located
in an external tool, or a plugin. For example, a more close
integration of CLang with CCache, could give the compilation speedups
that we have observed in our prototype.

Before I start on preparing parts of the chash plugin to go upstream
(there is definitely some cleanup required), I would love to hear your
opinions on the following topics:

- Would you like to see the AST hashing in clang mainline?
- Would you like to have a standalone tool that calculates AST hashes?
- Should the detection of redundant compilations be part of CLang mainline?
- Should we integrate the test-suite[4] into the repo? If yes, how?

chris

[1] https://github.com/luhsra/chash
[2] https://www.usenix.org/conference/atc17/technical-sessions/presentation/dietrich
[3] https://github.com/luhsra/chash/blob/master/clang-plugin/hash-visitor.cc
[4] https://github.com/luhsra/chash/tree/master/test
    https://github.com/luhsra/chash/blob/master/test/control_flow/if/IfStmt_10.c
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
Hey,

This looks like an awesome project! It would be great to see something like this in Clang itself.
What impact does the hashing itself have on the build time? I.e. If I were to fully rebuild a project from scratch, how much longer would it take?

Alex

> On 3 Aug 2017, at 11:41, Christian Dietrich via cfe-dev <[hidden email]> wrote:
>
> Hello!
>
> First, I'd like to give you a little bit of context, before I get into
> the details what all of this has to do with the cfe-dev list.
>
> TL;DR: cHash calculates an hash over the AST and, therefore, allows
>       the detection of redundant builds.
>
> In the last months, we have been working on a mechanism to avoid
> redundant recompilations. In C/C++ projects, we observe that a change
> to a central header file can cause a lot of compiler invocations.
> However, not all compiler runs will lead to an object file that
> differs from the last invocation. For example, if I add a typedef
> that is never used in any .c file, none of the object files will
> change.
>
> Our approach to this problem is that we hook into the compiler front
> end, after the parsing and the semantic analysis and recursively
> calculate an AST hash. We start with all definitions (variables and
> functions) and include all relevant AST-node properties and the hashes
> of all children and referenced types. Thereby, we get a hash value
> that changes if a changed translation unit would result in a different
> object file. As a byproduct, we get a hash for every function, every
> referenced type, and all global variables.
>
> So far, we have only implemented cHash for C, not yet for C++. We have
> already published our current version of the clang plugin[1] and are
> also working on a GCC version.
>
> Furthermore, we presented our approach at USENIX ATC (and got a best
> paper award, yay!)[2]. Now, we consider proposing some parts of cHash
> for mainline.
>
> Our impression is that the AST hashing mechanism[3], which calculates
> a semantic fingerprint for a given AST node is not only useful to
> detect redundant builds. AST hashes are (computationally) fast to
> calculate, easy to store, easy to calculate, and they capture all of
> the semantic of an AST node. These properties should ease other techniques:
>
> - partial recompilation: we now know at an early state that a function
>  body has changed directly or indirectly.

Do you currently hash the full AST at once, or do you also compute/store hashes for individual functions?

> - change impact analysis: we can now quantify which parts of the
>  program is actually touched, even if we only modfiy a type.
> - project-wide analysis: have all types with the same name also
>  the name hash, or is there an inconsistency?
>
> Besides the AST hashing mechanism, cHash also includes the caching and
> compiler abortion logic. However, in my opinion this is better located
> in an external tool, or a plugin. For example, a more close
> integration of CLang with CCache, could give the compilation speedups
> that we have observed in our prototype.
>
> Before I start on preparing parts of the chash plugin to go upstream
> (there is definitely some cleanup required), I would love to hear your
> opinions on the following topics:
>
> - Would you like to see the AST hashing in clang mainline?
> - Would you like to have a standalone tool that calculates AST hashes?
> - Should the detection of redundant compilations be part of CLang mainline?
> - Should we integrate the test-suite[4] into the repo? If yes, how?
>
> chris
>
> [1] https://github.com/luhsra/chash
> [2] https://www.usenix.org/conference/atc17/technical-sessions/presentation/dietrich
> [3] https://github.com/luhsra/chash/blob/master/clang-plugin/hash-visitor.cc
> [4] https://github.com/luhsra/chash/tree/master/test
>    https://github.com/luhsra/chash/blob/master/test/control_flow/if/IfStmt_10.c
> --
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel:    +49 511 762-19737
> Fax:    +49 511 762-19733
> eMail:  [hidden email]
> WWW:    https://www.sra.uni-hannover.de
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
Alex Lorenz via cfe-dev <[hidden email]> writes:

> Hey,
>
> What impact does the hashing itself have on the build time? I.e. If I
> were to fully rebuild a project from scratch, how much longer would it
> take?

We have build over 2300 commits from the musl libc library. In total
that are 5.6 million compiler invocations (all commits are built from
scratch). On average, the AST hashing took 9ms. In comparision, the
parsing is 187ms and the -O2 optimization and assembling 1082ms. So the
hashing time alone is almost negliable. Furthermore, the hashing time
does not depend on the syntactic elements in the AST, but only on the
used parts. So even if you have a thousands of type declarations, a file
with a single 'void foo() {}' takes no time.

We use a non-cryptographical hash (MurMur2) to speed things up a little
bit. And since we do not have to defend against evil attackers, we
believe a non-cryptographic hash is enough. However, for our current
implementation, we still have a few optimizations that could cut down a
little bit on the hashing[1].

chris

[1] We hash currently every type annotation for every AST note. However,
    only the type fields for variable declarations and function
    signatures are relevant, since all other type fields are derived
    from them in a determinsitic fashion.
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
I see, thanks. That sounds promising.

Is the hash stored in a central database or do you use a separate file for each translation unit (or maybe something else)?

> On 3 Aug 2017, at 12:39, Christian Dietrich <[hidden email]> wrote:
>
> Alex Lorenz via cfe-dev <[hidden email]> writes:
>
>> Hey,
>>
>> What impact does the hashing itself have on the build time? I.e. If I
>> were to fully rebuild a project from scratch, how much longer would it
>> take?
>
> We have build over 2300 commits from the musl libc library. In total
> that are 5.6 million compiler invocations (all commits are built from
> scratch). On average, the AST hashing took 9ms. In comparision, the
> parsing is 187ms and the -O2 optimization and assembling 1082ms. So the
> hashing time alone is almost negliable. Furthermore, the hashing time
> does not depend on the syntactic elements in the AST, but only on the
> used parts. So even if you have a thousands of type declarations, a file
> with a single 'void foo() {}' takes no time.
>
> We use a non-cryptographical hash (MurMur2) to speed things up a little
> bit. And since we do not have to defend against evil attackers, we
> believe a non-cryptographic hash is enough. However, for our current
> implementation, we still have a few optimizations that could cut down a
> little bit on the hashing[1].
>
> chris
>
> [1] We hash currently every type annotation for every AST note. However,
>    only the type fields for variable declarations and function
>    signatures are relevant, since all other type fields are derived
>    from them in a determinsitic fashion.
> --
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel:    +49 511 762-19737
> Fax:    +49 511 762-19733
> eMail:  [hidden email]
> WWW:    https://www.sra.uni-hannover.de

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
Alex Lorenz <[hidden email]> writes:

> I see, thanks. That sounds promising.
>
> Is the hash stored in a central database or do you use a separate file
> for each translation unit (or maybe something else)?

In the current implementation, we store the hash in a separate file
besides the object file. And since CMake removes the .o file, before the
compiler is invoked, we hardlink also the object file to another name.
So, for a single object file we then have:

foo.o
foo.o.hash.copy <- Hardlink to foo.o
foo.o.hash      <- Hexdigest of the hash

For the redundant-compilation-avoidance, we could also store the
hexdigest in a seperate ELF (or mach-o) section and parse the header
partially in the cHash pass.

chris
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
In reply to this post by Sumner, Brian via cfe-dev
Christian Dietrich via cfe-dev <[hidden email]> writes:

> In C/C++ projects, we observe that a change to a central header file can
> cause a lot of compiler invocations.  However, not all compiler runs will
> lead to an object file that differs from the last invocation. For example,
> if I add a typedef that is never used in any .c file, none of the object
> files will change.

We are doing something similar in build2[1] though at the token stream
level (after preprocessing). One thing that we found to limit the
applicability of this approach is debug info: if you add a typedef
(presumably as a new line), then function definitions (if any) after
this line and in the same translation unit will have different debug
information (line position in the source code).

In build2, when calculating the hash, we include the position information
for each token. Do you do something similar at the AST level? If so, did
you find anything to mitigate such "line shift" changes?

[1] https://build2.org

Boris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
In reply to this post by Sumner, Brian via cfe-dev
That sounds like a really nice tool!

One problem I see is that we get yet another AST hashing
implementation in clang. I'm aware of at least four of them that I
personally work with:

* The normal Stmt::Profile for Stmts
* The ODR hashing for AST nodes in modules.
* The ASTDiff also has one
* The CloneDetector with its StmtDataCollector which can be
specialized for hashing

They all contain similar code and have similar goals. And they all
contain different kind of bugs. And that's just the implementations
that I'm aware of.

I would very much prefer if we don't add more redundant code here but
instead try to figure out if we can get all this code in a more
central place.

Most of the use cases are just about

> I have this node, please forward all interesting data into this stream so I can [store|hash|...] it and I [do|don't] care about having cross-TU stable data and maybe I want to treat special nodes in a different way

so it's not necessarily a difficult problem to solve. If we find a
common baseline for this, we could make one visitor that people can
specialize their specific hashing needs on top.

I already started doing this for the Stmt hashing when I refactored
the StmtDataCollector. I set a common baseline that I could turn my
two hashing implementations back then into a single one and I'm aiming
at also replacing the Stmt::Profile implementations with it.

Not sure if anyone started working on something similar for generic
AST nodes (which is probably a bit more complicated than the Stmts
which are relatively simple), but maybe we should see if we can get
some people together to work on that. Otherwise we will have a dozen
hashing implementations in a few years that no one can/will maintain.

- Raphael






2017-08-03 12:41 GMT+02:00 Christian Dietrich via cfe-dev
<[hidden email]>:

> Hello!
>
> First, I'd like to give you a little bit of context, before I get into
> the details what all of this has to do with the cfe-dev list.
>
> TL;DR: cHash calculates an hash over the AST and, therefore, allows
>        the detection of redundant builds.
>
> In the last months, we have been working on a mechanism to avoid
> redundant recompilations. In C/C++ projects, we observe that a change
> to a central header file can cause a lot of compiler invocations.
> However, not all compiler runs will lead to an object file that
> differs from the last invocation. For example, if I add a typedef
> that is never used in any .c file, none of the object files will
> change.
>
> Our approach to this problem is that we hook into the compiler front
> end, after the parsing and the semantic analysis and recursively
> calculate an AST hash. We start with all definitions (variables and
> functions) and include all relevant AST-node properties and the hashes
> of all children and referenced types. Thereby, we get a hash value
> that changes if a changed translation unit would result in a different
> object file. As a byproduct, we get a hash for every function, every
> referenced type, and all global variables.
>
> So far, we have only implemented cHash for C, not yet for C++. We have
> already published our current version of the clang plugin[1] and are
> also working on a GCC version.
>
> Furthermore, we presented our approach at USENIX ATC (and got a best
> paper award, yay!)[2]. Now, we consider proposing some parts of cHash
> for mainline.
>
> Our impression is that the AST hashing mechanism[3], which calculates
> a semantic fingerprint for a given AST node is not only useful to
> detect redundant builds. AST hashes are (computationally) fast to
> calculate, easy to store, easy to calculate, and they capture all of
> the semantic of an AST node. These properties should ease other techniques:
>
> - partial recompilation: we now know at an early state that a function
>   body has changed directly or indirectly.
> - change impact analysis: we can now quantify which parts of the
>   program is actually touched, even if we only modfiy a type.
> - project-wide analysis: have all types with the same name also
>   the name hash, or is there an inconsistency?
>
> Besides the AST hashing mechanism, cHash also includes the caching and
> compiler abortion logic. However, in my opinion this is better located
> in an external tool, or a plugin. For example, a more close
> integration of CLang with CCache, could give the compilation speedups
> that we have observed in our prototype.
>
> Before I start on preparing parts of the chash plugin to go upstream
> (there is definitely some cleanup required), I would love to hear your
> opinions on the following topics:
>
> - Would you like to see the AST hashing in clang mainline?
> - Would you like to have a standalone tool that calculates AST hashes?
> - Should the detection of redundant compilations be part of CLang mainline?
> - Should we integrate the test-suite[4] into the repo? If yes, how?
>
> chris
>
> [1] https://github.com/luhsra/chash
> [2] https://www.usenix.org/conference/atc17/technical-sessions/presentation/dietrich
> [3] https://github.com/luhsra/chash/blob/master/clang-plugin/hash-visitor.cc
> [4] https://github.com/luhsra/chash/tree/master/test
>     https://github.com/luhsra/chash/blob/master/test/control_flow/if/IfStmt_10.c
> --
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel:    +49 511 762-19737
> Fax:    +49 511 762-19733
> eMail:  [hidden email]
> WWW:    https://www.sra.uni-hannover.de
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
In reply to this post by Sumner, Brian via cfe-dev
Boris Kolpackov <[hidden email]> writes:

> We are doing something similar in build2[1] though at the token stream
> level (after preprocessing). One thing that we found to limit the
> applicability of this approach is debug info: if you add a typedef
> (presumably as a new line), then function definitions (if any) after
> this line and in the same translation unit will have different debug
> information (line position in the source code).

The benefit of doing it on the token stream is that you can avoid the
expensive parsing. But I wonder how hashing the token stream can be any
better than doing a textual hash on the preprocessed code before lexing
(as ccache is doing it)?

In the current prototype, we do not include debugging information.
Therefore, the hash does not change if you introduce a newline into the
source. In the future, line numbers/filnames should be included, if
debugging information is requested.

> In build2, when calculating the hash, we include the position information
> for each token. Do you do something similar at the AST level? If so, did
> you find anything to mitigate such "line shift" changes?

However, the AST hash should expose less recompilations compared to
hashing the token stream. On the token stream you have to include all
the tokens and, therefore, all the line information/line shifts. The AST
hash includes only the referenced types, enums, and other declarations.
So, if there is a line shift in a header, but none of the declarations that
comes after that shift is used in this compilation unit, the hash will
not change. However, if any of the declarations after the shift is
referenced, the debug information will change and we cannot call it a
redundant recompilation anymore.

chris
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
In reply to this post by Sumner, Brian via cfe-dev
Raphael Isemann <[hidden email]> writes:

> That sounds like a really nice tool!

I also really like the general principle of avoiding work that is
redundant in the end. Let's save some polar bears! And some developers
from serious injuries stemming from sword fights.

> One problem I see is that we get yet another AST hashing
> implementation in clang. I'm aware of at least four of them that I
> personally work with:
>
> * The normal Stmt::Profile for Stmts

This does indeed look structurally similar to our implementation for
statements. However, types are only included as means of their pointers
and, therefore, the hash is not stable over compilations. I wonder for
which applications this is actually sufficient. And, does it really
speed up things to badly?

> * The ODR hashing for AST nodes in modules.

This is also based on StmtProfilerWithoutPointers as far as I can
tell[1]. And it is stable over invocations as it actually handles types.
However, changing the order of declarations[1] will alter the hash. Also
one thing we did in cHash is to use pseudrandom identifiers for
different ast node types[3] to avoid some of the possible collisions. If
you only add small integer values (getKind(), etc.) It is much easier to
have a collision. Therefore, we add something like

<UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>

to the hashing stream most of the time. Ideally, the text that goes into
the hash should be usable to reconstruct an AST tree with the same
structure; not neccessarily with the same information.

> * The ASTDiff also has one

Ok, then I was unable to find it. Can you give me a pointer

> * The CloneDetector with its StmtDataCollector which can be
> specialized for hashing
>
> They all contain similar code and have similar goals. And they all
> contain different kind of bugs. And that's just the implementations
> that I'm aware of.

So, when they all contain some kinds of quirks, we really should build
a good test harness around a unified implementation. Since nobody wants
to break the "give me a unique identifier" silently.

> I would very much prefer if we don't add more redundant code here but
> instead try to figure out if we can get all this code in a more
> central place.

I like that idea and I do not necessarily stick to our implementation.

> Most of the use cases are just about
>
>> I have this node, please forward all interesting data into this
>> stream so I can [store|hash|...] it and I [do|don't] care about
>> having cross-TU stable data and maybe I want to treat special nodes
>> in a different way
>
> so it's not necessarily a difficult problem to solve. If we find a
> common baseline for this, we could make one visitor that people can
> specialize their specific hashing needs on top.

Your are right, it is not a difficult problem in the end. But when doing
It for the whole AST there are some weird corner cases that should be
handled correctly (recursive type definitions come to my mind)

> I already started doing this for the Stmt hashing when I refactored
> the StmtDataCollector. I set a common baseline that I could turn my
> two hashing implementations back then into a single one and I'm aiming
> at also replacing the Stmt::Profile implementations with it.
>
> Not sure if anyone started working on something similar for generic
> AST nodes (which is probably a bit more complicated than the Stmts
> which are relatively simple), but maybe we should see if we can get
> some people together to work on that. Otherwise we will have a dozen
> hashing implementations in a few years that no one can/will maintain.

Bringing cHash to mainline is based on the wish that I don't want to
maintain an AST hashing mechanism in parallel to the clang development.
Therefore, I would love to help with that. We should collect some
requirements that the different users of unique identifiers have. And
perhaps we can make them all happy by introducing a few switches to a
unified visitor.

Some switches and requirements come directly to my mind:

- Syntactical vs. Semantic: There should be a switch to include only
  nodes that are actually syntacical children of the given node. Not
  everybody wants to include the transitive hull over all referenced AST
  nodes.

- Multiple Hashes in one pass: With one pass over the AST I want to have
  not only one top-level hash, but also hashes of some interesting nodes
  (top-level definitions, type declarations, global variables). cHash
  already caches hashes for some nodes in a map to speed things up.

- Debug information: I want to turn on/off the inclusion of debugging
  information (line numbers, file names, local identifier names).

- Stable?: Should the hash be stable over compiler invocations? Or can
  we take some shortcuts for types/declarations/things that are not
  syntactially enclosed into the node.

chris
--
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  [hidden email]
WWW:    https://www.sra.uni-hannover.de
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
In reply to this post by Sumner, Brian via cfe-dev
Christian Dietrich <[hidden email]> writes:
 
> The benefit of doing it on the token stream is that you can avoid the
> expensive parsing.

And it works with any compiler that supports the -E option.


> But I wonder how hashing the token stream can be any better than doing
> a textual hash on the preprocessed code before lexing (as ccache is
> doing it)?

Probably not by much (can ignore some redundant #line directives, etc).
The reason we do it this way in build2 is to be ready for the day when
we no longer need preprocessing, at least for some translation units
(with C++ modules being the first step in that direction). Such
translation units will still contain comments, line continuations,
etc., which we would want to ignore.


> In the current prototype, we do not include debugging information.
> Therefore, the hash does not change if you introduce a newline into the
> source. In the future, line numbers/filnames should be included, if
> debugging information is requested.

In our experience, this feature is most useful during development
when debug info is almost always required.

Boris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
In reply to this post by Sumner, Brian via cfe-dev
Raphael Isemann via cfe-dev <[hidden email]> writes:

> That sounds like a really nice tool!
>
> One problem I see is that we get yet another AST hashing
> implementation in clang. I'm aware of at least four of them that I
> personally work with:
>
> * The normal Stmt::Profile for Stmts
> * The ODR hashing for AST nodes in modules.
> * The ASTDiff also has one
> * The CloneDetector with its StmtDataCollector which can be
> specialized for hashing
>
> They all contain similar code and have similar goals. And they all
> contain different kind of bugs. And that's just the implementations
> that I'm aware of.

I have done some work on making the StmtDataCollector approach
extensible [1] (it was not possible to subclass it due to the recurring
template pattern style inheritance of ConstStmtVisitor).

I think the DataCollector approach is the best one. Note that it only
collects data from a single node, so it makes sense to combine it with a
RecursiveASTVisitor in order to hash entire subtrees.

It should be possible to rewrite ODR hashing and the others with this!

(Currently the hashing for IntegerLiteral is not stable because it uses
hash_value for APInt but I imagine this can be fixed easily)

>
> I would very much prefer if we don't add more redundant code here but
> instead try to figure out if we can get all this code in a more
> central place.
>
> Most of the use cases are just about
>
>> I have this node, please forward all interesting data into this stream so I can [store|hash|...] it and I [do|don't] care about having cross-TU stable data and maybe I want to treat special nodes in a different way
>
> so it's not necessarily a difficult problem to solve. If we find a
> common baseline for this, we could make one visitor that people can
> specialize their specific hashing needs on top.
>
> I already started doing this for the Stmt hashing when I refactored
> the StmtDataCollector. I set a common baseline that I could turn my
> two hashing implementations back then into a single one and I'm aiming
> at also replacing the Stmt::Profile implementations with it.
>
> Not sure if anyone started working on something similar for generic
> AST nodes

I am also working on a DeclDataCollector. I still have to come up with
a good way to test that two nodes indeed have a different hash value.

> (which is probably a bit more complicated than the Stmts
> which are relatively simple), but maybe we should see if we can get
> some people together to work on that. Otherwise we will have a dozen
> hashing implementations in a few years that no one can/will maintain.
>
> - Raphael
>
>

Johannes

[1] https://reviews.llvm.org/D36664

>
>
>
>
> 2017-08-03 12:41 GMT+02:00 Christian Dietrich via cfe-dev
> <[hidden email]>:
>> Hello!
>>
>> First, I'd like to give you a little bit of context, before I get into
>> the details what all of this has to do with the cfe-dev list.
>>
>> TL;DR: cHash calculates an hash over the AST and, therefore, allows
>>        the detection of redundant builds.
>>
>> In the last months, we have been working on a mechanism to avoid
>> redundant recompilations. In C/C++ projects, we observe that a change
>> to a central header file can cause a lot of compiler invocations.
>> However, not all compiler runs will lead to an object file that
>> differs from the last invocation. For example, if I add a typedef
>> that is never used in any .c file, none of the object files will
>> change.
>>
>> Our approach to this problem is that we hook into the compiler front
>> end, after the parsing and the semantic analysis and recursively
>> calculate an AST hash. We start with all definitions (variables and
>> functions) and include all relevant AST-node properties and the hashes
>> of all children and referenced types. Thereby, we get a hash value
>> that changes if a changed translation unit would result in a different
>> object file. As a byproduct, we get a hash for every function, every
>> referenced type, and all global variables.
>>
>> So far, we have only implemented cHash for C, not yet for C++. We have
>> already published our current version of the clang plugin[1] and are
>> also working on a GCC version.
>>
>> Furthermore, we presented our approach at USENIX ATC (and got a best
>> paper award, yay!)[2]. Now, we consider proposing some parts of cHash
>> for mainline.
>>
>> Our impression is that the AST hashing mechanism[3], which calculates
>> a semantic fingerprint for a given AST node is not only useful to
>> detect redundant builds. AST hashes are (computationally) fast to
>> calculate, easy to store, easy to calculate, and they capture all of
>> the semantic of an AST node. These properties should ease other techniques:
>>
>> - partial recompilation: we now know at an early state that a function
>>   body has changed directly or indirectly.
>> - change impact analysis: we can now quantify which parts of the
>>   program is actually touched, even if we only modfiy a type.
>> - project-wide analysis: have all types with the same name also
>>   the name hash, or is there an inconsistency?
>>
>> Besides the AST hashing mechanism, cHash also includes the caching and
>> compiler abortion logic. However, in my opinion this is better located
>> in an external tool, or a plugin. For example, a more close
>> integration of CLang with CCache, could give the compilation speedups
>> that we have observed in our prototype.
>>
>> Before I start on preparing parts of the chash plugin to go upstream
>> (there is definitely some cleanup required), I would love to hear your
>> opinions on the following topics:
>>
>> - Would you like to see the AST hashing in clang mainline?
>> - Would you like to have a standalone tool that calculates AST hashes?
>> - Should the detection of redundant compilations be part of CLang mainline?
>> - Should we integrate the test-suite[4] into the repo? If yes, how?
>>
>> chris
>>
>> [1] https://github.com/luhsra/chash
>> [2] https://www.usenix.org/conference/atc17/technical-sessions/presentation/dietrich
>> [3] https://github.com/luhsra/chash/blob/master/clang-plugin/hash-visitor.cc
>> [4] https://github.com/luhsra/chash/tree/master/test
>>     https://github.com/luhsra/chash/blob/master/test/control_flow/if/IfStmt_10.c
>> --
>> Christian Dietrich, M.Sc. (Scientific Staff)
>> Institute for Systems Engineering (Systems and Computerarchitecture)
>> Leibniz Universität Hannover
>> Appelstraße 4
>> 30167 Hannover, Germany
>>
>> Tel:    +49 511 762-19737
>> Fax:    +49 511 762-19733
>> eMail:  [hidden email]
>> WWW:    https://www.sra.uni-hannover.de
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [RFC] cHash: Integration of AST hashing

Sumner, Brian via cfe-dev
In reply to this post by Sumner, Brian via cfe-dev
Christian Dietrich via cfe-dev <[hidden email]> writes:

> Raphael Isemann <[hidden email]> writes:
>
>> That sounds like a really nice tool!
>
> I also really like the general principle of avoiding work that is
> redundant in the end. Let's save some polar bears! And some developers
> from serious injuries stemming from sword fights.
>
>> One problem I see is that we get yet another AST hashing
>> implementation in clang. I'm aware of at least four of them that I
>> personally work with:
>>
>> * The normal Stmt::Profile for Stmts
>
> This does indeed look structurally similar to our implementation for
> statements. However, types are only included as means of their pointers
> and, therefore, the hash is not stable over compilations. I wonder for
> which applications this is actually sufficient. And, does it really
> speed up things to badly?
>
>> * The ODR hashing for AST nodes in modules.
>
> This is also based on StmtProfilerWithoutPointers as far as I can
> tell[1]. And it is stable over invocations as it actually handles types.
> However, changing the order of declarations[1] will alter the hash. Also
> one thing we did in cHash is to use pseudrandom identifiers for
> different ast node types[3] to avoid some of the possible collisions. If
> you only add small integer values (getKind(), etc.) It is much easier to
> have a collision. Therefore, we add something like
>
> <UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>

We should probably do the same, maybe also for other enum members.

In addition, it may be necessary to add some padding or separator
between when hashing multiple nodes as they have different numbers of
fields, I am not an expert though.

>
> to the hashing stream most of the time. Ideally, the text that goes into
> the hash should be usable to reconstruct an AST tree with the same
> structure; not neccessarily with the same information.
>
>> * The ASTDiff also has one
>
> Ok, then I was unable to find it. Can you give me a pointer
>
>> * The CloneDetector with its StmtDataCollector which can be
>> specialized for hashing
>>
>> They all contain similar code and have similar goals. And they all
>> contain different kind of bugs. And that's just the implementations
>> that I'm aware of.
>
> So, when they all contain some kinds of quirks, we really should build
> a good test harness around a unified implementation. Since nobody wants
> to break the "give me a unique identifier" silently.
>
>> I would very much prefer if we don't add more redundant code here but
>> instead try to figure out if we can get all this code in a more
>> central place.
>
> I like that idea and I do not necessarily stick to our implementation.
>
>> Most of the use cases are just about
>>
>>> I have this node, please forward all interesting data into this
>>> stream so I can [store|hash|...] it and I [do|don't] care about
>>> having cross-TU stable data and maybe I want to treat special nodes
>>> in a different way
>>
>> so it's not necessarily a difficult problem to solve. If we find a
>> common baseline for this, we could make one visitor that people can
>> specialize their specific hashing needs on top.
>
> Your are right, it is not a difficult problem in the end. But when doing
> It for the whole AST there are some weird corner cases that should be
> handled correctly (recursive type definitions come to my mind)
>
>> I already started doing this for the Stmt hashing when I refactored
>> the StmtDataCollector. I set a common baseline that I could turn my
>> two hashing implementations back then into a single one and I'm aiming
>> at also replacing the Stmt::Profile implementations with it.
>>
>> Not sure if anyone started working on something similar for generic
>> AST nodes (which is probably a bit more complicated than the Stmts
>> which are relatively simple), but maybe we should see if we can get
>> some people together to work on that. Otherwise we will have a dozen
>> hashing implementations in a few years that no one can/will maintain.
>
> Bringing cHash to mainline is based on the wish that I don't want to
> maintain an AST hashing mechanism in parallel to the clang development.
> Therefore, I would love to help with that. We should collect some
> requirements that the different users of unique identifiers have. And
> perhaps we can make them all happy by introducing a few switches to a
> unified visitor.

Yes, it would be good to identify the different kinds of uniqueness that
people need.

>
> Some switches and requirements come directly to my mind:
>
> - Syntactical vs. Semantic: There should be a switch to include only
>   nodes that are actually syntacical children of the given node. Not
>   everybody wants to include the transitive hull over all referenced AST
>   nodes.

Sounds good, so the syntactical one discards any references to nodes
save for its children. I can use this in ASTDiff for checking whether
two matched nodes differ.

>
> - Multiple Hashes in one pass: With one pass over the AST I want to have
>   not only one top-level hash, but also hashes of some interesting nodes
>   (top-level definitions, type declarations, global variables). cHash
>   already caches hashes for some nodes in a map to speed things up.
>
> - Debug information: I want to turn on/off the inclusion of debugging
>   information (line numbers, file names, local identifier names).
>

These two should be easy to do, I think we definitely want a system that can
have user-provided functions for each node class.

> - Stable?: Should the hash be stable over compiler invocations? Or can
>   we take some shortcuts for types/declarations/things that are not
>   syntactially enclosed into the node.

Hopefully it is not too difficult to make it stable.

If there is a node that is referred to by its name, then it suffices to hash the
qualified name, provided that the referenced node is also included in
the hash.

So there are references to other nodes by name.
All other references between nodes I can think of are based on the tree
structure (correct me if I am wrong), so we should be fine if we hash
subtrees.

Johannes

>
> chris
> --
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel:    +49 511 762-19737
> Fax:    +49 511 762-19733
> eMail:  [hidden email]
> WWW:    https://www.sra.uni-hannover.de
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Loading...