A survey of dataflow analyses in Clang

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

A survey of dataflow analyses in Clang

Manas via cfe-dev
Hi!

In an earlier email I wrote about how LivenessAnalysis works in Clang [1], and hinted at how the static analyzer takes advantage of it. I incorrectly advertised Liveness and UninitializedVariables to be the only two dataflow analyses written in Clang, whereas consumed annotation checking [2] and thread safety analysis [3] are also utiliting dataflow. Lifetime analysis hasn't landed yet [24-28], but is also great to study.

The point of this mail is to take an objective look at these algorithms in terms of their "dataflowness", identify their similarities and differences, and what techniques they implement that could benefit all of them. To be clear, dataflow describes a wide ranges of analyses -- we're looking at flow sensitive, intraprocedural ones.

I invite you to point out any and all the incomplet and inkorrek items in this survey.

---=== The many struggles of C++ data flow analysis in Clang ===---

Not all data flow struggles are C/C++, or Clang specific, but some are, and universal difficulties (aliasing, for instance) might be obfuscated by them more than usual. The following is a non-comprehensive list of such issues all data flow algorithms in Clang face.

---  Reads/writes ---

A common theme among data flow algorithms is looking for gens (generations, or in other words "reads" or "loads") or kills (in other words "writes" or "stores") of either variables or values. When we say GEN or KILL set, we mean a set of variables generated, or killed at a given statement. GEN and KILL sets can be defined for CFGBlocks as well from the sets of the contained statements. Many data flow analyses are defined with a fixpoint algorithm on GEN/KILL sets. In literature, we assume that these sets are precalculated, but that is rarely a case for Clang.

In literature, most data flow algorithms are defined and showcased on either some simple pseudo code, or an instruction set in a three address form:

x <- y + z

This is a very significant simplification, as the appearance of a variable's in Clang's AST (DeclRefExprs) is not very telling of its context. [10, 11, 13] all mention this a very significant difficulty in Clang: only the surrounding context, and even that with great difficulties, can tell whether a variable is written or read. LLVM IR is largely void of this problem, but in [14] Chris Lattner provides a lengthy reasoning why it'd be next to impossible to lower to LLVM IR, conduct analysis, but still produce diagnostics in the original source code. LLVM IR is still tempting enough though that we have ongoing projects to get some use out of it [15].

This poses another problem: in literature, transfer functions are responsible for the flow of dataflow information through a statement. For liveness, a transfer function for "x = a;" would tell that "a" should be added to liveness set, and "x" should be removed. In clang however, where transfer functions are implemented with statement visitors, the fact that we need substantial contextual information for any given statement is sometimes a problem.

There is a push for an MLIR dialect for C++. You can find a discussion in [19], and there will also be talk on the next dev meeting, but that will happen after this mail [17, 18].

--- Classic data flow issues: Aliasing, function calls, unfeasible execution paths ---

Aliasing is an often discussed, unavoidable obstacle for C/C++, and intraprocedural of most dataflow algorithms is also problematic.

int a = 5;
int *p = &a;
foo(a, p);
// foo can not know that using the second parameter affects the first.
// Also: will foo write these arguments? Will it read them?


This naturally allows a third category next to gen/kill: unknown. On such a statement, the variable may be read or written, maybe neither, maybe both. Literature solves this by:
* Overestimating: If we intend to emit diagnostics for values computed but never read (dead values), we say that it is not read only if it is killed without ever being read, or occuring is such an unknown statement (we overestimate the liveness sets),
* Underestimating: For uninitialized variable misuse, we only look for variables that were read, but never killed up to that point or appeared in an unknown context (we underestimate the uninitialized variable sets).
However, in practice, its worth keeping track of such unknown events. Some analyses go further, and assign a sort of confidence level: "we're not absolutely sure whether this is kill, but its reasonable to believe that it isn't".

Unfeasible execution paths are also an inherent problem for dataflow.

if (a)
  x = 0;
if (!a)
  10 / x;

Sure the flow of control will never go through both of these then branches (in a single threaded environment), but that is an undecidable problem in dataflow. The solution here is the same as above, under or over approximation. The foundation of this lies in lattice theory, which is explained in [30, 31].

--- Records, fields, methods ---

Suppose you need to conduct data flow analysis on the following code in clang, and need to identify variables in the code:

void A::foo(int u) {
  int i;
  this->a = 5;
  S s;
  s.a->get().b = 6;
  (*s.a).c = 9;
  // ...
}


The parameter "u" and "i" are easy enough, they are both VarDecls, and so is the implicit this parameter. "s" is also a VarDecl, but its fields aren't, they are FieldDelcs. An option would be to keep record of a chain of fields -- for s.a.b.x.z, a pair like this would indeed identify which memory region we're describing: (s, [a, b, x, z]). However, there are a lot of ways to obfuscate this specific memory region, for example, if we ever support some aliasing the variety of ways dereferencing can be expressed will pose a challenge. Simply put, its difficult to compare expressions [16].

---=== 1. LivenessAnalysis ===---

Liveness answers the question whether value computed at one point will ever be used before being overwritten:

int a = 5;
a = 6;
if (rand())
  process(a);
//...


The value 5 from 'a' will never be read, meaning that 'a' is not live from the point of its initialization until it is assigned 6. After that point it is live until the next time it's assigned, because there exists a path from the assignment to a read of 'a' without any interleaving assignments.

The product of the analysis is liveness sets (sets of variables or sets of expressions). They are calculated for each statement, and transitively, each CFGBlock. If a variable/expression is found in this set, it means that they are live (will be read from later) at that point. The algorithm is described and implemented as a
* Backward analysis: whether a value will be read is a property of the future
* May analysis: we're interested whether it *could* be read from, not whether its guaranteed. Coupled with the fact that infeasible execution paths are also analyzed, this will overapproxiate the actual liveness sets.

It was mentioned above that transfer functions require considerable context to know whether a statement writes/reads a variable. Transfer functions for DeclRefExpr that look for variables usages have a problem, namely that by the time a statement visitor reaches a DeclRefExpr the read/write information is already lost. So, how does Liveness solve this issue? Does it implement a pass that categorizes DeclRefExprs before the main analysis runs? It turns out, the fact that Liveness is a backward analysis allows it to gather this information on the go [34]:

For the following code snippet:

void f() {
  int i;

  i = 5;
}


This is the generated CFG:

void f()
 [B2 (ENTRY)]
   Succs (1): B1

 [B1]
   1: int i;
   2: 5
   3: i
   4: [B1.3] = [B1.2]
   Preds (1): B2
   Succs (1): B0

 [B0 (EXIT)]
   Preds (1): B1


In B1, the 4th element, which is the assignment, is visited before the 3rd, where the written variable is, meaning that by the time the transfer function for DeclRefExpr is ran, Liveness could've already noted it as a write.

The analysis, as described in [4], which Clang follows closely, ignores the worst of the read/write problem. Only variable declarations and assignments are regarded as writes. If a DeclRefExpr pops up in any other context, it is seen as a read, even if its passed variables to function call as a non-const reference, etc. Also, the algorithm doesn't respect aliasing, and doesn't regard the fields of a record as a distinct, writable memory regions. However, there is still value in the analysis interpreted this way. The Static Analyzer conducts liveness analysis with some relaxations [1], and its results are used, with a few other techniques, to know how long a variable's memory region needs to be kept in the Environment. This simplicity also allows the algorithm to have a very simple implementation.

The way the Static Analyzer utilizes this analysis requires clang to keep the liveness sets for a while, making it the only dataflow analysis that doesn't immediately discard its results. This might explain the choice of data structures, which is, contrary to what many literature suggests (bitvectors), are ImmutableSets. Since these sets are accessed far more regularly then the result of other analyses, its a non-trivial claim that either solution is faster than the other.

[1] is purely dedicated to Clang's Liveness, and is a great source for further reading.

--- Strengths / Deficiencies ---

+ Straightforward implementation without many cornercase handling or struggle with the difficult read/write problem.
+ The result of the analysis is cached through clang::ManagedAnalysis, and is easily to query.

- Has some crude hacks, the code is a bit challenging to follow and is rather poorly documented.
- Non-relaxed liveness analysis has tests, but isn't actually used anywhere in the codebase.
- The choice of data structures is probably inefficient.

---=== 2. UninitializedVariables ===---

The point of this analysis is self explanatory -- catch the usage of a variable that has been initialized. There are a variety of severities we can define: a variable is uninitialized on all paths leading to the usage, or only on some (which can be divided into further subcategories).

Some literature, like [4], talks about uninitialized variable analysis as a side product of liveness analysis, while others, like [30], describe it as a distinct analysis. [4] isn't necessarily wrong, if a variable is known to be uninitialized but is live, then an uninitialized value is read on some path of execution. However, liveness is an over-approximating algorithm, meaning this path of execution might be infeasible. For diagnostics, uninitialized variables should be under-approximating to avoid false positives. On the implementation side, the categorization of statements as variable reading/writing requires far more sophistication then what liveness offers. For this reason, this is a
* Forward analysis: Initializedness is a property of the past.
* Must analysis: We want to be sure that the variable is uninitialized. However, in practice, we're a bit more lenient on this, as we shall see.

Uninitialized variables doesn't enjoy the advantage Liveness does regarding the categorization of statements as variable reading/writing, so it implements multiple passes:

--- First pass: ---

Clang doesn't have any precalculated KILL/GEN sets, though UninitializedVariables stops just short of doing that. As a distinct pass, it scans the code associated with the CFG (for the most part, that is a FunctionDecl), and classifies each DeclRefExpr whether its an initialization/write, read, read as a const reference, self initialization (int x = x;), or should be ignored by the analysis.

void use(int x); // note that this is a pass-by-value parameter

int x;
if (rand())
  x = 5; // Write of x
use(x); // Read of x


--- Second pass: ---

This is where uninitialized variable usages are actually calculated. Indeed, this is reminiscent of liveness analysis: we assign an uninitialized variables set to each statement, initially with the information gathered from the first pass, then propagating it with a fixpoint algorithm. We found an uninitialized variable use, when the fixpoint algorithm halts, and a set associated with a statement notes a variable as uninitialized, yet the same statement also reads this variable.

During the fixpoint calculation, there are a few extra steps to be taken that are absent in LivenessAnalysis:

* Variable declarations (which are not DeclRefExprs, and as such won't be classified in the first pass), and their lack of initialization are noted here.
* Handling of special cases: inline asm, ObjectiveCMessageExpr, OMPExecutableDirective
* If a variable read is found where the variable is sometimes uninitialized (not on all paths leading to the read), we'll look for block edges that unconditionally lead to an uninitialized variable. This is used to say stronger than 'may be uninitialized', namely that 'either it's used uninitialized or you have dead code'. There is a great block of comments that details this in the actual code [5].

--- Third pass ---

Though technically this is an implementation detail, a third pass is used to emit diagnostics. Block visitations can be accompanied with a callback object, which is a caching object in the second pass, and a report emission object in the third.

--- Strengths / Deficiencies ---

+ Has a far more sophisticated understanding of when a variable is read from / written to.
+ Has a great categorization system. Variables that are maybe uninitialized can be analyzed further to judge how likely the lack of initialization is.

- Lacks support for record fields.
- The third pass looks like a hack, and could probably be factored out.
- The analysis results are non-reusable. Is not a clang::ManagedAnalysis.

---=== 3. Thread Safety analysis ===---

Clang's oldest dataflow analysis, Liveness, was committed on the 6th of September, 2007 [6]. Thread safety analysis isn't much younger though, the earliest proposals and discussions popped up in July, 2008 on the GCC mailing lists [7-8]. Its eventual port to clang came a few years later in June 2011, which was proposed by Caitlin Sadowski [9], and was presented at the 2011 LLVM Developers' Meeting [10].

The analysis is a subject of a 2014 article [11], notably after the analysis matured some. It is an amazing piece of literature, very easily digestible, but doesn't sacrifice anything for readability. If you want to learn more about many of the challenges data flow presents in Clang, or thread safety checking in general, I highly recommend you read it. You can assume that anything I write about this analysis is cited from either this paper or the similarly amazing dev talk.

The domain of the analysis is different this time: Liveness looks for expressions, and both Liveness and UninitializedValue look for local variables. The domain of Thread Safety analysis is on _annotated_ member or non-member functions in addition to _annotated_ variables.

Annotations are a very important component of this analysis. The user annotates which variable or function is guarded by which mutex. Functions that lock or unlock a mutex are annotated likewise. Annotations on functions have an interesting effect; though the analysis is still intraprocedural, it will be able to argue across function calls and achieve some path-sensitivity [25 at 3:27]: If a function's precondition is to have a set of mutexes locked, that must be annotated (otherwise Clang will assume said mutexes are unlocked), and this precondition can be checked on the caller side. If the precondition and the postcondition have different set of lucked mutexes, that must be annotated as well (otherwise Clang will emit a warning), so the callee can check whether the post condition holds.

A set of locked mutexes are computed for each statement in a block. The set for the last statement in the block becomes the set associated with the block.

// Some CFGBlock, suppose that no mutexes are locked before reaching it
{
  /*{}      */ std::mutex mu, um;
  /*{}      */ int a GUARDED_BY(mu);
  /*{mu}    */ mu.lock();
  /*{mu, um}*/ um.lock();
  /*{mu, um}*/ foo();
  /*{um}    */ mu.unlock();
}
// {um} is the lock set associated with this block


Suppose you have the following CFG:
   
             -> [B3] ->
           /            \
[B1] -> [B2] -> [B4] -> [B5] -> [B6]


Suppose the blocks lock the following mutexes before propagating this information with the use of dataflow:

             -> {mu} ->
           /            \
{  } -> {  } -> {mu} -> {  } -> {(unlocks mu)}


How do we propagate this information throughout the graph? Following the classic fixpoint algorithms most dataflow analyses implement, this is simple enough; since all ascendant blocks to B5 locks "mu", it must be locked in B5 as well (there is no need to think about under/over approximation). Since B6 unlocks mu, it won't be present in its lock set.

             -> {mu} ->
           /            \
{  } -> {  } -> {mu} -> {mu} -> {  }


Suppose B4 doesn't lock "mu" (for instance, B2 branches on mu.try_lock()). In that case, the joint point at B5 will have differing incoming lock sets:

             -> {mu} ->
           /            \
{  } -> {  } -> {  } -> {  } -> {(unlocks mu)}


What should the lock set of B5 be after the analysis? What if we introduce a loop:

             -> {mu} ->
           /            \
{  } -> {  } -> {  } -> {  } -> {(unlocks mu)}
          \              /
           <-------------


We could decide that for each cases where we need to merge differing lock sets:

* We merge the sets (a may analysis, the lock sets will be overestimated)
* We intersect them (a must analysis, the lock sets will be underestimated)
* Note that the lock may be either locked/unlocked, similarly to uninitialized variables analysis.

Thread safety analysis cuts the Gordian knot [12] by asserting that code that allows differing lock sets on merge points is error prone, and emits a warning. This is a very significant simplification, since there is no longer a need for a fixpoint algorithm, the sets can be calculated and propagated in O(N) time for a CFG with N blocks. It could be argued that this is too restrictive, but [11 p5.] is confident that that allowing such constructs doesn't have a real practical benefit, and is a sign of code smell.

--- Thread Safety's own intermediate language ---

Thread Safety analysis allows the annotation of methods and pointees. This poses a very big problem: the comparison of expressions [16].

 class A { Mutex mu; int dat GUARDED_BY(this->mu); }
 class B { A a; }

 void foo(B* b) {
   (*b).a.mu.lock();     // locks (*b).a.mu
   b->a.dat = 0;         // substitute &b->a for 'this';
                         // requires lock on (&b->a)->mu
   (b->a.mu).unlock();   // unlocks (b->a.mu)
 }


There is only a single mutex that is locked/unlocked, but the AST makes this difficult to understand. Just as the code states, knowing that b->a.dat is guarded by b->a.mu is a non-trivial step as well. For this reason, stunningly, Thread Safety Analysis lowers Clang expressions, implementing its very own IR, a Typed Intermediate Language, or TIL.

Quoting from [20] (this has changed a lot since then, but captures the essence of it):

"Unlike a clang Expr, a SExpr does not capture surface syntax, and it does not distinguish between C++ concepts, like pointers and references, that have no real semantic differences. This simplicity allows SExprs to be meaningfully compared, e.g.
        (x)          =  x
        (*this).foo  =  this->foo
        *&a          =  a"

The implementing code is well documented, but I failed to find discussion previous to the first commit, or some high level description of it. [40] offers some, but is very in medias res. I'm honestly not confident enough in my knowledge of IRs to talk about this much. If you have any pointers, any and all would be appreciated!

--- Some dataflow aspects ---

Thread safety is a forward analysis. To detect an improper write on a variable that should be guarded by a mutex implies that the read/write problem needs to be solved. However, unlike UninitializedVariables analysis which is also a forward analysis, this isn't done in a distinct pass. The reason behind this is that thread safety doesn't have a transfer function for DeclRefExpr. Instead, it has a transfer function for each context that might access a variable: unary/binary operators, lvalue-to-rvalue casts, call expressions, etc, translates its argument to TIL, and checks whether it was a variable of interest.

Whenever a function that locks/unlocks a mutex is found, the corresponding mutex expression (for "a.b.Lock()", this is "a.b", for a function annotated with "ACQUIRE(mu)", this is "mu") is translated to TIL. These TIL expressions form the actual lock sets. When a variable is found to have been read/written, the analyzer checks whether the guarding mutex is in the lock set.

This analysis doesn't care about the aliasing issue. This can cause false positives, but the rationale is that these kinds of reports are an acceptable sacrifice for greater coverage.

That is all I was interested -- I can't recommend [11] and [10] enough if you wanna dive deeper, or peek at the code, it practically reads itself aloud.

--- Strengths / Deficiencies ---

+ Exceptional documentation. Articles, design discussions are readily (and without charge!) available on  the web.
+ Has strong C++ support, it is able to translate Exprs into its own intermediate language. Seriously, that is nuts.
+ Actively maintained.
+ O(n) analysis time for a CFG with n blocks.
+ Solves the read/write issue without a distinct pass.

- The analysis results are non-reusable. Is not a clang::ManagedAnalysis.
- TIL seems to be severely undertested. You can litter a lot of llvm_unreachables in the code, and no tests will fail.
- It seems like you can't annotate subfields, but that might as well be a feature:

struct A { int i; };
struct B {
  A a; // how to annotate B::a.i?
};

---=== 4. Consumed analysis ===---

Before its implementation, consumed analysis was called uniqueness analysis [21]. In his proposal, Christian Wailes explains the motivations in great depths in [22]. In terms of development, this analysis laid dormant for about 5 years, but has received a few improvements last year.

Consumed analysis is very similar, but not as high-tech as Thread Safety is; it is also a single-pass, annotation based algorithm, that tracks a set of consumed/unconsumed variables, instead of a set of locked mutexes. While I didn't devote too much time to this analysis (party because it doesn't have as much available discussion as Thread Safety, and party because it seemed really similar to it), I personally think this is the least mature of the bunch.

---=== 5. Lifetime analysis ===---

There is a fifth major data flow analysis in clang I'm aware of, though it isn't a part of upstream clang: Lifetime analysis. This was based largely on Herb Sutter's paper [29], and has been the subject of numerous talks [24-28].

[29] is an amazing, step-by-step description on the lifetime algorithm. [25] is a great talk about how this was implemented in Clang specifically, which was preceded by design discussion on the mailing list [32, 33]. Herb's paper and the dev talk alone are so thorough, I can't really do it any justice in this survey. As such, this mail will focus on the "dataflowness" of the algorithm, largely ignoring non-flow sensitive lifetime analyses and the kinds of ownership categories that were proposed with annotations, etc. This simplification quickly boils the discussion down to what is essentially a points-to analysis.

Points-to analysis collects a set of memory regions for each pointer it might point to at different points of the program. If this set contains null or invalid, that is a sign of a programmer error. This implies a

* Forward analysis: what a pointer will point to is a property of the future
* May analysis: we are interested in the set of memory regions the pointer might point to across all paths in the CFG, meaning that we're overapproximating the real points-to set.

The analysis thoroughly defines the semantics of the points-to sets, or psets, as literature calls it. Similarly to thread safety analysis with try_lock(), these sets may change at a condition point as well, should the a pointer in the pmap (which maps pointers to their psets) participate in them:

if (p == q) {
  // pset(p) == pset(q)
}
if (p) {
  // pset(p) doesn't contain null
}

They are affected by, well, lifetime:

int *p; // pset(p) == {invalid}
{
  int x;
  p = &x // pset(p) == {&x}
} // pset(p) == {invalid}
*p = 5; // warning

Asserts pose a serious problem for lifetime analysis in particular. Suppose you have the following assert:

assert(p && q);
...

(bool)(p && q) ? 0 : abort(); // assert macro expanded
...

And the CFG would look like this:

    ------------>           ----> [abort()]
  /              \        /
[p] -> [q] -> [(bool)(p && q)] -> [...]

The path on which pset(p) == {null} merges with the path where pset(p) doesn't contain null, before the nonreturn abort() block is reached. Lifetime analysis solves this by an amazing technique; when it encounters such a pattern, both psets for the false and true branches are kept, doing a sort of local pathsensivity. Later, on the next branch, the the appropriate set is propagated to the true/false paths.

Thread safety neatly avoids with a distinct annotation (ASSERT_EXCLUSIVE_LOCK) that can be used instead of the standard assert. Liveness, UninitializedVariables sets don't change on conditions, so they aren't affected.

A may analysis in this context is risky. The set may have included null or invalid from an infeasible path of execution. To help filter out the reports, lifetime analysis employs an array of great tricks to find reports that are far more likely to be true positives. Such is the case when an assignment of null to a pointer and a dereference of that pointer is in a dominance or post-dominance relationship, similar to what was written about UninitializedValues. Its hard to tell whether a statement that adds null/invalid to a pset is a reaching definition to the dereference point, but there is no reaching definitions in clang yet, so dominance sets must suffice for the time being.

Lifetime analysis was conceptualized with strong C++ support from the get-go. Records object that have an Owner data member are Owners themselves, non-Owner record objects that have a Pointer data member are Pointers, and non-Owner, non-Pointer record objects are Aggregates (I left a lot of rules out, see [29 p8-10] for details). Owners and Pointers are modeled as a single variable like any other non-record object, and aggregates are exploded, each field is regarded as a variable:

struct A { int x, y; };
struct B { A a; int z; };

B b; // variables: b.a.x, b.a.y, b.a, b.z, b

The Variable [25] class accurately captures the domain of the analysis. It gracefully handles CXXThis, temporaries and local variables.

As the psets are calculated, the read/write problem pops up naturally. [29 p12-24] thoroughly explains what the solution is; lets see an example before the explanation:

a = &b.a;

Relevant part of the CFGBlock:

4: b
5: [B1.4].a
6: &[B1.5]
7: aptr
8: [B1.7] = [B1.6]


The created psets:

Set RefersTo[DeclRefExpr] = (b)
Set RefersTo[MemberExpr] = (b.a)
Set PSetsOfExpr[UnaryOperator] = (b.a)
Set RefersTo[DeclRefExpr] = (aptr)
PMap[(aptr)] = (b.a)
Set RefersTo[BinaryOperator] = (aptr)


At CFGElement 4, "b" is assigned the pset {b}, at E5 "b.a" is assigned {b.a}, at E6 "&b.a" is assigned {b.a}, and at last, at E7 "aptr" is assigned {aptr}. The transfer function for the assignment sees that the pset on the LHS, {aptr}, resolves to the Variable "aptr", and assigns it the pset on the RHS, {b.a}.

In essence, the transfer function for each expression calculates their respective psets, while the transfer function for some expressions propagate these sets. This implies that at the point of a DeclRefExpr, the analysis doesn't need to know whether the pset is in the context of an read or write.

The domain of this analysis is entirely different than liveness or uninitialized variables, so what was previously mentioned as GEN/KILL sets don't make sense here. Also, [29] calls a variable killed when it goes out of scope, not when it is written.

--- Strengths / Deficiencies ---

+ Unlike other analyses, the relevant literature is centered around C++, so the challenges that the language poses doesn't present an immediate engineering problem.
+ Exceptionally well documented in literature/talks.
+ The same algorithm is already implemented in MSVC, so it is battle-tested.
+ Excellent support for C++ via the Variable class.
+ Solves the read/write problem in a single pass as a forward analysis.

- The analysis results are non-reusable. Is not a clang::ManagedAnalysis.
- Still WIP, it uses inefficient data structures, and has a lot of TODOs/FIXMEs.
- Has to compensate for the fact that psets are overestimated, but lacks the toolset to do it particularly well.

---=== 6. Reaching definitions? ===---

If you wanna see how would somebody implement a dataflow algorithm without knowing what on earth dataflow even is, or have done any meaningful research, I recommend looking through my earlier works. The idea to implement this was first fetched in a GSoC meeting in June 11th, 2019, and the first mail on the list was [36]. Conversations about using it as a part of my GSoC to help on the static analyzer's report generation followed in [37, 41]. Since then, it popped up as an idea here and there, the above mentioned links will guide you to them, if you care.

At this point, I was confident in my ability to implement this algorithm, because I read the relevant wikipedia article, and [4]'s relevant chapter. I never understood what transfer functions are, let alone what lattice theory is, and never bothered to look at other dataflow analyses. [38] is the latest version I uploaded, and the questions raised made me realize how far behind I was truly lagging. Hence this survey!

---=== Conclusion ===---

[31 chapter 9.3] describes a dataflow analysis framework as follows: it is a (D, V, ^, F) quadruple, where
* D is the direction of the dataflow,
* V is the lattice, which includes the domain of the analysis,
* ^ is the meet operator,
* F: V -> V, a family of transfer functions.

Out of these 4, we have D factored out [39], and F, if we regard statement visitors as such. As for the other two, an agreement on the most efficient data structure with a merge/intersect operation would tie things together. All dataflow algorithms above need to manage sets for each statement in the CFG, and these sets are usually created from one another. Uninitialized variables and thread safety, and literature in general agree that bitvectors are a great choice, yet both of these analyses implement their own solution.

No analyses but liveness is reusable.

Support for the fields of a record are only supported by thread safety and lifetime. Thread safety's TIL is amazing, but it is also TIL's only user, and it only uses a tiny fraction of it. Also, as a compiler novice myself, I found that it drastically increased the barrier of entry to its codebase, though I'm yet to study IRs in depth. Lifetime's approach is much simpler, and I admit to have not understood what really justifies TIL.

For GEN/KILL analyses, such as liveness, uninitialized variables and reaching definitions, it should be possible to factor a lot of knowledge out about what statements read/write a variable. However, these analyses don't always agree on what is a GEN or KILL, so any attempt at refactoring must be configurable to some extent.

---=== Further reading ===---

I'm hardly a scholar on dataflow analyses, and I haven't taken all of the available compiler studies related courses in my uni. Though I'm trying to make a decent dent in all the great books and scholarly articles I have access to, I've got more to go; here are some sources I used and can recommend if you are looking for more dataflow knowledge:

* Engineering a Compiler [4] is an excellent book on compiler design. It starts scanners, parsers, talks about intermediate representations, various forms of dataflow analyses, optimization all the way to code generation. As one can tell, this isn't laser focused on static analysis for diagnostic purposes, and definitely aims to use the fruits of static analysis mainly for optimization. Nevertheless, static analyses engineered for optimization are still an invaluable tool for spotting programming errors. Chapters on intermediate representations and dataflow analyses are a great read, though in my view they don't trivially translate to how one can implement them in Clang.

* Static Program Analysis [30] is an amazing and completely free book on static analysis. It takes a large step towards formality compared to [4], and offers a numerous exercises for the reader, though isn't quite as thorough as [31]. It presents lattice theory, a foundation of the presented analyses, and shows that building on this theory gracefully handles the problem of over/under approximation. This book gave me a far better understanding of dataflow, and stands a lot closer to what I've observed in Clang. Up to this point, this has definitely been my favourite book to learn from regarding this subject.

* Compilers: Principles, Techniques and Tools [31], AKA the second edition of the infamous dragon book. As I understand it, its a bigger and better version of [4], but doesn't have same target audience. [4] is a lot friendlier to beginners, and doesn't delve into the theoretical background of a topic unless its necessary to understand the chapter. It is more formal, laying out the mathematical foundations for most items. Chapter 9. "Machine independent optimizations" is a dataflow treasure chest, practically a must read, but as a novice, I definitely appreciated that I already read up on dataflow before.

* For practical applications of dataflow analyses, all the references below are great; they are mostly a sample of design discussions/patches of the past 12 years in Clang.

---=== Closing words ===---

Special thanks to Gábor Horváth and Gábor Márton for helping me collect some relevant literature and discussions! I definitely plan to spin parts of this mail into a paper of some sort in the future, please don't beat me to it :)

Cheers!
Kristóf Umann

---=== References ===---

[1] [cfe-dev] [analyzer] An in-depth look at Liveness Analysis in the Static Analyzer, http://lists.llvm.org/pipermail/cfe-dev/2020-July/066330.html
[2] https://clang.llvm.org/docs/AttributeReference.html#consumed-annotation-checking, clang/test/SemaCXX/warn-consumed-analysis.cpp
[3] https://clang.llvm.org/docs/ThreadSafetyAnalysis.html
[4] Cooper, Keith, and Linda Torczon. Engineering a compiler, second edition. Elsevier, 2011.
[5] https://github.com/llvm/llvm-project/blob/054704082b461418d3dac3a379792cdaf52d40b3/clang/lib/Analysis/UninitializedValues.cpp#L511
[6] Added an early implementation of Live-Variables analysis built on source-level CFGs. This code may change significantly in the near future as we explore different means to implement dataflow analyses. https://reviews.llvm.org/rGb56a9909556aee1670e88adcf8b0a56dc25c5c9e
[7] Thread safety annotations and analysis in GCC, https://gcc.gnu.org/pipermail/gcc/2008-June/177461.html, continuing in https://gcc.gnu.org/pipermail/gcc/2008-July/178359.html
[8] C/C++ Thread Safety Annotations proposal, https://docs.google.com/document/d/1_d9MvYX3LpjTk3nlubM5LE4dFmU91SDabVdWp9-VDxc/edit
[9] [cfe-dev] Proposal for thread safety attributes for Clang, http://lists.llvm.org/pipermail/cfe-dev/2011-June/015899.html, continuing in http://lists.llvm.org/pipermail/cfe-dev/2011-July/015906.html
[10] 2011 LLVM Developers’ Meeting: D. Hutchins “Thread Safety Annotations in Clang”, https://www.youtube.com/watch?v=1em66mRozm0&ab_channel=LLVM
[11] Hutchins, DeLesley, Aaron Ballman, and Dean Sutherland. "C/c++ thread safety analysis." 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation. IEEE, 2014.,
[12] https://en.wikipedia.org/wiki/Gordian_Knot
[13] 2019 EuroLLVM Developers’ Meeting: T. Shpeisman & C. Lattner “MLIR: Multi-Level Intermediate Representation for Compiler Infrastructure, https://www.youtube.com/watch?v=qzljG6DKgic
[14] [cfe-dev] LLVM Dev meeting: Slides & Minutes from the Static Analyzer BoF (2015. November) http://lists.llvm.org/pipermail/cfe-dev/2015-November/045872.html
[15] [cfe-dev] [analyzer][RFC] Get info from the LLVM IR for precision, http://lists.llvm.org/pipermail/cfe-dev/2020-August/066537.html
[16] https://github.com/llvm/llvm-project/blob/master/clang/include/clang/Analysis/Analyses/ThreadSafetyTIL.h#L27
[17] [cfe-dev] [analyzer][RFC] Get info from the LLVM IR for precision, mentioning the talk, http://lists.llvm.org/pipermail/cfe-dev/2020-August/066611.html
[18] CIL : Common MLIR Dialect for C/C++ and Fortran, 2020 Virtual LLVM Developers' Meeting.
[19] [llvm-dev] MLIR for clang http://lists.llvm.org/pipermail/llvm-dev/2020-February/139192.html
[20] Thread safety analysis: refactor to support more sophisticated handling. https://reviews.llvm.org/rC161690
[21] [patch] Adding Consumed Analysis to Clang,  https://reviews.llvm.org/D1233
[22] [cfe-dev] Proposal for Uniqueness Analysis, http://lists.llvm.org/pipermail/cfe-dev/2013-July/030635.html, continuing in http://lists.llvm.org/pipermail/cfe-dev/2013-August/031331.html
[23] https://github.com/mgehre/llvm-project/tree/lifetime
[24] CppCon 2018: Herb Sutter “Thoughts on a more powerful and simpler C++ (5 of N)” https://youtu.be/80BZxujhY38?t=1095
[25] 2019 EuroLLVM Developers’ Meeting: G. Horvath & M. Gehre: Implementing the C++ Core Guidelines' Lifetime Safety Profile in Clang https://www.youtube.com/watch?v=VynWyOIb6Bk&ab_channel=LLVM
[26] CppCon 2018: “Implementing the C++ Core Guidelines’ Lifetime Safety Profile in Clang” https://www.youtube.com/watch?v=sjnp3P9x5jA&ab_channel=CppCon
[27] CppCon 2019: Gábor Horváth, Matthias Gehre “Lifetime analysis for everyone”, https://www.youtube.com/watch?v=d67kfSnhbpA&ab_channel=CppCon
[28] Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spring 2019, https://www.youtube.com/watch?v=EeEjgT4OJ3E&ab_channel=corehard
[29] Herb Sutter: Lifetime profile v1.0 posted. https://herbsutter.com/2018/09/20/lifetime-profile-v1-0-posted/
[30] Møller, Anders, and Michael I. Schwartzbach. "Static program analysis." Notes. Feb (2012). https://users-cs.au.dk/amoeller/spa/spa.pdf
[31] A. V. Aho, Monica S. Lam, R. Sethi, Jeffrey D. Ullman. "Compilers: Principles, Techniques and Tools", Pearson New International Edition, 2nd Edition
[32] [cfe-dev] [RFC] Adding lifetime analysis to clang, http://clang-developers.42468.n3.nabble.com/RFC-Adding-lifetime-analysis-to-clang-tt4063068.html (the thread is very hard to follow on cfe-dev, so this site is a better bet)
[33] [cfe-dev] [RFC] Upstreaming Lifetime Function Annotations http://lists.llvm.org/pipermail/cfe-dev/2019-December/064067.html
[34][analyzer][Liveness][NFC] Remove an unneeded pass to collect variables that appear in an assignment, https://reviews.llvm.org/D87518
[36] [cfe-dev] [analyzer] Speaking about reaching definitions... http://lists.llvm.org/pipermail/cfe-dev/2019-June/062733.html
[37] [cfe-dev] [analyzer] Using reaching definitions analysis for bug report construction, http://lists.llvm.org/pipermail/cfe-dev/2019-July/062885.html
[38] [analysis][analyzer] Introduce the skeleton of a reaching definitions calculator https://reviews.llvm.org/D76287
[39] [DataFlow] Factor two worklist implementations out, https://reviews.llvm.org/D72380
[40] [cfe-dev] Next step (discussion on TIL) http://lists.llvm.org/pipermail/cfe-dev/2014-May/036983.html
[41] GSoC 2019: Enhancing bug reports in the Clang Static Analyzer, https://szelethus.github.io/gsoc2019/

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: A survey of dataflow analyses in Clang

Manas via cfe-dev
Hi!

Thanks for this survey! I am just adding some random notes inline that might or might not be useful. Some of them include questions and musings.

On Sun, 4 Oct 2020 at 21:43, Kristóf Umann <[hidden email]> wrote:



---  Reads/writes ---

A common theme among data flow algorithms is looking for gens (generations, or in other words "reads" or "loads") or kills (in other words "writes" or "stores") of either variables or values.

One of the advantages of GEN/KILL based analyses is that they are distributive. This is a class of analyses that are well studied and there is a rich literature on how to efficiently evaluate them (like IFDS).  Also, the sets are often represented using bit-vectors where each position corresponds to a variable/register. Rust also uses this approach for liveness and some other analyses.
 

In literature, most data flow algorithms are defined and showcased on either some simple pseudo code, or an instruction set in a three address form:

x <- y + z

Clang's AST and CFG is a very heavy weight for most dataflow analyses that deal with low-level concepts like liveness, or definition/use. I'd love to see a lightweight overlay on top of the Clang AST that can be used to categorize all nodes into some basic operations. For example, there is a large class of dataflow analyses that only care about function calls, reads, and writes. For those analyses, it is just noise to distinguish an overloaded operator call from a constructor call and so on. Currently, every dataflow analysis is doing an ad-hoc abstraction over the AST/CFG. The question is, is it viable to come up with something that fits most analyses, so people don't have to reinvent those abstractions?

Would MLIR make it easier to create such an overlay?
 

In clang however, where transfer functions are implemented with statement visitors, the fact that we need substantial contextual information for any given statement is sometimes a problem.

Is this need for context coming from the use of linearized CFGs? (I.e. every element of a basic block is a subexpression, not a whole one).
 


--- Classic data flow issues: Aliasing, function calls, unfeasible execution paths ---

Aliasing is an often discussed, unavoidable obstacle for C/C++, and intraprocedural of most dataflow algorithms is also problematic.


LLVM IR passes can query alias analyses. As dataflow algorithms are ad-hoc in Clang, we do not have an infrastructure for these analyses to depend on each other. In an ideal world, we could create an alias analysis that always returns MayAlias and write other dataflow analyses against its interface and once that analysis becomes smarter, all the other dataflow analyses start to improve in precision. While I do agree, we cannot map back analysis results from LLVM
IR to the source code in general (especially, when we want to generate meaningful diagnostics), but I wonder whether it is actually possible for some restricted cases like alias analysis.
 


--- Records, fields, methods ---


There are some well-known abstractions for field sensitive analyses. Those include access paths and store-based models. Due to the heap and loops, these models are potentially unbounded and there are several methods to summarize those cases including k-limiting. Here is a survey of some methods: https://arxiv.org/abs/1403.4910

 

---=== 1. LivenessAnalysis ===---


C++ has many peculiarities with liveness analysis. For example, if we have a local variable with a dtor, it is live for its whole lexical scope as the dtor might read its fields. So it looks like we have both lexical and non-lexical liveness, and this can be very confusing.



If a DeclRefExpr pops up in any other context, it is seen as a read, even if its passed variables to function call as a non-const reference, etc.

This is a serious problem in the C++ language. Parameter passing describes the HOW (e.g. by reference) instead of the WHAT (e.g. to read).  So seeing a non-const reference we can never be sure what the intention of the developer was. Is this a read and the user forgot the const? Is it a read followed by a write? Is this a write-only?
 
Also, the algorithm doesn't respect aliasing, and doesn't regard the fields of a record as a distinct, writable memory regions.

Depending on the analysis client's need, this might be OK. E.g. do we want under- or overapproximation? What is the consequence of imprecision at the client? Making an analysis more precise only makes sense if the client can benefit more from the additional precision compared to what we lose to achieve this precision in the first place.
 

---=== 2. UninitializedVariables ===---

--- Third pass ---

Though technically this is an implementation detail, a third pass is used to emit diagnostics.

I believe that for every monotone analysis we can emit the warnings during the analysis, so we can spare the last pass (at the cost of making sure to not emit duplicated warnings). I am not sure, however, which method would be more performant or cleaner in this particular case.
 

---=== 3. Thread Safety analysis ===---

--- Thread Safety's own intermediate language ---

Thread Safety Analysis lowers Clang expressions, implementing its very own IR, a Typed Intermediate Language, or TIL.

The existence of TIL just amplifies what I wrote earlier, the AST/CFG is too verbose for dataflow analysis. A lot of the information available there is just noise and having some abstractions over them could make things significantly easier. Having a more principled way to organize those abstractions would be great. Currently every analysis has its own ad-hoc methods.
 


---=== 5. Lifetime analysis ===---

Asserts pose a serious problem for lifetime analysis in particular. Suppose you have the following assert:


I imagine that this assert problem could be applicable for many other analyses. I'd love to see a CFG option to specifically build a CFG that does not expose this problem to the clients. Unfortunatly, CFG already has a really large number of options, to the point that I have no confidence that every reasonable combinations of the configurations are tested. Getting rid of the unused config options (probably there are quite a few) would be a huge contribution. But unfortunately, we could never know about potential downstream users.
 

Lifetime analysis was conceptualized with strong C++ support from the get-go.

Lifetime analysis also suffers from the design problems of C++ language I mentioned earlier. One of the bigger ones, parameter passing describes the HOW rather thant he intention.
 

The domain of this analysis is entirely different than liveness or uninitialized variables, so what was previously mentioned as GEN/KILL sets don't make sense here.

Right, bit-vector analyses are distributive. Lifetime analysis is not.
 

Cheers,
Gabor

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev