[RFC] Upstreaming Lifetime Function Annotations

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] Upstreaming Lifetime Function Annotations

David Zarzycki via cfe-dev

Hi!


1. Introduction


As you might now, there is an ongoing effort to implement Herb Sutter’s lifetime analysis [1] in Clang [2]. Not too long ago, we upstreamed the Pointer and Owner annotations including a set of statement local checks and some hard coded STL function annotations. Those new warnings are on by default in ToT and they proved to be very useful finding bugs in a variety of open source projects including LLVM [3] while maintaining virtually zero false positives (modulo potential bugs). As a next step, we would like to propose function annotations. A prototype is available in a fork [4].


2. Motivation


Consider the following function and its call site:


  const char *find(const string &haystack, const string &needle);

  auto match = find(“Hi cfe-dev!”, someLValue);


We might easily see the dangling problem in the code above but it is much harder for the compiler to diagnose it. The main problem is that the compiler does not know what the relationship is between the arguments and the return value. If the compiler knew that the return value is either null or it will point into the first argument, it would be able to diagnose that the returned pointer dangles immediately. We propose an annotation language to describe such relationships. This language could be useful both for statement-local, on-by-default, no-false-positive warnings and other bug finding tools such as Clang Tidy or the Clang Static Analyzer. Information like this is often hard coded into checks in the compiler, or other tools. Using this annotation language we could have a uniform way to encode this knowledge that is understood by all the tools under the Clang umbrella. This could also improve the user experience and ease the introduction of one of the clang tools to a project that is already using other clang tools.


3. Design


There is no separate design document for the annotation language, but it is part of the flow-sensitive lifetime analysis by Herb [6]. Despite this fact, we believe that these annotations are universally useful for other analyses as well.


To summarize the paper:


The annotations would be after the function declaration where all the symbols are available. We would like to follow the contracts proposal in almost every detail. The plan is, once contracts are implemented, we turn those annotations into pre- and postconditions.


The annotation for the motivational example would look like this:

  const char *find(const string &haystack, const string &needle)
      [[gsl::post(lifetime(find, {haystack, null}))]];


Whenever the annotation is about the return value, we use the function name. We can read this annotation the following way: the value returned by find is either null or its lifetime is tied to the lifetime of haystack. In the current design, the target of the lifetime attribute must be a parameter or the return value whose type is a raw pointer, reference or a (possibly implicitly) gsl::Pointer annotated user defined type.

We also support output arguments and fields. Let’s look at a slightly different variant of a find API.

  struct Match { const char *pos; /* ... */ };
  bool find(const string &hs, const string &n, Match *m)
    [[gsl::post(lifetime(deref(M).pos, {haystack}))]];


We can also have preconditions, e. g. to express to iterators must be obtained from the same container:

  template<class It, class val>
  It find(It begin, It end, T val)

      [[gsl::pre(lifetime(begin, {end}))]];


To potentially diagnose calls like:
  auto result = find(a.begin(), b.end(), val); 


We plan to include a small library to support this. The identifiers used within the gsl::pre/post annotations would be part of the gsl namespace. The examples assume the presence of the corresponding using statements. Users will need to specialize a template so their type can be dereferenced in the annotations. The reason is that, not all types are using operator* for this purpose, for example there might be smart pointers in code bases where operator overloading is forbidden. 


3a. Relation to lifetimebound

Clang already has a similar attribute, lifetimebound [5], which allows to express a subset of the proposed function annotations.

Example:

  const V& findOrDefault(const map<K, V>& m, const K& key,

                         const V& defvalue [[clang::lifetimebound]]);

is equivalent to

  const V& findOrDefault(const map<K, V>& m, const K& key,

                         const V& defvalue)
    [[gsl::post(lifetime(findOrDefault, {defvaluel}))]];


  template<...> class basic_string_view {

    …

    basic_string_view(

      const basic_string& str [[clang::lifetimebound]]) noexcept;

  };

is equivalent to

    template<...> class basic_string_view {

    …

    basic_string_view(const basic_string& str) noexcept

      [[gsl::post(lifetime(*this, str))]];

    };


While the lifetimebound annotation has similar purpose, it has less expressivity. It is not compatible with output arguments, field selection and it cannot express nullability.


4. Bikeshedding


We are open to discuss any details of the syntax. We also need to discuss if we are willing to accept that using these annotations might need a small library support and how to handle this.


5. Upstreaming


We plan to contribute incremental changes based on the prototype we have. The first part would only include the representation of the annotations without any of the parsing and syntax. This could help us replace some of the hard coded ad-hoc information using this new format and gives us (the community) more time to think about bikeshedding. After the representation is upstreamed, we plan to start contributing to the actual parsing logic.


6. Conclusion


Adding an annotation language might be scary, but this is only intended for library authors. The annotations could greatly improve the precision of existing static analysis tools and increase the uniformity of the clang ecosystem. We could also introduce new useful efficient warnings that have no false positives.


What do you think? Is the community open to see this being upstreamed?


Cheers,

Gabor


[1]: https://herbsutter.com/2018/09/20/lifetime-profile-v1-0-posted/

[2]: http://lists.llvm.org/pipermail/cfe-dev/2018-November/060355.html

[3]: https://www.youtube.com/watch?v=d67kfSnhbpA

[4]: https://github.com/mgehre/llvm-project

[5]: https://clang.llvm.org/docs/AttributeReference.html#lifetimebound

[6]: https://github.com/isocpp/CppCoreGuidelines/blob/master/docs/Lifetime.pdf



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Upstreaming Lifetime Function Annotations

David Zarzycki via cfe-dev
Hi Gábor,

I'm very excited about lifetime annotations! I have two comments/requests.

The first one is about the implementation. There are quite a few warnings, ClangTidy checkers and other analyses in Clang that are dataflow-based or at least sort of dataflow-based. They all have to implement the interpretation of Clang's AST semantics, which is unfortunate, because it is very complicated logic that is refined and polished over time as we get to use those warnings and checkers, and yet, we can't reuse this logic for any new checker.

Therefore, my request is to try to structure the implementation in such a way that it is at least plausible to factor out the "dataflow engine" parts of the static analysis in future, and keep the abstract domain and lifetime specifics more or less separate.

On Thu, Dec 5, 2019 at 12:02 AM Gábor Horváth via cfe-dev <[hidden email]> wrote:


  const char *find(const string &haystack, const string &needle)
      [[gsl::post(lifetime(find, {haystack, null}))]];


I have a concern about the bulkiness of the syntax. I understand why it ended up this way (use standard attribute syntax, use the contracts syntax, ensure that names are referenced syntactically after they are declared, and we get the proposed syntax) -- that helps with rationalization, but that does not help me justify it.

  struct Match { const char *pos; /* ... */ };
  bool find(const string &hs, const string &n, Match *m)
    [[gsl::post(lifetime(deref(M).pos, {haystack}))]];


I understand why the lifetime specification has to go at the end of the declaration in the general case -- to handle cases like this, where we want to specify a lifetime for some part of the data structure, but I'm not convinced that users should always use the most general syntax. I feel like it is going to be an adoption barrier and a readability issue.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <[hidden email]>*/

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Upstreaming Lifetime Function Annotations

David Zarzycki via cfe-dev
Hi Dmitri,

On Thu, Dec 12, 2019 at 5:44 AM Dmitri Gribenko <[hidden email]> wrote:
Hi Gábor,

I'm very excited about lifetime annotations! I have two comments/requests.

The first one is about the implementation. There are quite a few warnings, ClangTidy checkers and other analyses in Clang that are dataflow-based or at least sort of dataflow-based. They all have to implement the interpretation of Clang's AST semantics, which is unfortunate, because it is very complicated logic that is refined and polished over time as we get to use those warnings and checkers, and yet, we can't reuse this logic for any new checker.

Therefore, my request is to try to structure the implementation in such a way that it is at least plausible to factor out the "dataflow engine" parts of the static analysis in future, and keep the abstract domain and lifetime specifics more or less separate.

Thanks, I totally agree. I think it should be relatively easy to separate the CFG traversal/fixed point iteration part. Anything bigger is likely to be more involved, but we definitely will strive for some reusability. 
 

On Thu, Dec 5, 2019 at 12:02 AM Gábor Horváth via cfe-dev <[hidden email]> wrote:


  const char *find(const string &haystack, const string &needle)
      [[gsl::post(lifetime(find, {haystack, null}))]];


I have a concern about the bulkiness of the syntax. I understand why it ended up this way (use standard attribute syntax, use the contracts syntax, ensure that names are referenced syntactically after they are declared, and we get the proposed syntax) -- that helps with rationalization, but that does not help me justify it.

  struct Match { const char *pos; /* ... */ };
  bool find(const string &hs, const string &n, Match *m)
    [[gsl::post(lifetime(deref(M).pos, {haystack}))]];


I understand why the lifetime specification has to go at the end of the declaration in the general case -- to handle cases like this, where we want to specify a lifetime for some part of the data structure, but I'm not convinced that users should always use the most general syntax. I feel like it is going to be an adoption barrier and a readability issue.

I would love to see a simpler syntax. I think implementing this the most general way possible and adding some syntactic sugar later on after having some data about the most common patterns might make sense. Is it problematic to evolve the syntax upstream? I know this would be bad for early adopters but we could make it clear what they are opting into. 

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <[hidden email]>*/

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Upstreaming Lifetime Function Annotations

David Zarzycki via cfe-dev

Gábor wrote:

> I would love to see a simpler syntax. I think implementing this the most general

> way possible and adding some syntactic sugar later on after having some data

> about the most common patterns might make sense.

 

Yes. In particular, the nicest “nice syntax” of all is intended by design to be whitespace for the large majority of functions, via defaults that are equivalent to default annotations… the Lifetime design paper proposes such defaults (in sections 2.5.3 and 2.5.5), and the aim is that very few functions ever want explicit annotation.

 

So far in our experience that has been working very well, including that the large majority of std:: standard library functions just work unchanged without annotation:

  • To see basic examples, please look at https://godbolt.org/z/1C4t8m (which involves std::min) and https://godbolt.org/z/4G-8H- (which diagnoses a StackOverflow question involving unique_ptr). No annotation of the standard library was needed in either example, they use the stock unmodified std:: implementation.*
  • For an (IMO pretty slick) example of zero-annotation of more complex code, see https://godbolt.org/z/eqCRLx – the code it diagnoses is actually quite complex (vector push_back but also ranges with filtering views) and all unannotated, and it gives nice and accurate diagnostics (pretty much “hey, your ranges::view::filter is dangling here on line B, because your vector<int>.push_back() invalidated it here on line A”).
  • See also section 2.6.2 in the design paper which shows that the string_view dangling problem examples given in WG21 paper P0936 are (I think all) diagnosed without any explicit annotation, because the proposed defaults do the right thing.

 

Those are examples, many of them drawn from real-world code, that needed no annotation at all with the proposed default annotations.

 

So I would propose this:

  1. Implement the general form. We know that occasionally we’ll need that, and then we can express anything in #2 and #3 as equivalents/sugars for something expressible in #1.
  2. Implement the proposed default rules (including tweak them as we gain experience in larger codebases) so that whitespace zero-annotation Just Works for (we hope) a very large number of functions.
  3. Then see if we actually need any in-between syntactic sugars at all. If #2 covers a sufficiently high % of functions, we may not even be interested in other sugars. And if we discover there are patterns that #2 doesn’t cover well, then we can add sugars for those patterns.

 

How does that sound?

 

Herb

 

* As a temporary implementation detail, the prototype does currently hardwire knowledge that unique_ptr and vector are owners, but there also proposed default zero-annotation rules for automatically recognizing which types are Owners (see section 2.1) which recognize those types without annotation (e.g., it recognizes containers and smart pointers as implicitly Owners).

 

 

 

From: Gábor Horváth <[hidden email]>
Sent: Thursday, December 12, 2019 8:43 AM
To: Dmitri Gribenko <[hidden email]>
Cc: cfe-dev <[hidden email]>; [hidden email]; Dmitri Gribenko <[hidden email]>; Herb Sutter <[hidden email]>; Kyle Reed <[hidden email]>; Aaron Ballman <[hidden email]>; Artem Dergachev <[hidden email]>; xazax.hun <[hidden email]>; Petr Hosek <[hidden email]>; Haowei Wu <[hidden email]>; [hidden email]; Richard Smith <[hidden email]>
Subject: Re: [cfe-dev] [RFC] Upstreaming Lifetime Function Annotations

 

Hi Dmitri,

 

On Thu, Dec 12, 2019 at 5:44 AM Dmitri Gribenko <[hidden email]> wrote:

Hi Gábor,

 

I'm very excited about lifetime annotations! I have two comments/requests.

 

The first one is about the implementation. There are quite a few warnings, ClangTidy checkers and other analyses in Clang that are dataflow-based or at least sort of dataflow-based. They all have to implement the interpretation of Clang's AST semantics, which is unfortunate, because it is very complicated logic that is refined and polished over time as we get to use those warnings and checkers, and yet, we can't reuse this logic for any new checker.

 

Therefore, my request is to try to structure the implementation in such a way that it is at least plausible to factor out the "dataflow engine" parts of the static analysis in future, and keep the abstract domain and lifetime specifics more or less separate.

 

Thanks, I totally agree. I think it should be relatively easy to separate the CFG traversal/fixed point iteration part. Anything bigger is likely to be more involved, but we definitely will strive for some reusability. 

 

 

On Thu, Dec 5, 2019 at 12:02 AM Gábor Horváth via cfe-dev <[hidden email]> wrote:

 

  const char *find(const string &haystack, const string &needle)
      [[gsl::post(lifetime(find, {haystack, null})
)]];

 

I have a concern about the bulkiness of the syntax. I understand why it ended up this way (use standard attribute syntax, use the contracts syntax, ensure that names are referenced syntactically after they are declared, and we get the proposed syntax) -- that helps with rationalization, but that does not help me justify it.

 

  struct Match { const char *pos; /* ... */ };
  bool find(const string &hs, const string &n, Match *m)
    [[gsl::post(lifetime(deref(M).pos, {haystack}))]];

 

I understand why the lifetime specification has to go at the end of the declaration in the general case -- to handle cases like this, where we want to specify a lifetime for some part of the data structure, but I'm not convinced that users should always use the most general syntax. I feel like it is going to be an adoption barrier and a readability issue.

 

I would love to see a simpler syntax. I think implementing this the most general way possible and adding some syntactic sugar later on after having some data about the most common patterns might make sense. Is it problematic to evolve the syntax upstream? I know this would be bad for early adopters but we could make it clear what they are opting into. 

 

Dmitri

 

--

main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <[hidden email]>*/


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Upstreaming Lifetime Function Annotations

David Zarzycki via cfe-dev
Hi Herb,

On Thu, Dec 12, 2019 at 7:34 PM Herb Sutter <[hidden email]> wrote:

Gábor wrote:

> I would love to see a simpler syntax. I think implementing this the most general

> way possible and adding some syntactic sugar later on after having some data

> about the most common patterns might make sense.

 

Yes. In particular, the nicest “nice syntax” of all is intended by design to be whitespace for the large majority of functions, via defaults that are equivalent to default annotations… the Lifetime design paper proposes such defaults (in sections 2.5.3 and 2.5.5), and the aim is that very few functions ever want explicit annotation.

 

So far in our experience that has been working very well, including that the large majority of std:: standard library functions just work unchanged without annotation:

  • To see basic examples, please look at https://godbolt.org/z/1C4t8m (which involves std::min) and https://godbolt.org/z/4G-8H- (which diagnoses a StackOverflow question involving unique_ptr). No annotation of the standard library was needed in either example, they use the stock unmodified std:: implementation.*
  • For an (IMO pretty slick) example of zero-annotation of more complex code, see https://godbolt.org/z/eqCRLx – the code it diagnoses is actually quite complex (vector push_back but also ranges with filtering views) and all unannotated, and it gives nice and accurate diagnostics (pretty much “hey, your ranges::view::filter is dangling here on line B, because your vector<int>.push_back() invalidated it here on line A”).
  • See also section 2.6.2 in the design paper which shows that the string_view dangling problem examples given in WG21 paper P0936 are (I think all) diagnosed without any explicit annotation, because the proposed defaults do the right thing.

 

Those are examples, many of them drawn from real-world code, that needed no annotation at all with the proposed default annotations.

 

So I would propose this:

  1. Implement the general form. We know that occasionally we’ll need that, and then we can express anything in #2 and #3 as equivalents/sugars for something expressible in #1.
  2. Implement the proposed default rules (including tweak them as we gain experience in larger codebases) so that whitespace zero-annotation Just Works for (we hope) a very large number of functions.
  3. Then see if we actually need any in-between syntactic sugars at all. If #2 covers a sufficiently high % of functions, we may not even be interested in other sugars. And if we discover there are patterns that #2 doesn’t cover well, then we can add sugars for those patterns.

 

How does that sound?


I have studied your paper in detail quite some time ago, so I'm familiar with the proposed inference rules -- but thanks for reminding me. I'm still skeptical that the explicit annotations would be rare enough that readability of the general syntax is irrelevant. Anyway, the only way to find out is to try to annotate real-world code, and for that we need tooling support first (e.g., this warning) to find where we need annotations. While it is good that the standard library required few annotations, I don't think the standard library is representative of real-world business logic that engineers write today, or of legacy data structures and APIs that companies have accumulated over the decades.

In other words, I agree with your plan, but I think the chances are very low that we won't need to do anything in #3.

* As a temporary implementation detail, the prototype does currently hardwire knowledge that unique_ptr and vector are owners, but there also proposed default zero-annotation rules for automatically recognizing which types are Owners (see section 2.1) which recognize those types without annotation (e.g., it recognizes containers and smart pointers as implicitly Owners).


I actually think that annotating owner types is a good idea. The cost of an inference failure for an owner type is too high, and debugging it would be non-trivial even for experts (what did *exactly* prevent this type from being inferred to be an owner?) Annotating owners is also good as a matter of documentation. I am not worried about annotation cost for owner types, since they are going to be quite rare.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <[hidden email]>*/

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev