Re: [Bug 7772] analyzer reports leaks when data is added to hash table or linked list

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Bug 7772] analyzer reports leaks when data is added to hash table or linked list

Michael Blumenkrantz
On Thu, 12 Aug 2010 13:17:07 -0700
Ted Kremenek <[hidden email]> wrote:

>
> On Aug 2, 2010, at 2:50 PM, Michael Blumenkrantz wrote:
>
> > Hi,
> >
> > For some reason I can't reply on bugzilla so I'll just mail you
> > directly.
> >
> > In case 1 from above, the allocated data is set with
> > evas_object_smart_data_set to part of obj's struct since it's an opaque
> > struct.  The data will later be freed manually with
> > evas_object_smart_data_get to retrieve it.
> >
> > In case 2, the data is prepended to a list which is returned and
> > inserted into an rbtree that is passed by reference as function param
> > #2.
> >
> > In case 3, the data is appended to the list that is returned.
> >
> > In cases 2+3, it is up to the caller of the function to free the data
> > at some point, or the caller of that caller, etc.
>
> Hi Mike,
>
> (Bugzilla is still not working for me on that PR)
>
> I think in all of these cases the malloc/free's checker "local" checking
> rules of how one uses malloc/free aren't being followed by your code (since
> it has its own conventions).  This is a general problem with C code, which is
> why the malloc/free checker is still an experimental check.
>
> One solution we have looked at is providing "ownership annotations" that one
> can use to annotate the arguments of functions.  This allows one to say
> "passing this object to foo() means it will later be released" and so on.
> I'm not convinced this is a full solution; not all users will want to use
> such annotations, so a combination of smarter analysis and the use of these
> annotations may be necessary.  The advantage of the annotations (which I
> would like to hash out more with real users like yourself) is that they clear
> document the memory management rules at a "contractual" level with the
> declaration of functions.  This is really powerful, and will enable the
> analyzer not only to emit fewer false positives but also find more real bugs
> when those rules are violated.  That said, it isn't for everyone.  If you
> have suggestions of what you'd like to see, I'd be very interested in
> discussing it in the open on cfe-dev
>
> Note that not running with experimental checks disabled will cause these
> warnings to go away.
>
> Ted
Hi,

I am replying to this on the cfe-dev mailing list as requested to continue our
discussion.

With regard to the annotations that you have mentioned, I do not believe they
will be a full or workable solution to anyone with a reasonable sized api who
wishes to have accurate results from the scanner.  In a case where there are
hundreds of function calls such as the EFL, it would be extremely frustrating,
not to mention annoying, to create and maintain such annotations with any
degree of accuracy  That said, I think there would definitely be a place for it
in smaller applications/libraries, or specialized cases with unusual behavior.

I think the best option here, though more difficult to implement, would be if
there was a functionality which would do something like this:

1) if an object is passed to a function and is not a primitive C type, it is
tracked throughout the function. When allocated data is added to the object,
the scope of the newly allocated data is set to equal to the object until it is
directly referenced again, at which point the scope would split as the pointer
is copied. This process would repeat until either the memory is freed or the
pointer is lost.

2) if memory is allocated in a function and then the allocated object is
returned, the scope of the memory should be passed up the stack to the caller
of the function and then tracked from there. As with #1, this would also repeat
until free or leak.

This type of functionality would eliminate almost all of the false positives
for leaks which I have seen, and would also allow for effective evaluation of
NULL pointer dereferencing.

Another unrelated feature that I would definitely enjoy seeing added to the
analyzer (which I think is on the todo?) is the ability to see "report diffs",
or just the bugs that have been fixed or created since the last report was
run.  I imagine this would be done with something like scan-diff [file1]
[file2] > outfile.

Hopefully though, we can at least find a solution to the bug I have reported :)

--
Mike Blumenkrantz
Zentific: Our boolean values are huge.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Bug 7772] analyzer reports leaks when data is added to hash table or linked list

bruce.r.stephens
Michael Blumenkrantz <mike-n7nqhmouzSVWk0Htik3J/[hidden email]>
writes:

[...]

> With regard to the annotations that you have mentioned, I do not
> believe they will be a full or workable solution to anyone with a
> reasonable sized api who wishes to have accurate results from the
> scanner.  In a case where there are hundreds of function calls such as
> the EFL, it would be extremely frustrating, not to mention annoying,
> to create and maintain such annotations with any degree of accuracy
> That said, I think there would definitely be a place for it in smaller
> applications/libraries, or specialized cases with unusual behavior.

I'm unconvinced.  I'm not familiar with EFL, and I'm not sure what might
be a commonly understood example.

Hmm, what about FILE?  There's a bunch of functions that can take a
FILE* but only a handful that need annotating: fopen() (and similar
functions like fdopen()), fclose(), freopen().  Isn't EFL similar, in
that almost all functions won't need annotation because they'll just use
things passed in to them (without making references, freeing them,
etc.)?

OpenSSL seems to me to be similar to FILE, with just a few wrinkles,
specifically functions like d2i_X509(): that returns an X509* and takes
a parameter X509**px.  If px or *px is NULL then it returns a newly
allocated X509.  In that case if px isn't NULL then *px is set to the
newly allocated X509 (and d2i_X509() returns *px).  If px and *px are
not NULL then it's assumed that *px points to a valid X509 and that's
reused (so there's no new allocation of an X509).  (There's also an
X509_dup(), though I guess malloc() has things like strdup().)

(PKCS#11 has a different pattern it uses for returning vectors of
unknown length which might be a useful test of annotation
expressiveness.  The functions take something like a char* and a
size_t*.

If the char* is non-NULL and the size_t* points to a value that's large
enough for the returned value then the function returns its value into
the char*, setting the size_t* to the length of what's returned.  If the
size_t* is too small then a buffer-too-small error is returned.

If the size_t* points to a value that's zero then the function simply
returns the size of its result.  So a typical invocation when the size
isn't known is to call the function twice: once to get the size, then
you allocate, then you call the function again to get the value.)

[...]

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Bug 7772] analyzer reports leaks when data is added to hash table or linked list

bruce.r.stephens
[hidden email] writes:

[...]

> (PKCS#11 has a different pattern it uses for returning vectors of
> unknown length which might be a useful test of annotation
> expressiveness.  The functions take something like a char* and a
> size_t*.

Now I think about it that's a lousy example since the API itself doesn't
do any allocation or deallocation (that's the whole point).  So it's
probably a test of some kind of annotation expressiveness, but not of
the proposed malloc-related annotations (IIUC).

[...]

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Bug 7772] analyzer reports leaks when data is added to hash table or linked list

Andrew McGregor-3
In reply to this post by Michael Blumenkrantz

On 13/08/2010, at 9:59 AM, Michael Blumenkrantz wrote:

> On Thu, 12 Aug 2010 13:17:07 -0700
> Ted Kremenek <[hidden email]> wrote:
>
>>
>> On Aug 2, 2010, at 2:50 PM, Michael Blumenkrantz wrote:
>>
>>> Hi,
>>>
>>> For some reason I can't reply on bugzilla so I'll just mail you
>>> directly.
>>>
>>> In case 1 from above, the allocated data is set with
>>> evas_object_smart_data_set to part of obj's struct since it's an opaque
>>> struct.  The data will later be freed manually with
>>> evas_object_smart_data_get to retrieve it.
>>>
>>> In case 2, the data is prepended to a list which is returned and
>>> inserted into an rbtree that is passed by reference as function param
>>> #2.
>>>
>>> In case 3, the data is appended to the list that is returned.
>>>
>>> In cases 2+3, it is up to the caller of the function to free the data
>>> at some point, or the caller of that caller, etc.
>>
>> Hi Mike,
>>
>> (Bugzilla is still not working for me on that PR)
>>
>> I think in all of these cases the malloc/free's checker "local" checking
>> rules of how one uses malloc/free aren't being followed by your code (since
>> it has its own conventions).  This is a general problem with C code, which is
>> why the malloc/free checker is still an experimental check.
>>
>> One solution we have looked at is providing "ownership annotations" that one
>> can use to annotate the arguments of functions.  This allows one to say
>> "passing this object to foo() means it will later be released" and so on.
>> I'm not convinced this is a full solution; not all users will want to use
>> such annotations, so a combination of smarter analysis and the use of these
>> annotations may be necessary.  The advantage of the annotations (which I
>> would like to hash out more with real users like yourself) is that they clear
>> document the memory management rules at a "contractual" level with the
>> declaration of functions.  This is really powerful, and will enable the
>> analyzer not only to emit fewer false positives but also find more real bugs
>> when those rules are violated.  That said, it isn't for everyone.  If you
>> have suggestions of what you'd like to see, I'd be very interested in
>> discussing it in the open on cfe-dev
>>
>> Note that not running with experimental checks disabled will cause these
>> warnings to go away.
>>
>> Ted
> Hi,
>
> I am replying to this on the cfe-dev mailing list as requested to continue our
> discussion.
>
> With regard to the annotations that you have mentioned, I do not believe they
> will be a full or workable solution to anyone with a reasonable sized api who
> wishes to have accurate results from the scanner.  In a case where there are
> hundreds of function calls such as the EFL, it would be extremely frustrating,
> not to mention annoying, to create and maintain such annotations with any
> degree of accuracy  That said, I think there would definitely be a place for it
> in smaller applications/libraries, or specialized cases with unusual behavior.

Having actually implemented this for the entire user space of an embedded Linux distribution, I can report that it does seem to be practical.  Annotating only two header files (glist.h and its counterpart in one large proprietary package) accounted for more than 80% of the false positives, and showed up significant number of real bugs that were not reported without the annotations.  Annotating another few headers (more glib container types) reduced the false positives from roughly 9000 to around 100 (compared to a real issue count of around 1200, this isn't bad).  The patch touches 82 lines of code, out of nearly 65 million (I'm not counting the kernel), so I think the cost is well worth it given the benefit. Incidentally, this exercise also shows the very high quality of most open-source code; fully 80% of the issues are in proprietary code, not all of it ours, that forms only 15% of the codebase.

That said, I'm working on the ability to deduce function annotations from the implementation. This has been shown feasible in a research compiler, and although it runs up against decidability problems in theory, real code should very seldom suffer from that issue in practice. It will require a two pass analysis build, however, as annotations for all functions will need to be available before any ownership checking will be reliable. For the time being I'm limiting this work to plain C, as I'm not enough of a C++ language lawyer to be at all confident of correctness.

> I think the best option here, though more difficult to implement, would be if
> there was a functionality which would do something like this:
>
> 1) if an object is passed to a function and is not a primitive C type, it is
> tracked throughout the function. When allocated data is added to the object,
> the scope of the newly allocated data is set to equal to the object until it is
> directly referenced again, at which point the scope would split as the pointer
> is copied. This process would repeat until either the memory is freed or the
> pointer is lost.

This more or less describes what the ownership checker currently does, except that it currently only works for pointers allocated by malloc and annotated functions.

> 2) if memory is allocated in a function and then the allocated object is
> returned, the scope of the memory should be passed up the stack to the caller
> of the function and then tracked from there. As with #1, this would also repeat
> until free or leak.

Again, the ownership checker can do this too; however, at present functions need annotated that they do in fact allocate memory. There may be some possible API variants that can not at present be annotated, this is somewhat work in progress.

> This type of functionality would eliminate almost all of the false positives
> for leaks which I have seen, and would also allow for effective evaluation of
> NULL pointer dereferencing.

It certainly does help the NULL checker out; the two interact a great deal.

> Another unrelated feature that I would definitely enjoy seeing added to the
> analyzer (which I think is on the todo?) is the ability to see "report diffs",
> or just the bugs that have been fixed or created since the last report was
> run.  I imagine this would be done with something like scan-diff [file1]
> [file2] > outfile.

+1.  I may eventually write this; I certainly plan to contribute back a less site-specific version of my build analysis harness in due course, and this is functionality I may be adding to that.

> Hopefully though, we can at least find a solution to the bug I have reported :)
>
> --
> Mike Blumenkrantz
> Zentific: Our boolean values are huge.
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Bug 7772] analyzer reports leaks when data is added to hash table or linked list

Michael Blumenkrantz
On Fri, 13 Aug 2010 11:33:04 +1200
Andrew McGregor <[hidden email]> wrote:

>
> On 13/08/2010, at 9:59 AM, Michael Blumenkrantz wrote:
>
> > On Thu, 12 Aug 2010 13:17:07 -0700
> > Ted Kremenek <[hidden email]> wrote:
> >
> >>
> >> On Aug 2, 2010, at 2:50 PM, Michael Blumenkrantz wrote:
> >>
> >>> Hi,
> >>>
> >>> For some reason I can't reply on bugzilla so I'll just mail you
> >>> directly.
> >>>
> >>> In case 1 from above, the allocated data is set with
> >>> evas_object_smart_data_set to part of obj's struct since it's an opaque
> >>> struct.  The data will later be freed manually with
> >>> evas_object_smart_data_get to retrieve it.
> >>>
> >>> In case 2, the data is prepended to a list which is returned and
> >>> inserted into an rbtree that is passed by reference as function param
> >>> #2.
> >>>
> >>> In case 3, the data is appended to the list that is returned.
> >>>
> >>> In cases 2+3, it is up to the caller of the function to free the data
> >>> at some point, or the caller of that caller, etc.
> >>
> >> Hi Mike,
> >>
> >> (Bugzilla is still not working for me on that PR)
> >>
> >> I think in all of these cases the malloc/free's checker "local" checking
> >> rules of how one uses malloc/free aren't being followed by your code (since
> >> it has its own conventions).  This is a general problem with C code, which
> >> is why the malloc/free checker is still an experimental check.
> >>
> >> One solution we have looked at is providing "ownership annotations" that
> >> one can use to annotate the arguments of functions.  This allows one to say
> >> "passing this object to foo() means it will later be released" and so on.
> >> I'm not convinced this is a full solution; not all users will want to use
> >> such annotations, so a combination of smarter analysis and the use of these
> >> annotations may be necessary.  The advantage of the annotations (which I
> >> would like to hash out more with real users like yourself) is that they
> >> clear document the memory management rules at a "contractual" level with
> >> the declaration of functions.  This is really powerful, and will enable the
> >> analyzer not only to emit fewer false positives but also find more real
> >> bugs when those rules are violated.  That said, it isn't for everyone.  If
> >> you have suggestions of what you'd like to see, I'd be very interested in
> >> discussing it in the open on cfe-dev
> >>
> >> Note that not running with experimental checks disabled will cause these
> >> warnings to go away.
> >>
> >> Ted
> > Hi,
> >
> > I am replying to this on the cfe-dev mailing list as requested to continue
> > our discussion.
> >
> > With regard to the annotations that you have mentioned, I do not believe
> > they will be a full or workable solution to anyone with a reasonable sized
> > api who wishes to have accurate results from the scanner.  In a case where
> > there are hundreds of function calls such as the EFL, it would be extremely
> > frustrating, not to mention annoying, to create and maintain such
> > annotations with any degree of accuracy  That said, I think there would
> > definitely be a place for it in smaller applications/libraries, or
> > specialized cases with unusual behavior.
>
> Having actually implemented this for the entire user space of an embedded
> Linux distribution, I can report that it does seem to be practical.
> Annotating only two header files (glist.h and its counterpart in one large
> proprietary package) accounted for more than 80% of the false positives, and
> showed up significant number of real bugs that were not reported without the
> annotations.  Annotating another few headers (more glib container types)
> reduced the false positives from roughly 9000 to around 100 (compared to a
> real issue count of around 1200, this isn't bad).  The patch touches 82 lines
> of code, out of nearly 65 million (I'm not counting the kernel), so I think
> the cost is well worth it given the benefit. Incidentally, this exercise also
> shows the very high quality of most open-source code; fully 80% of the issues
> are in proprietary code, not all of it ours, that forms only 15% of the
> codebase.
>
> That said, I'm working on the ability to deduce function annotations from the
> implementation. This has been shown feasible in a research compiler, and
> although it runs up against decidability problems in theory, real code should
> very seldom suffer from that issue in practice. It will require a two pass
> analysis build, however, as annotations for all functions will need to be
> available before any ownership checking will be reliable. For the time being
> I'm limiting this work to plain C, as I'm not enough of a C++ language lawyer
> to be at all confident of correctness.
>
> > I think the best option here, though more difficult to implement, would be
> > if there was a functionality which would do something like this:
> >
> > 1) if an object is passed to a function and is not a primitive C type, it is
> > tracked throughout the function. When allocated data is added to the object,
> > the scope of the newly allocated data is set to equal to the object until
> > it is directly referenced again, at which point the scope would split as
> > the pointer is copied. This process would repeat until either the memory is
> > freed or the pointer is lost.
>
> This more or less describes what the ownership checker currently does, except
> that it currently only works for pointers allocated by malloc and annotated
> functions.
>
> > 2) if memory is allocated in a function and then the allocated object is
> > returned, the scope of the memory should be passed up the stack to the
> > caller of the function and then tracked from there. As with #1, this would
> > also repeat until free or leak.
>
> Again, the ownership checker can do this too; however, at present functions
> need annotated that they do in fact allocate memory. There may be some
> possible API variants that can not at present be annotated, this is somewhat
> work in progress.
>
> > This type of functionality would eliminate almost all of the false positives
> > for leaks which I have seen, and would also allow for effective evaluation
> > of NULL pointer dereferencing.
>
> It certainly does help the NULL checker out; the two interact a great deal.
>
> > Another unrelated feature that I would definitely enjoy seeing added to the
> > analyzer (which I think is on the todo?) is the ability to see "report
> > diffs", or just the bugs that have been fixed or created since the last
> > report was run.  I imagine this would be done with something like scan-diff
> > [file1] [file2] > outfile.
>
> +1.  I may eventually write this; I certainly plan to contribute back a less
> site-specific version of my build analysis harness in due course, and this is
> functionality I may be adding to that.
>
> > Hopefully though, we can at least find a solution to the bug I have
> > reported :)
> >
> > --
> > Mike Blumenkrantz
> > Zentific: Our boolean values are huge.
> > _______________________________________________
> > cfe-dev mailing list
> > [hidden email]
> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
Hm, well I'd definitely be interested in seeing some testing in the analyzer
with annotations.  If it really is the case that only a few headers need to be
annotated to see this kind of change, then that would most likely be sufficient
for my needs.

--
Mike Blumenkrantz
Zentific: Our boolean values are huge.
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev