[analyzer] Constrain the size of unknown memory regions

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[analyzer] Constrain the size of unknown memory regions

suyash singh via cfe-dev
Hi, checker devs

TLDR:
How to constrain the size of unknown memory regions, eg. pointed by 'raw' char pointers?

longer version:
Working on taint analysis I'm facing with the following problem:

void strncpy_bounded_tainted_buffer(char *src, char *dst) {
  // assert(strlen(src) >= 10 && "src must have at leas 10 elements");
  int n;
  scanf("%d", &n);
  if (0 < n && n < 10) {
    strncpy(dst, src, n); // Should we warn or not?
  }
}


In this example we analyze a function taking two raw pointers, so we don't know how big those arrays are.
We will check the `strncpy` call, whether it will access (read and write) only valid memory.
We will check the pointers (src and dst) separately by checking if `&src[0]` and `&src[n-1]` would be in bound of the memory region pointed by the pointer. Since the analyzer don't know (both states are non-null), we should check if the `length` parameter is tainted, and if so, we should still issue a warning telling that "String copy function might overflow the given buffer since untrusted data is used to specify the buffer size."
Obviously, if the `length` parameter is not tainted, we will assume (conservatively) that the access would be valid.


How should tell the analyzer that the array which is pointed by the pointer holds at least/most N elements?
For example in the code above, express something similar via an assertion, like saying that `src` points to a c-string, which has at least 10 + 1 element underlying storage.
Although this assertion using `strlen` seems like a solution, unfortunately not applicable for example to the `dst` buffer, which is most likely uninitialized - so not a c-string, in other words calling `strlen` would be undefined behavior.

The only (hacky) option which came in my mind was to abuse the standard regarding pointer arithmetic.
assert(&src[n] - &src[-1]);
The standard is clear about that pointer arithmetic is only applicable for pointers pointing to elements of the same array OR to a hypothetical ONE past/before element of that array.
http://eel.is/c++draft/expr.add#4.2

This assertion would be undefined behavior if the size of the array pointed by `src` would be smaller than `n`.

IMO this looks really ugly.
I think that no 'annotations' should introduce UB even if that assumption expressed via an annotation is turned out to be invalid.


What would be the right approach to guide (to constrain the size of a memory region) the analyzer?
How can the analyzer inference such constraint?

Thanks Balazs.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Constrain the size of unknown memory regions

suyash singh via cfe-dev
How To Ask The Analyzer The Smart Way

I hope the following writing could help checker developers to make sure they understand the problem by asking the Analyzer what could be the problem.

-------------

> How to constrain the size of unknown memory regions, eg. pointed by 'raw' char pointers?

Well, the secret is: you do not. The Analyzer tries to give you the best approximation of your program.

-------------

> How should tell the analyzer that the array which is pointed by the pointer holds at least/most N elements?

The Analyzer denote a pointer with a symbolic value and during the analysis it applies constraints on such symbols on their own will using the core facilities and existing API modeling checkers. If your checker models a function call/nullability, so that you need to explicitly create assumptions you can `assume()` values using the ProgramState, like: `std::tie(StateTrue, StateFalse) = State->assume(Value);`. With that you can force out a state split and if your new assumption of `Value` does not contradict with previous assumptions `StateTrue` holds the assumption, and `StateFalse` holds the assumption negated. If such contradiction happens the return value is a `nullptr`.

-------------

> What would be the right approach to guide (to constrain the size of a memory region) the analyzer?

Use [1] checkers and know them well, they guide the Analyzer. Otherwise create your own API modeling checkers.

If you can't solve a problem, then there is an easier problem you can't solve: find it.
- George Polya. 

First, let me simplify your problem/example and make sure we can work with it:
```
// RUN: %clang_analyze_cc1 \
// RUN:  -analyzer-checker=core,unix,debug.ExprInspection \
// RUN:  -verify %s

typedef __SIZE_TYPE__ size_t;
size_t strlen(const char *);

void clang_analyzer_printState();

void test(const char *src) {
  if (strlen(src) < 10)
    return;

  clang_analyzer_printState();
  (void)src;
}

```

It is a runnable test using `llvm-lit` which dumps out the state of the program:
- The `core` package models the analysis.
- The `unix` package models memory and string manipulation.
- The [2] `debug.ExprInspection` checker / `debug` package defines function calls which could be used to make human-readable what the Analyzer thinks of your program at the moment of the function call.
- `(void)src` makes sure when we print out the state the Analyzer does not throw away the information of `src`. Without that extra call we would print out the state where dead symbols, such as `src` would be purged out from the analysis.

It dumps out the following:
```
"program_state": {
  ...
  "constraints": [
    { "symbol": "reg_$0<const char * src>", "range": "{ [1, 18446744073709551615] }" },
    { "symbol": "meta_$1{SymRegion{reg_$0<const char * src>},unsigned long}", "range": "{ [10, 4611686018427387903] }" }
  ],
  ...
}

```

To continue dig into your example, we need to predefine function signatures, which you can find easily using Microsoft Docs [3]:
```
// RUN: %clang_analyze_cc1 \
// RUN:  -analyzer-checker=core,unix,debug.ExprInspection \
// RUN:  -verify %s

typedef __SIZE_TYPE__ size_t;
size_t strlen(const char *);
char *strncpy(char *dst, const char *src, size_t size);

void clang_analyzer_printState();

void test(char *dst, const char *src, size_t size) {
  if (strlen(src) < 10)
    return;

  strncpy(dst, src, size);
  clang_analyzer_printState();

  (void)dst; (void)src; (void)size;
}

```

With that you could see if the Analyzer cannot model something enough precisely for your needs. With `grep` and GitHub's blame system usually you can find out how such API modeling introduced by an LLVM Phabricator review's link. If you want to learn more about symbolic values please visit section 5 from Artem Dergachev's [4] booklet.

-------------

> How can the analyzer inference such constraint?

Like magic and unicorns
and we name it ConstraintManager.

-------------

assert(&src[n] - &src[-1]);

I do not get what you are trying to do with this expression. Back in the days I have learnt [5] how to ask questions the smart way, which sounds rude for the first time, but this is the best reading on the web: http://www.catb.org/esr/faqs/smart-questions.html

On Thu, Mar 12, 2020 at 1:20 PM Balázs Benics via cfe-dev <[hidden email]> wrote:
Hi, checker devs

TLDR:
How to constrain the size of unknown memory regions, eg. pointed by 'raw' char pointers?

longer version:
Working on taint analysis I'm facing with the following problem:

void strncpy_bounded_tainted_buffer(char *src, char *dst) {
  // assert(strlen(src) >= 10 && "src must have at leas 10 elements");
  int n;
  scanf("%d", &n);
  if (0 < n && n < 10) {
    strncpy(dst, src, n); // Should we warn or not?
  }
}


In this example we analyze a function taking two raw pointers, so we don't know how big those arrays are.
We will check the `strncpy` call, whether it will access (read and write) only valid memory.
We will check the pointers (src and dst) separately by checking if `&src[0]` and `&src[n-1]` would be in bound of the memory region pointed by the pointer. Since the analyzer don't know (both states are non-null), we should check if the `length` parameter is tainted, and if so, we should still issue a warning telling that "String copy function might overflow the given buffer since untrusted data is used to specify the buffer size."
Obviously, if the `length` parameter is not tainted, we will assume (conservatively) that the access would be valid.


How should tell the analyzer that the array which is pointed by the pointer holds at least/most N elements?
For example in the code above, express something similar via an assertion, like saying that `src` points to a c-string, which has at least 10 + 1 element underlying storage.
Although this assertion using `strlen` seems like a solution, unfortunately not applicable for example to the `dst` buffer, which is most likely uninitialized - so not a c-string, in other words calling `strlen` would be undefined behavior.

The only (hacky) option which came in my mind was to abuse the standard regarding pointer arithmetic.
assert(&src[n] - &src[-1]);
The standard is clear about that pointer arithmetic is only applicable for pointers pointing to elements of the same array OR to a hypothetical ONE past/before element of that array.
http://eel.is/c++draft/expr.add#4.2

This assertion would be undefined behavior if the size of the array pointed by `src` would be smaller than `n`.

IMO this looks really ugly.
I think that no 'annotations' should introduce UB even if that assumption expressed via an annotation is turned out to be invalid.


What would be the right approach to guide (to constrain the size of a memory region) the analyzer?
How can the analyzer inference such constraint?

Thanks Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Constrain the size of unknown memory regions

suyash singh via cfe-dev
In reply to this post by suyash singh via cfe-dev
I think you're looking for SymbolExtent. It is *the* symbol that denotes
the [otherwise completely] unknown size of a region. The helper function
for obtaining either a known extent or a SymbolExtent is currently known
as getDynamicSize() (but i'd rather rename it back to "extent" - see
also https://reviews.llvm.org/D69726).

ArrayBoundChecker already has the code that you're looking for.

On 3/12/20 3:20 PM, Balázs Benics via cfe-dev wrote:

> Hi, checker devs
>
> TLDR:
> How to constrain the size of unknown memory regions, eg. pointed by
> 'raw' char pointers?
>
> longer version:
> Working on taint analysis I'm facing with the following problem:
>
>     void strncpy_bounded_tainted_buffer(char *src, char *dst) {
>       // assert(strlen(src) >= 10 && "src must have at leas 10 elements");
>       int n;
>       scanf("%d", &n);
>       if (0 < n && n < 10) {
>         strncpy(dst, src, n); // Should we warn or not?
>       }
>     }
>
>
>
> In this example we analyze a function taking two raw pointers, so we
> don't know how big those arrays are.
> We will check the `strncpy` call, whether it will access /(read and
> write)/ only valid memory.
> We will check the pointers (src and dst) separately by checking if
> /`/&src[0]` and `&src[n-1]` would be in bound of the memory region
> pointed by the pointer. Since the analyzer don't know (both states are
> non-null), we should check if the `length` parameter is tainted, and
> if so, we should still issue a warning telling that "String copy
> function might overflow the given buffer since untrusted data is used
> to specify the buffer size."
> Obviously, if the `length` parameter is not tainted, we will assume
> (conservatively) that the access would be valid.
>
>
> How should tell the analyzer that the array which is pointed by the
> pointer holds at least/most N elements?
> For example in the code above, express something similar via an
> assertion, like saying that `src` points to a c-string, which has at
> least 10 + 1 element underlying storage.
> Although this assertion using `strlen` seems like a solution,
> unfortunately not applicable for example to the `dst` buffer, which is
> most likely uninitialized - so not a c-string, in other words calling
> `strlen` would be undefined behavior.
>
> The only (hacky) option which came in my mind was to abuse the
> standard regarding pointer arithmetic.
>
>     assert(&src[n] - &src[-1]);
>
> The standard is clear about that pointer arithmetic is only applicable
> for pointers pointing to elements of the same array OR to a
> hypothetical ONE past/before element of that array.
> http://eel.is/c++draft/expr.add#4.2
>
> This assertion would be undefined behavior if the size of the array
> pointed by `src` would be smaller than `n`.
>
> IMO this looks really ugly.
> I think that no '/annotations/' should introduce UB even if that
> assumption expressed via an annotation is turned out to be _invalid_.
>
>
> What would be the right approach to guide (to constrain the size of a
> memory region) the analyzer?
> How can the analyzer inference such constraint?
>
> Thanks Balazs.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Constrain the size of unknown memory regions

suyash singh via cfe-dev
Hi Artem and Csaba,
Thank you for your response.

Sorry for the long description of the problem in my original email, but I wanted to provide as much context as I could.
Provided that we have to analyze code like in that example, the analyzer (correctly) assumes that the pointer points to an unknown memory region.
However, the user knows that the function will be called with a valid buffer, which would be capable of holding at least/most n elements. For now, we can not tell this assumption to the analyzer. There is no way in standard C++ to express such notion precisely. Asserts would not be powerful enough, as I earlier stated.
So we have to tell this kind of assumption to the analyzer in a different way. But in what way?

Some sort of annotation or special analyzer assert? I'm not sure, but none of these look promising, really.

I think we should find a way to properly analyze the following two cases:
  • `int f_user_function(int *points_to_at_least_5_elements);`
    The user knows that this function must be called with a pointer pointing to an array capable of holding at least 5 elements.
    We should be able to tell this assumption to the analyzer, to analyze its body according to this assumption.
  • `int g_user_function(int *buff, int size);`
    There is a connection between the `buff` and `size`, which denotes similar properties that were described in the previous bullet point, but the analyzer would not know.

Regards Balazs.

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. márc. 16., H, 3:08):
I think you're looking for SymbolExtent. It is *the* symbol that denotes
the [otherwise completely] unknown size of a region. The helper function
for obtaining either a known extent or a SymbolExtent is currently known
as getDynamicSize() (but i'd rather rename it back to "extent" - see
also https://reviews.llvm.org/D69726).

ArrayBoundChecker already has the code that you're looking for.

On 3/12/20 3:20 PM, Balázs Benics via cfe-dev wrote:
> Hi, checker devs
>
> TLDR:
> How to constrain the size of unknown memory regions, eg. pointed by
> 'raw' char pointers?
>
> longer version:
> Working on taint analysis I'm facing with the following problem:
>
>     void strncpy_bounded_tainted_buffer(char *src, char *dst) {
>       // assert(strlen(src) >= 10 && "src must have at leas 10 elements");
>       int n;
>       scanf("%d", &n);
>       if (0 < n && n < 10) {
>         strncpy(dst, src, n); // Should we warn or not?
>       }
>     }
>
>
>
> In this example we analyze a function taking two raw pointers, so we
> don't know how big those arrays are.
> We will check the `strncpy` call, whether it will access /(read and
> write)/ only valid memory.
> We will check the pointers (src and dst) separately by checking if
> /`/&src[0]` and `&src[n-1]` would be in bound of the memory region
> pointed by the pointer. Since the analyzer don't know (both states are
> non-null), we should check if the `length` parameter is tainted, and
> if so, we should still issue a warning telling that "String copy
> function might overflow the given buffer since untrusted data is used
> to specify the buffer size."
> Obviously, if the `length` parameter is not tainted, we will assume
> (conservatively) that the access would be valid.
>
>
> How should tell the analyzer that the array which is pointed by the
> pointer holds at least/most N elements?
> For example in the code above, express something similar via an
> assertion, like saying that `src` points to a c-string, which has at
> least 10 + 1 element underlying storage.
> Although this assertion using `strlen` seems like a solution,
> unfortunately not applicable for example to the `dst` buffer, which is
> most likely uninitialized - so not a c-string, in other words calling
> `strlen` would be undefined behavior.
>
> The only (hacky) option which came in my mind was to abuse the
> standard regarding pointer arithmetic.
>
>     assert(&src[n] - &src[-1]);
>
> The standard is clear about that pointer arithmetic is only applicable
> for pointers pointing to elements of the same array OR to a
> hypothetical ONE past/before element of that array.
> http://eel.is/c++draft/expr.add#4.2
>
> This assertion would be undefined behavior if the size of the array
> pointed by `src` would be smaller than `n`.
>
> IMO this looks really ugly.
> I think that no '/annotations/' should introduce UB even if that
> assumption expressed via an annotation is turned out to be _invalid_.
>
>
> What would be the right approach to guide (to constrain the size of a
> memory region) the analyzer?
> How can the analyzer inference such constraint?
>
> Thanks Balazs.
>
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Constrain the size of unknown memory regions

suyash singh via cfe-dev
Wait, so you're not trying to do this in a checker, but trying to
introduce, like, a mechanism for the users to use or something? Ok, got
it. Interesting, yeah.

Well, you could always add a magic function like
__clang_analyzer_getExtent() that would work exactly like
clang_analyzer_getExtent() from the debug.ExprInspection checker but
will be on by default and double-underscored so that to avoid
conflicting with user code (note that defining double-underscored
declarations outside of the standard library is UB). Then give them a
header:

   // analyzer_helper.h
   #ifdef __clang_analyzer__
     size_t __clang_analyzer_getExtent(void *x);
     #define ANALYZER_ASSERT assert
     #define ANALYZER_EXTENT(x) __clang_analyzer_getExtent(x)
   #else
     #define ANALYZER_ASSERT
     #define ANALYZER_EXTENT(x)
   #endif

And then tell them to write code like this:

   ANALYZER_ASSERT(ANALYZER_EXTENT(buff) == size);

I don't know whether it's a good idea. Not only we have too little
positive experience with array bound checking for now, but also there
are other powerful tools to deal with buffer overflows, like ASan/MSan
and, well, C++ containers. It's unclear to me how often would it be
impossible

Another possible approach is to introduce annotations and teach the
analyzer to understand them, eg.:

   int g_user_function(
     __attribute__((buffer_ptr("buff"))) int *buff,
     __attribute__((buffer_size("buff"))) int size);

Or like this:

   struct BoundedBuff {
     __attribute__((buffer_ptr("buff"))) int *buff;
     __attribute__((buffer_size("buff"))) int size);
   };

That's slightly less flexible but also less ugly (when you hide those
behind macros as well).

 > I think that no '/annotations/' should introduce UB even if that
 > assumption expressed via an annotation is turned out to be _invalid_.

Well, that's a fairly common thing to happen when annotations are
introduced for optimizations rather than for hardening. Say, if you
violate __builtin_assume it's a UB. If you violate a `restrict`
qualifier contract it's a UB. Even if you violate a `const` qualifier
contract it's a UB. Interestingly, iirc in-language assertions in C++23
contracts are currently expected to introduce UB when violated. That
said, i definitely agree with you in your case.


On 3/16/20 5:06 PM, Balázs Benics wrote:

> Hi Artem and Csaba,
> Thank you for your response.
>
> Sorry for the long description of the problem in my original email,
> but I wanted to provide as much context as I could.
> Provided that we have to analyze code like in that example, the
> analyzer (correctly) assumes that the pointer points to an unknown
> memory region.
> However, the user knows that the function will be called with a valid
> buffer, which would be capable of holding at least/most n elements.
> For now, we can not tell this assumption to the analyzer. There is no
> way in standard C++ to express such notion precisely. Asserts would
> not be powerful enough, as I earlier stated.
> So we have to tell this kind of assumption to the analyzer in a
> different way. But in what way?
>
> Some sort of annotation or special analyzer assert? I'm not sure, but
> none of these look promising, really.
>
> I think we should find a way to properly analyze the following two cases:
>
>   * `int f_user_function(int *points_to_at_least_5_elements);`
>     The user knows that this function must be called with a pointer
>     pointing to an array capable of holding at least 5 elements.
>     We should be able to tell this assumption to the analyzer, to
>     analyze its body according to this assumption.
>
>   * `int g_user_function(int *buff, int size);`
>     There is a connection between the `buff`and `size`, which denotes
>     similar properties that were described in the previous bullet
>     point, but the analyzer would not know.
>
> Regards Balazs.
>
> Artem Dergachev <[hidden email] <mailto:[hidden email]>> ezt
> írta (időpont: 2020. márc. 16., H, 3:08):
>
>     I think you're looking for SymbolExtent. It is *the* symbol that
>     denotes
>     the [otherwise completely] unknown size of a region. The helper
>     function
>     for obtaining either a known extent or a SymbolExtent is currently
>     known
>     as getDynamicSize() (but i'd rather rename it back to "extent" - see
>     also https://reviews.llvm.org/D69726).
>
>     ArrayBoundChecker already has the code that you're looking for.
>
>     On 3/12/20 3:20 PM, Balázs Benics via cfe-dev wrote:
>     > Hi, checker devs
>     >
>     > TLDR:
>     > How to constrain the size of unknown memory regions, eg. pointed by
>     > 'raw' char pointers?
>     >
>     > longer version:
>     > Working on taint analysis I'm facing with the following problem:
>     >
>     >     void strncpy_bounded_tainted_buffer(char *src, char *dst) {
>     >       // assert(strlen(src) >= 10 && "src must have at leas 10
>     elements");
>     >       int n;
>     >       scanf("%d", &n);
>     >       if (0 < n && n < 10) {
>     >         strncpy(dst, src, n); // Should we warn or not?
>     >       }
>     >     }
>     >
>     >
>     >
>     > In this example we analyze a function taking two raw pointers,
>     so we
>     > don't know how big those arrays are.
>     > We will check the `strncpy` call, whether it will access /(read and
>     > write)/ only valid memory.
>     > We will check the pointers (src and dst) separately by checking if
>     > /`/&src[0]` and `&src[n-1]` would be in bound of the memory region
>     > pointed by the pointer. Since the analyzer don't know (both
>     states are
>     > non-null), we should check if the `length` parameter is tainted,
>     and
>     > if so, we should still issue a warning telling that "String copy
>     > function might overflow the given buffer since untrusted data is
>     used
>     > to specify the buffer size."
>     > Obviously, if the `length` parameter is not tainted, we will assume
>     > (conservatively) that the access would be valid.
>     >
>     >
>     > How should tell the analyzer that the array which is pointed by the
>     > pointer holds at least/most N elements?
>     > For example in the code above, express something similar via an
>     > assertion, like saying that `src` points to a c-string, which
>     has at
>     > least 10 + 1 element underlying storage.
>     > Although this assertion using `strlen` seems like a solution,
>     > unfortunately not applicable for example to the `dst` buffer,
>     which is
>     > most likely uninitialized - so not a c-string, in other words
>     calling
>     > `strlen` would be undefined behavior.
>     >
>     > The only (hacky) option which came in my mind was to abuse the
>     > standard regarding pointer arithmetic.
>     >
>     >     assert(&src[n] - &src[-1]);
>     >
>     > The standard is clear about that pointer arithmetic is only
>     applicable
>     > for pointers pointing to elements of the same array OR to a
>     > hypothetical ONE past/before element of that array.
>     > http://eel.is/c++draft/expr.add#4.2
>     >
>     > This assertion would be undefined behavior if the size of the array
>     > pointed by `src` would be smaller than `n`.
>     >
>     > IMO this looks really ugly.
>     > I think that no '/annotations/' should introduce UB even if that
>     > assumption expressed via an annotation is turned out to be
>     _invalid_.
>     >
>     >
>     > What would be the right approach to guide (to constrain the size
>     of a
>     > memory region) the analyzer?
>     > How can the analyzer inference such constraint?
>     >
>     > Thanks Balazs.
>     >
>     > _______________________________________________
>     > cfe-dev mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Constrain the size of unknown memory regions

suyash singh via cfe-dev
I'm impressed.

ANALYZER_ASSERT(ANALYZER_EXTENT(buff) == size);
 That looks like something I've dreamed about, thank you.

I don't know whether it's a good idea. Not only we have too little
positive experience with array bound checking for now, but also there
are other powerful tools to deal with buffer overflows, like ASan/MSan
and, well, C++ containers. It's unclear to me how often would it be
impossible
I totally agree, and that is why I was looking into eg.: `ArrayBoundCheckerV2` and it's buggy behavior.

This problem raised by implementing taint warnings for the `CStringChecker`, which meant that it would warn for cases like I mentioned in my first email.
Which would result in false positives without any mechanism to silence them, since we can not tell the analyzer about the implicit assumptions which are not present in the code.

How should we ship this `analyzer_assert.h` with the analyzer?

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. márc. 16., H, 15:41):
Wait, so you're not trying to do this in a checker, but trying to
introduce, like, a mechanism for the users to use or something? Ok, got
it. Interesting, yeah.

Well, you could always add a magic function like
__clang_analyzer_getExtent() that would work exactly like
clang_analyzer_getExtent() from the debug.ExprInspection checker but
will be on by default and double-underscored so that to avoid
conflicting with user code (note that defining double-underscored
declarations outside of the standard library is UB). Then give them a
header:

   // analyzer_helper.h
   #ifdef __clang_analyzer__
     size_t __clang_analyzer_getExtent(void *x);
     #define ANALYZER_ASSERT assert
     #define ANALYZER_EXTENT(x) __clang_analyzer_getExtent(x)
   #else
     #define ANALYZER_ASSERT
     #define ANALYZER_EXTENT(x)
   #endif

And then tell them to write code like this:

   ANALYZER_ASSERT(ANALYZER_EXTENT(buff) == size);

I don't know whether it's a good idea. Not only we have too little
positive experience with array bound checking for now, but also there
are other powerful tools to deal with buffer overflows, like ASan/MSan
and, well, C++ containers. It's unclear to me how often would it be
impossible

Another possible approach is to introduce annotations and teach the
analyzer to understand them, eg.:

   int g_user_function(
     __attribute__((buffer_ptr("buff"))) int *buff,
     __attribute__((buffer_size("buff"))) int size);

Or like this:

   struct BoundedBuff {
     __attribute__((buffer_ptr("buff"))) int *buff;
     __attribute__((buffer_size("buff"))) int size);
   };

That's slightly less flexible but also less ugly (when you hide those
behind macros as well).

 > I think that no '/annotations/' should introduce UB even if that
 > assumption expressed via an annotation is turned out to be _invalid_.

Well, that's a fairly common thing to happen when annotations are
introduced for optimizations rather than for hardening. Say, if you
violate __builtin_assume it's a UB. If you violate a `restrict`
qualifier contract it's a UB. Even if you violate a `const` qualifier
contract it's a UB. Interestingly, iirc in-language assertions in C++23
contracts are currently expected to introduce UB when violated. That
said, i definitely agree with you in your case.


On 3/16/20 5:06 PM, Balázs Benics wrote:
> Hi Artem and Csaba,
> Thank you for your response.
>
> Sorry for the long description of the problem in my original email,
> but I wanted to provide as much context as I could.
> Provided that we have to analyze code like in that example, the
> analyzer (correctly) assumes that the pointer points to an unknown
> memory region.
> However, the user knows that the function will be called with a valid
> buffer, which would be capable of holding at least/most n elements.
> For now, we can not tell this assumption to the analyzer. There is no
> way in standard C++ to express such notion precisely. Asserts would
> not be powerful enough, as I earlier stated.
> So we have to tell this kind of assumption to the analyzer in a
> different way. But in what way?
>
> Some sort of annotation or special analyzer assert? I'm not sure, but
> none of these look promising, really.
>
> I think we should find a way to properly analyze the following two cases:
>
>   * `int f_user_function(int *points_to_at_least_5_elements);`
>     The user knows that this function must be called with a pointer
>     pointing to an array capable of holding at least 5 elements.
>     We should be able to tell this assumption to the analyzer, to
>     analyze its body according to this assumption.
>
>   * `int g_user_function(int *buff, int size);`
>     There is a connection between the `buff`and `size`, which denotes
>     similar properties that were described in the previous bullet
>     point, but the analyzer would not know.
>
> Regards Balazs.
>
> Artem Dergachev <[hidden email] <mailto:[hidden email]>> ezt
> írta (időpont: 2020. márc. 16., H, 3:08):
>
>     I think you're looking for SymbolExtent. It is *the* symbol that
>     denotes
>     the [otherwise completely] unknown size of a region. The helper
>     function
>     for obtaining either a known extent or a SymbolExtent is currently
>     known
>     as getDynamicSize() (but i'd rather rename it back to "extent" - see
>     also https://reviews.llvm.org/D69726).
>
>     ArrayBoundChecker already has the code that you're looking for.
>
>     On 3/12/20 3:20 PM, Balázs Benics via cfe-dev wrote:
>     > Hi, checker devs
>     >
>     > TLDR:
>     > How to constrain the size of unknown memory regions, eg. pointed by
>     > 'raw' char pointers?
>     >
>     > longer version:
>     > Working on taint analysis I'm facing with the following problem:
>     >
>     >     void strncpy_bounded_tainted_buffer(char *src, char *dst) {
>     >       // assert(strlen(src) >= 10 && "src must have at leas 10
>     elements");
>     >       int n;
>     >       scanf("%d", &n);
>     >       if (0 < n && n < 10) {
>     >         strncpy(dst, src, n); // Should we warn or not?
>     >       }
>     >     }
>     >
>     >
>     >
>     > In this example we analyze a function taking two raw pointers,
>     so we
>     > don't know how big those arrays are.
>     > We will check the `strncpy` call, whether it will access /(read and
>     > write)/ only valid memory.
>     > We will check the pointers (src and dst) separately by checking if
>     > /`/&src[0]` and `&src[n-1]` would be in bound of the memory region
>     > pointed by the pointer. Since the analyzer don't know (both
>     states are
>     > non-null), we should check if the `length` parameter is tainted,
>     and
>     > if so, we should still issue a warning telling that "String copy
>     > function might overflow the given buffer since untrusted data is
>     used
>     > to specify the buffer size."
>     > Obviously, if the `length` parameter is not tainted, we will assume
>     > (conservatively) that the access would be valid.
>     >
>     >
>     > How should tell the analyzer that the array which is pointed by the
>     > pointer holds at least/most N elements?
>     > For example in the code above, express something similar via an
>     > assertion, like saying that `src` points to a c-string, which
>     has at
>     > least 10 + 1 element underlying storage.
>     > Although this assertion using `strlen` seems like a solution,
>     > unfortunately not applicable for example to the `dst` buffer,
>     which is
>     > most likely uninitialized - so not a c-string, in other words
>     calling
>     > `strlen` would be undefined behavior.
>     >
>     > The only (hacky) option which came in my mind was to abuse the
>     > standard regarding pointer arithmetic.
>     >
>     >     assert(&src[n] - &src[-1]);
>     >
>     > The standard is clear about that pointer arithmetic is only
>     applicable
>     > for pointers pointing to elements of the same array OR to a
>     > hypothetical ONE past/before element of that array.
>     > http://eel.is/c++draft/expr.add#4.2
>     >
>     > This assertion would be undefined behavior if the size of the array
>     > pointed by `src` would be smaller than `n`.
>     >
>     > IMO this looks really ugly.
>     > I think that no '/annotations/' should introduce UB even if that
>     > assumption expressed via an annotation is turned out to be
>     _invalid_.
>     >
>     >
>     > What would be the right approach to guide (to constrain the size
>     of a
>     > memory region) the analyzer?
>     > How can the analyzer inference such constraint?
>     >
>     > Thanks Balazs.
>     >
>     > _______________________________________________
>     > cfe-dev mailing list
>     > [hidden email] <mailto:[hidden email]>
>     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Constrain the size of unknown memory regions

suyash singh via cfe-dev


On 3/16/20 6:17 PM, Balázs Benics wrote:

> I'm impressed.
>
>     ANALYZER_ASSERT(ANALYZER_EXTENT(buff) == size);
>
>  That looks like something I've dreamed about, thank you.
>
>     I don't know whether it's a good idea. Not only we have too little
>     positive experience with array bound checking for now, but also there
>     are other powerful tools to deal with buffer overflows, like
>     ASan/MSan
>     and, well, C++ containers. It's unclear to me how often would it be
>     impossible
>

Whoops, i didn't finish the sentence. What i was trying to say is "how
often would it be possible to use the program's variables to add the
assertion anyway, without the help of the current annotation".

> I totally agree, and that is why I was looking into eg.:
> `ArrayBoundCheckerV2` and it's buggy behavior.
>
> This problem raised by implementing taint warnings for the
> `CStringChecker`, which meant that it would warn for cases like I
> mentioned in my first email.
> Which would result in false positives without any mechanism to silence
> them, since we can not tell the analyzer about the implicit
> assumptions which are not present in the code.

If you get false positives, you're doing something wrong. In the
original example there clearly should not be a warning, because there's
no indication anywhere in the code that the array consists of less than
10 elements.

> How should we ship this `analyzer_assert.h` with the analyzer?

We most likely can't. Usually when new annotations are introduced, each
project that uses them comes up with their own macros around them. Eg.,
how LLVM comes up with LLVM_UNREACHABLE and LLVM_NODISCARD.

I guess we could make a separate tiny header-only library for static
analysis helpers. Not sure i'm willing to maintain it but it might make
sense long-term.

>
> Artem Dergachev <[hidden email] <mailto:[hidden email]>> ezt
> írta (időpont: 2020. márc. 16., H, 15:41):
>
>     Wait, so you're not trying to do this in a checker, but trying to
>     introduce, like, a mechanism for the users to use or something?
>     Ok, got
>     it. Interesting, yeah.
>
>     Well, you could always add a magic function like
>     __clang_analyzer_getExtent() that would work exactly like
>     clang_analyzer_getExtent() from the debug.ExprInspection checker but
>     will be on by default and double-underscored so that to avoid
>     conflicting with user code (note that defining double-underscored
>     declarations outside of the standard library is UB). Then give them a
>     header:
>
>        // analyzer_helper.h
>        #ifdef __clang_analyzer__
>          size_t __clang_analyzer_getExtent(void *x);
>          #define ANALYZER_ASSERT assert
>          #define ANALYZER_EXTENT(x) __clang_analyzer_getExtent(x)
>        #else
>          #define ANALYZER_ASSERT
>          #define ANALYZER_EXTENT(x)
>        #endif
>
>     And then tell them to write code like this:
>
>        ANALYZER_ASSERT(ANALYZER_EXTENT(buff) == size);
>
>     I don't know whether it's a good idea. Not only we have too little
>     positive experience with array bound checking for now, but also there
>     are other powerful tools to deal with buffer overflows, like
>     ASan/MSan
>     and, well, C++ containers. It's unclear to me how often would it be
>     impossible
>
>     Another possible approach is to introduce annotations and teach the
>     analyzer to understand them, eg.:
>
>        int g_user_function(
>          __attribute__((buffer_ptr("buff"))) int *buff,
>          __attribute__((buffer_size("buff"))) int size);
>
>     Or like this:
>
>        struct BoundedBuff {
>          __attribute__((buffer_ptr("buff"))) int *buff;
>          __attribute__((buffer_size("buff"))) int size);
>        };
>
>     That's slightly less flexible but also less ugly (when you hide those
>     behind macros as well).
>
>      > I think that no '/annotations/' should introduce UB even if that
>      > assumption expressed via an annotation is turned out to be
>     _invalid_.
>
>     Well, that's a fairly common thing to happen when annotations are
>     introduced for optimizations rather than for hardening. Say, if you
>     violate __builtin_assume it's a UB. If you violate a `restrict`
>     qualifier contract it's a UB. Even if you violate a `const` qualifier
>     contract it's a UB. Interestingly, iirc in-language assertions in
>     C++23
>     contracts are currently expected to introduce UB when violated. That
>     said, i definitely agree with you in your case.
>
>
>     On 3/16/20 5:06 PM, Balázs Benics wrote:
>     > Hi Artem and Csaba,
>     > Thank you for your response.
>     >
>     > Sorry for the long description of the problem in my original email,
>     > but I wanted to provide as much context as I could.
>     > Provided that we have to analyze code like in that example, the
>     > analyzer (correctly) assumes that the pointer points to an unknown
>     > memory region.
>     > However, the user knows that the function will be called with a
>     valid
>     > buffer, which would be capable of holding at least/most n elements.
>     > For now, we can not tell this assumption to the analyzer. There
>     is no
>     > way in standard C++ to express such notion precisely. Asserts would
>     > not be powerful enough, as I earlier stated.
>     > So we have to tell this kind of assumption to the analyzer in a
>     > different way. But in what way?
>     >
>     > Some sort of annotation or special analyzer assert? I'm not
>     sure, but
>     > none of these look promising, really.
>     >
>     > I think we should find a way to properly analyze the following
>     two cases:
>     >
>     >   * `int f_user_function(int *points_to_at_least_5_elements);`
>     >     The user knows that this function must be called with a pointer
>     >     pointing to an array capable of holding at least 5 elements.
>     >     We should be able to tell this assumption to the analyzer, to
>     >     analyze its body according to this assumption.
>     >
>     >   * `int g_user_function(int *buff, int size);`
>     >     There is a connection between the `buff`and `size`, which
>     denotes
>     >     similar properties that were described in the previous bullet
>     >     point, but the analyzer would not know.
>     >
>     > Regards Balazs.
>     >
>     > Artem Dergachev <[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>> ezt
>     > írta (időpont: 2020. márc. 16., H, 3:08):
>     >
>     >     I think you're looking for SymbolExtent. It is *the* symbol that
>     >     denotes
>     >     the [otherwise completely] unknown size of a region. The helper
>     >     function
>     >     for obtaining either a known extent or a SymbolExtent is
>     currently
>     >     known
>     >     as getDynamicSize() (but i'd rather rename it back to
>     "extent" - see
>     >     also https://reviews.llvm.org/D69726).
>     >
>     >     ArrayBoundChecker already has the code that you're looking for.
>     >
>     >     On 3/12/20 3:20 PM, Balázs Benics via cfe-dev wrote:
>     >     > Hi, checker devs
>     >     >
>     >     > TLDR:
>     >     > How to constrain the size of unknown memory regions, eg.
>     pointed by
>     >     > 'raw' char pointers?
>     >     >
>     >     > longer version:
>     >     > Working on taint analysis I'm facing with the following
>     problem:
>     >     >
>     >     >     void strncpy_bounded_tainted_buffer(char *src, char
>     *dst) {
>     >     >       // assert(strlen(src) >= 10 && "src must have at leas 10
>     >     elements");
>     >     >       int n;
>     >     >       scanf("%d", &n);
>     >     >       if (0 < n && n < 10) {
>     >     >         strncpy(dst, src, n); // Should we warn or not?
>     >     >       }
>     >     >     }
>     >     >
>     >     >
>     >     >
>     >     > In this example we analyze a function taking two raw pointers,
>     >     so we
>     >     > don't know how big those arrays are.
>     >     > We will check the `strncpy` call, whether it will access
>     /(read and
>     >     > write)/ only valid memory.
>     >     > We will check the pointers (src and dst) separately by
>     checking if
>     >     > /`/&src[0]` and `&src[n-1]` would be in bound of the
>     memory region
>     >     > pointed by the pointer. Since the analyzer don't know (both
>     >     states are
>     >     > non-null), we should check if the `length` parameter is
>     tainted,
>     >     and
>     >     > if so, we should still issue a warning telling that
>     "String copy
>     >     > function might overflow the given buffer since untrusted
>     data is
>     >     used
>     >     > to specify the buffer size."
>     >     > Obviously, if the `length` parameter is not tainted, we
>     will assume
>     >     > (conservatively) that the access would be valid.
>     >     >
>     >     >
>     >     > How should tell the analyzer that the array which is
>     pointed by the
>     >     > pointer holds at least/most N elements?
>     >     > For example in the code above, express something similar
>     via an
>     >     > assertion, like saying that `src` points to a c-string, which
>     >     has at
>     >     > least 10 + 1 element underlying storage.
>     >     > Although this assertion using `strlen` seems like a solution,
>     >     > unfortunately not applicable for example to the `dst` buffer,
>     >     which is
>     >     > most likely uninitialized - so not a c-string, in other words
>     >     calling
>     >     > `strlen` would be undefined behavior.
>     >     >
>     >     > The only (hacky) option which came in my mind was to abuse the
>     >     > standard regarding pointer arithmetic.
>     >     >
>     >     >     assert(&src[n] - &src[-1]);
>     >     >
>     >     > The standard is clear about that pointer arithmetic is only
>     >     applicable
>     >     > for pointers pointing to elements of the same array OR to a
>     >     > hypothetical ONE past/before element of that array.
>     >     > http://eel.is/c++draft/expr.add#4.2
>     >     >
>     >     > This assertion would be undefined behavior if the size of
>     the array
>     >     > pointed by `src` would be smaller than `n`.
>     >     >
>     >     > IMO this looks really ugly.
>     >     > I think that no '/annotations/' should introduce UB even
>     if that
>     >     > assumption expressed via an annotation is turned out to be
>     >     _invalid_.
>     >     >
>     >     >
>     >     > What would be the right approach to guide (to constrain
>     the size
>     >     of a
>     >     > memory region) the analyzer?
>     >     > How can the analyzer inference such constraint?
>     >     >
>     >     > Thanks Balazs.
>     >     >
>     >     > _______________________________________________
>     >     > cfe-dev mailing list
>     >     > [hidden email] <mailto:[hidden email]>
>     <mailto:[hidden email] <mailto:[hidden email]>>
>     >     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>     >
>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev