[analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
if (isa<SymbolMetadata>(sym))
MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
if (isa<SymbolMetadata>(sym))
MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

run-output.txt (5K) Download Attachment
test.c (454 bytes) Download Attachment
add-debug-prints.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
if (isa<SymbolMetadata>(sym))
MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
Gábor
> How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?
I'm not sure, since I'm not really experienced in this liveness stuff. I will try it.

Artem
> It sounds to me as if putting metadata symbols into the live set directly would have worked just fine.
Ehm, I' not sure about this.
If you look my example closely, you can note that the CStringChecker maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned long} symbol.
So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in the Live set.
However later, when you query the SymbolReaper if the mentioned meta_$2 symbol is dead or not, it will answer you that it is dead.

I'm quoting the related trace log:
[...]
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

Why does the ExprEngine conjure a return symbol, if an evalCall already evaluated the call?

Artem
> See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`
I managed to implement a metadata collector visitor using the new SymExprVisitor infrastructure, in just a couple of lines of code. I was amazed by that :)

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 6., Cs, 19:04):
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev


On 8/6/20 1:38 PM, Balázs Benics wrote:
Gábor
> How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?
I'm not sure, since I'm not really experienced in this liveness stuff. I will try it.

Artem
> It sounds to me as if putting metadata symbols into the live set directly would have worked just fine.
Ehm, I' not sure about this.
If you look my example closely, you can note that the CStringChecker maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned long} symbol.
So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in the Live set.
However later, when you query the SymbolReaper if the mentioned meta_$2 symbol is dead or not, it will answer you that it is dead.

That's not how that works.

markInUse() doesn't put anything into the live set, it puts things into an auxiliary "metadata-in-use" set: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396

On the other hand, isDead()/isLive() has to return true if the symbol is present in the live set. In fact, that's the first thing it checks for: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437


I'm quoting the related trace log:
[...]
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

Why does the ExprEngine conjure a return symbol, if an evalCall already evaluated the call?

Artem
> See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`
I managed to implement a metadata collector visitor using the new SymExprVisitor infrastructure, in just a couple of lines of code. I was amazed by that :)

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 6., Cs, 19:04):
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
It turns out that if the region of `src` is alive, then the required metadata symbol will be kept alive as well.
Here is the modified example:
void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
  (void)*src; // Now we keep the `src` alive, thus any metadata symbols to that region will be alive as well at the eval call.
}
```

It seems slightly confusing to me that depending on that `src` is used later or not, the `clang_analyzer_eval(strlen(dst) >= 4)` will show either `TRUE` or `UNKNOWN`.
I think it should always give `TRUE` as an answer.

Note that a metadata symbol is alive only if marked in use AND its region is also alive.
Without the `(void)*src;` the region of `src` is dead, thus the symbol ($meta{src} + 4) representing the cstring length of the region `dst` will be dead too.

Since this problem was caused by the handling of metadata symbols, shouldn't we use conjured ones instead?
In that way, we would decouple the liveness of the cstring length of a region and the region itself.

What is the use-case for using a metadata symbol instead of a conjured one?

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 7., P, 0:05):


On 8/6/20 1:38 PM, Balázs Benics wrote:
Gábor
> How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?
I'm not sure, since I'm not really experienced in this liveness stuff. I will try it.

Artem
> It sounds to me as if putting metadata symbols into the live set directly would have worked just fine.
Ehm, I' not sure about this.
If you look my example closely, you can note that the CStringChecker maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned long} symbol.
So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in the Live set.
However later, when you query the SymbolReaper if the mentioned meta_$2 symbol is dead or not, it will answer you that it is dead.

That's not how that works.

markInUse() doesn't put anything into the live set, it puts things into an auxiliary "metadata-in-use" set: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396

On the other hand, isDead()/isLive() has to return true if the symbol is present in the live set. In fact, that's the first thing it checks for: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437


I'm quoting the related trace log:
[...]
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

Why does the ExprEngine conjure a return symbol, if an evalCall already evaluated the call?

Artem
> See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`
I managed to implement a metadata collector visitor using the new SymExprVisitor infrastructure, in just a couple of lines of code. I was amazed by that :)

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 6., Cs, 19:04):
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
Hi!

Since the metadata symbol describes a property of the region I think in this example it might make sense to duplicate the metadata symbol for dst. But I can also imagine code using the length of a dead string which would be nice to support. 

I also wonder what is the advantage of not using a conjured symbol. 

Regards,
Gabor


On Wed, Aug 19, 2020, 3:08 PM Balázs Benics <[hidden email]> wrote:
It turns out that if the region of `src` is alive, then the required metadata symbol will be kept alive as well.
Here is the modified example:
void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
  (void)*src; // Now we keep the `src` alive, thus any metadata symbols to that region will be alive as well at the eval call.
}
```

It seems slightly confusing to me that depending on that `src` is used later or not, the `clang_analyzer_eval(strlen(dst) >= 4)` will show either `TRUE` or `UNKNOWN`.
I think it should always give `TRUE` as an answer.

Note that a metadata symbol is alive only if marked in use AND its region is also alive.
Without the `(void)*src;` the region of `src` is dead, thus the symbol ($meta{src} + 4) representing the cstring length of the region `dst` will be dead too.

Since this problem was caused by the handling of metadata symbols, shouldn't we use conjured ones instead?
In that way, we would decouple the liveness of the cstring length of a region and the region itself.

What is the use-case for using a metadata symbol instead of a conjured one?

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 7., P, 0:05):


On 8/6/20 1:38 PM, Balázs Benics wrote:
Gábor
> How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?
I'm not sure, since I'm not really experienced in this liveness stuff. I will try it.

Artem
> It sounds to me as if putting metadata symbols into the live set directly would have worked just fine.
Ehm, I' not sure about this.
If you look my example closely, you can note that the CStringChecker maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned long} symbol.
So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in the Live set.
However later, when you query the SymbolReaper if the mentioned meta_$2 symbol is dead or not, it will answer you that it is dead.

That's not how that works.

markInUse() doesn't put anything into the live set, it puts things into an auxiliary "metadata-in-use" set: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396

On the other hand, isDead()/isLive() has to return true if the symbol is present in the live set. In fact, that's the first thing it checks for: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437


I'm quoting the related trace log:
[...]
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

Why does the ExprEngine conjure a return symbol, if an evalCall already evaluated the call?

Artem
> See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`
I managed to implement a metadata collector visitor using the new SymExprVisitor infrastructure, in just a couple of lines of code. I was amazed by that :)

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 6., Cs, 19:04):
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
Gabor, we've talked about this two days ago >_> +Nithin because we're facing the same decision with the smart pointer checker to which raw pointer value would make a lot of sense as SymbolMetadata.

The way SymbolMetadata used in CStringChecker always bothered me. It *is* used *as if* it's SymbolConjured: it represents an unknown value of an expression of a specific type. As such it doesn't do much more than preserve its identity for debugging purposes. We indeed don't need another SymbolConjured.

I believe we should remove SymbolConjured's identity elements from SymbolMetadata and instead make it work more like SymbolRegionValue works in RegionStore. Namely:

1. Remove statement, location context, block count from SymbolMetadata's identity. Let it solely depend on the region.

2. Get rid of the metadata-in-use set. From now on SymbolMetadata, like SymbolRegionValue, is live whenever its region is live.

3. When strlen(R) is used for the first time on a region R, produce $meta<size_t R> as the answer, but *do not store* it in the string length map. We don't need to change the state because the state didn't in fact change. On subsequent strlen(R) calls we'll simply return the same symbol (*because* the map will remain empty), until the string is changed.

4. If the string is mutated, record its length as usual. But if the string is invalidated (or the new length is completely unknown), *do not remove* the length from the state; instead, set it to SymbolConjured. Only remove it from the map when the region dies.

5. Keep string length symbols alive as long as they're in the map.

This model is obviously correct because it mimics RegionStore which is obviously correct (i mean, it's obviously broken beyond repair, but *this* part seems to be correct). In particular, this model doesn't have any obvious flaws that you point out in this thread. It also makes sure that the identity of the state correctly reflects mutations in the state. I think it's very much superior to the existing model. This also more or less matches my belief of "everything is a Store" that i wanted to push on a few years ago (though that one was more radical; back then i wanted to make a "shadow" region for a string and use the actual SymbolRegionValue of that shadow region instead of the SymbolMetadata-as-suggested-in-this-mail).

Balasz, are you willing to implement something like that?


On 8/19/20 7:59 AM, Gábor Horváth wrote:
Hi!

Since the metadata symbol describes a property of the region I think in this example it might make sense to duplicate the metadata symbol for dst. But I can also imagine code using the length of a dead string which would be nice to support. 

I also wonder what is the advantage of not using a conjured symbol. 

Regards,
Gabor


On Wed, Aug 19, 2020, 3:08 PM Balázs Benics <[hidden email]> wrote:
It turns out that if the region of `src` is alive, then the required metadata symbol will be kept alive as well.
Here is the modified example:
void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
  (void)*src; // Now we keep the `src` alive, thus any metadata symbols to that region will be alive as well at the eval call.
}
```

It seems slightly confusing to me that depending on that `src` is used later or not, the `clang_analyzer_eval(strlen(dst) >= 4)` will show either `TRUE` or `UNKNOWN`.
I think it should always give `TRUE` as an answer.

Note that a metadata symbol is alive only if marked in use AND its region is also alive.
Without the `(void)*src;` the region of `src` is dead, thus the symbol ($meta{src} + 4) representing the cstring length of the region `dst` will be dead too.

Since this problem was caused by the handling of metadata symbols, shouldn't we use conjured ones instead?
In that way, we would decouple the liveness of the cstring length of a region and the region itself.

What is the use-case for using a metadata symbol instead of a conjured one?

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 7., P, 0:05):


On 8/6/20 1:38 PM, Balázs Benics wrote:
Gábor
> How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?
I'm not sure, since I'm not really experienced in this liveness stuff. I will try it.

Artem
> It sounds to me as if putting metadata symbols into the live set directly would have worked just fine.
Ehm, I' not sure about this.
If you look my example closely, you can note that the CStringChecker maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned long} symbol.
So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in the Live set.
However later, when you query the SymbolReaper if the mentioned meta_$2 symbol is dead or not, it will answer you that it is dead.

That's not how that works.

markInUse() doesn't put anything into the live set, it puts things into an auxiliary "metadata-in-use" set: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396

On the other hand, isDead()/isLive() has to return true if the symbol is present in the live set. In fact, that's the first thing it checks for: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437


I'm quoting the related trace log:
[...]
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

Why does the ExprEngine conjure a return symbol, if an evalCall already evaluated the call?

Artem
> See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`
I managed to implement a metadata collector visitor using the new SymExprVisitor infrastructure, in just a couple of lines of code. I was amazed by that :)

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 6., Cs, 19:04):
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev


On 8/19/20 2:18 PM, Artem Dergachev wrote:
Gabor, we've talked about this two days ago >_> +Nithin because we're facing the same decision with the smart pointer checker to which raw pointer value would make a lot of sense as SymbolMetadata.

The way SymbolMetadata used in CStringChecker always bothered me. It *is* used *as if* it's SymbolConjured: it represents an unknown value of an expression of a specific type. As such it doesn't do much more than preserve its identity for debugging purposes. We indeed don't need another SymbolConjured.

I believe we should remove SymbolConjured's identity elements from SymbolMetadata and instead make it work more like SymbolRegionValue works in RegionStore. Namely:

1. Remove statement, location context, block count from SymbolMetadata's identity. Let it solely depend on the region.

2. Get rid of the metadata-in-use set. From now on SymbolMetadata, like SymbolRegionValue, is live whenever its region is live.

3. When strlen(R) is used for the first time on a region R, produce $meta<size_t R> as the answer, but *do not store* it in the string length map. We don't need to change the state because the state didn't in fact change. On subsequent strlen(R) calls we'll simply return the same symbol (*because* the map will remain empty), until the string is changed.

4. If the string is mutated, record its length as usual. But if the string is invalidated (or the new length is completely unknown), *do not remove* the length from the state; instead, set it to SymbolConjured. Only remove it from the map when the region dies.

5. Keep string length symbols alive as long as they're in the map.

This model is obviously correct because it mimics RegionStore which is obviously correct (i mean, it's obviously broken beyond repair, but *this* part seems to be correct). In particular, this model doesn't have any obvious flaws that you point out in this thread. It also makes sure that the identity of the state correctly reflects mutations in the state. I think it's very much superior to the existing model. This also more or less matches my belief of "everything is a Store" that i wanted to push on a few years ago (though that one was more radical; back then i wanted to make a "shadow" region for a string and use the actual SymbolRegionValue of that shadow region instead of the SymbolMetadata-as-suggested-in-this-mail).

Balasz, are you willing to implement something like that?

*Balázs.
Sry!!



On 8/19/20 7:59 AM, Gábor Horváth wrote:
Hi!

Since the metadata symbol describes a property of the region I think in this example it might make sense to duplicate the metadata symbol for dst. But I can also imagine code using the length of a dead string which would be nice to support. 

I also wonder what is the advantage of not using a conjured symbol. 

Regards,
Gabor


On Wed, Aug 19, 2020, 3:08 PM Balázs Benics <[hidden email]> wrote:
It turns out that if the region of `src` is alive, then the required metadata symbol will be kept alive as well.
Here is the modified example:
void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
  (void)*src; // Now we keep the `src` alive, thus any metadata symbols to that region will be alive as well at the eval call.
}
```

It seems slightly confusing to me that depending on that `src` is used later or not, the `clang_analyzer_eval(strlen(dst) >= 4)` will show either `TRUE` or `UNKNOWN`.
I think it should always give `TRUE` as an answer.

Note that a metadata symbol is alive only if marked in use AND its region is also alive.
Without the `(void)*src;` the region of `src` is dead, thus the symbol ($meta{src} + 4) representing the cstring length of the region `dst` will be dead too.

Since this problem was caused by the handling of metadata symbols, shouldn't we use conjured ones instead?
In that way, we would decouple the liveness of the cstring length of a region and the region itself.

What is the use-case for using a metadata symbol instead of a conjured one?

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 7., P, 0:05):


On 8/6/20 1:38 PM, Balázs Benics wrote:
Gábor
> How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?
I'm not sure, since I'm not really experienced in this liveness stuff. I will try it.

Artem
> It sounds to me as if putting metadata symbols into the live set directly would have worked just fine.
Ehm, I' not sure about this.
If you look my example closely, you can note that the CStringChecker maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned long} symbol.
So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in the Live set.
However later, when you query the SymbolReaper if the mentioned meta_$2 symbol is dead or not, it will answer you that it is dead.

That's not how that works.

markInUse() doesn't put anything into the live set, it puts things into an auxiliary "metadata-in-use" set: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396

On the other hand, isDead()/isLive() has to return true if the symbol is present in the live set. In fact, that's the first thing it checks for: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437


I'm quoting the related trace log:
[...]
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

Why does the ExprEngine conjure a return symbol, if an evalCall already evaluated the call?

Artem
> See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`
I managed to implement a metadata collector visitor using the new SymExprVisitor infrastructure, in just a couple of lines of code. I was amazed by that :)

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 6., Cs, 19:04):
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev





_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
Thanks for the advice.
It sounds like a plan, Artem. 
I will make some experiments :)

We will see where it goes. 

On Thu, Aug 20, 2020, 00:02 Artem Dergachev <[hidden email]> wrote:


On 8/19/20 2:18 PM, Artem Dergachev wrote:
Gabor, we've talked about this two days ago >_> +Nithin because we're facing the same decision with the smart pointer checker to which raw pointer value would make a lot of sense as SymbolMetadata.

The way SymbolMetadata used in CStringChecker always bothered me. It *is* used *as if* it's SymbolConjured: it represents an unknown value of an expression of a specific type. As such it doesn't do much more than preserve its identity for debugging purposes. We indeed don't need another SymbolConjured.

I believe we should remove SymbolConjured's identity elements from SymbolMetadata and instead make it work more like SymbolRegionValue works in RegionStore. Namely:

1. Remove statement, location context, block count from SymbolMetadata's identity. Let it solely depend on the region.

2. Get rid of the metadata-in-use set. From now on SymbolMetadata, like SymbolRegionValue, is live whenever its region is live.

3. When strlen(R) is used for the first time on a region R, produce $meta<size_t R> as the answer, but *do not store* it in the string length map. We don't need to change the state because the state didn't in fact change. On subsequent strlen(R) calls we'll simply return the same symbol (*because* the map will remain empty), until the string is changed.

4. If the string is mutated, record its length as usual. But if the string is invalidated (or the new length is completely unknown), *do not remove* the length from the state; instead, set it to SymbolConjured. Only remove it from the map when the region dies.

5. Keep string length symbols alive as long as they're in the map.

This model is obviously correct because it mimics RegionStore which is obviously correct (i mean, it's obviously broken beyond repair, but *this* part seems to be correct). In particular, this model doesn't have any obvious flaws that you point out in this thread. It also makes sure that the identity of the state correctly reflects mutations in the state. I think it's very much superior to the existing model. This also more or less matches my belief of "everything is a Store" that i wanted to push on a few years ago (though that one was more radical; back then i wanted to make a "shadow" region for a string and use the actual SymbolRegionValue of that shadow region instead of the SymbolMetadata-as-suggested-in-this-mail).

Balasz, are you willing to implement something like that?

*Balázs.
Sry!!



On 8/19/20 7:59 AM, Gábor Horváth wrote:
Hi!

Since the metadata symbol describes a property of the region I think in this example it might make sense to duplicate the metadata symbol for dst. But I can also imagine code using the length of a dead string which would be nice to support. 

I also wonder what is the advantage of not using a conjured symbol. 

Regards,
Gabor


On Wed, Aug 19, 2020, 3:08 PM Balázs Benics <[hidden email]> wrote:
It turns out that if the region of `src` is alive, then the required metadata symbol will be kept alive as well.
Here is the modified example:
void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
  (void)*src; // Now we keep the `src` alive, thus any metadata symbols to that region will be alive as well at the eval call.
}
```

It seems slightly confusing to me that depending on that `src` is used later or not, the `clang_analyzer_eval(strlen(dst) >= 4)` will show either `TRUE` or `UNKNOWN`.
I think it should always give `TRUE` as an answer.

Note that a metadata symbol is alive only if marked in use AND its region is also alive.
Without the `(void)*src;` the region of `src` is dead, thus the symbol ($meta{src} + 4) representing the cstring length of the region `dst` will be dead too.

Since this problem was caused by the handling of metadata symbols, shouldn't we use conjured ones instead?
In that way, we would decouple the liveness of the cstring length of a region and the region itself.

What is the use-case for using a metadata symbol instead of a conjured one?

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 7., P, 0:05):


On 8/6/20 1:38 PM, Balázs Benics wrote:
Gábor
> How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?
I'm not sure, since I'm not really experienced in this liveness stuff. I will try it.

Artem
> It sounds to me as if putting metadata symbols into the live set directly would have worked just fine.
Ehm, I' not sure about this.
If you look my example closely, you can note that the CStringChecker maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned long} symbol.
So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in the Live set.
However later, when you query the SymbolReaper if the mentioned meta_$2 symbol is dead or not, it will answer you that it is dead.

That's not how that works.

markInUse() doesn't put anything into the live set, it puts things into an auxiliary "metadata-in-use" set: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396

On the other hand, isDead()/isLive() has to return true if the symbol is present in the live set. In fact, that's the first thing it checks for: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437


I'm quoting the related trace log:
[...]
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

Why does the ExprEngine conjure a return symbol, if an evalCall already evaluated the call?

Artem
> See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`
I managed to implement a metadata collector visitor using the new SymExprVisitor infrastructure, in just a couple of lines of code. I was amazed by that :)

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 6., Cs, 19:04):
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev





_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev


On Wed, 19 Aug 2020 at 23:18, Artem Dergachev <[hidden email]> wrote:
Gabor, we've talked about this two days ago >_>
 
This email is much easier to follow than a verbal discussion :D Now I have a much better understanding of what you meant.
 

I believe we should remove SymbolConjured's identity elements from SymbolMetadata and instead make it work more like SymbolRegionValue works in RegionStore. Namely:

If SymbolMetadata works like SymbolRegionValue, do we actually need both? I mean I do understand that we use them for different purposes, but we could also have a type alias or something like that if that actually helps understanding how the analyzer works.
 

2. Get rid of the metadata-in-use set. From now on SymbolMetadata, like SymbolRegionValue, is live whenever its region is live.

Will this work for code examples where the actual string is already dead but we still want to compute with its length?
 

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev


On 8/20/20 9:41 AM, Gábor Horváth wrote:

On Wed, 19 Aug 2020 at 23:18, Artem Dergachev <[hidden email]> wrote:
Gabor, we've talked about this two days ago >_>
 
This email is much easier to follow than a verbal discussion :D Now I have a much better understanding of what you meant.
 

I believe we should remove SymbolConjured's identity elements from SymbolMetadata and instead make it work more like SymbolRegionValue works in RegionStore. Namely:

If SymbolMetadata works like SymbolRegionValue, do we actually need both? I mean I do understand that we use them for different purposes, but we could also have a type alias or something like that if that actually helps understanding how the analyzer works.

Yes, i believe they can be technically backed by the same re-used implementation.

Btw i just unforgot about the deep connection between SymbolRegionValue and SymbolDerived. We'll also probably need to introduce derived metadata. Say, if a smart pointer is part of a bigger object and that bigger object is invalidated, it doesn't mean that we immediately conjure a bunch of symbols for all raw values in all smart pointers that we aren't tracking that are sub-regions of that object; instead, we could simply start producing derived metadata instead of normal metadata when asked.


2. Get rid of the metadata-in-use set. From now on SymbolMetadata, like SymbolRegionValue, is live whenever its region is live.

Will this work for code examples where the actual string is already dead but we still want to compute with its length?

When i say "is live whenever its region is live", i mean that the opposite is not necessarily true. If the symbol is stored anywhere in memory then it's definitely live.

Like, generally, all objects can be "internally live" (example: "SymbolRegionValue<R> is internally live if R is live [externally or internally]") or "externally live" (which can be only determined by exploring the current exploded node in its entirety and implemented via populating live and dead sets, example: "if symbol $x is written into region R and R is live [externally or internally] then $x is externally live").

I was talking about internal liveness rules; external liveness rules are untouched.

Such mental distinction is pretty important because only finite amount of symbols can be externally live (as the amount of info in the state is finite, and so is the live set); the amount of internally live symbols is potentially infinite.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Hubert Tong via cfe-dev
In reply to this post by Hubert Tong via cfe-dev
1. Remove statement, location context, block count from SymbolMetadata's identity. Let it solely depend on the region.
2. Get rid of the metadata-in-use set. From now on SymbolMetadata, like SymbolRegionValue, is live whenever its region is live.
3. When strlen(R) is used for the first time on a region R, produce $meta<size_t R> as the answer, but *do not store* it in the string length map. We don't need to change the state because the state didn't in fact change. On subsequent strlen(R) calls we'll simply return the same symbol (*because* the map will remain empty), until the string is changed.
4. If the string is mutated, record its length as usual. But if the string is invalidated (or the new length is completely unknown), *do not remove* the length from the state; instead, set it to SymbolConjured. Only remove it from the map when the region dies.
5. Keep string length symbols alive as long as they're in the map.

Thank you all for the precious time you put into this.
I really appreciate it.

; Balazs


Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 19., Sze, 21:18):
Gabor, we've talked about this two days ago >_> +Nithin because we're facing the same decision with the smart pointer checker to which raw pointer value would make a lot of sense as SymbolMetadata.

The way SymbolMetadata used in CStringChecker always bothered me. It *is* used *as if* it's SymbolConjured: it represents an unknown value of an expression of a specific type. As such it doesn't do much more than preserve its identity for debugging purposes. We indeed don't need another SymbolConjured.

I believe we should remove SymbolConjured's identity elements from SymbolMetadata and instead make it work more like SymbolRegionValue works in RegionStore. Namely:

1. Remove statement, location context, block count from SymbolMetadata's identity. Let it solely depend on the region.

2. Get rid of the metadata-in-use set. From now on SymbolMetadata, like SymbolRegionValue, is live whenever its region is live.

3. When strlen(R) is used for the first time on a region R, produce $meta<size_t R> as the answer, but *do not store* it in the string length map. We don't need to change the state because the state didn't in fact change. On subsequent strlen(R) calls we'll simply return the same symbol (*because* the map will remain empty), until the string is changed.

4. If the string is mutated, record its length as usual. But if the string is invalidated (or the new length is completely unknown), *do not remove* the length from the state; instead, set it to SymbolConjured. Only remove it from the map when the region dies.

5. Keep string length symbols alive as long as they're in the map.

This model is obviously correct because it mimics RegionStore which is obviously correct (i mean, it's obviously broken beyond repair, but *this* part seems to be correct). In particular, this model doesn't have any obvious flaws that you point out in this thread. It also makes sure that the identity of the state correctly reflects mutations in the state. I think it's very much superior to the existing model. This also more or less matches my belief of "everything is a Store" that i wanted to push on a few years ago (though that one was more radical; back then i wanted to make a "shadow" region for a string and use the actual SymbolRegionValue of that shadow region instead of the SymbolMetadata-as-suggested-in-this-mail).

Balasz, are you willing to implement something like that?


On 8/19/20 7:59 AM, Gábor Horváth wrote:
Hi!

Since the metadata symbol describes a property of the region I think in this example it might make sense to duplicate the metadata symbol for dst. But I can also imagine code using the length of a dead string which would be nice to support. 

I also wonder what is the advantage of not using a conjured symbol. 

Regards,
Gabor


On Wed, Aug 19, 2020, 3:08 PM Balázs Benics <[hidden email]> wrote:
It turns out that if the region of `src` is alive, then the required metadata symbol will be kept alive as well.
Here is the modified example:
void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
  (void)*src; // Now we keep the `src` alive, thus any metadata symbols to that region will be alive as well at the eval call.
}
```

It seems slightly confusing to me that depending on that `src` is used later or not, the `clang_analyzer_eval(strlen(dst) >= 4)` will show either `TRUE` or `UNKNOWN`.
I think it should always give `TRUE` as an answer.

Note that a metadata symbol is alive only if marked in use AND its region is also alive.
Without the `(void)*src;` the region of `src` is dead, thus the symbol ($meta{src} + 4) representing the cstring length of the region `dst` will be dead too.

Since this problem was caused by the handling of metadata symbols, shouldn't we use conjured ones instead?
In that way, we would decouple the liveness of the cstring length of a region and the region itself.

What is the use-case for using a metadata symbol instead of a conjured one?

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 7., P, 0:05):


On 8/6/20 1:38 PM, Balázs Benics wrote:
Gábor
> How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?
I'm not sure, since I'm not really experienced in this liveness stuff. I will try it.

Artem
> It sounds to me as if putting metadata symbols into the live set directly would have worked just fine.
Ehm, I' not sure about this.
If you look my example closely, you can note that the CStringChecker maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned long} symbol.
So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in the Live set.
However later, when you query the SymbolReaper if the mentioned meta_$2 symbol is dead or not, it will answer you that it is dead.

That's not how that works.

markInUse() doesn't put anything into the live set, it puts things into an auxiliary "metadata-in-use" set: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396

On the other hand, isDead()/isLive() has to return true if the symbol is present in the live set. In fact, that's the first thing it checks for: https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437


I'm quoting the related trace log:
[...]
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

Why does the ExprEngine conjure a return symbol, if an evalCall already evaluated the call?

Artem
> See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`
I managed to implement a metadata collector visitor using the new SymExprVisitor infrastructure, in just a couple of lines of code. I was amazed by that :)

Artem Dergachev <[hidden email]> ezt írta (időpont: 2020. aug. 6., Cs, 19:04):
I- i- i was about to reply to that!

I don't know why metadata-in-use is a thing at all. It sounds to me as if putting metadata symbols into the live set directly would have worked just fine. If you find any interesting counterexamples please let me know.

Apart from that, indeed, the correct way to implement checkLiveSymbols when you're tracking arbitrary symbols is to iterate over these arbitrary symbols and mark all sub-symbols as live. See how RegionStore does that within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following example works correctly and i expect CStringChecker to work similarly:

```
int conjure();

int foo() {
  int x = conjure();
  clang_analyzer_warnOnDeadSymbol(x);
  return x + 1;
}

void bar() {
  int y = foo(); // At this point `conj_$2` is no longer directly present in the state; only `conj_$2 + 1` is.
  (void)y;
} // But despite that, `conj_$2` only dies here.
```


On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
+Artem

It would be great if the analyzer could reason about code like that. I think Artem is the most competent in these liveness related problems.
Aside from performance, I do not see any downside for keeping the whole symbolic expression alive after markInUse was called on it (hopefully Artem corrects me if I'm wrong).
But mainly due to constraint solver limitations it might not make sense to keep arbitrarily complex expressions alive.
How much work would it be to prototype keeping these expressions alive and measuring performance and memory implications?


On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <[hidden email]> wrote:
If you are really curious, here is some context.

Imagine the following code (test.c):
```
typedef typeof(sizeof(int)) size_t;

void clang_analyzer_eval(int);
char  *strcat(char *restrict s1, const char *restrict s2);
size_t strlen(const char *s);

void strcat_symbolic_src_length(char *src) {
  char dst[8] = "1234";
  strcat(dst, src);
  clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
  (void)*dst;
}

```

One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns UNKOWN.

After doing a little bit of investigation - and debug prints - I came up with the following trace.

---

# In the CStringChecker::evalStrcat creates a metadata symbol representing the cstring length of the region pointed by 'src'.
created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'

# After the evalStrcat evaluated the call, the state contains the necessary mapping that the 'dst' points to a cstring which is 4 + meta$4 long.
# Note that meta$4 represents the cstring length of the region pointed by 'src'.
# So far so good. We know that resulting cstring length precisely.
strcpy common END State.dump: "program_state": {
  [...]
  "checker_messages": [
    { "checker": "unix.cstring.CStringModeling", "messages": [
      "CString lengths:",
      "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U",
      "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
      ""
    ]}
  ]
}

# We mark all symbols as live in the cstring length map.
# At least we think so...
CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use
CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use

# Even we marked the given symbols in use, we still removes them for some reason...
CStringChecker::checkDeadSymbols finds the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so removes the mapping from 'dst'
CStringChecker::checkDeadSymbols finds the 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes the mapping from 'SymRegion{reg_$0<char * src>}'

# Now that state does not contain the cstring length of the region pointed by 'dst'.


---

Further investigation showed that even if I would visit all the sub SymExprs looking for SymbolMetadata would not help.
SymbolReaper::isDead would still show that the '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is dead.

---

How can we preserve such metadata information?

---

You can also reproduce this following these steps:

Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo: s/epomymous/eponymous/ NFC".
Analyze the 'test.c' file using this command:
./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection -analyzer-config eagerly-assume=false test.c

Balázs Benics <[hidden email]> ezt írta (időpont: 2020. aug. 5., Sze, 10:29):
Why does the SymbolReaper::markInUse only work on SymbolMetadata symbols?
void SymbolReaper::markInUse(SymbolRef sym) {
  if (isa<SymbolMetadata>(sym))
    MetadataInUse.insert(sym);
}
I think it is flawed if the Symbol is a SymIntExpr holding an expression tree referring to SymbolMetadata symbols. In such case, those symbols would not be kept alive - causing some confusion on the checker developers' side and potentially losing some information about the analysis.

Should we walk the expression tree instead of the current implementation?
What performance impact should we expect by doing so?

Any ideas?

Balazs.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev




_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev