Memory modeling, RegionStore Questions

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Memory modeling, RegionStore Questions

Fangrui Song via cfe-dev

Hello, I’ve been studying the RegionStore implementation of the static analyzer in an attempt to debug some false positives I’ve come across. I found the paper “Memory Model for the Static Analysis of C Programs” and Artem’s workbook (“Clang Static Analyzer – A Checker Developer’s Guide”) , and read through many past articles. These are excellent references to get started, now I'd like to enhance my comprehension by jumping in and fixing a few issues that may need to be fixed. I apologize in advance if I’m missing some obvious bridges to comprehension, but I’m still early in the process of learning this. I’d like to contribute to improving the static analyzer and learn best through debugging and implementing. Which brings me to the discussion questions and asks for a few tips on next steps. Thanks in advance for tips and suggestions for how to contribute.

Question 1:

Constructing a case to understanding how data and pointers to data are mapped, I came across some unexpected behavior. In the example below, I’d expect the UNKNOWN values to evaluate to true. Is this a reasonable expectation for the clang static analyzer?

Digging into this problem a bit more with the debugger, it seems at least a small portion of RegionStore is not fully implemented. I debugged this down to the code snippet shown in the data for Q1. ‘a’ comes back as 0x11223344 into V (line 1723), V->getAsSymbol() comes back as NULL (line 1724), V is known, so UnknownVal() is returned – matching the test case results. There’s also a comment in the code at that point – “// Other cases: give up.  We are indexing into a larger object that has some value, but we don't know how to handle that yet.”.

Seems it’s possible to construct concrete values for these cases, if I’m not missing anything – but I’m not sure how to do this. I see the return type, and the offset into the base type, so it seems as “easy” as bit twiddling to return the expected byte or short values. Is there a review that demonstrates how to accomplish this, perhaps some tips?

Question 2:

Looking through CStringChecker.cpp, I see that memset is implemented only for the case of an initialization value of 0, and the sizeof arg and buffer extents are equal (see example in data below, line 1076, also a FIXME in the comments). This seems reasonable for most use cases, but is also interesting since an implementation that allows for arbitrary initialization values could open up possibilities for other improvements (see Question 3). ProgramState.cpp has a method named bindDefaultInitialValue that seems to be perfect for this application, but asserts in RegionStore’s BindDefaultInitial -> “Double initialization!” (approx line 452). Would repurposing this method (bindDefaultInitial) be a reasonable approach to solve this problem, or is there another way that would be a better approach?

Question 3:

Looking through CStringChecker.cpp, I see that memcpy does not perfectly model the memory copy from source to destination memory. Memcpy looks to be implemented as a blank invalidation of the source and destination buffers. I had first looked at this problem, and was quite puzzled since I had assumed memory copies should be modeled (at least for common cases). The suggested approach in the comment to address this issue is using LCVs. Is there an example of such an approach, or perhaps an outline to start with?

 

Case data for Question 1:

void clang_analyzer_eval(int);

void clang_analyzer_dump(int);

void clang_analyzer_printState();

 

int foo(void) {

  unsigned int a = 0x11223344;

  unsigned char *p = (unsigned char*)&a;

  unsigned short *pu = (unsigned short *)&a;

 

  clang_analyzer_printState();

  clang_analyzer_eval(p[0] == 0x44); // expected-warning{{TRUE}} -- OK

  clang_analyzer_eval(p[1] == 0x33); // expected-warning{{TRUE}} -- UNKNOWN

  clang_analyzer_eval(p[2] == 0x22); // expected-warning{{TRUE}} -- UNKNOWN

  clang_analyzer_eval(p[3] == 0x11); // expected-warning{{TRUE}} -- UNKNOWN

 

  clang_analyzer_eval(pu[0] == 0x3344); // expected-warning{{TRUE}} -- OK

  clang_analyzer_eval(pu[1] == 0x1122); // expected-warning{{TRUE}} -- UNKNOWN

 

  clang_analyzer_dump(&p[0]); -- &Element{a,0 S64b,unsigned char} [as 32 bit integer]

  clang_analyzer_dump(&p[1]); -- &Element{a,1 S64b,unsigned char} [as 32 bit integer]

 

  clang_analyzer_dump(&pu[0]); -- &Element{a,0 S64b,unsigned short} [as 32 bit integer]

  clang_analyzer_dump(&pu[1]); -- &Element{a,0 S64b,unsigned short} [as 32 bit integer]

 

}

 

The program state from the decls in the sample are shown here.

 

"program_state": {

  "store": { "pointer": "0x76787f8", "items": [

    { "cluster": "a", "pointer": "0x7677fc0", "items": [

      { "kind": "Direct", "offset": 0, "value": "287454020 U32b" }

    ]},

    { "cluster": "p", "pointer": "0x7678728", "items": [

      { "kind": "Direct", "offset": 0, "value": "&Element{a,0 S64b,unsigned char}" }

    ]},

    { "cluster": "pu", "pointer": "0x767d9a8", "items": [

      { "kind": "Direct", "offset": 0, "value": "&Element{a,0 S64b,unsigned short}" }

    ]}

  ]},

  "environment": { "pointer": "0x7607d50", "items": [

    { "lctx_id": 1, "location_context": "#0 Call", "calling": "foo", "location": null, "items": [

      { "stmt_id": 805, "pretty": "clang_analyzer_printState", "value": "&code{clang_analyzer_printState}" }

    ]}

  ]},

  "constraints": null,

  "dynamic_types": null,

  "constructing_objects": null,

  "checker_messages": null

}"program_state": {

  "store": null,

  "environment": { "pointer": "0x7607d50", "items": [

    { "lctx_id": 1, "location_context": "#0 Call", "calling": "foo", "location": null, "items": [

      { "stmt_id": 1201, "pretty": "clang_analyzer_printState", "value": "&code{clang_analyzer_printState}" }

    ]}

  ]},

  "constraints": null,

  "dynamic_types": null,

  "constructing_objects": null,

  "checker_messages": null

 

Code snippet from RegionStore.cpp, Function getBindingForElement, line #s approximate.

 

1716   if (const TypedValueRegion *baseR =

1717         dyn_cast_or_null<TypedValueRegion>(O.getRegion())) {

1718     QualType baseT = baseR->getValueType();

1719     if (baseT->isScalarType()) {

1720       QualType elemT = R->getElementType();

1721       if (elemT->isScalarType()) {

1722         if (Ctx.getTypeSizeInChars(baseT) >= Ctx.getTypeSizeInChars(elemT)) {

1723           if (const Optional<SVal> &V = B.getDirectBinding(superR)) {

 

(gdb) p V->dump()

287454020 U32b$2 = void

(gdb) p R->dump()

Element{a,1 S64b,unsigned char}$3 = void

 

1724             if (SymbolRef parentSym = V->getAsSymbol())

1725               return svalBuilder.getDerivedRegionValueSymbolVal(parentSym, R);

1726              

1727             if (V->isUnknownOrUndef())

1728               return *V;

1729             // Other cases: give up.  We are indexing into a larger object

1730             // that has some value, but we don't know how to handle that yet.

1731             return UnknownVal();

1732           }

1733         }

1734       }

 

Question 2 – CStringChecker.cpp, method memsetAux

 

1071     if (StateWholeReg && !StateNotWholeReg && StateNullChar &&

1072         !StateNonNullChar) {

1073       // If the 'memset()' acts on the whole region of destination buffer and

1074       // the value of the second argument of 'memset()' is zero, bind the second

1075       // argument's value to the destination buffer with 'default binding'.

1076       // FIXME: Since there is no perfect way to bind the non-zero character, we

1077       // can only deal with zero value here. In the future, we need to deal with

1078       // the binding of non-zero value in the case of whole region.

1079       State = State->bindDefaultZero(svalBuilder.makeLoc(BR),

1080                                      C.getLocationContext());

 

 

Question 3 – CStringChecker.cpp, method evalCopyCommon

 

1201     // Invalidate the destination (regular invalidation without pointer-escaping

1202     // the address of the top-level region).

1203     // FIXME: Even if we can't perfectly model the copy, we should see if we

1204     // can use LazyCompoundVals to copy the source values into the destination.

1205     // This would probably remove any existing bindings past the end of the

1206     // copied region, but that's still an improvement over blank invalidation.

1207     state =

1208         InvalidateBuffer(C, state, Dest.Expression, C.getSVal(Dest.Expression),

1209                          /*IsSourceBuffer*/ false, Size.Expression);

1210

1211     // Invalidate the source (const-invalidation without const-pointer-escaping

1212     // the address of the top-level region).

1213     state = InvalidateBuffer(C, state, Source.Expression,

1214                              C.getSVal(Source.Expression),

1215                              /*IsSourceBuffer*/ true, nullptr);


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev