[analyzer] Problem tracking taint applied to regions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[analyzer] Problem tracking taint applied to regions

Manuel Klimek via cfe-dev
Hello cfe-dev,

I am currently trying to write a custom checker for the static analyzer.

The idea is to perform taint analysis on C++ code with sources, sinks and filters being provided by configuration. A checker which does something similar exist already,e.g. [1]. However, in our use-case, we want to mark data/values (rather than functions) as sources and sinks. Consider for example a piece of security-relevant configuration data which must not be writable by an potential attacker. For now, I use the following example for testing:

| class Foo {
| public: int x; // Foo::x marked as tainted (e.g. source)
| };
| static int y; // Marked as sink
|
| int main() {
|   Foo f;
|   y = f.x;
|   return y;
| }

Now, clang already provides some infrastructure for taint analysis, e.g. `ProgramState::addTaint()` and `ProgramState::isTainted()`.The internal taint map used in the store tracks symbolic expressions. I decided to use this existing infrastructure for obvious reasons, e.g. benefitting from existing and future taint propagation infrastructure. However, it bit me and now I'm stuck.

The taint is introduced in a `checkLocation()` implementation. The first thing I do is unwrapping a `MemberExpr` (e.g. removing casts) and decide whether to taint it or not based on the referred declaration. This far, it works as intended. However, it appears that the `SVal loc` provided as a parameter to this handler does not refer to a `SymExpr`, neither does `loc.getAsRegion()` return a `SymbolicRegion`. In order to taint the piece of data, I create my own symbolic expression as well as an associated `SVal` and bind it to the member expression `S`:
| // The taint map can only be used for tracking symbolic expressions.
| auto symExpr = loc.getAsSymExpr();
|
| if (!symExpr) {
|   // The current `loc` is not suitable for carrying taint. Construct a new one.
|   if (const auto region = dyn_cast_or_null<TypedValueRegion>(loc.getAsRegion())) {
|     if (symExpr = C.getSymbolManager().getRegionValueSymbol(region)) {
|       loc = C.getSValBuilder().makeLoc(symExpr);
|       state = state->BindExpr(S, C.getLocationContext(), loc);
|     }
|   }
| }
|
| // Add taint if possible
| if (symExpr) {
|   state = state->addTaint(symExpr);
|   C.addTransition(state);
| }

Invoking `dumpTaint()` on ` state` shows that the expression did indeed make it into the taint map, `state->isTainted(S, C.getLocationContext())` does return `true` and `state->dump()` reveals that `f.x` is now bound to a symbolic expression looking just like I'd expect (`f.x : &SymRegion{reg_$1<int f->x>}SVal `).

Btw: I did not use a `check::PreStmt` specialized to a `MemberExpr` because I do not (yet) understand the exact semantics of the ` SVal `s and `Loc`s which I can get from the state (partly because of the utter lack of higher-level documentation on these things). They appear to be different: the code above does fail in a `checkPreStmt()`, one appears to refer to the location where the value is originating while some other appears to refer to something completely different although the method's name suggests its just the same.

Using the information in subsequent post-statement checks is where I'm stuck. First, I noted that the taint is not propagated anywhere. I assumed that simple assignments were, for whatever reason, not yet considered (I also did not find anything in the generic taint checker), so I started implementing a `check::PostStmt<BinaryOperator>` checker. Since `isTainted()` on the statement behaved as expected on the last handler, I assumed the taint detection to be somewhat easier. Sure, I have to strip both the LHS and RHS of casts and paranthesis, but detecting taint should be easy, I thought. Turns out `isTainted()` never returns true.

Now, dumping both the taint and the overall state using `ProgramState::dump()` and `ProgramState::dumpTaint()` reveals that the symbolic expression which was previously added to the taint map is still tainted, but the binding of this expression to `f.x` is gnone (`f.x: Undefined`).

Does anybody on this list have an idea or an explanation why the binding vanishes? Or can naybody point me to move exhaustive resources on the excact semantics of the different kinds of `SVal`s turning up in different contexts? Help would be much appreciated.

Btw: I'm stuck with LLVM/clang 4.0 provided by the package manager at the moment. Self-compiled clang 6.0 refuses to load modules (cannot resolve the `CheckerBase` symbol for some reason) although I ran `cmake` with `-DCLANG_PLUGIN_SUPPORT=ON`, `-DLLVM_BUILD_LLVM_DYLIB=ON` and `-DLLVM_ENABLE_MODULES=ON`.

Greetings and thanks in advance,
Julian

[1] https://github.com/franchiotta/taintchecker

PS: Sorry for the missing line-wrapping. I did not yet manage to make Outlook behave like a decent eMail-Client

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Problem tracking taint applied to regions

Manuel Klimek via cfe-dev
Hello cfe-dev,

> Using the information in subsequent post-statement checks is where I'm stuck. First, I noted that the taint is not propagated anywhere. I assumed that simple assignments were, for whatever reason, not yet considered (I also did not find anything in the generic taint checker), so I started implementing a `check::PostStmt<BinaryOperator>` checker. Since `isTainted()` on the statement behaved as expected on the last handler, I assumed the taint detection to be somewhat easier. Sure, I have to strip both the LHS and RHS of casts and paranthesis, but detecting taint should be easy, I thought. Turns out `isTainted()` never returns true.
>
> Now, dumping both the taint and the overall state using `ProgramState::dump()` and `ProgramState::dumpTaint()` reveals that the symbolic expression which was previously added to the taint map is still tainted, but the binding of this expression to `f.x` is gnone (`f.x: Undefined`).
It looks like the value is collected by the `SymbolReaper`, which I find strange since the statement containing the member expression is clearly still processed.

By now I suspect that I should have used a `SymbolMetadata` and somehow made sure it lives long enough (e.g. using `SymbolReaper::markInUse()`).

Greetings,
Julian
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [analyzer] Problem tracking taint applied to regions

Manuel Klimek via cfe-dev
In reply to this post by Manuel Klimek via cfe-dev
Whoops +cfe-dev.

On 3/20/18 3:01 PM, Artem Dergachev wrote:

> Hello Julian,
>
> You seem to have made your way very far on your own, but it doesn't
> look to me like you're on the right track.
>
> On 3/20/18 2:34 AM, Julian Ganz via cfe-dev wrote:
>> Hello cfe-dev,
>>
>> I am currently trying to write a custom checker for the static analyzer.
>>
>> The idea is to perform taint analysis on C++ code with sources, sinks
>> and filters being provided by configuration. A checker which does
>> something similar exist already,e.g. [1]. However, in our use-case,
>> we want to mark data/values (rather than functions) as sources and
>> sinks.
>> Consider for example a piece of security-relevant configuration data
>> which must not be writable by an potential attacker. For now, I use
>> the following example for testing:
>>
>> | class Foo {
>> | public: int x; // Foo::x marked as tainted (e.g. source)
>> | };
>> | static int y; // Marked as sink
>> |
>> | int main() {
>> |   Foo f;
>> |   y = f.x;
>> |   return y;
>> | }
>>
>> Now, clang already provides some infrastructure for taint analysis,
>> e.g. `ProgramState::addTaint()` and `ProgramState::isTainted()`.The
>> internal taint map used in the store tracks symbolic expressions. I
>> decided to use this existing infrastructure for obvious reasons, e.g.
>> benefitting from existing and future taint propagation
>> infrastructure. However, it bit me and now I'm stuck.
>>
>> The taint is introduced in a `checkLocation()` implementation. The
>> first thing I do is unwrapping a `MemberExpr` (e.g. removing casts)
>> and decide whether to taint it or not based on the referred
>> declaration. This far, it works as intended. However, it appears that
>> the `SVal loc` provided as a parameter to this handler does not refer
>> to a `SymExpr`, neither does `loc.getAsRegion()` return a
>> `SymbolicRegion`. In order to taint the piece of data, I create my
>> own symbolic expression as well as an associated `SVal` and bind it
>> to the member expression `S`:
>> | // The taint map can only be used for tracking symbolic expressions.
>> | auto symExpr = loc.getAsSymExpr();
>> |
>> | if (!symExpr) {
>> |   // The current `loc` is not suitable for carrying taint.
>> Construct a new one.
>> |   if (const auto region =
>> dyn_cast_or_null<TypedValueRegion>(loc.getAsRegion())) {
>> |     if (symExpr = C.getSymbolManager().getRegionValueSymbol(region)) {
>
> You should not create your own symbols in the checker. At least, not
> until you 100% know what you are doing.
>
> It's like the analyzer says "i know that there are two sheep" and you
> later say "let $x denote the number of sheep" - you should just take
> the value that's already there instead. Just worse because you use a
> specific notation (SymbolRegionValue) rather than a nameless "x",
> which matters (unlike school algebra).
>
> There is already a correct SVal (which is not necessarily a symbol) in
> the program state that denotes the value of the region; you should
> retrieve this value by using the State->getSVal(Region, QualType) API.
> If there isn't such value yet, it will be denoted with a symbol
> automatically and presented to you.
>
>> |       loc = C.getSValBuilder().makeLoc(symExpr);
>
> The symbol you've created is an integer. It's not a loc-type symbol,
> i.e. not a pointer or a reference or an lvalue thing. This operation
> should ideally fail.
>
>> |       state = state->BindExpr(S, C.getLocationContext(), loc);
>
> You are changing the value of the syntactic expression (in the
> "Environment"), but not changing the value in the symbolic memory
> model (the "Store"). Well, ideally, the checker should do neither.
>
> Replacing values of expressions in the Environment is a catastrophic
> thing to do - they're essentially immutable by design (though
> unfortunately we have a few places in the analyzer where we still do
> that). Environment is "spacial", not "temporal": it captures a single
> moment in time and explores values of expressions in that moment of
> time, and these values are not allowed to change because any change
> assumes some sort of flow of time.
>
> Replacing the value stored in a region in the Store is an equivalent
> of saying "the program has modified contents of memory at this point",
> which is also not what you're looking for, and it's risky to do unless
> you take full responsibility of modeling the statement (for now
> checkers can only take such responsibility for call expressions and
> for branch conditions); otherwise checkers that subscribed on the
> callback before you would see the old value, and checkers that go
> later would see the new value.
>
>> |     }
>> |   }
>> | }
>> |
>> | // Add taint if possible
>> | if (symExpr) {
>> |   state = state->addTaint(symExpr);
>> |   C.addTransition(state);
>> | }
>>
>> Invoking `dumpTaint()` on ` state` shows that the expression did
>> indeed make it into the taint map, `state->isTainted(S,
>> C.getLocationContext())` does return `true` and `state->dump()`
>> reveals that `f.x` is now bound to a symbolic expression looking just
>> like I'd expect (`f.x : &SymRegion{reg_$1<int f->x>}SVal `).
>
> "SymRegion{$x}" means "an unidentified segment of memory that starts
> at address $x". That's definitely not what you want because
> "reg_$1<int f->x>" is an unknown value of type "int" that can't denote
> any reasonable memory address.
>
> I added an assertion recently that would have informed you of this
> problem if you used a more recent clang.
>
> Note that symbols, much like expressions, always have a type which is
> a valid QualType from the AST of the program.
>
>> Btw: I did not use a `check::PreStmt` specialized to a `MemberExpr`
>> because I do not (yet) understand the exact semantics of the ` SVal
>> `s and `Loc`s which I can get from the state (partly because of the
>> utter lack of higher-level documentation on these things). They
>> appear to be different: the code above does fail in a
>> `checkPreStmt()`, one appears to refer to the location where the
>> value is originating while some other appears to refer to something
>> completely different although the method's name suggests its just the
>> same.
>>
>> Using the information in subsequent post-statement checks is where
>> I'm stuck. First, I noted that the taint is not propagated anywhere.
>> I assumed that simple assignments were, for whatever reason, not yet
>> considered (I also did not find anything in the generic taint
>> checker), so I started implementing a
>> `check::PostStmt<BinaryOperator>` checker. Since `isTainted()` on the
>> statement behaved as expected on the last handler, I assumed the
>> taint detection to be somewhat easier. Sure, I have to strip both the
>> LHS and RHS of casts and paranthesis, but detecting taint should be
>> easy, I thought. Turns out `isTainted()` never returns true.
>>
>> Now, dumping both the taint and the overall state using
>> `ProgramState::dump()` and `ProgramState::dumpTaint()` reveals that
>> the symbolic expression which was previously added to the taint map
>> is still tainted, but the binding of this expression to `f.x` is
>> gnone (`f.x: Undefined`).
>
> "Undefined" is not "gone"; it's an indication that something is indeed
> undefined, as in, the analyzer has identified some undefined behavior
> that definitely occurs in the program under analysis on the current
> execution path which leads to something (in our case, a value of an
> expression) being potentially equal to any arbitrary value.
>
> The lack of specific value would have been represented as "Unknown",
> not "Undefined". Or, in case of Environment, the binding would have
> disappear completely.
>
> In your example f.x is indeed undefined because the class has no
> constructor and therefore the local variable will have uninitialized
> contents. Undefined values are used by the analyzer for finding
> related bugs - for instance you would have seen the bug in your
> example if you ran the vanilla (i.e. unmodified) analyzer on it:
>
> $ clang --analyze test.cpp
> test.cpp:8:5: warning: Assigned value is garbage or undefined
>   y = f.x;
>     ^ ~~~
> 1 warning generated.
>
> So the UndefinedVal here is the "correct value" i've been talking
> about in the beginning. You'd have obtained it by calling getSVal().
> The analyzer can prove to you that the field has undefined contents,
> just by looking at your code. Taint analysis is not necessary in this
> case: you already know that the value is completely undefined. Which
> is why it's not even a symbol: it is pointless to denote an undefined
> value with a symbol because any code that deals with such value would
> be buggy and we need to terminate the analysis and emit a warning as
> soon as the value is used at all.
>
> So you'd need a better test case to work with.
>
> Generally the whole idea of "certain fields are the source of
> taint"doesn't sound good to me . In C++ fields are not properties; if
> you put a value into them and don't touch the object otherwise, you're
> guaranteed to read back the same value. It means that fields are not a
> good source of taint: they merely store whatever you put into them.
> You should not report taint if someone writes 0 to the field and then
> reads it.
>
> I'd strongly suggest considering identifying sources of the values
> stored in these fields that cause these fields to contain tainted
> values - instead of blaming fields themselves. For example:
>
> - A function returned an object of type Foo. Upon return, immediately
> mark the symbolic value of its field x as symbolic. Don't wait until
> the read because it may be preceded with a write.
>
> - A method of Foo sets the object's field x to a tainted value. Upon
> method call, immediately mark the symbolic value this object's field x
> as tainted.
>
> - A function is known to touch a specific global object of type Foo
> and set its field x to a tainted value. Upon call of that function,
> immediately mark the symbolic value of this object's field x as tainted.
>
> This is the intended/supported workflow, and i strongly suspect that
> the field-based approach wouldn't work well (i.e. will have
> unavoidable false positives).
>
>
>> Does anybody on this list have an idea or an explanation why the
>> binding vanishes? Or can naybody point me to move exhaustive
>> resources on the excact semantics of the different kinds of `SVal`s
>> turning up in different contexts? Help would be much appreciated.
>>
>> Btw: I'm stuck with LLVM/clang 4.0 provided by the package manager at
>> the moment. Self-compiled clang 6.0 refuses to load modules (cannot
>> resolve the `CheckerBase` symbol for some reason) although I ran
>> `cmake` with `-DCLANG_PLUGIN_SUPPORT=ON`,
>> `-DLLVM_BUILD_LLVM_DYLIB=ON` and `-DLLVM_ENABLE_MODULES=ON`.
>>
>> Greetings and thanks in advance,
>> Julian
>>
>> [1] https://github.com/franchiotta/taintchecker
>>
>> PS: Sorry for the missing line-wrapping. I did not yet manage to make
>> Outlook behave like a decent eMail-Client
>>
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev