Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Don Hinton via cfe-dev
On Wed, Oct 7, 2015 at 11:58 PM, Ismail Pazarbasi
<[hidden email]> wrote:

> Hi,
>
> I've submitted some of the small patches to get TSan running on Mac OS
> X. Time for the real challenge is approaching, and that needs some
> discussion.
>
> The biggest problem with TSan on Darwin is the way thread local
> variables are created and destroyed. TSan uses a thread-local variable
> called "cur_thread_placeholder" to store each thread's state. Thread
> locals are initialized lazily on Darwin.
>
> Darwin dynamic linker uses pthread keys to maintain lifetime and
> storage for thread local variables. pthread keys are destroyed in the
> order of their construction, not the opposite direction; from lowest
> index to highest index. Per POSIX standard, this order is unspecified,
> but dyld code (and comments) explain the order. We want TSan's
> thread-locals to live until the end of execution, because we'll be
> accessing "cur_thread_placeholder" when thread exits. However, since
> TSan is initialized too early, it's taken down earlier than other
> thread locals. This is problematic, because destructor of these
> late-destroyed thread-locals may still call functions in TSan RTL
> (interposed functions or functions called from instrumented code, e.g.
> __tsan_func_entry). That either immediately crashes, because TSan is
> in invalid state or tries to "resurrect" TLV, which generally creates
> an infinite recursion, and end up with stack overflow.
>
> As far as I understood, dynamic linker creates a pthread key acting
> like a class "destructor" (quoting to distinguish it from pthread's
> destructor). When dyld loads image, it creates another key, whose
> destructor function frees the storage. At thread exit, it will call
> destructor functions from first to last. Therefore, the first
> "destructor" destructor function will be called before "finalize"
> destructor. After finalize, storage is freed, but can be resurrected
> if address of TLV is taken.
>
> I have found a workaround (or a solution -- you name it) that aims to
> keep early initialization behavior of TSan TLV keys, but postpone
> their destruction as much as possible. It seems like the only way to
> do this is to assign highest pthread key indices to dyld's TLV code
> when it allocates thread-local variables for TSan, and preserve their
> order. Because dyld will create keys in the right order; we just want
> to move those keys to the end of allocatable range. We have N pthread
> keys available in the range, and we need to allocate three keys at the
> highest possible indices, and order them. Two of these keys are used
> by TLV code (in dyld), and one is used internally in TSan. In this
> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
> Destructor functions for these keys will be called in ascending order,
> and TSan will work as expected in this case.
>
> TSan wasn't interposing pthread_key_create, but I had to highjack it
> on Darwin to achieve the behavior I explained. First, it identifies
> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
> call "pthread_key_create_high" to allocate highest possible key
> indices. It's basically a 'while' loop until real pthread_key_create
> returns EAGAIN. It then frees all other keys it has previously
> allocated. This places TSan's TLV keys to the end of destructor list.
> There is one more thing this function does; it stores pointer to
> original destructor function (passed in to pthread_key_create), and
> replaces it with its own destructor function, which will call
> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
> times to postpone destruction of TSan TLV even further (IIRC, this was
> necessary). The replacement destructor function then calls the actual
> destructor function.
>
> As a side note, dyld uses libc++, and its loading the instrumented
> version. This didn't seem to be creating any problem so far. I didn't
> test this on a sufficiently large code base just yet. TSan tests
> generally pass, but they didn't have this "late-destroyed thread
> locals" problem I mentioned.
>
> Does my approach sound wrong, flawed or too brittle? Is there any
> other way to order these events in a sensible way? Can you help me
> fixing some of the problems I mentioned below, especially the
> 'identify_caller' function?
>
> The code is in a very dirty state now, so I refrain submitting it.
> But, no worries; there's always pseudo code! I'd like to share the key
> creation part, the rest of TSan patches are much simpler than this
> part.
>
> Thanks,
> Ismail
>
> #if SANITIZER_MAC
> enum { kPthHighIndexTlvInitializer = 0,
> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
> kPthHighIndexTsanFinalize,
> kPthHighIndexCount };
>
> // FIXME: This function is problematic. Can we avoid dladdr?
> static int identify_caller(uptr pc) {
>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>     return kPthHighIndexTsanFinalize;
>   // There are 2 "hidden" callers
>   Dl_info dli;
>   if (!dladdr((const void*)pc, &dli))
>     return -1;
>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>     return kPthHighIndexTlvInitializer;
>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>     return kPthHighIndexTlvLoadNotification;
>   return -1;
> }
>
> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>                  void (*dtor)(void*)) {
>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>     return REAL(pthread_key_create)(key, dtor);
>   }
>   const uptr caller_pc = GET_CALLER_PC();
>   const int caller_index = identify_caller(caller_pc);
>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>                             : pthread_key_create_high(key, dtor, caller_index);
> }
>
> // 'index' represents caller index so that we allocate them
> // in an order. We want to get the last N-index keys
> // in a sensible order.
> int __tsan::pthread_key_create_high(unsigned long *key,
>                                     void (*dtor)(void *), int index) {
>   int res = 0;
>   original_dtors[index] = dtor;
>   // Can be a function-local ulong[512]?
>   unsigned long lowest_key, highest_key;
>
>   // The following 2 loops may terminate early or later
>   // depending on 'index'.
>   do
>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>   while (res != EAGAIN);
>   for (; lowest_key < highest_key; ++lowest_key) {
>     pthread_key_delete(lowest_key);
>   }
>   tlv_keys[index] = highest_key;
>   return res;
> }
>
> // This will be called from _pthread_tsd_cleanup
> static void replacement_dtor(void *p) {
>   uptr cur_key;
>   unsigned index = kPthHighIndexCount;
>   // Can this make it through code review?
>   __asm__ __volatile__ ("" : "=b" (cur_key));
>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>     if (cur_key == tlv_keys[c]) {
>       index = c;
>       break;
>     }
>   }
>   if (--tls_dtor_counters[index] == 1) {
>     if (original_dtors[index])
>       original_dtors[index](p);
>     return;
>   }
>   pthread_setspecific(tlv_keys[index], p);
> }
> #endif



Hi Ismail,

I have a counter-proposal that I wanted to implement for some time,
but never actually get to it because it was low-priority.
The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
postponing. There is other smart code that does the same and that also
breaks the assumption that tsan thread finalization runs last.
Another problem was with asan on mac. Essentially what you see, but in
asan it is easier to work around. We just ignore all memory accesses
that come after our thread destructor.

The idea that I had is relatively clean, reasonably
platform-independent and should solve _all_ problems of such kind once
and for all.
Namely: we execute asan/tsan/msan thread destructor when the thread
_join_ returns. At that point there is definitely no thread code
running. The high-level implementation plan: for joinable threads we
hook into join (we already intercept it); for detached threads we
start a helper background thread when thread function returns and join
the user thread from that helper thread. I suspect that actual
implementation will be messier than that: for example, a thread can
call pthread_detach in a pthread_specific destructor after we've
decided to _not_ start a helper thread; or a thread can call
pthread_exit and does not return from thread function.

What do you think about this idea?
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Don Hinton via cfe-dev
On Thu, Oct 8, 2015 at 10:20 AM, Dmitry Vyukov <[hidden email]> wrote:

> On Wed, Oct 7, 2015 at 11:58 PM, Ismail Pazarbasi
> <[hidden email]> wrote:
>> Hi,
>>
>> I've submitted some of the small patches to get TSan running on Mac OS
>> X. Time for the real challenge is approaching, and that needs some
>> discussion.
>>
>> The biggest problem with TSan on Darwin is the way thread local
>> variables are created and destroyed. TSan uses a thread-local variable
>> called "cur_thread_placeholder" to store each thread's state. Thread
>> locals are initialized lazily on Darwin.
>>
>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>> storage for thread local variables. pthread keys are destroyed in the
>> order of their construction, not the opposite direction; from lowest
>> index to highest index. Per POSIX standard, this order is unspecified,
>> but dyld code (and comments) explain the order. We want TSan's
>> thread-locals to live until the end of execution, because we'll be
>> accessing "cur_thread_placeholder" when thread exits. However, since
>> TSan is initialized too early, it's taken down earlier than other
>> thread locals. This is problematic, because destructor of these
>> late-destroyed thread-locals may still call functions in TSan RTL
>> (interposed functions or functions called from instrumented code, e.g.
>> __tsan_func_entry). That either immediately crashes, because TSan is
>> in invalid state or tries to "resurrect" TLV, which generally creates
>> an infinite recursion, and end up with stack overflow.
>>
>> As far as I understood, dynamic linker creates a pthread key acting
>> like a class "destructor" (quoting to distinguish it from pthread's
>> destructor). When dyld loads image, it creates another key, whose
>> destructor function frees the storage. At thread exit, it will call
>> destructor functions from first to last. Therefore, the first
>> "destructor" destructor function will be called before "finalize"
>> destructor. After finalize, storage is freed, but can be resurrected
>> if address of TLV is taken.
>>
>> I have found a workaround (or a solution -- you name it) that aims to
>> keep early initialization behavior of TSan TLV keys, but postpone
>> their destruction as much as possible. It seems like the only way to
>> do this is to assign highest pthread key indices to dyld's TLV code
>> when it allocates thread-local variables for TSan, and preserve their
>> order. Because dyld will create keys in the right order; we just want
>> to move those keys to the end of allocatable range. We have N pthread
>> keys available in the range, and we need to allocate three keys at the
>> highest possible indices, and order them. Two of these keys are used
>> by TLV code (in dyld), and one is used internally in TSan. In this
>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>> Destructor functions for these keys will be called in ascending order,
>> and TSan will work as expected in this case.
>>
>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>> on Darwin to achieve the behavior I explained. First, it identifies
>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>> call "pthread_key_create_high" to allocate highest possible key
>> indices. It's basically a 'while' loop until real pthread_key_create
>> returns EAGAIN. It then frees all other keys it has previously
>> allocated. This places TSan's TLV keys to the end of destructor list.
>> There is one more thing this function does; it stores pointer to
>> original destructor function (passed in to pthread_key_create), and
>> replaces it with its own destructor function, which will call
>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>> times to postpone destruction of TSan TLV even further (IIRC, this was
>> necessary). The replacement destructor function then calls the actual
>> destructor function.
>>
>> As a side note, dyld uses libc++, and its loading the instrumented
>> version. This didn't seem to be creating any problem so far. I didn't
>> test this on a sufficiently large code base just yet. TSan tests
>> generally pass, but they didn't have this "late-destroyed thread
>> locals" problem I mentioned.
>>
>> Does my approach sound wrong, flawed or too brittle? Is there any
>> other way to order these events in a sensible way? Can you help me
>> fixing some of the problems I mentioned below, especially the
>> 'identify_caller' function?
>>
>> The code is in a very dirty state now, so I refrain submitting it.
>> But, no worries; there's always pseudo code! I'd like to share the key
>> creation part, the rest of TSan patches are much simpler than this
>> part.
>>
>> Thanks,
>> Ismail
>>
>> #if SANITIZER_MAC
>> enum { kPthHighIndexTlvInitializer = 0,
>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>> kPthHighIndexTsanFinalize,
>> kPthHighIndexCount };
>>
>> // FIXME: This function is problematic. Can we avoid dladdr?
>> static int identify_caller(uptr pc) {
>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>     return kPthHighIndexTsanFinalize;
>>   // There are 2 "hidden" callers
>>   Dl_info dli;
>>   if (!dladdr((const void*)pc, &dli))
>>     return -1;
>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>     return kPthHighIndexTlvInitializer;
>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>     return kPthHighIndexTlvLoadNotification;
>>   return -1;
>> }
>>
>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>                  void (*dtor)(void*)) {
>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>     return REAL(pthread_key_create)(key, dtor);
>>   }
>>   const uptr caller_pc = GET_CALLER_PC();
>>   const int caller_index = identify_caller(caller_pc);
>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>                             : pthread_key_create_high(key, dtor, caller_index);
>> }
>>
>> // 'index' represents caller index so that we allocate them
>> // in an order. We want to get the last N-index keys
>> // in a sensible order.
>> int __tsan::pthread_key_create_high(unsigned long *key,
>>                                     void (*dtor)(void *), int index) {
>>   int res = 0;
>>   original_dtors[index] = dtor;
>>   // Can be a function-local ulong[512]?
>>   unsigned long lowest_key, highest_key;
>>
>>   // The following 2 loops may terminate early or later
>>   // depending on 'index'.
>>   do
>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>   while (res != EAGAIN);
>>   for (; lowest_key < highest_key; ++lowest_key) {
>>     pthread_key_delete(lowest_key);
>>   }
>>   tlv_keys[index] = highest_key;
>>   return res;
>> }
>>
>> // This will be called from _pthread_tsd_cleanup
>> static void replacement_dtor(void *p) {
>>   uptr cur_key;
>>   unsigned index = kPthHighIndexCount;
>>   // Can this make it through code review?
>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>     if (cur_key == tlv_keys[c]) {
>>       index = c;
>>       break;
>>     }
>>   }
>>   if (--tls_dtor_counters[index] == 1) {
>>     if (original_dtors[index])
>>       original_dtors[index](p);
>>     return;
>>   }
>>   pthread_setspecific(tlv_keys[index], p);
>> }
>> #endif
>
>
>
> Hi Ismail,
>
> I have a counter-proposal that I wanted to implement for some time,
> but never actually get to it because it was low-priority.
> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
> postponing. There is other smart code that does the same and that also
> breaks the assumption that tsan thread finalization runs last.
> Another problem was with asan on mac. Essentially what you see, but in
> asan it is easier to work around. We just ignore all memory accesses
> that come after our thread destructor.
>
> The idea that I had is relatively clean, reasonably
> platform-independent and should solve _all_ problems of such kind once
> and for all.
> Namely: we execute asan/tsan/msan thread destructor when the thread
> _join_ returns. At that point there is definitely no thread code
> running. The high-level implementation plan: for joinable threads we
> hook into join (we already intercept it); for detached threads we
> start a helper background thread when thread function returns and join
> the user thread from that helper thread. I suspect that actual
> implementation will be messier than that: for example, a thread can
> call pthread_detach in a pthread_specific destructor after we've
> decided to _not_ start a helper thread; or a thread can call
> pthread_exit and does not return from thread function.
>
> What do you think about this idea?

Another implementation plan can be to always join all threads
ourselves. Then in pthread_detach merely remember that the thread is
detached, but don't call REAL(pthread_detach). Then when we join a
thread check whether it is detached or not; if it is detached then we
can discard all thread state; if it is non-detached then we need to
remember that the thread has finished and its return value, wait for
user pthread_join call and satisfy it from our state.

I don't know which plan will lead to simpler and more reliable implementation.

To make it clear, what I don't like in your proposal: it is somewhat
platform-dependent, should be enabled only on mac (thus less tested)
and does not solve all potential problems of such kind (e.g. on some
other OS there can be some other magical way to run user code after
tsan thread destructor).
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Don Hinton via cfe-dev
In reply to this post by Don Hinton via cfe-dev
On Thu, Oct 8, 2015 at 10:20 AM, Dmitry Vyukov <[hidden email]> wrote:

> On Wed, Oct 7, 2015 at 11:58 PM, Ismail Pazarbasi
> <[hidden email]> wrote:
>> Hi,
>>
>> I've submitted some of the small patches to get TSan running on Mac OS
>> X. Time for the real challenge is approaching, and that needs some
>> discussion.
>>
>> The biggest problem with TSan on Darwin is the way thread local
>> variables are created and destroyed. TSan uses a thread-local variable
>> called "cur_thread_placeholder" to store each thread's state. Thread
>> locals are initialized lazily on Darwin.
>>
>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>> storage for thread local variables. pthread keys are destroyed in the
>> order of their construction, not the opposite direction; from lowest
>> index to highest index. Per POSIX standard, this order is unspecified,
>> but dyld code (and comments) explain the order. We want TSan's
>> thread-locals to live until the end of execution, because we'll be
>> accessing "cur_thread_placeholder" when thread exits. However, since
>> TSan is initialized too early, it's taken down earlier than other
>> thread locals. This is problematic, because destructor of these
>> late-destroyed thread-locals may still call functions in TSan RTL
>> (interposed functions or functions called from instrumented code, e.g.
>> __tsan_func_entry). That either immediately crashes, because TSan is
>> in invalid state or tries to "resurrect" TLV, which generally creates
>> an infinite recursion, and end up with stack overflow.
>>
>> As far as I understood, dynamic linker creates a pthread key acting
>> like a class "destructor" (quoting to distinguish it from pthread's
>> destructor). When dyld loads image, it creates another key, whose
>> destructor function frees the storage. At thread exit, it will call
>> destructor functions from first to last. Therefore, the first
>> "destructor" destructor function will be called before "finalize"
>> destructor. After finalize, storage is freed, but can be resurrected
>> if address of TLV is taken.
>>
>> I have found a workaround (or a solution -- you name it) that aims to
>> keep early initialization behavior of TSan TLV keys, but postpone
>> their destruction as much as possible. It seems like the only way to
>> do this is to assign highest pthread key indices to dyld's TLV code
>> when it allocates thread-local variables for TSan, and preserve their
>> order. Because dyld will create keys in the right order; we just want
>> to move those keys to the end of allocatable range. We have N pthread
>> keys available in the range, and we need to allocate three keys at the
>> highest possible indices, and order them. Two of these keys are used
>> by TLV code (in dyld), and one is used internally in TSan. In this
>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>> Destructor functions for these keys will be called in ascending order,
>> and TSan will work as expected in this case.
>>
>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>> on Darwin to achieve the behavior I explained. First, it identifies
>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>> call "pthread_key_create_high" to allocate highest possible key
>> indices. It's basically a 'while' loop until real pthread_key_create
>> returns EAGAIN. It then frees all other keys it has previously
>> allocated. This places TSan's TLV keys to the end of destructor list.
>> There is one more thing this function does; it stores pointer to
>> original destructor function (passed in to pthread_key_create), and
>> replaces it with its own destructor function, which will call
>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>> times to postpone destruction of TSan TLV even further (IIRC, this was
>> necessary). The replacement destructor function then calls the actual
>> destructor function.
>>
>> As a side note, dyld uses libc++, and its loading the instrumented
>> version. This didn't seem to be creating any problem so far. I didn't
>> test this on a sufficiently large code base just yet. TSan tests
>> generally pass, but they didn't have this "late-destroyed thread
>> locals" problem I mentioned.
>>
>> Does my approach sound wrong, flawed or too brittle? Is there any
>> other way to order these events in a sensible way? Can you help me
>> fixing some of the problems I mentioned below, especially the
>> 'identify_caller' function?
>>
>> The code is in a very dirty state now, so I refrain submitting it.
>> But, no worries; there's always pseudo code! I'd like to share the key
>> creation part, the rest of TSan patches are much simpler than this
>> part.
>>
>> Thanks,
>> Ismail
>>
>> #if SANITIZER_MAC
>> enum { kPthHighIndexTlvInitializer = 0,
>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>> kPthHighIndexTsanFinalize,
>> kPthHighIndexCount };
>>
>> // FIXME: This function is problematic. Can we avoid dladdr?
>> static int identify_caller(uptr pc) {
>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>     return kPthHighIndexTsanFinalize;
>>   // There are 2 "hidden" callers
>>   Dl_info dli;
>>   if (!dladdr((const void*)pc, &dli))
>>     return -1;
>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>     return kPthHighIndexTlvInitializer;
>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>     return kPthHighIndexTlvLoadNotification;
>>   return -1;
>> }
>>
>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>                  void (*dtor)(void*)) {
>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>     return REAL(pthread_key_create)(key, dtor);
>>   }
>>   const uptr caller_pc = GET_CALLER_PC();
>>   const int caller_index = identify_caller(caller_pc);
>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>                             : pthread_key_create_high(key, dtor, caller_index);
>> }
>>
>> // 'index' represents caller index so that we allocate them
>> // in an order. We want to get the last N-index keys
>> // in a sensible order.
>> int __tsan::pthread_key_create_high(unsigned long *key,
>>                                     void (*dtor)(void *), int index) {
>>   int res = 0;
>>   original_dtors[index] = dtor;
>>   // Can be a function-local ulong[512]?
>>   unsigned long lowest_key, highest_key;
>>
>>   // The following 2 loops may terminate early or later
>>   // depending on 'index'.
>>   do
>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>   while (res != EAGAIN);
>>   for (; lowest_key < highest_key; ++lowest_key) {
>>     pthread_key_delete(lowest_key);
>>   }
>>   tlv_keys[index] = highest_key;
>>   return res;
>> }
>>
>> // This will be called from _pthread_tsd_cleanup
>> static void replacement_dtor(void *p) {
>>   uptr cur_key;
>>   unsigned index = kPthHighIndexCount;
>>   // Can this make it through code review?
>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>     if (cur_key == tlv_keys[c]) {
>>       index = c;
>>       break;
>>     }
>>   }
>>   if (--tls_dtor_counters[index] == 1) {
>>     if (original_dtors[index])
>>       original_dtors[index](p);
>>     return;
>>   }
>>   pthread_setspecific(tlv_keys[index], p);
>> }
>> #endif
>
>
>
> Hi Ismail,
Hi Dmitry!

sorry for the late reply. It was a hectic period. I am looking forward
to be back to my normal schedule.

>
> I have a counter-proposal that I wanted to implement for some time,
> but never actually get to it because it was low-priority.
> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
> postponing. There is other smart code that does the same and that also
> breaks the assumption that tsan thread finalization runs last.
In our case, that other smart code will run after TSan's keys, because
our (TLV) keys are at the end of the list. The loop will call
destructors for the smart code before TSan's destructor.

> Another problem was with asan on mac. Essentially what you see, but in
> asan it is easier to work around. We just ignore all memory accesses
> that come after our thread destructor.
I remember I tried it, but something went wrong. It was around April,
I don't have a complete recollection of what happened. But I remember
one thing; clock synchronization was called from somewhere (either
something was acquiring pthread_mutex_lock, which is interposed, or
FuncEntry was called), and that synchronization relied on cur_thread()
to return a valid pointer to ThreadState.

>
> The idea that I had is relatively clean, reasonably
> platform-independent and should solve _all_ problems of such kind once
> and for all.
> Namely: we execute asan/tsan/msan thread destructor when the thread
> _join_ returns. At that point there is definitely no thread code
> running. The high-level implementation plan: for joinable threads we
> hook into join (we already intercept it); for detached threads we
> start a helper background thread when thread function returns and join
> the user thread from that helper thread. I suspect that actual
> implementation will be messier than that: for example, a thread can
> call pthread_detach in a pthread_specific destructor after we've
> decided to _not_ start a helper thread; or a thread can call
> pthread_exit and does not return from thread function.
>
> What do you think about this idea?
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Don Hinton via cfe-dev
In reply to this post by Don Hinton via cfe-dev
On Thu, Oct 8, 2015 at 10:41 AM, Dmitry Vyukov <[hidden email]> wrote:

> On Thu, Oct 8, 2015 at 10:20 AM, Dmitry Vyukov <[hidden email]> wrote:
>> On Wed, Oct 7, 2015 at 11:58 PM, Ismail Pazarbasi
>> <[hidden email]> wrote:
>>> Hi,
>>>
>>> I've submitted some of the small patches to get TSan running on Mac OS
>>> X. Time for the real challenge is approaching, and that needs some
>>> discussion.
>>>
>>> The biggest problem with TSan on Darwin is the way thread local
>>> variables are created and destroyed. TSan uses a thread-local variable
>>> called "cur_thread_placeholder" to store each thread's state. Thread
>>> locals are initialized lazily on Darwin.
>>>
>>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>>> storage for thread local variables. pthread keys are destroyed in the
>>> order of their construction, not the opposite direction; from lowest
>>> index to highest index. Per POSIX standard, this order is unspecified,
>>> but dyld code (and comments) explain the order. We want TSan's
>>> thread-locals to live until the end of execution, because we'll be
>>> accessing "cur_thread_placeholder" when thread exits. However, since
>>> TSan is initialized too early, it's taken down earlier than other
>>> thread locals. This is problematic, because destructor of these
>>> late-destroyed thread-locals may still call functions in TSan RTL
>>> (interposed functions or functions called from instrumented code, e.g.
>>> __tsan_func_entry). That either immediately crashes, because TSan is
>>> in invalid state or tries to "resurrect" TLV, which generally creates
>>> an infinite recursion, and end up with stack overflow.
>>>
>>> As far as I understood, dynamic linker creates a pthread key acting
>>> like a class "destructor" (quoting to distinguish it from pthread's
>>> destructor). When dyld loads image, it creates another key, whose
>>> destructor function frees the storage. At thread exit, it will call
>>> destructor functions from first to last. Therefore, the first
>>> "destructor" destructor function will be called before "finalize"
>>> destructor. After finalize, storage is freed, but can be resurrected
>>> if address of TLV is taken.
>>>
>>> I have found a workaround (or a solution -- you name it) that aims to
>>> keep early initialization behavior of TSan TLV keys, but postpone
>>> their destruction as much as possible. It seems like the only way to
>>> do this is to assign highest pthread key indices to dyld's TLV code
>>> when it allocates thread-local variables for TSan, and preserve their
>>> order. Because dyld will create keys in the right order; we just want
>>> to move those keys to the end of allocatable range. We have N pthread
>>> keys available in the range, and we need to allocate three keys at the
>>> highest possible indices, and order them. Two of these keys are used
>>> by TLV code (in dyld), and one is used internally in TSan. In this
>>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>>> Destructor functions for these keys will be called in ascending order,
>>> and TSan will work as expected in this case.
>>>
>>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>>> on Darwin to achieve the behavior I explained. First, it identifies
>>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>>> call "pthread_key_create_high" to allocate highest possible key
>>> indices. It's basically a 'while' loop until real pthread_key_create
>>> returns EAGAIN. It then frees all other keys it has previously
>>> allocated. This places TSan's TLV keys to the end of destructor list.
>>> There is one more thing this function does; it stores pointer to
>>> original destructor function (passed in to pthread_key_create), and
>>> replaces it with its own destructor function, which will call
>>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>>> times to postpone destruction of TSan TLV even further (IIRC, this was
>>> necessary). The replacement destructor function then calls the actual
>>> destructor function.
>>>
>>> As a side note, dyld uses libc++, and its loading the instrumented
>>> version. This didn't seem to be creating any problem so far. I didn't
>>> test this on a sufficiently large code base just yet. TSan tests
>>> generally pass, but they didn't have this "late-destroyed thread
>>> locals" problem I mentioned.
>>>
>>> Does my approach sound wrong, flawed or too brittle? Is there any
>>> other way to order these events in a sensible way? Can you help me
>>> fixing some of the problems I mentioned below, especially the
>>> 'identify_caller' function?
>>>
>>> The code is in a very dirty state now, so I refrain submitting it.
>>> But, no worries; there's always pseudo code! I'd like to share the key
>>> creation part, the rest of TSan patches are much simpler than this
>>> part.
>>>
>>> Thanks,
>>> Ismail
>>>
>>> #if SANITIZER_MAC
>>> enum { kPthHighIndexTlvInitializer = 0,
>>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>>> kPthHighIndexTsanFinalize,
>>> kPthHighIndexCount };
>>>
>>> // FIXME: This function is problematic. Can we avoid dladdr?
>>> static int identify_caller(uptr pc) {
>>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>>     return kPthHighIndexTsanFinalize;
>>>   // There are 2 "hidden" callers
>>>   Dl_info dli;
>>>   if (!dladdr((const void*)pc, &dli))
>>>     return -1;
>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>>     return kPthHighIndexTlvInitializer;
>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>>     return kPthHighIndexTlvLoadNotification;
>>>   return -1;
>>> }
>>>
>>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>>                  void (*dtor)(void*)) {
>>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>>     return REAL(pthread_key_create)(key, dtor);
>>>   }
>>>   const uptr caller_pc = GET_CALLER_PC();
>>>   const int caller_index = identify_caller(caller_pc);
>>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>>                             : pthread_key_create_high(key, dtor, caller_index);
>>> }
>>>
>>> // 'index' represents caller index so that we allocate them
>>> // in an order. We want to get the last N-index keys
>>> // in a sensible order.
>>> int __tsan::pthread_key_create_high(unsigned long *key,
>>>                                     void (*dtor)(void *), int index) {
>>>   int res = 0;
>>>   original_dtors[index] = dtor;
>>>   // Can be a function-local ulong[512]?
>>>   unsigned long lowest_key, highest_key;
>>>
>>>   // The following 2 loops may terminate early or later
>>>   // depending on 'index'.
>>>   do
>>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>>   while (res != EAGAIN);
>>>   for (; lowest_key < highest_key; ++lowest_key) {
>>>     pthread_key_delete(lowest_key);
>>>   }
>>>   tlv_keys[index] = highest_key;
>>>   return res;
>>> }
>>>
>>> // This will be called from _pthread_tsd_cleanup
>>> static void replacement_dtor(void *p) {
>>>   uptr cur_key;
>>>   unsigned index = kPthHighIndexCount;
>>>   // Can this make it through code review?
>>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>>     if (cur_key == tlv_keys[c]) {
>>>       index = c;
>>>       break;
>>>     }
>>>   }
>>>   if (--tls_dtor_counters[index] == 1) {
>>>     if (original_dtors[index])
>>>       original_dtors[index](p);
>>>     return;
>>>   }
>>>   pthread_setspecific(tlv_keys[index], p);
>>> }
>>> #endif
>>
>>
>>
>> Hi Ismail,
>>
>> I have a counter-proposal that I wanted to implement for some time,
>> but never actually get to it because it was low-priority.
>> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
>> postponing. There is other smart code that does the same and that also
>> breaks the assumption that tsan thread finalization runs last.
>> Another problem was with asan on mac. Essentially what you see, but in
>> asan it is easier to work around. We just ignore all memory accesses
>> that come after our thread destructor.
>>
>> The idea that I had is relatively clean, reasonably
>> platform-independent and should solve _all_ problems of such kind once
>> and for all.
>> Namely: we execute asan/tsan/msan thread destructor when the thread
>> _join_ returns. At that point there is definitely no thread code
>> running. The high-level implementation plan: for joinable threads we
>> hook into join (we already intercept it); for detached threads we
>> start a helper background thread when thread function returns and join
>> the user thread from that helper thread. I suspect that actual
>> implementation will be messier than that: for example, a thread can
>> call pthread_detach in a pthread_specific destructor after we've
>> decided to _not_ start a helper thread; or a thread can call
>> pthread_exit and does not return from thread function.
>>
>> What do you think about this idea?
>
> Another implementation plan can be to always join all threads
> ourselves. Then in pthread_detach merely remember that the thread is
> detached, but don't call REAL(pthread_detach). Then when we join a
> thread check whether it is detached or not; if it is detached then we
> can discard all thread state; if it is non-detached then we need to
> remember that the thread has finished and its return value, wait for
> user pthread_join call and satisfy it from our state.
>
> I don't know which plan will lead to simpler and more reliable implementation.
>
I didn't think thoroughly about this case. This sounds easier to
implement, but I am not sure if it'll work correctly (i.e. will it
miss problems that happen after TSan TLV dies, or will it report
false-positives, because TSan TLV dies too early). Still, we need to
change a few things in TSan so that it tolerates missing
cur_thread_state. I forgot many details of TSan, I haven't been
hacking it for a long time.

> To make it clear, what I don't like in your proposal: it is somewhat
> platform-dependent, should be enabled only on mac (thus less tested)
> and does not solve all potential problems of such kind (e.g. on some
> other OS there can be some other magical way to run user code after
> tsan thread destructor).
I agree; it's best to avoid platform-specific workarounds.

Ismail
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Don Hinton via cfe-dev
On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi
<[hidden email]> wrote:

>>>> Hi,
>>>>
>>>> I've submitted some of the small patches to get TSan running on Mac OS
>>>> X. Time for the real challenge is approaching, and that needs some
>>>> discussion.
>>>>
>>>> The biggest problem with TSan on Darwin is the way thread local
>>>> variables are created and destroyed. TSan uses a thread-local variable
>>>> called "cur_thread_placeholder" to store each thread's state. Thread
>>>> locals are initialized lazily on Darwin.
>>>>
>>>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>>>> storage for thread local variables. pthread keys are destroyed in the
>>>> order of their construction, not the opposite direction; from lowest
>>>> index to highest index. Per POSIX standard, this order is unspecified,
>>>> but dyld code (and comments) explain the order. We want TSan's
>>>> thread-locals to live until the end of execution, because we'll be
>>>> accessing "cur_thread_placeholder" when thread exits. However, since
>>>> TSan is initialized too early, it's taken down earlier than other
>>>> thread locals. This is problematic, because destructor of these
>>>> late-destroyed thread-locals may still call functions in TSan RTL
>>>> (interposed functions or functions called from instrumented code, e.g.
>>>> __tsan_func_entry). That either immediately crashes, because TSan is
>>>> in invalid state or tries to "resurrect" TLV, which generally creates
>>>> an infinite recursion, and end up with stack overflow.
>>>>
>>>> As far as I understood, dynamic linker creates a pthread key acting
>>>> like a class "destructor" (quoting to distinguish it from pthread's
>>>> destructor). When dyld loads image, it creates another key, whose
>>>> destructor function frees the storage. At thread exit, it will call
>>>> destructor functions from first to last. Therefore, the first
>>>> "destructor" destructor function will be called before "finalize"
>>>> destructor. After finalize, storage is freed, but can be resurrected
>>>> if address of TLV is taken.
>>>>
>>>> I have found a workaround (or a solution -- you name it) that aims to
>>>> keep early initialization behavior of TSan TLV keys, but postpone
>>>> their destruction as much as possible. It seems like the only way to
>>>> do this is to assign highest pthread key indices to dyld's TLV code
>>>> when it allocates thread-local variables for TSan, and preserve their
>>>> order. Because dyld will create keys in the right order; we just want
>>>> to move those keys to the end of allocatable range. We have N pthread
>>>> keys available in the range, and we need to allocate three keys at the
>>>> highest possible indices, and order them. Two of these keys are used
>>>> by TLV code (in dyld), and one is used internally in TSan. In this
>>>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>>>> Destructor functions for these keys will be called in ascending order,
>>>> and TSan will work as expected in this case.
>>>>
>>>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>>>> on Darwin to achieve the behavior I explained. First, it identifies
>>>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>>>> call "pthread_key_create_high" to allocate highest possible key
>>>> indices. It's basically a 'while' loop until real pthread_key_create
>>>> returns EAGAIN. It then frees all other keys it has previously
>>>> allocated. This places TSan's TLV keys to the end of destructor list.
>>>> There is one more thing this function does; it stores pointer to
>>>> original destructor function (passed in to pthread_key_create), and
>>>> replaces it with its own destructor function, which will call
>>>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>>>> times to postpone destruction of TSan TLV even further (IIRC, this was
>>>> necessary). The replacement destructor function then calls the actual
>>>> destructor function.
>>>>
>>>> As a side note, dyld uses libc++, and its loading the instrumented
>>>> version. This didn't seem to be creating any problem so far. I didn't
>>>> test this on a sufficiently large code base just yet. TSan tests
>>>> generally pass, but they didn't have this "late-destroyed thread
>>>> locals" problem I mentioned.
>>>>
>>>> Does my approach sound wrong, flawed or too brittle? Is there any
>>>> other way to order these events in a sensible way? Can you help me
>>>> fixing some of the problems I mentioned below, especially the
>>>> 'identify_caller' function?
>>>>
>>>> The code is in a very dirty state now, so I refrain submitting it.
>>>> But, no worries; there's always pseudo code! I'd like to share the key
>>>> creation part, the rest of TSan patches are much simpler than this
>>>> part.
>>>>
>>>> Thanks,
>>>> Ismail
>>>>
>>>> #if SANITIZER_MAC
>>>> enum { kPthHighIndexTlvInitializer = 0,
>>>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>>>> kPthHighIndexTsanFinalize,
>>>> kPthHighIndexCount };
>>>>
>>>> // FIXME: This function is problematic. Can we avoid dladdr?
>>>> static int identify_caller(uptr pc) {
>>>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>>>     return kPthHighIndexTsanFinalize;
>>>>   // There are 2 "hidden" callers
>>>>   Dl_info dli;
>>>>   if (!dladdr((const void*)pc, &dli))
>>>>     return -1;
>>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>>>     return kPthHighIndexTlvInitializer;
>>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>>>     return kPthHighIndexTlvLoadNotification;
>>>>   return -1;
>>>> }
>>>>
>>>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>>>                  void (*dtor)(void*)) {
>>>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>>>     return REAL(pthread_key_create)(key, dtor);
>>>>   }
>>>>   const uptr caller_pc = GET_CALLER_PC();
>>>>   const int caller_index = identify_caller(caller_pc);
>>>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>>>                             : pthread_key_create_high(key, dtor, caller_index);
>>>> }
>>>>
>>>> // 'index' represents caller index so that we allocate them
>>>> // in an order. We want to get the last N-index keys
>>>> // in a sensible order.
>>>> int __tsan::pthread_key_create_high(unsigned long *key,
>>>>                                     void (*dtor)(void *), int index) {
>>>>   int res = 0;
>>>>   original_dtors[index] = dtor;
>>>>   // Can be a function-local ulong[512]?
>>>>   unsigned long lowest_key, highest_key;
>>>>
>>>>   // The following 2 loops may terminate early or later
>>>>   // depending on 'index'.
>>>>   do
>>>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>>>   while (res != EAGAIN);
>>>>   for (; lowest_key < highest_key; ++lowest_key) {
>>>>     pthread_key_delete(lowest_key);
>>>>   }
>>>>   tlv_keys[index] = highest_key;
>>>>   return res;
>>>> }
>>>>
>>>> // This will be called from _pthread_tsd_cleanup
>>>> static void replacement_dtor(void *p) {
>>>>   uptr cur_key;
>>>>   unsigned index = kPthHighIndexCount;
>>>>   // Can this make it through code review?
>>>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>>>     if (cur_key == tlv_keys[c]) {
>>>>       index = c;
>>>>       break;
>>>>     }
>>>>   }
>>>>   if (--tls_dtor_counters[index] == 1) {
>>>>     if (original_dtors[index])
>>>>       original_dtors[index](p);
>>>>     return;
>>>>   }
>>>>   pthread_setspecific(tlv_keys[index], p);
>>>> }
>>>> #endif
>>>
>>>
>>>
>>> Hi Ismail,
>>>
>>> I have a counter-proposal that I wanted to implement for some time,
>>> but never actually get to it because it was low-priority.
>>> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
>>> postponing. There is other smart code that does the same and that also
>>> breaks the assumption that tsan thread finalization runs last.
>>> Another problem was with asan on mac. Essentially what you see, but in
>>> asan it is easier to work around. We just ignore all memory accesses
>>> that come after our thread destructor.
>>>
>>> The idea that I had is relatively clean, reasonably
>>> platform-independent and should solve _all_ problems of such kind once
>>> and for all.
>>> Namely: we execute asan/tsan/msan thread destructor when the thread
>>> _join_ returns. At that point there is definitely no thread code
>>> running. The high-level implementation plan: for joinable threads we
>>> hook into join (we already intercept it); for detached threads we
>>> start a helper background thread when thread function returns and join
>>> the user thread from that helper thread. I suspect that actual
>>> implementation will be messier than that: for example, a thread can
>>> call pthread_detach in a pthread_specific destructor after we've
>>> decided to _not_ start a helper thread; or a thread can call
>>> pthread_exit and does not return from thread function.
>>>
>>> What do you think about this idea?
>>
>> Another implementation plan can be to always join all threads
>> ourselves. Then in pthread_detach merely remember that the thread is
>> detached, but don't call REAL(pthread_detach). Then when we join a
>> thread check whether it is detached or not; if it is detached then we
>> can discard all thread state; if it is non-detached then we need to
>> remember that the thread has finished and its return value, wait for
>> user pthread_join call and satisfy it from our state.
>>
>> I don't know which plan will lead to simpler and more reliable implementation.
>>
> I didn't think thoroughly about this case. This sounds easier to
> implement, but I am not sure if it'll work correctly (i.e. will it
> miss problems that happen after TSan TLV dies, or will it report
> false-positives, because TSan TLV dies too early). Still, we need to


Can we somehow get access to ThreadState from cur_thread() until we
join the thread? That would be ideal and prevent false positives and
negatives. The simplest way I can think of is to consult
ThreadRegistriy using pthread_self() as the key to find current
ThreadState if pthread_getspecific() returns NULL.


> change a few things in TSan so that it tolerates missing
> cur_thread_state. I forgot many details of TSan, I haven't been
> hacking it for a long time.
>
>> To make it clear, what I don't like in your proposal: it is somewhat
>> platform-dependent, should be enabled only on mac (thus less tested)
>> and does not solve all potential problems of such kind (e.g. on some
>> other OS there can be some other magical way to run user code after
>> tsan thread destructor).
> I agree; it's best to avoid platform-specific workarounds.
>
> Ismail
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Don Hinton via cfe-dev
Hi Ismail and Dmitry,

I’ve been following this (and related) conversations.  We’ve had a recent discussion with Kostya about the issues with TLVs and interceptors on OS X, as I am also experimenting with porting TSan to work on OS X.

While I understand that this patch is about TLV destruction, I’m seeing a more imminent issue, which is the initialization of TLV during process startup and during new thread creation.  The issue is very similar, because TSan’s interceptors get called *before* TLVs are initialized by dyld, and by accessing cur_thread_placeholder the program either crashes or infinitely recurses.  Do you already have some workaround for this as well?

On which version of OS X are you running your tests?

Thanks,
Kuba


On 26 Oct 2015, at 17:45, Dmitry Vyukov <[hidden email]> wrote:

On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi
<[hidden email]> wrote:
Hi,

I've submitted some of the small patches to get TSan running on Mac OS
X. Time for the real challenge is approaching, and that needs some
discussion.

The biggest problem with TSan on Darwin is the way thread local
variables are created and destroyed. TSan uses a thread-local variable
called "cur_thread_placeholder" to store each thread's state. Thread
locals are initialized lazily on Darwin.

Darwin dynamic linker uses pthread keys to maintain lifetime and
storage for thread local variables. pthread keys are destroyed in the
order of their construction, not the opposite direction; from lowest
index to highest index. Per POSIX standard, this order is unspecified,
but dyld code (and comments) explain the order. We want TSan's
thread-locals to live until the end of execution, because we'll be
accessing "cur_thread_placeholder" when thread exits. However, since
TSan is initialized too early, it's taken down earlier than other
thread locals. This is problematic, because destructor of these
late-destroyed thread-locals may still call functions in TSan RTL
(interposed functions or functions called from instrumented code, e.g.
__tsan_func_entry). That either immediately crashes, because TSan is
in invalid state or tries to "resurrect" TLV, which generally creates
an infinite recursion, and end up with stack overflow.

As far as I understood, dynamic linker creates a pthread key acting
like a class "destructor" (quoting to distinguish it from pthread's
destructor). When dyld loads image, it creates another key, whose
destructor function frees the storage. At thread exit, it will call
destructor functions from first to last. Therefore, the first
"destructor" destructor function will be called before "finalize"
destructor. After finalize, storage is freed, but can be resurrected
if address of TLV is taken.

I have found a workaround (or a solution -- you name it) that aims to
keep early initialization behavior of TSan TLV keys, but postpone
their destruction as much as possible. It seems like the only way to
do this is to assign highest pthread key indices to dyld's TLV code
when it allocates thread-local variables for TSan, and preserve their
order. Because dyld will create keys in the right order; we just want
to move those keys to the end of allocatable range. We have N pthread
keys available in the range, and we need to allocate three keys at the
highest possible indices, and order them. Two of these keys are used
by TLV code (in dyld), and one is used internally in TSan. In this
case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
Destructor functions for these keys will be called in ascending order,
and TSan will work as expected in this case.

TSan wasn't interposing pthread_key_create, but I had to highjack it
on Darwin to achieve the behavior I explained. First, it identifies
caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
call "pthread_key_create_high" to allocate highest possible key
indices. It's basically a 'while' loop until real pthread_key_create
returns EAGAIN. It then frees all other keys it has previously
allocated. This places TSan's TLV keys to the end of destructor list.
There is one more thing this function does; it stores pointer to
original destructor function (passed in to pthread_key_create), and
replaces it with its own destructor function, which will call
'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
times to postpone destruction of TSan TLV even further (IIRC, this was
necessary). The replacement destructor function then calls the actual
destructor function.

As a side note, dyld uses libc++, and its loading the instrumented
version. This didn't seem to be creating any problem so far. I didn't
test this on a sufficiently large code base just yet. TSan tests
generally pass, but they didn't have this "late-destroyed thread
locals" problem I mentioned.

Does my approach sound wrong, flawed or too brittle? Is there any
other way to order these events in a sensible way? Can you help me
fixing some of the problems I mentioned below, especially the
'identify_caller' function?

The code is in a very dirty state now, so I refrain submitting it.
But, no worries; there's always pseudo code! I'd like to share the key
creation part, the rest of TSan patches are much simpler than this
part.

Thanks,
Ismail

#if SANITIZER_MAC
enum { kPthHighIndexTlvInitializer = 0,
kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
kPthHighIndexTsanFinalize,
kPthHighIndexCount };

// FIXME: This function is problematic. Can we avoid dladdr?
static int identify_caller(uptr pc) {
 if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
   return kPthHighIndexTsanFinalize;
 // There are 2 "hidden" callers
 Dl_info dli;
 if (!dladdr((const void*)pc, &dli))
   return -1;
 else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
   return kPthHighIndexTlvInitializer;
 else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
   return kPthHighIndexTlvLoadNotification;
 return -1;
}

TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
                void (*dtor)(void*)) {
 if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
   SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
   return REAL(pthread_key_create)(key, dtor);
 }
 const uptr caller_pc = GET_CALLER_PC();
 const int caller_index = identify_caller(caller_pc);
 return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
                           : pthread_key_create_high(key, dtor, caller_index);
}

// 'index' represents caller index so that we allocate them
// in an order. We want to get the last N-index keys
// in a sensible order.
int __tsan::pthread_key_create_high(unsigned long *key,
                                   void (*dtor)(void *), int index) {
 int res = 0;
 original_dtors[index] = dtor;
 // Can be a function-local ulong[512]?
 unsigned long lowest_key, highest_key;

 // The following 2 loops may terminate early or later
 // depending on 'index'.
 do
   res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
 while (res != EAGAIN);
 for (; lowest_key < highest_key; ++lowest_key) {
   pthread_key_delete(lowest_key);
 }
 tlv_keys[index] = highest_key;
 return res;
}

// This will be called from _pthread_tsd_cleanup
static void replacement_dtor(void *p) {
 uptr cur_key;
 unsigned index = kPthHighIndexCount;
 // Can this make it through code review?
 __asm__ __volatile__ ("" : "=b" (cur_key));
 for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
   if (cur_key == tlv_keys[c]) {
     index = c;
     break;
   }
 }
 if (--tls_dtor_counters[index] == 1) {
   if (original_dtors[index])
     original_dtors[index](p);
   return;
 }
 pthread_setspecific(tlv_keys[index], p);
}
#endif



Hi Ismail,

I have a counter-proposal that I wanted to implement for some time,
but never actually get to it because it was low-priority.
The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
postponing. There is other smart code that does the same and that also
breaks the assumption that tsan thread finalization runs last.
Another problem was with asan on mac. Essentially what you see, but in
asan it is easier to work around. We just ignore all memory accesses
that come after our thread destructor.

The idea that I had is relatively clean, reasonably
platform-independent and should solve _all_ problems of such kind once
and for all.
Namely: we execute asan/tsan/msan thread destructor when the thread
_join_ returns. At that point there is definitely no thread code
running. The high-level implementation plan: for joinable threads we
hook into join (we already intercept it); for detached threads we
start a helper background thread when thread function returns and join
the user thread from that helper thread. I suspect that actual
implementation will be messier than that: for example, a thread can
call pthread_detach in a pthread_specific destructor after we've
decided to _not_ start a helper thread; or a thread can call
pthread_exit and does not return from thread function.

What do you think about this idea?

Another implementation plan can be to always join all threads
ourselves. Then in pthread_detach merely remember that the thread is
detached, but don't call REAL(pthread_detach). Then when we join a
thread check whether it is detached or not; if it is detached then we
can discard all thread state; if it is non-detached then we need to
remember that the thread has finished and its return value, wait for
user pthread_join call and satisfy it from our state.

I don't know which plan will lead to simpler and more reliable implementation.

I didn't think thoroughly about this case. This sounds easier to
implement, but I am not sure if it'll work correctly (i.e. will it
miss problems that happen after TSan TLV dies, or will it report
false-positives, because TSan TLV dies too early). Still, we need to


Can we somehow get access to ThreadState from cur_thread() until we
join the thread? That would be ideal and prevent false positives and
negatives. The simplest way I can think of is to consult
ThreadRegistriy using pthread_self() as the key to find current
ThreadState if pthread_getspecific() returns NULL.


change a few things in TSan so that it tolerates missing
cur_thread_state. I forgot many details of TSan, I haven't been
hacking it for a long time.

To make it clear, what I don't like in your proposal: it is somewhat
platform-dependent, should be enabled only on mac (thus less tested)
and does not solve all potential problems of such kind (e.g. on some
other OS there can be some other magical way to run user code after
tsan thread destructor).
I agree; it's best to avoid platform-specific workarounds.

Ismail


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Don Hinton via cfe-dev
Hi Kuba,

Given enough hacks, there's always a workaround! I'll try to cleanup
the messy parts and put my branch on github so that we can share the
code. I haven't posted these parts for review, so I haven't spent
serious effort to make it pretty. It's ugly, but it'll give you the
idea.

I've started on 10.9. It took a while for Cisco IT to give the green
light for 10.10 upgrade, then it took a while for me to upgrade,
because I didn't want any glitch before my sanitizers presentation in
April 2014. I haven't tested on 10.11 yet, because Cisco IT gave green
light for the upgrade just a few days ago.

TL;DR; set a flag when dyld tells you that TLV has been created, and
use that flag in interceptor macros. Likewise, reset that flag when
dyld tells you that it's been deallocated. Note that there were some
interceptors (pthread_mutex_lock, I think, and some C++ ABI functions)
that needed explicit checks for Mac.

1. For Mac, I've made a function named InitCurThreadDarwin, and
__tsan_init calls __tsan::InitCurThreadDarwin()
void __tsan_init() {
#if SANITIZER_MAC
  __tsan::InitCurThreadDarwin();
#endif
  Initialize(cur_thread());
}

2. We'll create a pthread key that indicates that TSan's TLV has been
allocated, and TSan is ready to work. Interceptors will call
TsanIsTlsReady function (on Mac). This prevents interceptors to
intercept anything before TSan's TLV is initialized. As you can see,
it relies on pthread_key_create_high, which I explained in the first
email.

static unsigned long tls_init_key;
static pthread_once_t once_control = PTHREAD_ONCE_INIT;

void InitCurThreadDarwin() {
  pthread_once(&once_control, []{
    __tsan::pthread_key_create_high(&tls_init_key, nullptr, 2);
  });
  // Register handler first; will be called once TLS is allocated
  // FIXME: Must be done only once per thread!
  dyld_register_tlv_state_change_handler(
      dyld_tlv_state_allocated,
      ^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
        tls_state_notify(state, info);
      });
  dyld_register_tlv_state_change_handler(
      dyld_tlv_state_deallocated,
      ^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
        tls_state_notify(state, info);
      });

  // Taking address will force it to to be initialized.
  volatile void *tmp = &cur_thread_placeholder;
  (void)tmp;
  // enumerate TLS slots, if we are not initialized yet.
  if (!TsanTlsReady()) {
    dyld_enumerate_tlv_storage(
        ^(enum dyld_tlv_states state, const dyld_tlv_info *info) {
          tls_state_notify(state, info);
        });
  }
}

3. TsanIsTlsReady function:
bool TsanIsTlsReady() {
  return !tls_init_key ?
    false : pthread_getspecific(tls_init_key) != nullptr;
}

4. Some macros!
#define ENSURE_TSAN_INITED() do { \
CHECK(!tsan_init_is_running); \
if (UNLIKELY(!tsan_inited)) { \
  TsanInitFromRtl(); \
} \
} while (0)

#define COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED \
  (!tsan_inited || ! TsanIsTlsReady() || !cur_thread()->is_inited)

5. dyld will call this function when TSan's TLV gets initialized.
Watch your eyes!
static void tls_state_notify(dyld_tlv_states state, const dyld_tlv_info *info) {
  const bool is_allocated = state == dyld_tlv_state_allocated;
  if (!is_allocated) {
    // If this key is freed before thread_finalize, ensure we stop it now.
    uptr AlignedSize = RoundUpToAlignment(sizeof(ThreadState), 64);
    // FIXME: Is this working by chance?
    uptr A = (uptr)((char *)info->tlv_addr) + info->info_size;
    A = RoundUpToAlignment(A, 64);
    // assert( that this is the pointer we're looking for -- but how? )
    // FIXME: Can pthread_self() be used as a key to lookup original address
    // of &cur_thread_placeholder so that we can compare it with 'A'?
    ThreadState *thr = reinterpret_cast<ThreadState *>(A);
    TsanForceFinalize(thr);
    pthread_setspecific(tls_init_key, nullptr);
    return;
  }
  const bool V = is_placeholder_addr(info);
  if (!pthread_getspecific(tls_init_key) && V)
    pthread_setspecific(tls_init_key, (void *)is_allocated);
}
// FIXME: Ensure that this function is not called after TLV is freed!
// Taking the address will resurrect it, and infinitely recurse.
ALWAYS_INLINE static bool is_placeholder_addr(const dyld_tlv_info *info) {
  return (info->tlv_addr <= &cur_thread_placeholder) &&
         (char *)&cur_thread_placeholder <
             ((char *)info->tlv_addr + info->tlv_size);
}

I hope that this takes a much better shape, and lands to trunk soon! I
would be very happy, if you could subscribe me to your related
patches.

Ismail

On Tue, Oct 27, 2015 at 5:13 PM, Kuba Brecka <[hidden email]> wrote:

> Hi Ismail and Dmitry,
>
> I’ve been following this (and related) conversations.  We’ve had a recent
> discussion with Kostya about the issues with TLVs and interceptors on OS X,
> as I am also experimenting with porting TSan to work on OS X.
>
> While I understand that this patch is about TLV destruction, I’m seeing a
> more imminent issue, which is the initialization of TLV during process
> startup and during new thread creation.  The issue is very similar, because
> TSan’s interceptors get called *before* TLVs are initialized by dyld, and by
> accessing cur_thread_placeholder the program either crashes or infinitely
> recurses.  Do you already have some workaround for this as well?
>
> On which version of OS X are you running your tests?
>
> Thanks,
> Kuba
>
>
> On 26 Oct 2015, at 17:45, Dmitry Vyukov <[hidden email]> wrote:
>
> On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi
> <[hidden email]> wrote:
>
> Hi,
>
> I've submitted some of the small patches to get TSan running on Mac OS
> X. Time for the real challenge is approaching, and that needs some
> discussion.
>
> The biggest problem with TSan on Darwin is the way thread local
> variables are created and destroyed. TSan uses a thread-local variable
> called "cur_thread_placeholder" to store each thread's state. Thread
> locals are initialized lazily on Darwin.
>
> Darwin dynamic linker uses pthread keys to maintain lifetime and
> storage for thread local variables. pthread keys are destroyed in the
> order of their construction, not the opposite direction; from lowest
> index to highest index. Per POSIX standard, this order is unspecified,
> but dyld code (and comments) explain the order. We want TSan's
> thread-locals to live until the end of execution, because we'll be
> accessing "cur_thread_placeholder" when thread exits. However, since
> TSan is initialized too early, it's taken down earlier than other
> thread locals. This is problematic, because destructor of these
> late-destroyed thread-locals may still call functions in TSan RTL
> (interposed functions or functions called from instrumented code, e.g.
> __tsan_func_entry). That either immediately crashes, because TSan is
> in invalid state or tries to "resurrect" TLV, which generally creates
> an infinite recursion, and end up with stack overflow.
>
> As far as I understood, dynamic linker creates a pthread key acting
> like a class "destructor" (quoting to distinguish it from pthread's
> destructor). When dyld loads image, it creates another key, whose
> destructor function frees the storage. At thread exit, it will call
> destructor functions from first to last. Therefore, the first
> "destructor" destructor function will be called before "finalize"
> destructor. After finalize, storage is freed, but can be resurrected
> if address of TLV is taken.
>
> I have found a workaround (or a solution -- you name it) that aims to
> keep early initialization behavior of TSan TLV keys, but postpone
> their destruction as much as possible. It seems like the only way to
> do this is to assign highest pthread key indices to dyld's TLV code
> when it allocates thread-local variables for TSan, and preserve their
> order. Because dyld will create keys in the right order; we just want
> to move those keys to the end of allocatable range. We have N pthread
> keys available in the range, and we need to allocate three keys at the
> highest possible indices, and order them. Two of these keys are used
> by TLV code (in dyld), and one is used internally in TSan. In this
> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
> Destructor functions for these keys will be called in ascending order,
> and TSan will work as expected in this case.
>
> TSan wasn't interposing pthread_key_create, but I had to highjack it
> on Darwin to achieve the behavior I explained. First, it identifies
> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
> call "pthread_key_create_high" to allocate highest possible key
> indices. It's basically a 'while' loop until real pthread_key_create
> returns EAGAIN. It then frees all other keys it has previously
> allocated. This places TSan's TLV keys to the end of destructor list.
> There is one more thing this function does; it stores pointer to
> original destructor function (passed in to pthread_key_create), and
> replaces it with its own destructor function, which will call
> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
> times to postpone destruction of TSan TLV even further (IIRC, this was
> necessary). The replacement destructor function then calls the actual
> destructor function.
>
> As a side note, dyld uses libc++, and its loading the instrumented
> version. This didn't seem to be creating any problem so far. I didn't
> test this on a sufficiently large code base just yet. TSan tests
> generally pass, but they didn't have this "late-destroyed thread
> locals" problem I mentioned.
>
> Does my approach sound wrong, flawed or too brittle? Is there any
> other way to order these events in a sensible way? Can you help me
> fixing some of the problems I mentioned below, especially the
> 'identify_caller' function?
>
> The code is in a very dirty state now, so I refrain submitting it.
> But, no worries; there's always pseudo code! I'd like to share the key
> creation part, the rest of TSan patches are much simpler than this
> part.
>
> Thanks,
> Ismail
>
> #if SANITIZER_MAC
> enum { kPthHighIndexTlvInitializer = 0,
> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
> kPthHighIndexTsanFinalize,
> kPthHighIndexCount };
>
> // FIXME: This function is problematic. Can we avoid dladdr?
> static int identify_caller(uptr pc) {
>  if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>    return kPthHighIndexTsanFinalize;
>  // There are 2 "hidden" callers
>  Dl_info dli;
>  if (!dladdr((const void*)pc, &dli))
>    return -1;
>  else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>    return kPthHighIndexTlvInitializer;
>  else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>    return kPthHighIndexTlvLoadNotification;
>  return -1;
> }
>
> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>                 void (*dtor)(void*)) {
>  if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>    SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>    return REAL(pthread_key_create)(key, dtor);
>  }
>  const uptr caller_pc = GET_CALLER_PC();
>  const int caller_index = identify_caller(caller_pc);
>  return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>                            : pthread_key_create_high(key, dtor,
> caller_index);
> }
>
> // 'index' represents caller index so that we allocate them
> // in an order. We want to get the last N-index keys
> // in a sensible order.
> int __tsan::pthread_key_create_high(unsigned long *key,
>                                    void (*dtor)(void *), int index) {
>  int res = 0;
>  original_dtors[index] = dtor;
>  // Can be a function-local ulong[512]?
>  unsigned long lowest_key, highest_key;
>
>  // The following 2 loops may terminate early or later
>  // depending on 'index'.
>  do
>    res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>  while (res != EAGAIN);
>  for (; lowest_key < highest_key; ++lowest_key) {
>    pthread_key_delete(lowest_key);
>  }
>  tlv_keys[index] = highest_key;
>  return res;
> }
>
> // This will be called from _pthread_tsd_cleanup
> static void replacement_dtor(void *p) {
>  uptr cur_key;
>  unsigned index = kPthHighIndexCount;
>  // Can this make it through code review?
>  __asm__ __volatile__ ("" : "=b" (cur_key));
>  for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++)
> {
>    if (cur_key == tlv_keys[c]) {
>      index = c;
>      break;
>    }
>  }
>  if (--tls_dtor_counters[index] == 1) {
>    if (original_dtors[index])
>      original_dtors[index](p);
>    return;
>  }
>  pthread_setspecific(tlv_keys[index], p);
> }
> #endif
>
>
>
>
> Hi Ismail,
>
> I have a counter-proposal that I wanted to implement for some time,
> but never actually get to it because it was low-priority.
> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
> postponing. There is other smart code that does the same and that also
> breaks the assumption that tsan thread finalization runs last.
> Another problem was with asan on mac. Essentially what you see, but in
> asan it is easier to work around. We just ignore all memory accesses
> that come after our thread destructor.
>
> The idea that I had is relatively clean, reasonably
> platform-independent and should solve _all_ problems of such kind once
> and for all.
> Namely: we execute asan/tsan/msan thread destructor when the thread
> _join_ returns. At that point there is definitely no thread code
> running. The high-level implementation plan: for joinable threads we
> hook into join (we already intercept it); for detached threads we
> start a helper background thread when thread function returns and join
> the user thread from that helper thread. I suspect that actual
> implementation will be messier than that: for example, a thread can
> call pthread_detach in a pthread_specific destructor after we've
> decided to _not_ start a helper thread; or a thread can call
> pthread_exit and does not return from thread function.
>
> What do you think about this idea?
>
>
> Another implementation plan can be to always join all threads
> ourselves. Then in pthread_detach merely remember that the thread is
> detached, but don't call REAL(pthread_detach). Then when we join a
> thread check whether it is detached or not; if it is detached then we
> can discard all thread state; if it is non-detached then we need to
> remember that the thread has finished and its return value, wait for
> user pthread_join call and satisfy it from our state.
>
> I don't know which plan will lead to simpler and more reliable
> implementation.
>
> I didn't think thoroughly about this case. This sounds easier to
> implement, but I am not sure if it'll work correctly (i.e. will it
> miss problems that happen after TSan TLV dies, or will it report
> false-positives, because TSan TLV dies too early). Still, we need to
>
>
>
> Can we somehow get access to ThreadState from cur_thread() until we
> join the thread? That would be ideal and prevent false positives and
> negatives. The simplest way I can think of is to consult
> ThreadRegistriy using pthread_self() as the key to find current
> ThreadState if pthread_getspecific() returns NULL.
>
>
> change a few things in TSan so that it tolerates missing
> cur_thread_state. I forgot many details of TSan, I haven't been
> hacking it for a long time.
>
> To make it clear, what I don't like in your proposal: it is somewhat
> platform-dependent, should be enabled only on mac (thus less tested)
> and does not solve all potential problems of such kind (e.g. on some
> other OS there can be some other magical way to run user code after
> tsan thread destructor).
>
> I agree; it's best to avoid platform-specific workarounds.
>
> Ismail
>
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: TSan on Mac OS X -- workaround for delaying destructor of TSan TLV

Don Hinton via cfe-dev
In reply to this post by Don Hinton via cfe-dev
We can access cur_thread until _pthread_tsd_cleanup is called, which
is called after tsan finishes with the thread entry point, and returns
control to system. After that point, calling cur_thread will
"resurrect" that storage (re-allocated and filled with 0 -- like,
resurrecting in another body?). Accessing the storage through another
pointer aliasing now-freed cur_thread is undefined behavior (because
dyld frees the storage). Accessing cur_thread from within pthread_join
will work, because thread is still alive.

I've tried following so far, but not spent significant time on them.
1. Use malloc, not TLV. This wasn't an impressive way to resolve the
problem, and was also hacky. That required other machinery to make it
thread-safe and thread-specific. System's malloc is initialized a
little too late, IIRC. I also recall a problem with the instrumented
libc++ loaded early on by dyld. Finally, there's was a more serious
problem with freeing the storage; when and which function should free
it?
2. Keep using TLV as-is. Find "the right moment" and move cur_thread
to a dynamically allocated area that lives long enough. Use
pthread_self as key, because it's still valid. Unfortunately, moving
ThreadState didn't work, because there were non-moveable types along
the chain. Again, freeing malloc'ed storage will become a problem.

I gave up dynamic memory allocation for cur_thread, because I couldn't
resolve the problem with lifetime management of that storage. dyld
already manages that memory, so I thought working around its
limitations is a better solution.

Ismail

On Tue, Oct 27, 2015 at 1:45 AM, Dmitry Vyukov <[hidden email]> wrote:

> On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi
> <[hidden email]> wrote:
>>>>> Hi,
>>>>>
>>>>> I've submitted some of the small patches to get TSan running on Mac OS
>>>>> X. Time for the real challenge is approaching, and that needs some
>>>>> discussion.
>>>>>
>>>>> The biggest problem with TSan on Darwin is the way thread local
>>>>> variables are created and destroyed. TSan uses a thread-local variable
>>>>> called "cur_thread_placeholder" to store each thread's state. Thread
>>>>> locals are initialized lazily on Darwin.
>>>>>
>>>>> Darwin dynamic linker uses pthread keys to maintain lifetime and
>>>>> storage for thread local variables. pthread keys are destroyed in the
>>>>> order of their construction, not the opposite direction; from lowest
>>>>> index to highest index. Per POSIX standard, this order is unspecified,
>>>>> but dyld code (and comments) explain the order. We want TSan's
>>>>> thread-locals to live until the end of execution, because we'll be
>>>>> accessing "cur_thread_placeholder" when thread exits. However, since
>>>>> TSan is initialized too early, it's taken down earlier than other
>>>>> thread locals. This is problematic, because destructor of these
>>>>> late-destroyed thread-locals may still call functions in TSan RTL
>>>>> (interposed functions or functions called from instrumented code, e.g.
>>>>> __tsan_func_entry). That either immediately crashes, because TSan is
>>>>> in invalid state or tries to "resurrect" TLV, which generally creates
>>>>> an infinite recursion, and end up with stack overflow.
>>>>>
>>>>> As far as I understood, dynamic linker creates a pthread key acting
>>>>> like a class "destructor" (quoting to distinguish it from pthread's
>>>>> destructor). When dyld loads image, it creates another key, whose
>>>>> destructor function frees the storage. At thread exit, it will call
>>>>> destructor functions from first to last. Therefore, the first
>>>>> "destructor" destructor function will be called before "finalize"
>>>>> destructor. After finalize, storage is freed, but can be resurrected
>>>>> if address of TLV is taken.
>>>>>
>>>>> I have found a workaround (or a solution -- you name it) that aims to
>>>>> keep early initialization behavior of TSan TLV keys, but postpone
>>>>> their destruction as much as possible. It seems like the only way to
>>>>> do this is to assign highest pthread key indices to dyld's TLV code
>>>>> when it allocates thread-local variables for TSan, and preserve their
>>>>> order. Because dyld will create keys in the right order; we just want
>>>>> to move those keys to the end of allocatable range. We have N pthread
>>>>> keys available in the range, and we need to allocate three keys at the
>>>>> highest possible indices, and order them. Two of these keys are used
>>>>> by TLV code (in dyld), and one is used internally in TSan. In this
>>>>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use.
>>>>> Destructor functions for these keys will be called in ascending order,
>>>>> and TSan will work as expected in this case.
>>>>>
>>>>> TSan wasn't interposing pthread_key_create, but I had to highjack it
>>>>> on Darwin to achieve the behavior I explained. First, it identifies
>>>>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I
>>>>> call "pthread_key_create_high" to allocate highest possible key
>>>>> indices. It's basically a 'while' loop until real pthread_key_create
>>>>> returns EAGAIN. It then frees all other keys it has previously
>>>>> allocated. This places TSan's TLV keys to the end of destructor list.
>>>>> There is one more thing this function does; it stores pointer to
>>>>> original destructor function (passed in to pthread_key_create), and
>>>>> replaces it with its own destructor function, which will call
>>>>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1
>>>>> times to postpone destruction of TSan TLV even further (IIRC, this was
>>>>> necessary). The replacement destructor function then calls the actual
>>>>> destructor function.
>>>>>
>>>>> As a side note, dyld uses libc++, and its loading the instrumented
>>>>> version. This didn't seem to be creating any problem so far. I didn't
>>>>> test this on a sufficiently large code base just yet. TSan tests
>>>>> generally pass, but they didn't have this "late-destroyed thread
>>>>> locals" problem I mentioned.
>>>>>
>>>>> Does my approach sound wrong, flawed or too brittle? Is there any
>>>>> other way to order these events in a sensible way? Can you help me
>>>>> fixing some of the problems I mentioned below, especially the
>>>>> 'identify_caller' function?
>>>>>
>>>>> The code is in a very dirty state now, so I refrain submitting it.
>>>>> But, no worries; there's always pseudo code! I'd like to share the key
>>>>> creation part, the rest of TSan patches are much simpler than this
>>>>> part.
>>>>>
>>>>> Thanks,
>>>>> Ismail
>>>>>
>>>>> #if SANITIZER_MAC
>>>>> enum { kPthHighIndexTlvInitializer = 0,
>>>>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor,
>>>>> kPthHighIndexTsanFinalize,
>>>>> kPthHighIndexCount };
>>>>>
>>>>> // FIXME: This function is problematic. Can we avoid dladdr?
>>>>> static int identify_caller(uptr pc) {
>>>>>   if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey)
>>>>>     return kPthHighIndexTsanFinalize;
>>>>>   // There are 2 "hidden" callers
>>>>>   Dl_info dli;
>>>>>   if (!dladdr((const void*)pc, &dli))
>>>>>     return -1;
>>>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15))
>>>>>     return kPthHighIndexTlvInitializer;
>>>>>   else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21))
>>>>>     return kPthHighIndexTlvLoadNotification;
>>>>>   return -1;
>>>>> }
>>>>>
>>>>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key,
>>>>>                  void (*dtor)(void*)) {
>>>>>   if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) {
>>>>>     SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor);
>>>>>     return REAL(pthread_key_create)(key, dtor);
>>>>>   }
>>>>>   const uptr caller_pc = GET_CALLER_PC();
>>>>>   const int caller_index = identify_caller(caller_pc);
>>>>>   return -1 == caller_index ? REAL(pthread_key_create)(key, dtor)
>>>>>                             : pthread_key_create_high(key, dtor, caller_index);
>>>>> }
>>>>>
>>>>> // 'index' represents caller index so that we allocate them
>>>>> // in an order. We want to get the last N-index keys
>>>>> // in a sensible order.
>>>>> int __tsan::pthread_key_create_high(unsigned long *key,
>>>>>                                     void (*dtor)(void *), int index) {
>>>>>   int res = 0;
>>>>>   original_dtors[index] = dtor;
>>>>>   // Can be a function-local ulong[512]?
>>>>>   unsigned long lowest_key, highest_key;
>>>>>
>>>>>   // The following 2 loops may terminate early or later
>>>>>   // depending on 'index'.
>>>>>   do
>>>>>     res = REAL(pthread_key_create)(&highest_key, replacement_dtor);
>>>>>   while (res != EAGAIN);
>>>>>   for (; lowest_key < highest_key; ++lowest_key) {
>>>>>     pthread_key_delete(lowest_key);
>>>>>   }
>>>>>   tlv_keys[index] = highest_key;
>>>>>   return res;
>>>>> }
>>>>>
>>>>> // This will be called from _pthread_tsd_cleanup
>>>>> static void replacement_dtor(void *p) {
>>>>>   uptr cur_key;
>>>>>   unsigned index = kPthHighIndexCount;
>>>>>   // Can this make it through code review?
>>>>>   __asm__ __volatile__ ("" : "=b" (cur_key));
>>>>>   for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; c++) {
>>>>>     if (cur_key == tlv_keys[c]) {
>>>>>       index = c;
>>>>>       break;
>>>>>     }
>>>>>   }
>>>>>   if (--tls_dtor_counters[index] == 1) {
>>>>>     if (original_dtors[index])
>>>>>       original_dtors[index](p);
>>>>>     return;
>>>>>   }
>>>>>   pthread_setspecific(tlv_keys[index], p);
>>>>> }
>>>>> #endif
>>>>
>>>>
>>>>
>>>> Hi Ismail,
>>>>
>>>> I have a counter-proposal that I wanted to implement for some time,
>>>> but never actually get to it because it was low-priority.
>>>> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1
>>>> postponing. There is other smart code that does the same and that also
>>>> breaks the assumption that tsan thread finalization runs last.
>>>> Another problem was with asan on mac. Essentially what you see, but in
>>>> asan it is easier to work around. We just ignore all memory accesses
>>>> that come after our thread destructor.
>>>>
>>>> The idea that I had is relatively clean, reasonably
>>>> platform-independent and should solve _all_ problems of such kind once
>>>> and for all.
>>>> Namely: we execute asan/tsan/msan thread destructor when the thread
>>>> _join_ returns. At that point there is definitely no thread code
>>>> running. The high-level implementation plan: for joinable threads we
>>>> hook into join (we already intercept it); for detached threads we
>>>> start a helper background thread when thread function returns and join
>>>> the user thread from that helper thread. I suspect that actual
>>>> implementation will be messier than that: for example, a thread can
>>>> call pthread_detach in a pthread_specific destructor after we've
>>>> decided to _not_ start a helper thread; or a thread can call
>>>> pthread_exit and does not return from thread function.
>>>>
>>>> What do you think about this idea?
>>>
>>> Another implementation plan can be to always join all threads
>>> ourselves. Then in pthread_detach merely remember that the thread is
>>> detached, but don't call REAL(pthread_detach). Then when we join a
>>> thread check whether it is detached or not; if it is detached then we
>>> can discard all thread state; if it is non-detached then we need to
>>> remember that the thread has finished and its return value, wait for
>>> user pthread_join call and satisfy it from our state.
>>>
>>> I don't know which plan will lead to simpler and more reliable implementation.
>>>
>> I didn't think thoroughly about this case. This sounds easier to
>> implement, but I am not sure if it'll work correctly (i.e. will it
>> miss problems that happen after TSan TLV dies, or will it report
>> false-positives, because TSan TLV dies too early). Still, we need to
>
>
> Can we somehow get access to ThreadState from cur_thread() until we
> join the thread? That would be ideal and prevent false positives and
> negatives. The simplest way I can think of is to consult
> ThreadRegistriy using pthread_self() as the key to find current
> ThreadState if pthread_getspecific() returns NULL.
>
>
>> change a few things in TSan so that it tolerates missing
>> cur_thread_state. I forgot many details of TSan, I haven't been
>> hacking it for a long time.
>>
>>> To make it clear, what I don't like in your proposal: it is somewhat
>>> platform-dependent, should be enabled only on mac (thus less tested)
>>> and does not solve all potential problems of such kind (e.g. on some
>>> other OS there can be some other magical way to run user code after
>>> tsan thread destructor).
>> I agree; it's best to avoid platform-specific workarounds.
>>
>> Ismail
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev