Issue 8416 - race requiring use of pthread_self in sal/osl/unx/thread.c
Summary: race requiring use of pthread_self in sal/osl/unx/thread.c
Status: CLOSED FIXED
Alias: None
Product: porting
Classification: Code
Component: code (show other issues)
Version: OOo 1.0.1
Hardware: All Other OS
: P3 Trivial (vote)
Target Milestone: ---
Assignee: matthias.huetsch
QA Contact: issues@porting
URL:
Keywords: oooqa
Depends on: 8861
Blocks:
  Show dependency tree
 
Reported: 2002-10-16 19:53 UTC by khendricks
Modified: 2002-11-13 21:30 UTC (History)
1 user (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description khendricks 2002-10-16 19:53:25 UTC
Hi,

Here is the information regarding the race issue as taken from the 
dev@porting.openoffice.org mailing list.


Hi,

Yes, I believe from eyeballing the code in sal/osl/inc/thread.hxx and looking at the the 
run() method uses that extends oslThread in the "store" module in stordmon.cxx, I think 
we need to worry about the same problem in two other places:

The first is in osl_setThreadPriority where the newly created thread tries to set itself to the 
lowest priority it can (not used on all platforms) and in the osl_scheduleThread.

Lukcily there is an assert in osl_scheduleThread which should have fired had this race 
been the culprit,


      OSL_ASSERT(pthread_equal(pThreadImpl->m_hThread, pthread_self()));

It is hard to say which of the osl/unx/thread.c functions might be invoked by a thread on 
itself but the following routines seem to reference m_hThread in some way:

oslCleanupFunction
osl_destroyThread
osl_suspendThread
osl_setPriority
osl_getPriority
osl_joinWithThread
osl_scheduleThread


I will copy all of this mail to a new issuezilla and assign it to you.

Thanks,

Kevin

On Wednesday, October 16, 2002, at 02:36 PM, Matthias Hütsch wrote:

Hi Kevin,

Yes, now that you've pointed my nose to it, that's indeed a race condition. And yes, we 
should take care of that.

Re-reading the 'idlc segfault - the final word' thread, I agree that this possible race is most 
probably not guilty for that segfault. But to be sure someone would need to test that 
again, this time using the proposed fix of 'pthread_detach(pthread_self())' instead of have 
the call commented out.

But regardless of what might come out of that test, could you please file an issue against 
the porting project about the race condition? You could assign initial ownership to me 
(mhu@openoffice.org).

Thanks,
Matthias

Kevin.Hendricks wrote:
Hi,
I have asked Apple to help track down the pthread_detach problem we are seeing when 
using MacOSX 10.2.X
Matt Watson of Apple has been trying to help and he sent me the following report of a 
race from in which a newly created thread should not be using pThreadImpl->
m_hThread because its value need not be set properly until the call to pthread_create 
actually returns (and the scheduler may in fact schedule the newly created thread first).
We should instead be setting that value in the first routine invoked by the new thread 
(oslWorkerWrapperFunction) via a call to pthread_self()  or instead using pthread_self() 
in any call done by the newly created thread instead of a value that may not have been 
successfully set yet.
We do exactly this in the oslCleanupXXXXX code that is invoked just before the thread 
itself exits when the newly created thread calls pthread_detach on itself.
Will someone from Hamburg please look into this and let me know how you want to 
handle this potential race (use pthread_self() in all calls done by the newly created 
thread or use pthread_self to set it once in the oslWorkerWrapperFunction
I do not think this race is causing the segfault we saw using MacOSX 10.2 since both 
threads were active at the same time but it is a problem we should take care of I think.
Thanks,
Kevin
Begin forwarded message:
Date: Wed Oct 16, 2002  11:36:02 AM America/Montreal
To: Kevin Hendricks <khendricks@ivey.uwo.ca>
Subject: Re: [Fwd: Apple gcc 3.1 and OpenOffice.org]

(Note trimmed recipient list)

This is the race:

        if ( (  nRet=pthread_create((pthread_t*)&pThreadImpl->m_hThread, /* receives  
thread data */
                               PTHREAD_ATTR_DEFAULT,
                               oslWorkerWrapperFunction,
                               (void*)pThreadImpl) ) != 0 )
        {


pThreadImpl->m_hThread is a shared resource that is not guaranteed to  be valid until 
pthread_create() returns in the *creating* thread.

Here's a comp.programming.threads article from *the* authority on  pthreads that may 
help explain this:

http://groups.google.com/ groups?selm=35B4A903.94E079AC%40zko.dec.com&output=
gplain

Consider the following compliant, hypothetical implementation of  pthread_create():

int pthread_create(pthread_t *t, pthread_attr_t *attr, void  *(*func)(void *), void *arg)
{

    struct __pthread newthread;
    newthread.arg = arg;
    newthread.func = func;
     kernel_thread_start(newthread);
    /* Give new thread time to start, run and exit */
    sleep(10);
    *t = newthread;
    return 0;
}

In fact, it may never have any reliable value in the created thread at  all if the kernel 
happens to schedule the new thread to start, run and  finish before the calling thread is 
scheduled again. This could easily  happen on MP, for instance. One way to use a 
cached pthread_t in the  created thread is to have a private copy that gets filled in at the  
beginning of the wrapper function:

    static void* oslWorkerWrapperFunction(void* pData)
    {
        osl_TThreadImpl* pThreadImpl= (osl_TThreadImpl*)pData;

        pThreadImpl->m_hthread_self = pthread_self();

Another way is to just use pthread_self() in the calling thread:

        if ((attached = pThreadImpl->m_Flags & THREADIMPL_FLAGS_ATTACHED) > 0)
        {
            pThreadImpl->m_Flags &= ~THREADIMPL_FLAGS_ATTACHED;

            pthread_detach(pthread_self());
        }

(This would need to be done for all uses of pThreadImpl->m_hthread in  the created 
thread...)

matt.
Comment 1 matthias.huetsch 2002-10-16 20:01:21 UTC
Accepted...
Comment 2 khendricks 2002-10-21 23:44:05 UTC
Hi, 
 
As instructed, here is the summary about the pthread_join / 
pthread_detach issue. 
 
Thanks!   
 
Kevin 
 
  
Again based on comments from Matt Watson of Apple (Thank you Matt!), 
we 
seem to be doing a no-no in that we are using pthread_join to wait 
for a 
worker thread that has called pthread_detach on itself. 
 
From the pthread_join man page: 
 
       The  joined  thread  th  must be in the joinable state: it 
       must not have been detached using pthread_detach(3) or the 
       PTHREAD_CREATE_DETACHED attribute to pthread_create(3). 
 
 
If you look in store/source/stodmon.cxx you can see the 
That the OStorePageDaemon_Impl destructor (the OStorePageDaemon_Impl 
extends the osl::Thread class and defines its own run() method 
(similar to 
how the Java Thread class and run() method are typically used.) 
 
/* 
 * ~OStorePageDaemon_Impl. 
 */ 
OStorePageDaemon_Impl::~OStorePageDaemon_Impl (void) 
{ 
        if (isRunning()) 
        { 
                // Terminate. 
                terminate(); 
                m_aCond.set(); 
 
                // Join. 
                if (getCurrentIdentifier() != getIdentifier()) 
                        join(); 
        } 
} 
 
The destructor invokes terminate on other thread in "run" which is 
looping 
here: 
 
 /* 
 * run. 
 */ 
void SAL_CALL OStorePageDaemon_Impl::run (void) 
{ 
        setPriority (osl_Thread_PriorityBelowNormal); 
        while (schedule()) 
        { 
                TimeValue tmo; 
                tmo.Seconds = 10; 
                tmo.Nanosec = 0; 
 
                m_aCond.wait(&tmo); 
                flush(); 
                m_aCond.reset(); 
        } 
} 
 
 
osl_terminateThread then properly sets the flag for the "run" thread 
and 
 it exits the run() method and returns to the 
oslWorkerWrapperFunction 
 (see sal/osl/unx/thread.c) which in turn does a 
pthread_cleanup_pop(1) 
 which results in the oslCleanupFunction() being invoked which in 
turn 
 does a pthread_detach  on itself here: 
 
static void oslCleanupFunction(void* pData) 
{ 
        sal_Bool attached; 
 
        osl_TThreadImpl* pThreadImpl= (osl_TThreadImpl*)pData; 
 
        pthread_mutex_lock(&pThreadImpl->m_HandleLock); 
        pThreadImpl->m_Flags &= ~THREADIMPL_FLAGS_ACTIVE; 
 
        if ((attached = pThreadImpl->m_Flags & 
THREADIMPL_FLAGS_ATTACHED) 
 > 0) 
        { 
                pThreadImpl->m_Flags &= ~THREADIMPL_FLAGS_ATTACHED; 
 
                pthread_detach(pThreadImpl->m_hThread); 
        } 
... 
 
But the main thread (not the thread created to execute in the run() 
method 
is still in the destructor after terminate and is in a race to 
against the 
thread from "run()" exiting.  If the "run" thread exists first all 
is 
fine. 
 
If the thread in the OStorePageDaemon_Impl destructor gets to its 
join() 
 
              // Join. 
                if (getCurrentIdentifier() != getIdentifier()) 
                        join(); 
 
which invokes osl_joinWithThread()  first we end up doing a 
pthread_join 
 to wait for a thread that has run pthread_detach on itself. 
 
In fact my trace posted back on September 3rd shows exactly this: 
 
http://porting.openoffice.org/servlets/ReadMsg?msgId=396374&listName=dev 
 
Here is a snippet that shows what is going on. 
 
in OStorePageDaemon release - about to delete 0x4c2f00 
in ~OStorePageDaemon_Impl 
in ~OStorePageDaemon_Impl before join 
in ~OStorePageDaemon_Impl doing join 
OStorePageDaemon_Impl run returning 
OStorePageDaemon_Impl in on_Terminated 
right before pthread_cleanup_pop 
in oslCleanupFunction - about to detach 
in removeThreadID with initial pEntry 0x4c2df0 
in removeThreadID about to free 0x4c2df0 
this is the end my friends 
mach_port_deallocate(kernel_thread) failed: (os/kern) invalid name 
call th pthread_join returned 0 
 
Since we really want to do a "join" in the OStorePageDaemon_Impl 
 destructor we really should somehow not be allowing the thread to 
do the 
pthread_cleanup_pop(1) which invokes oslCleanupFunction and 
eventually 
pthread_detach. 
 
This is why commenting out pthread_detach in the oslCleanupFunction 
seems 
to work around the problem. 
 
I am not sure exactly how to tell sal that we want to "join" with a 
worker 
thread and not to do the pthread_detach which you defintely want to 
do if 
you don't care about the exit value or if the thread that would want 
to do 
the join (to wait) has somehow exited or terminated itself. 
 
I am not sure this causing the problem we are seeing under MacOSX 
10.2 but 
it is definitely a sequence of events that should not be happening 
given 
the pthread_join main page and my pthreads manual. 
 
Please let me know if you want me to file an issuezilla with this 
e-mail 
attached or let me know how you want to indicate that you want to 
wait on 
the worker thread and I can try to create a patch for sal that 
allows 
that. 
 
Thanks, 
 
Kevin 
 
 
Comment 3 matthias.huetsch 2002-10-31 17:53:16 UTC
With the exception that the originally suspected race condition turned
out to not exist (the 'child' thread signals it's startup to the
'parent' thread waiting on a condition; so access to the handle is
synchronized), careful code review (thanks Tino for helping) revealed
several other issues with this piece of code...

----------------------------
revision 1.23
date: 2002/10/30 21:05:08;  author: mhu;  state: Exp;  lines: +672 -682
#i8416# Fixed 'pthread_join()' / 'pthread_detach()' race condition,
several other race conditions, deadlock in 'osl_scheduleThread()' going
into suspend vs. 'osl_resumeThread()' using a redundant 2nd mutex,
static initialization not being thread safe, wrong 'spurious wakeup
protection' loops around 'pthread_cond_wait()' not checking a flag.
Added cleanup handlers to automatic cancellation points 'pthread_join()'
and 'pthread_cond_wait()'. Partial source cleanup.
----------------------------

As there was one race left, revision 1.24 contains the complete fix.
Comment 4 matthias.huetsch 2002-11-05 15:59:30 UTC
Fix confirmed on MacOSX, LinuxIntel and SolarisSparc, and none of the
other porting project complained -> resolved as fixed.

File 'porting/sal/osl/unx/thread.c' revision 1.24 needs to be tagged
for OOO_STABLE_1, OOO_STABLE_1_PORTS and OO643C.
Comment 5 matthias.huetsch 2002-11-09 18:39:28 UTC
Tagging confirmed (8861), closing...