Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Jan 2012 08:47:12 -0800
From:      Garrett Cooper <yanegomi@gmail.com>
To:        Ivan Voras <ivoras@freebsd.org>
Cc:        freebsd-hackers@freebsd.org, Xin LI <delphij@delphij.net>, davidxu@freebsd.org
Subject:   Re: sem(4) lockup in python?
Message-ID:  <CAGH67wRsek2-WY_ETW6QEER1r5dDXLXfDjbzpHMjtv059Y8cJw@mail.gmail.com>
In-Reply-To: <CAF-QHFWFvYTPeM68Mk%2BOYVX--MNhKOJ2o1GF9ZOsBmtiC5fYFQ@mail.gmail.com>
References:  <jejrbe$or8$1@dough.gmane.org> <201201110806.30620.jhb@freebsd.org> <CAF-QHFWFvYTPeM68Mk%2BOYVX--MNhKOJ2o1GF9ZOsBmtiC5fYFQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Jan 11, 2012 at 6:33 AM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 11 January 2012 14:06, John Baldwin <jhb@freebsd.org> wrote:
>> On Wednesday, January 11, 2012 6:21:18 am Ivan Voras wrote:
>>> The lang/python27 port can optionally be built with the support for
>>> POSIX semaphores - i.e. sem(4). This option is labeled as experimental
>>> so it may be that the code is simply incorrect. I've tried it and get
>>> frequent hangs with the python process in the "usem" state. The kernel
>>> stack is as follows and looks reasonable:
>>>
>>> # procstat -kk 19008
>>> =A0 =A0PID =A0 =A0TID COMM =A0 =A0 =A0 =A0 =A0 =A0 TDNAME =A0 =A0 =A0 =
=A0 =A0 KSTACK
>>>
>>> 19008 101605 python =A0 =A0 =A0 =A0 =A0 - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0mi_switch+0x174
>>> sleepq_catch_signals+0x2f4 sleepq_wait_sig+0x16 _sleep+0x269
>>> do_sem_wait+0xa19 __umtx_op_sem_wait+0x51 amd64_syscall+0x450
>>> Xfast_syscall+0xf7
>>>
>>> The process doesn't react to SIGINT or SIGTERM but fortunately reacts t=
o
>>> SIGKILL.
>>>
>>> This could be an error in Python code but OTOH this code is not
>>> FreeBSD-specific so it's unlikely.
>>
>> This is using the new umtx-based semaphore code that David Xu wrote. =A0=
He is
>> probably the best person to ask (cc'd).
>>
>
> Ok, I've encountered the problem repeatedly while building databases/tdb:
> =A0it uses Python in the build process (but maybe it needs something else=
 in
> parallel to provoke the problem).

Glad to see that iXsystems isn't the only one ([1] -- please add a "me
too" to the PR). The problem is that we do FreeNAS nightlies and they
frequently get stuck building tdb (10%~20% of the time) and it sticks
when doing interactive builds as well. The issue appears to be
exacerbated when we have more builds running in parallel on the same
machine. I've also run into the same issue compiling talloc because it
uses the same waf infrastructure as tdb, which was designed to "speed
things up by forcing builds to be parallelized" (It builds
kern.smp.ncpus jobs instead of -j 1). Furthermore, it seems to occur
regardless of whether or not we have the WITH_SEM enabled in python or
not (build.ix's copy of python doesn't have it enabled, but
streetfighter.ix, my system bayonetta, etc do).

I haven't actually enabled WITNESS or the deadlock resolver and
checked for LORs / deadlocks, but that might be an alternate avenue to
pursue in debugging the issue; my gut is that the issue exists within
the code that handles the subprocessing stuff and/or the GIL stuff in
the python interpreter and that the race condition between a command
actually finishing and not is relatively small (in most cases) and in
most cases python's code wins and continues on as usual. It could also
be some non-threadsafe code trying to run in parallel touching things
that it shouldn't in the python interpreter. It would also be
interesting to see what python3k brings to the table, but using that
would be introducing some extra unknowns into the equation.

It can be reproduced by running continuous builds of talloc or tdb.

Thanks!
-Garrett

1. http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dports/163489



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGH67wRsek2-WY_ETW6QEER1r5dDXLXfDjbzpHMjtv059Y8cJw>