From owner-freebsd-hackers@FreeBSD.ORG Wed Jan 11 16:47:13 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6ED49106564A; Wed, 11 Jan 2012 16:47:13 +0000 (UTC) (envelope-from yanegomi@gmail.com) Received: from mail-tul01m020-f182.google.com (mail-tul01m020-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 1A8178FC15; Wed, 11 Jan 2012 16:47:12 +0000 (UTC) Received: by obbwd18 with SMTP id wd18so1739781obb.13 for ; Wed, 11 Jan 2012 08:47:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=ZlZxdgcZCYqFJbR6cNmxTndqEAOaBTmcTddIrRSPXrg=; b=F1sO1ZCMjJ3ayB2TJ2r1TOXQxVUOpT9PXwG1VGQe0cKjTgzHdSb9JQCF7VM/8UZkd7 qNdEQWLug97CCRS5WuQEY3fw/gy5dzymTwP1l0LJ6ZCRstAg1tAF0WnIEhsYJxXxSCwX U6tq76EpRu9Br//wMyws8B6NfX6GjQzjqHqIc= MIME-Version: 1.0 Received: by 10.182.49.106 with SMTP id t10mr22959001obn.49.1326300432484; Wed, 11 Jan 2012 08:47:12 -0800 (PST) Received: by 10.182.152.6 with HTTP; Wed, 11 Jan 2012 08:47:12 -0800 (PST) In-Reply-To: References: <201201110806.30620.jhb@freebsd.org> Date: Wed, 11 Jan 2012 08:47:12 -0800 Message-ID: From: Garrett Cooper To: Ivan Voras Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org, Xin LI , davidxu@freebsd.org Subject: Re: sem(4) lockup in python? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Jan 2012 16:47:13 -0000 On Wed, Jan 11, 2012 at 6:33 AM, Ivan Voras wrote: > On 11 January 2012 14:06, John Baldwin wrote: >> On Wednesday, January 11, 2012 6:21:18 am Ivan Voras wrote: >>> The lang/python27 port can optionally be built with the support for >>> POSIX semaphores - i.e. sem(4). This option is labeled as experimental >>> so it may be that the code is simply incorrect. I've tried it and get >>> frequent hangs with the python process in the "usem" state. The kernel >>> stack is as follows and looks reasonable: >>> >>> # procstat -kk 19008 >>> =A0 =A0PID =A0 =A0TID COMM =A0 =A0 =A0 =A0 =A0 =A0 TDNAME =A0 =A0 =A0 = =A0 =A0 KSTACK >>> >>> 19008 101605 python =A0 =A0 =A0 =A0 =A0 - =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0mi_switch+0x174 >>> sleepq_catch_signals+0x2f4 sleepq_wait_sig+0x16 _sleep+0x269 >>> do_sem_wait+0xa19 __umtx_op_sem_wait+0x51 amd64_syscall+0x450 >>> Xfast_syscall+0xf7 >>> >>> The process doesn't react to SIGINT or SIGTERM but fortunately reacts t= o >>> SIGKILL. >>> >>> This could be an error in Python code but OTOH this code is not >>> FreeBSD-specific so it's unlikely. >> >> This is using the new umtx-based semaphore code that David Xu wrote. =A0= He is >> probably the best person to ask (cc'd). >> > > Ok, I've encountered the problem repeatedly while building databases/tdb: > =A0it uses Python in the build process (but maybe it needs something else= in > parallel to provoke the problem). Glad to see that iXsystems isn't the only one ([1] -- please add a "me too" to the PR). The problem is that we do FreeNAS nightlies and they frequently get stuck building tdb (10%~20% of the time) and it sticks when doing interactive builds as well. The issue appears to be exacerbated when we have more builds running in parallel on the same machine. I've also run into the same issue compiling talloc because it uses the same waf infrastructure as tdb, which was designed to "speed things up by forcing builds to be parallelized" (It builds kern.smp.ncpus jobs instead of -j 1). Furthermore, it seems to occur regardless of whether or not we have the WITH_SEM enabled in python or not (build.ix's copy of python doesn't have it enabled, but streetfighter.ix, my system bayonetta, etc do). I haven't actually enabled WITNESS or the deadlock resolver and checked for LORs / deadlocks, but that might be an alternate avenue to pursue in debugging the issue; my gut is that the issue exists within the code that handles the subprocessing stuff and/or the GIL stuff in the python interpreter and that the race condition between a command actually finishing and not is relatively small (in most cases) and in most cases python's code wins and continues on as usual. It could also be some non-threadsafe code trying to run in parallel touching things that it shouldn't in the python interpreter. It would also be interesting to see what python3k brings to the table, but using that would be introducing some extra unknowns into the equation. It can be reproduced by running continuous builds of talloc or tdb. Thanks! -Garrett 1. http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dports/163489