From owner-freebsd-hackers@FreeBSD.ORG Wed Jan 11 16:51:17 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 449601065679 for ; Wed, 11 Jan 2012 16:51:17 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-iy0-f182.google.com (mail-iy0-f182.google.com [209.85.210.182]) by mx1.freebsd.org (Postfix) with ESMTP id F14D28FC08 for ; Wed, 11 Jan 2012 16:51:16 +0000 (UTC) Received: by iazz13 with SMTP id z13so2029097iaz.13 for ; Wed, 11 Jan 2012 08:51:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=la+MZWQ6NQ9GRlNje7F2FM90pegV6TCE0x6WUlGj4KM=; b=AXi8my+IG77MjXXPrE+/7CIDLLfPwhVHTps5CmxdxWmAZlZMmMFvyF1Bgj0qrX2SCu vNuk6/dHHkcQr/69jxs1JLBekXDtITGQJPwpKFKXoXMh5qNFKyssFxgj6wGjkJkpL4za ULr4XRH8uGnxwiRE1e+Ijq60nshH0i3BxcCOE= MIME-Version: 1.0 Received: by 10.42.246.71 with SMTP id lx7mr6163927icb.54.1326300676292; Wed, 11 Jan 2012 08:51:16 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.42.243.65 with HTTP; Wed, 11 Jan 2012 08:51:16 -0800 (PST) In-Reply-To: References: <201201110806.30620.jhb@freebsd.org> Date: Wed, 11 Jan 2012 08:51:16 -0800 X-Google-Sender-Auth: wN40Ni6dDyRnxagqftfL9cUxbHI Message-ID: From: Adrian Chadd To: Garrett Cooper Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org, Xin LI , Ivan Voras , davidxu@freebsd.org Subject: Re: sem(4) lockup in python? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Jan 2012 16:51:17 -0000 ... yes, enable WITNESS already and see if you can find LORs. :) (Sheesh, that's what it's for! :) Adrian On 11 January 2012 08:47, Garrett Cooper wrote: > On Wed, Jan 11, 2012 at 6:33 AM, Ivan Voras wrote: >> On 11 January 2012 14:06, John Baldwin wrote: >>> On Wednesday, January 11, 2012 6:21:18 am Ivan Voras wrote: >>>> The lang/python27 port can optionally be built with the support for >>>> POSIX semaphores - i.e. sem(4). This option is labeled as experimental >>>> so it may be that the code is simply incorrect. I've tried it and get >>>> frequent hangs with the python process in the "usem" state. The kernel >>>> stack is as follows and looks reasonable: >>>> >>>> # procstat -kk 19008 >>>> =A0 =A0PID =A0 =A0TID COMM =A0 =A0 =A0 =A0 =A0 =A0 TDNAME =A0 =A0 =A0 = =A0 =A0 KSTACK >>>> >>>> 19008 101605 python =A0 =A0 =A0 =A0 =A0 - =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0mi_switch+0x174 >>>> sleepq_catch_signals+0x2f4 sleepq_wait_sig+0x16 _sleep+0x269 >>>> do_sem_wait+0xa19 __umtx_op_sem_wait+0x51 amd64_syscall+0x450 >>>> Xfast_syscall+0xf7 >>>> >>>> The process doesn't react to SIGINT or SIGTERM but fortunately reacts = to >>>> SIGKILL. >>>> >>>> This could be an error in Python code but OTOH this code is not >>>> FreeBSD-specific so it's unlikely. >>> >>> This is using the new umtx-based semaphore code that David Xu wrote. = =A0He is >>> probably the best person to ask (cc'd). >>> >> >> Ok, I've encountered the problem repeatedly while building databases/tdb= : >> =A0it uses Python in the build process (but maybe it needs something els= e in >> parallel to provoke the problem). > > Glad to see that iXsystems isn't the only one ([1] -- please add a "me > too" to the PR). The problem is that we do FreeNAS nightlies and they > frequently get stuck building tdb (10%~20% of the time) and it sticks > when doing interactive builds as well. The issue appears to be > exacerbated when we have more builds running in parallel on the same > machine. I've also run into the same issue compiling talloc because it > uses the same waf infrastructure as tdb, which was designed to "speed > things up by forcing builds to be parallelized" (It builds > kern.smp.ncpus jobs instead of -j 1). Furthermore, it seems to occur > regardless of whether or not we have the WITH_SEM enabled in python or > not (build.ix's copy of python doesn't have it enabled, but > streetfighter.ix, my system bayonetta, etc do). > > I haven't actually enabled WITNESS or the deadlock resolver and > checked for LORs / deadlocks, but that might be an alternate avenue to > pursue in debugging the issue; my gut is that the issue exists within > the code that handles the subprocessing stuff and/or the GIL stuff in > the python interpreter and that the race condition between a command > actually finishing and not is relatively small (in most cases) and in > most cases python's code wins and continues on as usual. It could also > be some non-threadsafe code trying to run in parallel touching things > that it shouldn't in the python interpreter. It would also be > interesting to see what python3k brings to the table, but using that > would be introducing some extra unknowns into the equation. > > It can be reproduced by running continuous builds of talloc or tdb. > > Thanks! > -Garrett > > 1. http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dports/163489 > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org= "