From owner-freebsd-arch@FreeBSD.ORG Sun Jun 21 11:52:05 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D440C106564A for ; Sun, 21 Jun 2009 11:52:05 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (skuns.zoral.com.ua [91.193.166.194]) by mx1.freebsd.org (Postfix) with ESMTP id B52F28FC19 for ; Sun, 21 Jun 2009 11:52:04 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id n5LBq04k082541 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 21 Jun 2009 14:52:00 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.3/8.14.3) with ESMTP id n5LBq0d9031850; Sun, 21 Jun 2009 14:52:00 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.3/8.14.3/Submit) id n5LBpvjI031849; Sun, 21 Jun 2009 14:51:57 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 21 Jun 2009 14:51:57 +0300 From: Kostik Belousov To: Jilles Tjoelker Message-ID: <20090621115157.GJ2884@deviant.kiev.zoral.com.ua> References: <20090619162328.GA79975@stack.nl> <20090620161540.GF2884@deviant.kiev.zoral.com.ua> <20090620203300.GA21763@stack.nl> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="sT9gWZPUZYhvPS56" Content-Disposition: inline In-Reply-To: <20090620203300.GA21763@stack.nl> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.1 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-arch@freebsd.org Subject: Re: deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 21 Jun 2009 11:52:06 -0000 --sT9gWZPUZYhvPS56 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jun 20, 2009 at 10:33:00PM +0200, Jilles Tjoelker wrote: > On Sat, Jun 20, 2009 at 07:15:40PM +0300, Kostik Belousov wrote: > > On Fri, Jun 19, 2009 at 06:23:28PM +0200, Jilles Tjoelker wrote: > > > I have been having trouble with deadlocks with NFS mounts for a while, > > > and I have found at least one way it can deadlock. It seems an issue > > > with the sleep/lock system. >=20 > > > NFS sleeps while holding a lockmgr lock, waiting for a reply from the > > > server. When the mount is set intr, this is an interruptible sleep, so > > > that interrupting signals can abort the sleep. However, this also mea= ns > > > that SIGSTOP etc will suspend the thread without waking it up first, = so > > > it will be suspended with a lock held. >=20 > > > If it holds the wrong locks, it is possible that the shell will not be > > > able to run, and the process cannot be continued in the normal manner. >=20 > > > Due to some other things I do not understand, it is then possible that > > > the process cannot be continued at all (SIGCONT seems ignored), but in > > > simple cases SIGCONT works, and things go back to normal. >=20 > > > In any case, this situation is undesirable, as even 'umount -f' doesn= 't > > > work while the thread is suspended. >=20 > > > Of course, this reasoning applies to any code that goes to sleep > > > interruptibly while holding a lock (sx or lockmgr). Is this supposed = to > > > be possible (likely useful)? If so, a third type of sleep would be > > > needed that is interrupted by signals but not suspended? If not, > > > something should check that it doesn't happen and NFS intr mounts may > > > need to check for signals periodically or find a way to avoid sleeping > > > with locks held. >=20 > > > The td_locks field is only accessible for the current thread, so it > > > cannot be used to check if suspending is safe. >=20 > > > Also, making SIGSTOP and the like interrupt/restart syscalls is not > > > acceptable unless you find some way to do it such that userland won't > > > notice. For example, a read of 10 megabytes from a regular file with > > > that much available must not return less then 10 megabytes. >=20 > > Note that NFS does check for the signals during i/o, so you may get > > short reads anyway. >=20 > > I do think that the right solution both there and with SINGLE_NO_EXIT > > case for thread_single is to stop at the usermode boundary instead of > > suspending a thread in the interruptible sleep state. >=20 > > I set error code returned from interrupted msleep() to ERESTART, > > that seems to be the right thing, at least to restart the i/o that > > transferred no data upon receiving SIGSTOP. >=20 > Any such short read on a regular file is wrong. That that badness > already occurs in some cases is not an excuse to make it occur more > often. Particularly because process suspension is expected not to affect > the process and interrupting syscalls would change the behaviour of the > debugged program significantly, while the current interruptions only > occur with signals that likely terminate the process anyway (note that > intr mounts only check for SIGINT, SIGTERM, SIGHUP, SIGKILL, SIGSTOP and > SIGQUIT and appear to mask all others; I don't know why SIGTSTP gets > through -- possibly a thread/process difference). >=20 > No matter the SIGSTOP issue, a warning about the interruptions in the > mount_nfs(8) man page may be in order; the current language makes the > impression that intr is only a good thing, which is not the case. This > applies to all NFS versions. A better way to deal with nonresponsive NFS > servers that will not come back would be forced unmount (does it always > work, apart from the case mentioned above? same for the experimental > client?). SIGKILL (but not any other signal, not even SIGSTOP) could > also be allowed on processes blocked on nointr mounts. >=20 > Another point (mostly for socket operations and the like) is that the > current causes of interrupted system calls are under control of the > application: if you do not catch any signals, you will only get short > read/writes for reasons related to the underlying object; hence, it is > often not necessary to add (ugly) code to handle it: any unexpected > short read or write is a problem with the underlying object. >=20 > Another example which currently works and would be a shame to break: >=20 > % /usr/bin/time sleep 10 > ^Z > zsh: suspended /usr/bin/time sleep 10 > % fg > [1] + continued /usr/bin/time sleep 10 > 10.00 real 0.00 user 0.00 sys > % >=20 > What's more, the fact that this works is thanks to the kernel. sleep(1) > just calls nanosleep(2), and because it doesn't catch any signals, that > suffices. >=20 > I do notice this is already broken for debuggers. Attaching gdb or truss > to a running sleep process immediately aborts the nanosleep with EINTR. The point is valid, I updated the patch by adding a special flag for the msleep that indicates that stop is allowed only on usermode boundary. Sleeps from the nfs client where resources are possibly locked are marked with the flag. diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c index 5c1d553..5312ffa 100644 --- a/sys/kern/kern_sig.c +++ b/sys/kern/kern_sig.c @@ -2310,18 +2310,28 @@ static void sig_suspend_threads(struct thread *td, struct proc *p, int sending) { struct thread *td2; + int wakeup_swapper; =20 PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); =20 + wakeup_swapper =3D 0; FOREACH_THREAD_IN_PROC(p, td2) { thread_lock(td2); td2->td_flags |=3D TDF_ASTPENDING | TDF_NEEDSUSPCHK; if ((TD_IS_SLEEPING(td2) || TD_IS_SWAPPED(td2)) && - (td2->td_flags & TDF_SINTR) && - !TD_IS_SUSPENDED(td2)) { - thread_suspend_one(td2); - } else { + (td2->td_flags & TDF_SINTR)) { + if (td2->td_flags & TDF_SBDRY) { + if (TD_IS_SUSPENDED(td2)) + wakeup_swapper |=3D + thread_unsuspend_one(td2); + if (TD_ON_SLEEPQ(td2)) + wakeup_swapper |=3D + sleepq_abort(td2, ERESTART); + } else if (!TD_IS_SUSPENDED(td2)) { + thread_suspend_one(td2); + } + } else if (!TD_IS_SUSPENDED(td2)) { if (sending || td !=3D td2) td2->td_flags |=3D TDF_ASTPENDING; #ifdef SMP @@ -2331,6 +2341,8 @@ sig_suspend_threads(struct thread *td, struct proc *p= , int sending) } thread_unlock(td2); } + if (wakeup_swapper) + kick_proc0(); } =20 int diff --git a/sys/kern/kern_synch.c b/sys/kern/kern_synch.c index b91c1a5..58488ac 100644 --- a/sys/kern/kern_synch.c +++ b/sys/kern/kern_synch.c @@ -188,6 +188,8 @@ _sleep(void *ident, struct lock_object *lock, int prior= ity, flags =3D SLEEPQ_SLEEP; if (catch) flags |=3D SLEEPQ_INTERRUPTIBLE; + if (priority & PBDRY) + flags |=3D SLEEPQ_STOP_ON_BDRY; =20 sleepq_lock(ident); CTR5(KTR_PROC, "sleep: thread %ld (pid %ld, %s) on %s (%p)", @@ -344,11 +346,16 @@ wakeup(void *ident) { int wakeup_swapper; =20 + repeat: sleepq_lock(ident); wakeup_swapper =3D sleepq_broadcast(ident, SLEEPQ_SLEEP, 0, 0); sleepq_release(ident); - if (wakeup_swapper) - kick_proc0(); + if (wakeup_swapper) { + if (ident =3D=3D &proc0) + goto repeat; + else + kick_proc0(); + } } =20 /* @@ -361,11 +368,16 @@ wakeup_one(void *ident) { int wakeup_swapper; =20 + repeat: sleepq_lock(ident); wakeup_swapper =3D sleepq_signal(ident, SLEEPQ_SLEEP, 0, 0); sleepq_release(ident); - if (wakeup_swapper) - kick_proc0(); + if (wakeup_swapper) { + if (ident =3D=3D &proc0) + goto repeat; + else + kick_proc0(); + } } =20 static void diff --git a/sys/kern/kern_thread.c b/sys/kern/kern_thread.c index bb8779b..800a1d1 100644 --- a/sys/kern/kern_thread.c +++ b/sys/kern/kern_thread.c @@ -504,6 +504,22 @@ thread_unlink(struct thread *td) /* Must NOT clear links to proc! */ } =20 +static int +recalc_remaining(struct proc *p, int mode) +{ + int remaining; + + if (mode =3D=3D SINGLE_EXIT) + remaining =3D p->p_numthreads; + else if (mode =3D=3D SINGLE_BOUNDARY) + remaining =3D p->p_numthreads - p->p_boundary_count; + else if (mode =3D=3D SINGLE_NO_EXIT) + remaining =3D p->p_numthreads - p->p_suspcount; + else + panic("recalc_remaining: wrong mode %d", mode); + return (remaining); +} + /* * Enforce single-threading. * @@ -551,12 +567,7 @@ thread_single(int mode) p->p_flag |=3D P_STOPPED_SINGLE; PROC_SLOCK(p); p->p_singlethread =3D td; - if (mode =3D=3D SINGLE_EXIT) - remaining =3D p->p_numthreads; - else if (mode =3D=3D SINGLE_BOUNDARY) - remaining =3D p->p_numthreads - p->p_boundary_count; - else - remaining =3D p->p_numthreads - p->p_suspcount; + remaining =3D recalc_remaining(p, mode); while (remaining !=3D 1) { if (P_SHOULDSTOP(p) !=3D P_STOPPED_SINGLE) goto stopme; @@ -587,18 +598,17 @@ thread_single(int mode) wakeup_swapper |=3D sleepq_abort(td2, ERESTART); break; + case SINGLE_NO_EXIT: + if (TD_IS_SUSPENDED(td2) && + !(td2->td_flags & TDF_BOUNDARY)) + wakeup_swapper |=3D + thread_unsuspend_one(td2); + if (TD_ON_SLEEPQ(td2) && + (td2->td_flags & TDF_SINTR)) + wakeup_swapper |=3D + sleepq_abort(td2, ERESTART); + break; default: - if (TD_IS_SUSPENDED(td2)) { - thread_unlock(td2); - continue; - } - /* - * maybe other inhibited states too? - */ - if ((td2->td_flags & TDF_SINTR) && - (td2->td_inhibitors & - (TDI_SLEEPING | TDI_SWAPPED))) - thread_suspend_one(td2); break; } } @@ -611,12 +621,7 @@ thread_single(int mode) } if (wakeup_swapper) kick_proc0(); - if (mode =3D=3D SINGLE_EXIT) - remaining =3D p->p_numthreads; - else if (mode =3D=3D SINGLE_BOUNDARY) - remaining =3D p->p_numthreads - p->p_boundary_count; - else - remaining =3D p->p_numthreads - p->p_suspcount; + remaining =3D recalc_remaining(p, mode); =20 /* * Maybe we suspended some threads.. was it enough? @@ -630,12 +635,7 @@ stopme: * In the mean time we suspend as well. */ thread_suspend_switch(td); - if (mode =3D=3D SINGLE_EXIT) - remaining =3D p->p_numthreads; - else if (mode =3D=3D SINGLE_BOUNDARY) - remaining =3D p->p_numthreads - p->p_boundary_count; - else - remaining =3D p->p_numthreads - p->p_suspcount; + remaining =3D recalc_remaining(p, mode); } if (mode =3D=3D SINGLE_EXIT) { /* diff --git a/sys/kern/subr_sleepqueue.c b/sys/kern/subr_sleepqueue.c index 01fcc73..781c186 100644 --- a/sys/kern/subr_sleepqueue.c +++ b/sys/kern/subr_sleepqueue.c @@ -341,6 +341,8 @@ sleepq_add(void *wchan, struct lock_object *lock, const= char *wmesg, int flags, if (flags & SLEEPQ_INTERRUPTIBLE) { td->td_flags |=3D TDF_SINTR; td->td_flags &=3D ~TDF_SLEEPABORT; + if (flags & SLEEPQ_STOP_ON_BDRY) + td->td_flags |=3D TDF_SBDRY; } thread_unlock(td); } diff --git a/sys/nfsclient/nfs_bio.c b/sys/nfsclient/nfs_bio.c index 22e2a79..d5d426e 100644 --- a/sys/nfsclient/nfs_bio.c +++ b/sys/nfsclient/nfs_bio.c @@ -1255,7 +1255,7 @@ nfs_getcacheblk(struct vnode *vp, daddr_t bn, int siz= e, struct thread *td) sigset_t oldset; =20 nfs_set_sigmask(td, &oldset); - bp =3D getblk(vp, bn, size, PCATCH, 0, 0); + bp =3D getblk(vp, bn, size, NFS_PCATCH, 0, 0); nfs_restore_sigmask(td, &oldset); while (bp =3D=3D NULL) { if (nfs_sigintr(nmp, NULL, td)) @@ -1292,7 +1292,7 @@ nfs_vinvalbuf(struct vnode *vp, int flags, struct thr= ead *td, int intrflg) if ((nmp->nm_flag & NFSMNT_INT) =3D=3D 0) intrflg =3D 0; if (intrflg) { - slpflag =3D PCATCH; + slpflag =3D NFS_PCATCH; slptimeo =3D 2 * hz; } else { slpflag =3D 0; @@ -1371,7 +1371,7 @@ nfs_asyncio(struct nfsmount *nmp, struct buf *bp, str= uct ucred *cred, struct thr } again: if (nmp->nm_flag & NFSMNT_INT) - slpflag =3D PCATCH; + slpflag =3D NFS_PCATCH; gotiod =3D FALSE; =20 /* @@ -1440,7 +1440,7 @@ again: mtx_unlock(&nfs_iod_mtx); =09 return (error2); } - if (slpflag =3D=3D PCATCH) { + if (slpflag =3D=3D NFS_PCATCH) { slpflag =3D 0; slptimeo =3D 2 * hz; } diff --git a/sys/nfsclient/nfs_socket.c b/sys/nfsclient/nfs_socket.c index 1ae31a5..2398695 100644 --- a/sys/nfsclient/nfs_socket.c +++ b/sys/nfsclient/nfs_socket.c @@ -516,7 +516,7 @@ nfs_reconnect(struct nfsreq *rep) =20 KASSERT(mtx_owned(&nmp->nm_mtx), ("NFS mnt lock not owned !")); if (nmp->nm_flag & NFSMNT_INT) - slpflag =3D PCATCH; + slpflag =3D NFS_PCATCH; /* * Wait for any pending writes to this socket to drain (or timeout). */ @@ -768,7 +768,7 @@ tryagain: slpflag =3D 0; mtx_lock(&nmp->nm_mtx); if (nmp->nm_flag & NFSMNT_INT) - slpflag =3D PCATCH; + slpflag =3D NFS_PCATCH; mtx_unlock(&nmp->nm_mtx); mtx_lock(&rep->r_mtx); while ((rep->r_mrep =3D=3D NULL) && (error =3D=3D 0) &&=20 @@ -1791,7 +1791,7 @@ nfs_connect_lock(struct nfsreq *rep) =20 td =3D rep->r_td; if (rep->r_nmp->nm_flag & NFSMNT_INT) - slpflag =3D PCATCH; + slpflag =3D NFS_PCATCH; while (*statep & NFSSTA_SNDLOCK) { error =3D nfs_sigintr(rep->r_nmp, rep, td); if (error) { @@ -1800,7 +1800,7 @@ nfs_connect_lock(struct nfsreq *rep) *statep |=3D NFSSTA_WANTSND; (void) msleep(statep, &rep->r_nmp->nm_mtx, slpflag | (PZERO - 1), "nfsndlck", slptimeo); - if (slpflag =3D=3D PCATCH) { + if (slpflag & PCATCH) { slpflag =3D 0; slptimeo =3D 2 * hz; } diff --git a/sys/nfsclient/nfs_vnops.c b/sys/nfsclient/nfs_vnops.c index 3623fab..a8d098b 100644 --- a/sys/nfsclient/nfs_vnops.c +++ b/sys/nfsclient/nfs_vnops.c @@ -2931,7 +2931,7 @@ nfs_flush(struct vnode *vp, int waitfor, int commit) int bvecsize =3D 0, bveccount; =20 if (nmp->nm_flag & NFSMNT_INT) - slpflag =3D PCATCH; + slpflag =3D NFS_PCATCH; if (!commit) passone =3D 0; bo =3D &vp->v_bufobj; @@ -3129,7 +3129,7 @@ loop: error =3D EINTR; goto done; } - if (slpflag =3D=3D PCATCH) { + if (slpflag & PCATCH) { slpflag =3D 0; slptimeo =3D 2 * hz; } @@ -3167,7 +3167,7 @@ loop: error =3D nfs_sigintr(nmp, NULL, td); if (error) goto done; - if (slpflag =3D=3D PCATCH) { + if (slpflag & PCATCH) { slpflag =3D 0; slptimeo =3D 2 * hz; } diff --git a/sys/nfsclient/nfsmount.h b/sys/nfsclient/nfsmount.h index 85f8501..c98a172 100644 --- a/sys/nfsclient/nfsmount.h +++ b/sys/nfsclient/nfsmount.h @@ -147,6 +147,8 @@ struct nfsmount { #define NFS_TPRINTF_DELAY 30 #endif =20 +#define NFS_PCATCH (PCATCH | PBDRY) + #endif =20 #endif diff --git a/sys/sys/param.h b/sys/sys/param.h index 06745f8..5ee9c16 100644 --- a/sys/sys/param.h +++ b/sys/sys/param.h @@ -186,6 +186,7 @@ #define PRIMASK 0x0ff #define PCATCH 0x100 /* OR'd with pri for tsleep to check signals */ #define PDROP 0x200 /* OR'd with pri to stop re-entry of interlock mutex */ +#define PBDRY 0x400 /* for PCATCH stop is done on the user boundary */ =20 #define NZERO 0 /* default "nice" */ =20 diff --git a/sys/sys/proc.h b/sys/sys/proc.h index 0a4b79c..b65db62 100644 --- a/sys/sys/proc.h +++ b/sys/sys/proc.h @@ -320,7 +320,7 @@ do { \ #define TDF_BOUNDARY 0x00000400 /* Thread suspended at user boundary */ #define TDF_ASTPENDING 0x00000800 /* Thread has some asynchronous events. = */ #define TDF_TIMOFAIL 0x00001000 /* Timeout from sleep after we were awake.= */ -#define TDF_UNUSED2000 0x00002000 /* --available-- */ +#define TDF_SBDRY 0x00002000 /* Stop only on usermode boundary. */ #define TDF_UPIBLOCKED 0x00004000 /* Thread blocked on user PI mutex. */ #define TDF_NEEDSUSPCHK 0x00008000 /* Thread may need to suspend. */ #define TDF_NEEDRESCHED 0x00010000 /* Thread needs to yield. */ diff --git a/sys/sys/sleepqueue.h b/sys/sys/sleepqueue.h index 0d1f361..362945a 100644 --- a/sys/sys/sleepqueue.h +++ b/sys/sys/sleepqueue.h @@ -93,6 +93,8 @@ struct thread; #define SLEEPQ_SX 0x03 /* Used by an sx lock. */ #define SLEEPQ_LK 0x04 /* Used by a lockmgr. */ #define SLEEPQ_INTERRUPTIBLE 0x100 /* Sleep is interruptible. */ +#define SLEEPQ_STOP_ON_BDRY 0x200 /* Stop sleeping thread on + user mode boundary */ =20 void init_sleepqueues(void); int sleepq_abort(struct thread *td, int intrval); --sT9gWZPUZYhvPS56 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (FreeBSD) iEYEARECAAYFAko+Ht0ACgkQC3+MBN1Mb4i8UACdF/vS3aAq6zWOi4PO438RpVA5 lVQAoLWpwuS57yLZjH8mMLPYEoE1xIGs =ClRu -----END PGP SIGNATURE----- --sT9gWZPUZYhvPS56--