Date: Sat, 30 Jan 2010 20:51:27 +0200 From: Alexander Motin <mav@FreeBSD.org> To: Pawel Jakub Dawidek <pjd@FreeBSD.org> Cc: freebsd-hackers@freebsd.org, FreeBSD-Current <freebsd-current@freebsd.org>, kib@FreeBSD.org, freebsd-geom@freebsd.org Subject: Re: Deadlock between GEOM and devfs device destroy and process exit. Message-ID: <4B647FAF.4090409@FreeBSD.org> In-Reply-To: <20100130114451.GB1660@garage.freebsd.pl> References: <4B636812.8060403@FreeBSD.org> <20100130112749.GA1660@garage.freebsd.pl> <20100130114451.GB1660@garage.freebsd.pl>
next in thread | previous in thread | raw e-mail | index | archive | help
Pawel Jakub Dawidek wrote: > On Sat, Jan 30, 2010 at 12:27:49PM +0100, Pawel Jakub Dawidek wrote: >> On Sat, Jan 30, 2010 at 12:58:26AM +0200, Alexander Motin wrote: >>> Experimenting with SATA hot-plug I've found quite repeatable deadlock >>> case. Problem observed when several SATA devices, opened via devfs, >>> disappear at exactly same time. In my case, at time of unplugging SATA >>> Port Multiplier with several disks beyond it. All I have to do is to run >>> several `dd if=/dev/adaX of=/dev/null bs=1m &` commands and unplug >>> multiplier. That causes predictable I/O errors and devices destruction. >>> But with high probability several dd processes getting stuck in kernel. >> [...] >> >> I observed the same thing yesterday while stress-testing HAST: >> >> 3659 2504 3659 0 DE+ GEOM top 0x8079a348 dd >> 3658 2102 2102 0 DE+ GEOM top 0x8079a348 hastd >> 2 0 0 0 DL devdrn 0x85b1bc68 [g_event] >> >> Both dd(1) and hastd(8) wait for the GEOM topology lock in the exit path, >> which is already held by the g_event thread. > > Maybe I'll add how I understand what's going on: > > GEOM calls destroy_dev() while holding the topology lock. > > Destroy_dev() wants to destroy device, but can't because there are > threads that still have it open. > > The threads can't close it, because to close it they need the topology > lock. > > The deadlock is quite obvious, IMHO. You are right, but as it happens not every time I was interested why. After closer look I found two different scenarios. In first case application receives I/O error and closes device. On device close CAM calls disk_destroy(), which schedules device destruction. When destroy_dev() called, device already free and there is no problem, as these events are always asynchronous. In second case, application also receives I/O error, but before it is able to react, GEOM starts handling of disk_gone(), called by CAM. As result, destroy_dev() called with device still opened, and it can't ever be closed due to topology lock held. I've played a bit with destroy_dev_sched(), but locking indeed looks not to be easy. Is there some known good practice? destroy_dev_sched_cb() looks a bit more promising. -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B647FAF.4090409>