Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 30 Jan 2010 01:11:10 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Alexander Motin <mav@freebsd.org>
Cc:        freebsd-hackers@freebsd.org, FreeBSD-Current <freebsd-current@freebsd.org>, freebsd-geom@freebsd.org
Subject:   Re: Deadlock between GEOM and devfs device destroy and process exit.
Message-ID:  <20100129231110.GS3877@deviant.kiev.zoral.com.ua>
In-Reply-To: <4B636812.8060403@FreeBSD.org>
References:  <4B636812.8060403@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--zhl+qcI0cpCDfCbW
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jan 30, 2010 at 12:58:26AM +0200, Alexander Motin wrote:
> Hi.
>=20
> Experimenting with SATA hot-plug I've found quite repeatable deadlock
> case. Problem observed when several SATA devices, opened via devfs,
> disappear at exactly same time. In my case, at time of unplugging SATA
> Port Multiplier with several disks beyond it. All I have to do is to run
> several `dd if=3D/dev/adaX of=3D/dev/null bs=3D1m &` commands and unplug
> multiplier. That causes predictable I/O errors and devices destruction.
> But with high probability several dd processes getting stuck in kernel.
>=20
> I've discovered such pieces of problem:
> - CAM receives disconnect event and starts device destruction. But as
> device is still opened, it can't do it immediately.
> - dd receives I/O error and exits.
> - exit1() call closes all descriptors, including adaX device. It
> triggers final device destruction, by sending event to geom_dev.
>=20
> adaclose(4571fa00,4,40c16576,76,0,...) at 0x4049c521
> g_disk_access(457e2200,ffffffff,0,0,0,...) at 0x4080b9a4
> g_access(45643d80,ffffffff,0,0,2000,...) at 0x40810ccb
> g_dev_close(45766500,1,2000,4569fd80,4569fd80,...) at 0x4080a425
> devfs_close(7b604aa8,80000,457f8000,80000,7b604acc,...) at 0x407f2762
> VOP_CLOSE_APV(40d03180,7b604aa8,40c2e681,128,0,...) at 0x40b6da55
> vn_close(457f8000,1,45624300,4569fd80,451271e0,...) at 0x40912750
> vn_closefile(4566da48,4569fd80,4566da48,0,7b604b58,...) at 0x40912854
> devfs_close_f(4566da48,4569fd80,3,0,4566da48,...) at 0x407f235b
> _fdrop(4566da48,4569fd80,7b604b8c,408b5cec,0,4569fe24,40eb23a8,40d10460,4=
0c1a8bb,4560672c,721,40c1a8b2,7b604bb4,40878220,4560672c,8,40c1a8b2,721)
> at 0x40836da3
> closef(4566da48,4569fd80,721,71e,4569fe24,...) at 0x40838ad0
> fdfree(4569fd80,0,40c1b1a9,107,7b604c80,...) at 0x408394da
> exit1(4569fd80,100,7b604d2c,40b565c0,4569fd80,...) at 0x40844423
> sys_exit(4569fd80,7b604cf8,40c59d34,40c26be4,4569d2a8,...) at 0x408450fd
> syscall(7b604d38) at 0x40b565c0
>=20
> - GEOM event thread tries to destroy /dev/adaX device (which should be
> already free at this moment), but for some reason freezes, waiting for
> device to be freed:
>=20
>     0     2     0   0  -8  0      0      8 devdrn DL    ??    0:02.89
> [g_event]
>=20
> - as GEOM event is still not handled, exit1() waits for it:
>=20
> kdb_backtrace(40c16bc4,0,40c16ab1,56,4540e640,...) at 0x408a2909
> g_waitidle(4569fd80,0,40c1b1a9,107,7b604c80,...) at 0x4080cd1f
> exit1(4569fd80,100,7b604d2c,40b565c0,4569fd80,...) at 0x40844431
> sys_exit(4569fd80,7b604cf8,40c59d34,40c26be4,4569d2a8,...) at 0x408450fd
> syscall(7b604d38) at 0x40b565c0
>=20
> - system stationary. GEOM frozen. No way to get out of this, except
> pushing reset.
>=20
>     0  1065  1055   0  44  0   5344   3040 g_wait DE     0    0:00.43 dd
> if=3D/dev/ada1 of=3D/dev/null bs=3D1m
>     0  1066  1055   0  44  0   5344   3040 GEOM t DE     0    0:00.07 dd
> if=3D/dev/ada2 of=3D/dev/null bs=3D1m
>=20
>=20
> So, does anybody have good idea why destroy_dev() can't complete?

The devdrn state means that thread performing the device destruction,
i.e. the thread called destroy_dev(), is waiting for threads to leave
the cdevsw d_* methods. The thread that notified the destruction thread
did that from d_close() method. This resulted in the deadlock.

I introduced destroy_dev_sched(9) KPI to handle this and similar issues.
Note that race-free use of destroy_dev_sched(9) is quite hard.

--zhl+qcI0cpCDfCbW
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (FreeBSD)

iEYEARECAAYFAktjaw0ACgkQC3+MBN1Mb4g4CgCg5qoXeNLMYgbyuZhwAZYQtX/g
F4UAoOF3rYGBwcwwsat2EykHAGqEog0e
=Rkef
-----END PGP SIGNATURE-----

--zhl+qcI0cpCDfCbW--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100129231110.GS3877>