Date: Sun, 31 Jul 2011 00:17:02 +0300 From: Andriy Gapon <avg@FreeBSD.org> To: Alexander Motin <mav@FreeBSD.org> Cc: Kostik Belousov <kostikbel@gmail.com>, svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org Subject: Re: svn commit: r224496 - head/sys/cam Message-ID: <4E3474CE.5090301@FreeBSD.org> In-Reply-To: <4E346DD9.5030902@FreeBSD.org> References: <201107292030.p6TKUSaf064895@svn.freebsd.org> <20110730163723.GZ17489@deviant.kiev.zoral.com.ua> <4E346DD9.5030902@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
on 30/07/2011 23:47 Alexander Motin said the following: > Kostik Belousov wrote: >> On Fri, Jul 29, 2011 at 08:30:28PM +0000, Alexander Motin wrote: >>> Author: mav >>> Date: Fri Jul 29 20:30:28 2011 >>> New Revision: 224496 >>> URL: http://svn.freebsd.org/changeset/base/224496 >>> >>> Log: >>> In some cases failed SATA disks may report their presence, but don't >>> respond to any commands. I've found that because of multiple command >>> retries, each of which cause 30s timeout, bus reset and another retry or >>> requeue for many commands, it may take ages to eventually drop the >>> failed device. The odd thing is that those retries continue even after >>> XPT considered device as dead and invalidated it. >>> >>> This patch makes cam_periph_error() to block any command retries after >>> periph was marked as invalid. With that patch all activity completes in >>> 1-2 minutes, just after several timeouts, required to consider device >>> death. This should make ZFS, gmirror, graid, etc. operation more robust. >>> >>> Reviewed by: mjacob@ on scsi@ >>> >>> Approved by: re (kib) >>> >>> Modified: >>> head/sys/cam/cam_periph.c >> Amusingly, this commit makes my test machine to not boot. >> This is Ibex Peak PCH, with two SATA disks on the channels 0 and 1. >> >> It seems that geom thread 100012 owns GEOM topology lock, while sleeping >> in adaclose->cam_periph_getccb() : >> >> db> bt 100012 >> Tracing pid 12 tid 100012 td 0xfffffe00028a2000 >> sched_switch() at 0xffffffff8034a0c7 = sched_switch+0x157 >> mi_switch() at 0xffffffff803291fb = mi_switch+0x2eb >> sleepq_switch() at 0xffffffff803631f3 = sleepq_switch+0x123 >> sleepq_wait() at 0xffffffff80363eed = sleepq_wait+0x4d >> _sleep() at 0xffffffff80329b59 = _sleep+0x3b9 >> cam_periph_getccb() at 0xffffffff817ffc50 = cam_periph_getccb+0xa0 >> adaclose() at 0xffffffff8182c484 = adaclose+0xc4 >> g_disk_access() at 0xffffffff802bea74 = g_disk_access+0x1e4 >> g_access() at 0xffffffff802c519a = g_access+0x1ba >> g_dev_attrchanged() at 0xffffffff802bd1f6 = g_dev_attrchanged+0x96 >> g_dev_taste() at 0xffffffff802bd574 = g_dev_taste+0x284 >> g_new_provider_event() at 0xffffffff802c4ecd = g_new_provider_event+0xad >> g_run_events() at 0xffffffff802c0750 = g_run_events+0x250 >> fork_exit() at 0xffffffff802f0d99 = fork_exit+0x189 >> fork_trampoline() at 0xffffffff804ee3be = fork_trampoline+0xe >> --- trap 0, rip = 0, rsp = 0xffffff800025fd00, rbp = 0 --- >> >> (gdb) list *cam_periph_getccb+0xa0 >> 0x1c50 is in cam_periph_getccb (/usr/home/kostik/work/build/bsd/DEV/src/sys/modules/cam/../../cam/cam_periph.c:883). >> 882 >> 883 while (SLIST_FIRST(&periph->ccb_list) == NULL) { >> 884 if (periph->immediate_priority > priority) >> >> Reverting the rev. or not loading ahci.ko allows machine to boot. > > After many experiments I believe that problem is not related to this > change. I've managed to reproduce it depending on GEOM modules > registration order. After I disabled all GEOM modules and only geom_dev > left, problem became persistent. Specifics of the geom_dev is that it > opens device and closes it back without doing any I/O. That caused race > condition between CCB allocation for FLUSHCACHE execution in adaclose() > and higher-priority commands of device initialization sequence. Any I/O > scheduled before adaclose() closed that race, making problem rare. The > problem is specific to the ada, as for no other driver initialization > and payload requests may intersect in time. Alexander, somewhat contradicting your conclusions... can the following issue be also caused by similar mechanics? http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020336.html > Attached patch solved the problem for me. Please try it and approve > commit if it works. > -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E3474CE.5090301>