From owner-svn-src-all@FreeBSD.ORG Sat Jul 30 21:09:35 2011 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1FCD9106566B; Sat, 30 Jul 2011 21:09:35 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id 535D68FC12; Sat, 30 Jul 2011 21:09:33 +0000 (UTC) Received: by fxe4 with SMTP id 4so4641497fxe.13 for ; Sat, 30 Jul 2011 14:09:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type; bh=frUVb/KGMPGpqdzWFpDZy75+v0ECobOfk0Kukc0pa5Y=; b=xBvdYl28nZQZIe0+m+Tt2BK4gUGZKt3KaWEw8dG/liGwmcUAjykVtDWy8WRwA3PwFZ sTHcALFiz7agjzcumHEtuqVxNwPj85vYs5cwym8SO0D7XEofv+lsITgGx1PZa6zy5u7G Vz+mPPGK9Kwd2YtxEXEcsONjeClHgCbBuB0eg= Received: by 10.223.76.202 with SMTP id d10mr3868751fak.106.1312058870838; Sat, 30 Jul 2011 13:47:50 -0700 (PDT) Received: from mavbook2.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226]) by mx.google.com with ESMTPS id x2sm1696538fal.44.2011.07.30.13.47.46 (version=SSLv3 cipher=OTHER); Sat, 30 Jul 2011 13:47:48 -0700 (PDT) Sender: Alexander Motin Message-ID: <4E346DD9.5030902@FreeBSD.org> Date: Sat, 30 Jul 2011 23:47:21 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.23 (X11/20091212) MIME-Version: 1.0 To: Kostik Belousov References: <201107292030.p6TKUSaf064895@svn.freebsd.org> <20110730163723.GZ17489@deviant.kiev.zoral.com.ua> In-Reply-To: <20110730163723.GZ17489@deviant.kiev.zoral.com.ua> X-Enigmail-Version: 0.96.0 Content-Type: multipart/mixed; boundary="------------020700080908040403090808" Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r224496 - head/sys/cam X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 30 Jul 2011 21:09:35 -0000 This is a multi-part message in MIME format. --------------020700080908040403090808 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Kostik Belousov wrote: > On Fri, Jul 29, 2011 at 08:30:28PM +0000, Alexander Motin wrote: >> Author: mav >> Date: Fri Jul 29 20:30:28 2011 >> New Revision: 224496 >> URL: http://svn.freebsd.org/changeset/base/224496 >> >> Log: >> In some cases failed SATA disks may report their presence, but don't >> respond to any commands. I've found that because of multiple command >> retries, each of which cause 30s timeout, bus reset and another retry or >> requeue for many commands, it may take ages to eventually drop the >> failed device. The odd thing is that those retries continue even after >> XPT considered device as dead and invalidated it. >> >> This patch makes cam_periph_error() to block any command retries after >> periph was marked as invalid. With that patch all activity completes in >> 1-2 minutes, just after several timeouts, required to consider device >> death. This should make ZFS, gmirror, graid, etc. operation more robust. >> >> Reviewed by: mjacob@ on scsi@ >> >> Approved by: re (kib) >> >> Modified: >> head/sys/cam/cam_periph.c > Amusingly, this commit makes my test machine to not boot. > This is Ibex Peak PCH, with two SATA disks on the channels 0 and 1. > > It seems that geom thread 100012 owns GEOM topology lock, while sleeping > in adaclose->cam_periph_getccb() : > > db> bt 100012 > Tracing pid 12 tid 100012 td 0xfffffe00028a2000 > sched_switch() at 0xffffffff8034a0c7 = sched_switch+0x157 > mi_switch() at 0xffffffff803291fb = mi_switch+0x2eb > sleepq_switch() at 0xffffffff803631f3 = sleepq_switch+0x123 > sleepq_wait() at 0xffffffff80363eed = sleepq_wait+0x4d > _sleep() at 0xffffffff80329b59 = _sleep+0x3b9 > cam_periph_getccb() at 0xffffffff817ffc50 = cam_periph_getccb+0xa0 > adaclose() at 0xffffffff8182c484 = adaclose+0xc4 > g_disk_access() at 0xffffffff802bea74 = g_disk_access+0x1e4 > g_access() at 0xffffffff802c519a = g_access+0x1ba > g_dev_attrchanged() at 0xffffffff802bd1f6 = g_dev_attrchanged+0x96 > g_dev_taste() at 0xffffffff802bd574 = g_dev_taste+0x284 > g_new_provider_event() at 0xffffffff802c4ecd = g_new_provider_event+0xad > g_run_events() at 0xffffffff802c0750 = g_run_events+0x250 > fork_exit() at 0xffffffff802f0d99 = fork_exit+0x189 > fork_trampoline() at 0xffffffff804ee3be = fork_trampoline+0xe > --- trap 0, rip = 0, rsp = 0xffffff800025fd00, rbp = 0 --- > > (gdb) list *cam_periph_getccb+0xa0 > 0x1c50 is in cam_periph_getccb (/usr/home/kostik/work/build/bsd/DEV/src/sys/modules/cam/../../cam/cam_periph.c:883). > 882 > 883 while (SLIST_FIRST(&periph->ccb_list) == NULL) { > 884 if (periph->immediate_priority > priority) > > Reverting the rev. or not loading ahci.ko allows machine to boot. After many experiments I believe that problem is not related to this change. I've managed to reproduce it depending on GEOM modules registration order. After I disabled all GEOM modules and only geom_dev left, problem became persistent. Specifics of the geom_dev is that it opens device and closes it back without doing any I/O. That caused race condition between CCB allocation for FLUSHCACHE execution in adaclose() and higher-priority commands of device initialization sequence. Any I/O scheduled before adaclose() closed that race, making problem rare. The problem is specific to the ada, as for no other driver initialization and payload requests may intersect in time. Attached patch solved the problem for me. Please try it and approve commit if it works. -- Alexander Motin --------------020700080908040403090808 Content-Type: text/plain; name="adaschedule.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="adaschedule.patch" --- cam/ata/ata_da.c.prev 2011-07-30 22:22:21.000000000 +0300 +++ cam/ata/ata_da.c 2011-07-30 23:12:22.000000000 +0300 @@ -488,12 +488,20 @@ static void adaschedule(struct cam_periph *periph) { struct ada_softc *softc = (struct ada_softc *)periph->softc; + uint32_t prio; + /* Check if cam_periph_getccb() was called. */ + prio = periph->immediate_priority; + + /* Check if we have more work to do. */ if (bioq_first(&softc->bio_queue) || (!softc->trim_running && bioq_first(&softc->trim_queue))) { - /* Have more work to do, so ensure we stay scheduled */ - xpt_schedule(periph, CAM_PRIORITY_NORMAL); + prio = CAM_PRIORITY_NORMAL; } + + /* Schedule CCB if any of above is true. */ + if (prio != CAM_PRIORITY_NONE) + xpt_schedule(periph, prio); } /* --------------020700080908040403090808--