From owner-freebsd-scsi@FreeBSD.ORG Sun May 8 10:02:48 2011 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 27CB9106566C for ; Sun, 8 May 2011 10:02:48 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 3FA198FC0A for ; Sun, 8 May 2011 10:02:46 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id p489j9kx049575 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 8 May 2011 12:45:09 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id p489j9g4067858; Sun, 8 May 2011 12:45:09 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id p489j9HN067857; Sun, 8 May 2011 12:45:09 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 8 May 2011 12:45:09 +0300 From: Kostik Belousov To: Joerg Wunsch Message-ID: <20110508094509.GT48734@deviant.kiev.zoral.com.ua> References: <20110508085314.GA5364@uriah.heep.sax.de> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="4MYbgvGoS9GHC/6/" Content-Disposition: inline In-Reply-To: <20110508085314.GA5364@uriah.heep.sax.de> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_05, DNS_FROM_OPENWHOIS autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-scsi@freebsd.org Subject: Re: Panic when removing a SCSI device entry X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 May 2011 10:02:48 -0000 --4MYbgvGoS9GHC/6/ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, May 08, 2011 at 10:53:14AM +0200, Joerg Wunsch wrote: > I've got a setup where a tape library is attached with a > computer-controllable power switch, so it is only turned on during the > time when backups (or restores) are done. This is mainly to reduce > the noise level, but also to reduce the overall power consumption > energy while that library is not needed. >=20 > Every now and then, the kernel panics with a page fault during the > (unattented, it happens at night times) power cycling and surrounding > actions. The current process when the page fault happens is always > mt(1), which is used inside the powerup/down script to ensure the > drive is being properly rewound. The page fault happens in > destroy_devl(), at this location: >=20 > /* If we are a child, remove us from the parents list */ > if (dev->si_flags & SI_CHILD) { > here --->>> LIST_REMOVE(dev, si_siblings); > dev->si_flags &=3D ~SI_CHILD; > } >=20 > The preprocessed code of that looks like: >=20 > if (dev->si_flags & 0x0010) { > if ((((dev))->si_siblings.le_next) !=3D ((void *)0)) > (((dev))->si_siblings.le_next)->si_siblings.le_prev =3D > (dev)->si_siblings.le_prev; > *(dev)->si_siblings.le_prev =3D (((dev))->si_siblings.le_next); > dev->si_flags &=3D ~0x0010; > } >=20 > and it's the indirection of *(dev)->si_siblings.le_prev that hits a > NULL pointer. Obviously, LIST_REMOVE doesn't anticipate that Is it NULL pointer dereference ? See below. > dev->si_siblings.le_prev might be a NULL pointer, so this is a usage > error, somehow. Could it be that destroy_devl() is called twice for > the same device? >=20 > This used to happen on an earlier system (some version of 7.x-stable), > and I eventually managed it to tweak the powerup/down scripts of the > library so to avoid the critical sequence of actions triggering this > situation. Now that I finally upgraded the machine to 8.2-STABLE, > it is triggered very frequently again though. >=20 > Any ideas how to fix it, or at least apply a workaround, other than > turning >=20 > *(elm)->field.le_prev =3D LIST_NEXT((elm), field); \ >=20 > in the LIST_REMOVE macro into >=20 > if ((elm)->field.le_prev !=3D NULL) \ > *(elm)->field.le_prev =3D LIST_NEXT((elm), field); \ >=20 > which affects the entire system, not just the SCSI subsystem part? Please provide the full printout from the panic. Also, it would be useful to get the dump and do "p *dev" from the frame of destroy_devl(). I might need further information after the requested data is provided. Thing you may try meantime is the following patch. diff --git a/sys/kern/kern_conf.c b/sys/kern/kern_conf.c index b2be5cc..59b876c 100644 --- a/sys/kern/kern_conf.c +++ b/sys/kern/kern_conf.c @@ -981,6 +981,8 @@ destroy_devl(struct cdev *dev) /* Remove name marking */ dev->si_flags &=3D ~SI_NAMED; =20 + dev->si_refcount++; /* Avoid race with dev_rel() */ + /* If we are a child, remove us from the parents list */ if (dev->si_flags & SI_CHILD) { LIST_REMOVE(dev, si_siblings); @@ -997,7 +999,6 @@ destroy_devl(struct cdev *dev) dev->si_flags &=3D ~SI_CLONELIST; } =20 - dev->si_refcount++; /* Avoid race with dev_rel() */ csw =3D dev->si_devsw; dev->si_devsw =3D NULL; /* already NULL for SI_ALIAS */ while (csw !=3D NULL && csw->d_purge !=3D NULL && dev->si_threadcount) { --4MYbgvGoS9GHC/6/ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAk3GZiQACgkQC3+MBN1Mb4iovACeOWr+L60r4QHYJ9bdK0A8QklZ agkAnj/TZk3ZnvgbUjVWiY16ShU7fU3m =I3gP -----END PGP SIGNATURE----- --4MYbgvGoS9GHC/6/--