From owner-freebsd-current@FreeBSD.ORG Mon Jan 21 16:35:50 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8348BAA0 for ; Mon, 21 Jan 2013 16:35:50 +0000 (UTC) (envelope-from c47g@gmx.at) Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) by mx1.freebsd.org (Postfix) with ESMTP id 32996DE0 for ; Mon, 21 Jan 2013 16:35:49 +0000 (UTC) Received: from mailout-de.gmx.net ([10.1.76.28]) by mrigmx.server.lan (mrigmx002) with ESMTP (Nemesis) id 0M2rdY-1T62jr0tWg-00se6q for ; Mon, 21 Jan 2013 17:35:49 +0100 Received: (qmail invoked by alias); 21 Jan 2013 16:35:49 -0000 Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232] by mail.gmx.net (mp028) with SMTP; 21 Jan 2013 17:35:49 +0100 X-Authenticated: #9978462 X-Provags-ID: V01U2FsdGVkX1+WhogqSGGWbji/QPn9uri6OOYM2egaGhKsU4WzjT C4kEbaFwuCAV08 From: Christian Gusenbauer To: freebsd-current@freebsd.org Subject: Re: disk "flipped" - a known problem? Date: Mon, 21 Jan 2013 17:37:18 +0100 User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; ) References: <50FC3EBF.6070803@FreeBSD.org> In-Reply-To: <50FC3EBF.6070803@FreeBSD.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201301211737.19235.c47g@gmx.at> X-Y-GMX-Trusted: 0 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jan 2013 16:35:50 -0000 Hi! On Sunday 20 January 2013 20:00:15 Andriy Gapon wrote: > Today something unusual happened on one of my machines: > kernel: (ada0:ahcich0:0:0:0): lost device > kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 > 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout > kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted > kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 > 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout > kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted > kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected > flags 0x18 refcount 1 > kernel: adaasync: Unable to attach to new device due to status 0x6 > > It looks like the disk disappeared from the bus and then re-appeared on the > bus, but not to the OS. > > One of the partitions that the disk hosted was a swap partition and it > seems to be the cause of some of the following consequences. > > The consequences: > > * ZFS properly noticed disappearance of the disk, but its diagnostic was a > little bit misleading: > > pool: pond > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using 'zpool online' or replace the device with > 'zpool replace'. > scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012 > config: > > NAME STATE READ > WRITE CKSUM pond DEGRADED 0 > 0 0 mirror-0 DEGRADED 0 > 0 0 12725235722288301230 REMOVED 0 0 > 0 was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff > gptid/48782c6e-8fbd-11de-b3e1-00241d20d446 ONLINE 0 > 0 0 > > Yes, I agree that the disk got removed/lost, but disagree that "the > administrator" did it. > > * geom_event thread started consuming 100% of CPU in g_wither_washer() > > * /dev/ada0 disappeared but camcontrol devlist still reported ada0: > at scbus0 target 0 lun 0 (pass0,ada0) > > * As seen in the system messages, CAM layer refused to re-attach the disk > > * gpart command would just crash > > > So, I can explain the behavior of the geom_event thread - apparently > swapgeom_orphan doesn't do anything that is really meaningful to GEOM and > so g_wither_washer is stuck waiting until the swap consumer goes way > (drops its access bits). > > (Another sad thing about this state is that I couldn't swapoff the device, > because there was no device entry.) > > I am not sure if the "attempt to re-allocate valid device" failure was > caused by this, but it could be, if something in CAM layer was waiting for > GEOM layer to be done with the disk. > > It would be nice if the swap code properly supported disappearance of the > underlying disks. Especially in this case where the swap was actually > never used / touched at all (few hours after reboot and completely idle > system). I don't know if it's related, but my new 2 TB WD green harddisk vanished three times during the last couple of weeks, too, Some guys over there at hackers@ told me that that might be due to bad blocks on the disk, but unfortunately (or luckily?) neither of the smart tests did find any errors :-(. So I wonder if there's a hardware or software problem. That happened on 9.1 stable when I was copying data from/to that harddisk (UFS). Ciao, Christian.