From owner-freebsd-current@FreeBSD.ORG  Mon Jan 21 16:35:50 2013
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 8348BAA0
 for <freebsd-current@freebsd.org>; Mon, 21 Jan 2013 16:35:50 +0000 (UTC)
 (envelope-from c47g@gmx.at)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.19])
 by mx1.freebsd.org (Postfix) with ESMTP id 32996DE0
 for <freebsd-current@freebsd.org>; Mon, 21 Jan 2013 16:35:49 +0000 (UTC)
Received: from mailout-de.gmx.net ([10.1.76.28]) by mrigmx.server.lan
 (mrigmx002) with ESMTP (Nemesis) id 0M2rdY-1T62jr0tWg-00se6q for
 <freebsd-current@freebsd.org>; Mon, 21 Jan 2013 17:35:49 +0100
Received: (qmail invoked by alias); 21 Jan 2013 16:35:49 -0000
Received: from cm56-168-232.liwest.at (EHLO bones.gusis.at) [86.56.168.232]
 by mail.gmx.net (mp028) with SMTP; 21 Jan 2013 17:35:49 +0100
X-Authenticated: #9978462
X-Provags-ID: V01U2FsdGVkX1+WhogqSGGWbji/QPn9uri6OOYM2egaGhKsU4WzjT
 C4kEbaFwuCAV08
From: Christian Gusenbauer <c47g@gmx.at>
To: freebsd-current@freebsd.org
Subject: Re: disk "flipped" - a known problem?
Date: Mon, 21 Jan 2013 17:37:18 +0100
User-Agent: KMail/1.13.7 (FreeBSD/9.1-STABLE; KDE/4.8.4; amd64; ; )
References: <50FC3EBF.6070803@FreeBSD.org>
In-Reply-To: <50FC3EBF.6070803@FreeBSD.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <201301211737.19235.c47g@gmx.at>
X-Y-GMX-Trusted: 0
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 21 Jan 2013 16:35:50 -0000

Hi!

On Sunday 20 January 2013 20:00:15 Andriy Gapon wrote:
> Today something unusual happened on one of my machines:
> kernel: (ada0:ahcich0:0:0:0): lost device
> kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00
> 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
> kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
> kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00
> 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
> kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
> kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected
> flags 0x18 refcount 1
> kernel: adaasync: Unable to attach to new device due to status 0x6
> 
> It looks like the disk disappeared from the bus and then re-appeared on the
> bus, but not to the OS.
> 
> One of the partitions that the disk hosted was a swap partition and it
> seems to be the cause of some of the following consequences.
> 
> The consequences:
> 
> * ZFS properly noticed disappearance of the disk, but its diagnostic was a
> little bit misleading:
> 
>   pool: pond
>  state: DEGRADED
> status: One or more devices has been removed by the administrator.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Online the device using 'zpool online' or replace the device with
>         'zpool replace'.
>   scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012
> config:
> 
>         NAME                                            STATE     READ
> WRITE CKSUM pond                                            DEGRADED     0
>     0     0 mirror-0                                      DEGRADED     0  
>   0     0 12725235722288301230                        REMOVED      0     0
> 0  was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff
>             gptid/48782c6e-8fbd-11de-b3e1-00241d20d446  ONLINE       0    
> 0     0
> 
> Yes, I agree that the disk got removed/lost, but disagree that "the
> administrator" did it.
> 
> * geom_event thread started consuming 100% of CPU in g_wither_washer()
> 
> * /dev/ada0 disappeared but camcontrol devlist still reported ada0:
> <ST3500410AS CC34>                 at scbus0 target 0 lun 0 (pass0,ada0)
> 
> * As seen in the system messages, CAM layer refused to re-attach the disk
> 
> * gpart command would just crash
> 
> 
> So, I can explain the behavior of the geom_event thread - apparently
> swapgeom_orphan doesn't do anything that is really meaningful to GEOM and
> so g_wither_washer is stuck waiting until the swap consumer goes way
> (drops its access bits).
> 
> (Another sad thing about this state is that I couldn't swapoff the device,
> because there was no device entry.)
> 
> I am not sure if the "attempt to re-allocate valid device" failure was
> caused by this, but it could be, if something in CAM layer was waiting for
> GEOM layer to be done with the disk.
> 
> It would be nice if the swap code properly supported disappearance of the
> underlying disks.  Especially in this case where the swap was actually
> never used / touched at all (few hours after reboot and completely idle
> system).

I don't know if it's related, but my new 2 TB WD green harddisk vanished three 
times during the last couple of weeks, too, Some guys over there at hackers@ 
told me that that might be due to bad blocks on the disk, but unfortunately 
(or luckily?) neither of the smart tests did find any errors :-(. So I wonder 
if there's a hardware or software problem. That happened on 9.1 stable when I 
was copying data from/to that harddisk (UFS).

Ciao,
Christian.