From owner-freebsd-current@FreeBSD.ORG  Sun Jan 20 19:00:29 2013
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 1E4ECB2E;
 Sun, 20 Jan 2013 19:00:29 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
 by mx1.freebsd.org (Postfix) with ESMTP id 43E93FDD;
 Sun, 20 Jan 2013 19:00:27 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
 [212.40.38.100])
 by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id VAA07884;
 Sun, 20 Jan 2013 21:00:18 +0200 (EET) (envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
 by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
 id 1Tx07u-0005ww-48; Sun, 20 Jan 2013 21:00:18 +0200
Message-ID: <50FC3EBF.6070803@FreeBSD.org>
Date: Sun, 20 Jan 2013 21:00:15 +0200
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/17.0 Thunderbird/17.0
MIME-Version: 1.0
To: freebsd-current@FreeBSD.org, freebsd-fs <freebsd-fs@FreeBSD.org>,
 freebsd-geom@FreeBSD.org
Subject: disk "flipped" - a known problem?
X-Enigmail-Version: 1.4.6
Content-Type: text/plain; charset=X-VIET-VPS
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Jan 2013 19:00:29 -0000


Today something unusual happened on one of my machines:
kernel: (ada0:ahcich0:0:0:0): lost device
kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected
flags 0x18 refcount 1
kernel: adaasync: Unable to attach to new device due to status 0x6

It looks like the disk disappeared from the bus and then re-appeared on the bus,
but not to the OS.

One of the partitions that the disk hosted was a swap partition and it seems to
be the cause of some of the following consequences.

The consequences:

* ZFS properly noticed disappearance of the disk, but its diagnostic was a
little bit misleading:

  pool: pond
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        pond                                            DEGRADED     0     0     0
          mirror-0                                      DEGRADED     0     0     0
            12725235722288301230                        REMOVED      0     0
 0  was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff
            gptid/48782c6e-8fbd-11de-b3e1-00241d20d446  ONLINE       0     0     0

Yes, I agree that the disk got removed/lost, but disagree that "the
administrator" did it.

* geom_event thread started consuming 100% of CPU in g_wither_washer()

* /dev/ada0 disappeared but camcontrol devlist still reported ada0:
<ST3500410AS CC34>                 at scbus0 target 0 lun 0 (pass0,ada0)

* As seen in the system messages, CAM layer refused to re-attach the disk

* gpart command would just crash


So, I can explain the behavior of the geom_event thread - apparently
swapgeom_orphan doesn't do anything that is really meaningful to GEOM and so
g_wither_washer is stuck waiting until the swap consumer goes way (drops its
access bits).

(Another sad thing about this state is that I couldn't swapoff the device,
because there was no device entry.)

I am not sure if the "attempt to re-allocate valid device" failure was caused by
this, but it could be, if something in CAM layer was waiting for GEOM layer to
be done with the disk.

It would be nice if the swap code properly supported disappearance of the
underlying disks.  Especially in this case where the swap was actually never
used / touched at all (few hours after reboot and completely idle system).

-- 
Andriy Gapon