From owner-freebsd-fs@FreeBSD.ORG Mon Jun 22 01:19:19 2015 Return-Path: Delivered-To: freebsd-fs@nevdull.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C6368ECD for ; Mon, 22 Jun 2015 01:19:19 +0000 (UTC) (envelope-from thomasrcurry@gmail.com) Received: from mail-oi0-x22b.google.com (mail-oi0-x22b.google.com [IPv6:2607:f8b0:4003:c06::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 895951479 for ; Mon, 22 Jun 2015 01:19:19 +0000 (UTC) (envelope-from thomasrcurry@gmail.com) Received: by oiyy130 with SMTP id y130so94848515oiy.0 for ; Sun, 21 Jun 2015 18:19:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Hd++X6M5ph1MSGvLPQPjsRhO2SKJlqnpgaiU00lvJ1E=; b=YD1bEby2rb+sCf2Cghph1Pm/J5U0BFSLTusNnDkwh2+J//epeje94c+ELRT3GFydl+ 6I1ydFMgLh7BPGTH+PXX0kZmZVWxLuLz+T2IAvdd6r7adarlIFLz4giFXv7TIuuTT4Xr cRIcHTnMl3J2vzyETv+HrYB0TfpIJQqLGRi0hT1VfCh4xqMMXKJ/nNZ1aRLexHvIE0l7 U2LeevDge6Q4NM+BYYTNGf7E/6b+FOjEAr61BvPcHM1W0rEQ6NEWdA5pfo4Q7WXz1luQ GVd6vU0AABUgUOuWnsL521yFqeZZ+BaXgno9TF5Y09zSnPWiD2x39Jw0IkfNcOly9qF5 5bbA== MIME-Version: 1.0 X-Received: by 10.182.22.33 with SMTP id a1mr22568524obf.41.1434929697833; Sun, 21 Jun 2015 16:34:57 -0700 (PDT) Received: by 10.202.77.138 with HTTP; Sun, 21 Jun 2015 16:34:57 -0700 (PDT) In-Reply-To: <55873E1D.9010401@digiware.nl> References: <5585767B.4000206@digiware.nl> <558590BD.40603@isletech.net> <5586C396.9010100@digiware.nl> <55873E1D.9010401@digiware.nl> Date: Sun, 21 Jun 2015 19:34:57 -0400 Message-ID: Subject: Re: This diskfailure should not panic a system, but just disconnect disk from ZFS From: Tom Curry To: Willem Jan Withagen Cc: freebsd-fs@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Jun 2015 01:19:19 -0000 I asked because recently I had similar trouble. Lots of kernel panics, sometimes they were just like yours, sometimes they were general protection faults. But they would always occur when my nightly backups took place where VMs on iSCSI zvol luns were read and then written over smb to another pool on the same machine over 10GbE. I nearly went out of my mind trying to figure out what was going on, I'll spare you the gory details, but I stumbled across this PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 and as I read through it little light bulbs starting coming on. Luckily it was easy for me to reproduce the problem so I kicked off the backups and watched the system memory. Wired would grow, ARC would shrink, and then the system would start swapping. If I stopped the IO right then it would recover after a while. But if I let it go it would always panic, and half the time it would be the same message as yours. So I applied the patch from that PR, rebooted, and kicked off the backup. No more panic. Recently I rebuilt a vanilla kernel from stable/10 but explicitly set vfs.zfs.arc_max to 24G (I have 32G) and ran my torture tests and it is stable. So I don't want to send you on a wild goose chase, but it's entirely possible this problem you are having is not hardware related at all, but is a memory starvation issue related to the ARC under periods of heavy activity. On Sun, Jun 21, 2015 at 6:43 PM, Willem Jan Withagen wrote: > On 21/06/2015 21:50, Tom Curry wrote: > > Was there by chance a lot of disk activity going on when this occurred? > > Define 'a lot'?? > But very likely, since the system is also a backup location for several > external service which backup thru rsync. And they can generate generate > quite some traffic. Next to the fact that it also serves a NVR with a > ZVOL trhu iSCSI... > > --WjW > > > > > On Sun, Jun 21, 2015 at 10:00 AM, Willem Jan Withagen > > wrote: > > > > On 20/06/2015 18:11, Daryl Richards wrote: > > > Check the failmode setting on your pool. From man zpool: > > > > > > failmode=wait | continue | panic > > > > > > Controls the system behavior in the event of > catastrophic > > > pool failure. This condition is typically a > > > result of a loss of connectivity to the underlying > storage > > > device(s) or a failure of all devices within > > > the pool. The behavior of such an event is determined as > > > follows: > > > > > > wait Blocks all I/O access until the device > > > connectivity is recovered and the errors are cleared. > > > This is the default behavior. > > > > > > continue Returns EIO to any new write I/O > requests but > > > allows reads to any of the remaining healthy > > > devices. Any write requests that have yet > to be > > > committed to disk would be blocked. > > > > > > panic Prints out a message to the console and > generates > > > a system crash dump. > > > > 'mmm > > > > Did not know about this setting. Nice one, but alas my current > > setting is: > > zfsboot failmode wait default > > zfsraid failmode wait default > > > > So either the setting is not working, or something else is up? > > Is waiting only meant to wait a limited time? And then panic anyways? > > > > But then still I wonder why even in the 'continue'-case the ZFS > system > > ends in a state where the filesystem is not able to continue in its > > standard functioning ( read and write ) and disconnects the disk??? > > > > All failmode settings result in a seriously handicapped system... > > On a raidz2 system I would perhaps expected this to occur when the > > second disk goes into thin space?? > > > > The other question is: The man page talks about > > 'Controls the system behavior in the event of catastrophic pool > failure' > > And is a hung disk a 'catastrophic pool failure'? > > > > Still very puzzled? > > > > --WjW > > > > > > > > > > > On 2015-06-20 10:19 AM, Willem Jan Withagen wrote: > > >> Hi, > > >> > > >> Found my system rebooted this morning: > > >> > > >> Jun 20 05:28:33 zfs kernel: sonewconn: pcb 0xfffff8011b6da498: > Listen > > >> queue overflow: 8 already in queue awaiting acceptance (48 > > occurrences) > > >> Jun 20 05:28:33 zfs kernel: panic: I/O to pool 'zfsraid' appears > > to be > > >> hung on vdev guid 18180224580327100979 at '/dev/da0'. > > >> Jun 20 05:28:33 zfs kernel: cpuid = 0 > > >> Jun 20 05:28:33 zfs kernel: Uptime: 8d9h7m9s > > >> Jun 20 05:28:33 zfs kernel: Dumping 6445 out of 8174 > > >> MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% > > >> > > >> Which leads me to believe that /dev/da0 went out on vacation, > leaving > > >> ZFS into trouble.... But the array is: > > >> ---- > > >> NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP > DEDUP > > >> zfsraid 32.5T 13.3T 19.2T - 7% 41% > 1.00x > > >> ONLINE - > > >> raidz2 16.2T 6.67T 9.58T - 8% 41% > > >> da0 - - - - - - > > >> da1 - - - - - - > > >> da2 - - - - - - > > >> da3 - - - - - - > > >> da4 - - - - - - > > >> da5 - - - - - - > > >> raidz2 16.2T 6.67T 9.58T - 7% 41% > > >> da6 - - - - - - > > >> da7 - - - - - - > > >> ada4 - - - - - - > > >> ada5 - - - - - - > > >> ada6 - - - - - - > > >> ada7 - - - - - - > > >> mirror 504M 1.73M 502M - 39% 0% > > >> gpt/log0 - - - - - - > > >> gpt/log1 - - - - - - > > >> cache - - - - - - > > >> gpt/raidcache0 109G 1.34G 107G - 0% 1% > > >> gpt/raidcache1 109G 787M 108G - 0% 0% > > >> ---- > > >> > > >> And thus I'd would have expected that ZFS would disconnect > > /dev/da0 and > > >> then switch to DEGRADED state and continue, letting the operator > > fix the > > >> broken disk. > > >> Instead it chooses to panic, which is not a nice thing to do. :) > > >> > > >> Or do I have to high hopes of ZFS? > > >> > > >> Next question to answer is why this WD RED on: > > >> > > >> arcmsr0@pci0:7:14:0: class=0x010400 card=0x112017d3 > > chip=0x112017d3 > > >> rev=0x00 hdr=0x00 > > >> vendor = 'Areca Technology Corp.' > > >> device = 'ARC-1120 8-Port PCI-X to SATA RAID Controller' > > >> class = mass storage > > >> subclass = RAID > > >> > > >> got hung, and nothing for this shows in SMART.... > > > > _______________________________________________ > > freebsd-fs@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org > > " > > > > > >