From owner-freebsd-stable@freebsd.org Wed Oct 19 06:58:18 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3CD79C162E7 for ; Wed, 19 Oct 2016 06:58:18 +0000 (UTC) (envelope-from ml@netfence.it) Received: from smtp206.alice.it (smtp206.alice.it [82.57.200.102]) by mx1.freebsd.org (Postfix) with ESMTP id CC7C2F3D for ; Wed, 19 Oct 2016 06:58:17 +0000 (UTC) (envelope-from ml@netfence.it) Received: from soth.ventu (79.46.7.147) by smtp206.alice.it (8.6.060.28) (authenticated as acanedi@alice.it) id 57FB3880018073BD for freebsd-stable@freebsd.org; Wed, 19 Oct 2016 08:52:26 +0200 Received: from alamar.ventu (alamar.local.netfence.it [10.1.2.18]) by soth.ventu (8.15.2/8.15.2) with ESMTP id u9J6qOeC067100 for ; Wed, 19 Oct 2016 08:52:25 +0200 (CEST) (envelope-from ml@netfence.it) X-Authentication-Warning: soth.ventu: Host alamar.local.netfence.it [10.1.2.18] claimed to be alamar.ventu From: Andrea Venturoli Subject: Nightly disk-related panic since upgrade to 10.3 To: "freebsd-stable@freebsd.org" Message-ID: Date: Wed, 19 Oct 2016 08:52:24 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Oct 2016 06:58:18 -0000 Hello. Last week I upgraded a 9.3/amd64 box to 10.3: since then, it crashed and rebooted at least once every night. The only exception was on Friday, when it locked without rebooting: it still answered ping request and logins through HTTP would half work; I'm under the impression that the disk subsystem was hung, so ICMP would work since it does no I/O and HTTP too worked as far as no disk access was required. Today I was able to get a couple of (almost identical) dumps: > cpuid = 1 > KDB: stack backtrace: > #0 0xffffffff804ee170 at kdb_backtrace+0x60 > #1 0xffffffff804b4576 at vpanic+0x126 > #2 0xffffffff804b4443 at panic+0x43 > #3 0xffffffff8068fd2a at softdep_deallocate_dependencies+0x6a > #4 0xffffffff805394b5 at brelse+0x145 > #5 0xffffffff8053793c at bufwrite+0x3c > #6 0xffffffff806ae20f at ffs_write+0x3df > #7 0xffffffff8076d519 at VOP_WRITE_APV+0x149 > #8 0xffffffff806ec7c9 at vnode_pager_generic_putpages+0x2a9 > #9 0xffffffff8076f3b7 at VOP_PUTPAGES_APV+0xa7 > #10 0xffffffff806ea6f5 at vnode_pager_putpages+0xc5 > #11 0xffffffff806e17f8 at vm_pageout_flush+0xc8 > #12 0xffffffff806db432 at vm_object_page_collect_flush+0x182 > #13 0xffffffff806db1cd at vm_object_page_clean+0x13d > #14 0xffffffff806dadbe at vm_object_terminate+0x8e > #15 0xffffffff806eac60 at vnode_destroy_vobject+0x90 > #16 0xffffffff806b4232 at ufs_reclaim+0x22 > #17 0xffffffff8076e5c7 at VOP_RECLAIM_APV+0xa7 Has anyone any better insight on what might be going on? The disks are all connected to a SAS RAID adapter running on mfi; I don't think it might be an hardware issue, since it has worked perfectly for years until I did the upgrade; also mfiutil says everything is ok and nothing mfi-related is in the logs. Some ideas come to mind about which I might use a second opinion: _ soft-update is broken: that would really surprise me, since I've been using that for years on this and several other boxes (10.3 too); _ snapshot creation/deletion is causing this: again I'm using that almost anywhere, so I don't think this might be the cause alone; besides, I've been able to do some dumps without trouble and I don't think anything was messing with snapshots at the time of the last two panics; _ mfi driver is broken on 10.3: this is more reasonable to me, since this is the only machine I have it on and it's the only case where I get this panics. I found https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=183618, but I get no "g_vfs_done()..." messages. Any other hint? I'd really like to find out what's going on, I'll appreciate any help and I'm willing to provide any useful info. On the other hand, this is a production server, so I have to solve this really soon. Some idea comes to mind, like disabling softupdate (knowing which file system was having trouble would help here; is there any way to know?), trying to enable journaling, upgrading to 10-STABLE, build a kernel with INVARIANTS/WITNESS/etc..., but I'd appreciate a second opinion before I start shooting in the dark. bye & Thanks av.