From owner-freebsd-fs@FreeBSD.ORG Fri Dec 12 17:16:27 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id ED16A8E9 for ; Fri, 12 Dec 2014 17:16:27 +0000 (UTC) Received: from tau.lfms.nl (tau.lfms.nl [93.189.130.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 957C76C4 for ; Fri, 12 Dec 2014 17:16:27 +0000 (UTC) Received: from sim.dt.lfms.nl (dt.lfms.nl [83.84.86.53]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by tau.lfms.nl (Postfix) with ESMTPS id 0B5B9892AB for ; Fri, 12 Dec 2014 18:16:18 +0100 (CET) Received: from [192.168.130.112] (borax.dt.lfms.nl [192.168.130.112]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by sim.dt.lfms.nl (Postfix) with ESMTPS id BF7DA9C09085 for ; Fri, 12 Dec 2014 18:16:17 +0100 (CET) From: Walter Hop Subject: Serious FS hangs and panics on 10.1 Message-Id: <553B39FA-7DBC-4536-9FD4-11A98E0D4740@spam.lifeforms.nl> Date: Fri, 12 Dec 2014 18:16:17 +0100 To: freebsd-fs@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 8.1 \(1993\)) X-Mailer: Apple Mail (2.1993) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Dec 2014 17:16:28 -0000 Hi all, As some may have read on -stable, various users are having system hangs = since 10.1-RC when unmounting the root filesystem on 10.1 with = UFS+softupdates. I'll recap: hangs occur for instance when /sbin/init = has been meddled with, so people experience it generally after running = freebsd-update. With the 10.1-p1 update, the bug and mailinglist posts = got additional activity, so it's a recurring theme. I verified the = problem still exists in CURRENT, and found lock order reversals which = may or may not be related. = (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D195458) Now the above problem has a simple mitigation: just disable softupdates = before doing freebsd-update, and you won't hang. Okay, a little = startling, but I=E2=80=99m still sleeping okay. Now today, the 10.1 story seems to look a lot worse, with a 10.1 box = getting back-to-back kernel panics in VFS functions. This is a box = serving SVN repositories, and SVN is known to exercise a filesystem = pretty thoroughly (even uncovering NTFS bugs in pre-SP1 Windows7). = We=E2=80=99ve updated this box from 10.0 to 10.1 a week ago. The four = panics that we saw (trace below), had the exact same instruction pointer = and stack trace, so I'm pretty positive we're not looking at a random = hardware fluke. The last panics were spaced only minutes apart, which was pretty scary. = I was fearing persistent disk corruption, but the panics stopped when... = I disabled softupdates! This was my first shot, as this also solved my = other stability problem on 10.1. Anyway, the machine has been stable so = far. Maybe these two problems are unrelated, it might be too early to tell, = but in any case, I am getting the strong vibe that something was changed = in UFS/VFS/softupdates between 10.0 and 10.1 that's possibly very = problematic and has a risk of causing data loss in the future. Our experience with 10.0 has been remarkably good (same for earlier = releases for that matter... in fact I don't think I can remember the = last kernel panic in production at all.. maybe on 5.2-STABLE?) So, = that's why we were very happy to see 10.1; but it feels really = troublesome in the filesystem department, which is very uncharacteristic = for FreeBSD. That said, I'd prefer spending some more energy on getting 10.1 working = well, rather than downgrading or jumping to other systems... But I think = it really needs some love. Any ideas on what we could do? Thanks! WH --=20 Walter Hop | PGP key: https://lifeforms.nl/pgp Panic: kernel: Fatal trap 12: page fault while in kernel mode kernel: cpuid =3D 0; apic id =3D 00 kernel: fault virtual address =3D 0x30058 kernel: fault code =3D supervisor write data, page not present kernel: instruction pointer =3D 0x20:0xffffffff8090e46a kernel: stack pointer =3D 0x28:0xfffffe000024d780 kernel: frame pointer =3D 0x28:0xfffffe000024d850 kernel: code segment =3D base 0x0, limit 0xfffff, type = 0x1b kernel: =3D DPL 0, pres 1, long 1, def32 0, gran 1 kernel: processor eflags =3D interrupt enabled, resume, IOPL =3D 0 kernel: current process =3D 27466 (httpd) kernel: trap number =3D 12 kernel: panic: page fault kernel: cpuid =3D 0 kernel: KDB: stack backtrace: kernel: #0 0xffffffff80963000 at kdb_backtrace+0x60 kernel: #1 0xffffffff80928125 at panic+0x155 kernel: #2 0xffffffff80d24f1f at trap_fatal+0x38f kernel: #3 0xffffffff80d25238 at trap_pfault+0x308 kernel: #4 0xffffffff80d2489a at trap+0x47a kernel: #5 0xffffffff80d0a782 at calltrap+0x8 kernel: #6 0xffffffff8090ec35 at lf_advlock+0x45 kernel: #7 0xffffffff809b8e69 at vop_stdadvlock+0xa9 kernel: #8 0xffffffff80e44247 at VOP_ADVLOCK_APV+0xa7 kernel: #9 0xffffffff808e4919 at kern_fcntl+0xb39 kernel: #10 0xffffffff808e3d5c at kern_fcntl_freebsd+0xac kernel: #11 0xffffffff80d25851 at amd64_syscall+0x351 kernel: #12 0xffffffff80d0aa6b at Xfast_syscall+0xfb