From owner-freebsd-xen@freebsd.org Wed Sep 20 10:41:09 2017 Return-Path: <owner-freebsd-xen@freebsd.org> Delivered-To: freebsd-xen@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C0523E07CAE for <freebsd-xen@mailman.ysv.freebsd.org>; Wed, 20 Sep 2017 10:41:09 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.krpservers.com", Issuer "RapidSSL SHA256 CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4F00666054 for <freebsd-xen@freebsd.org>; Wed, 20 Sep 2017 10:41:08 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) Received: from [10.12.30.106] (host86-162-208-244.range86-162.btcentralplus.com [86.162.208.244]) (authenticated bits=0) by smtp.krpservers.com (8.15.2/8.15.2) with ESMTPSA id v8KAZSWW001635 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for <freebsd-xen@freebsd.org>; Wed, 20 Sep 2017 11:35:29 +0100 (BST) (envelope-from kpielorz_lst@tdx.co.uk) Date: Wed, 20 Sep 2017 11:35:26 +0100 From: Karl Pielorz <kpielorz_lst@tdx.co.uk> To: freebsd-xen@freebsd.org Subject: Storage 'failover' largely kills FreeBSD 10.x under XenServer? Message-ID: <62BC29D8E1F6EA5C09759861@[10.12.30.106]> X-Mailer: Mulberry/4.0.8 (Win32) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-BeenThere: freebsd-xen@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussion of the freebsd port to xen - implementation and usage <freebsd-xen.freebsd.org> List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-xen>, <mailto:freebsd-xen-request@freebsd.org?subject=unsubscribe> List-Archive: <http://lists.freebsd.org/pipermail/freebsd-xen/> List-Post: <mailto:freebsd-xen@freebsd.org> List-Help: <mailto:freebsd-xen-request@freebsd.org?subject=help> List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-xen>, <mailto:freebsd-xen-request@freebsd.org?subject=subscribe> X-List-Received-Date: Wed, 20 Sep 2017 10:41:09 -0000 Hi All, We recently experienced an "unplanned storage" fail over on our XenServer pool. The pool is 7.1 based (on certified HP kit), and runs a mix of FreeBSD (all 10.3 based except for a legacy 9.x VM) - and a few Windows VM's - storage is provided by two Citrix certified Synology storage boxes. During the fail over - Xen see's the storage paths go down, and come up again (re-attaching when they are available again). Timing this - it takes around a minute, worst case. The process killed 99% of our FreeBSD VM's :( The earlier 9.x FreeBSD box survived, and all the Windows VM's survived. Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant of the I/O delays that occur during a storage fail over? I've enclosed some of the error we observed below. I realise a full storage fail over is a 'stressful time' for VM's - but the Windows VM's, and earlier FreeBSD version survived without issue. All the 10.3 boxes logged I/O errors, and then panic'd / rebooted. We've setup a test lab with the same kit - and can now replicate this at will (every time most to all the FreeBSD 10.x boxes panic and reboot, but Windows prevails) - so we can test any potential fixes. So if anyone can suggest anything we can tweak to minimize the chances of this happening (i.e. make I/O more timeout tolerant, or set larger timeouts?) that'd be great. Thanks, -Karl Errors we observed: ada0: disk error cmd=write 11339752-11339767 status: ffffffff ada0: disk error cmd=write g_vfs_done():11340544-11340607gpt/root[WRITE(offset=4731097088, length=8192)] status: ffffffff error = 5 (repeated a couple of times with different values) Machine then goes on to panic: g_vfs_done():panic: softdep_setup_freeblocks: inode busy cpuid = 0 KDB: stack backtrace: #0 0xffffffff8098e810 at kdb_backtrace+0x60 #1 0xffffffff809514e6 at vpanic+0x126 #2 0xffffffff809513b3 at panic+0x43 #3 0xffffffff80b9c685 at softdep_setup_freeblocks+0xaf5 #4 0xffffffff80b86bae at ffs_truncate+0x44e #5 0xffffffff80bbec49 at ufs_setattr+0x769 #6 0xffffffff80e81891 at VOP_SETATTR_APV+0xa1 #7 0xffffffff80a053c5 at vn_trunacte+0x165 #8 0xffffffff809ff236 at kern_openat+0x326 #9 0xffffffff80d56e6f at amd64_syscall+0x40f #10 0xffffffff80d3c0cb at Xfast_syscall+0xfb Another box also logged: ada0: disk error cmd=read 9970080-9970082 status: ffffffff g_vfs_done():gpt/root[READ(offset=4029825024, length=1536)]error = 5 vnode_pager_getpages: I/O read error vm_fault: pager read error, pid 24219 (make) And again, went on to panic shortly thereafter. I had to hand transcribe the above from screen shots / video, so apologies if any errors crept in. I'm hoping there's just a magic sysctl / kernel option we can set to up the timeouts? (if it is as simple as timeouts killing things)