From owner-freebsd-xen@freebsd.org Wed Sep 20 14:54:26 2017 Return-Path: Delivered-To: freebsd-xen@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 88281E1210D for ; Wed, 20 Sep 2017 14:54:26 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) Received: from smtp.krpservers.com (smtp.krpservers.com [62.13.128.145]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.krpservers.com", Issuer "RapidSSL SHA256 CA" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 306586E6B5 for ; Wed, 20 Sep 2017 14:54:25 +0000 (UTC) (envelope-from kpielorz_lst@tdx.co.uk) Received: from [10.12.30.100] (vo.getonline.co.uk [62.13.128.251]) (authenticated bits=0) by smtp.krpservers.com (8.15.2/8.15.2) with ESMTPSA id v8KEsK7w019497 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 20 Sep 2017 15:54:21 +0100 (BST) (envelope-from kpielorz_lst@tdx.co.uk) Date: Wed, 20 Sep 2017 15:54:20 +0100 From: Karl Pielorz To: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= cc: freebsd-xen@freebsd.org Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer? Message-ID: In-Reply-To: <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com> References: <62BC29D8E1F6EA5C09759861@[10.12.30.106]> <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com> X-Mailer: Mulberry/4.0.8 (Mac OS X) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-BeenThere: freebsd-xen@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussion of the freebsd port to xen - implementation and usage List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Sep 2017 14:54:26 -0000 --On 20 September 2017 at 12:44:18 +0100 Roger Pau Monn=C3=A9=20 wrote: >> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant >> of the I/O delays that occur during a storage fail over? > > Do you know whether the VMs saw the disks disconnecting and then > connecting again? I can't see any evidence the drives actually get 'disconnected' from the=20 VM's point of view. Plenty of I/O errors - but no "device destroyed" type=20 stuff. I have seen that kind of error logged on our test kit - when deliberately=20 failed non-HA storage, but I don't see it this time. > Hm, I have the feeling that part of the problem is that in-flight > requests are basically lost when a disconnect/reconnect happens. So if a disconnect doesn't happen (as it appears it isn't) - is there any=20 tunable to set the I/O timeout? 'sysctl -a | grep timeout' finds things like: kern.cam.ada.default_timeout=3D30 I might see if that has any effect (from memory - as I'm out of the office=20 now - it did seem to be about 30 seconds before the VM's started logging=20 I/O related errors to the console). As it's a pure test setup - I can try adjusting this without fear of=20 breaking anything :) Though I'm open to other suggestions... fwiw - Who's responsibility is it to re-send lost "in flight" data, e.g. if = a write is 'in flight' when an I/O error occurs in the lower layers of=20 XenServer is it XenServers responsibility to retry that - before giving up, = or does it just push the error straight back to the VM - expecting the VM=20 to retry it? [or a bit of both?] - just curious. -Karl