Date: Wed, 20 Sep 2017 15:54:20 +0100 From: Karl Pielorz <kpielorz_lst@tdx.co.uk> To: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> Cc: freebsd-xen@freebsd.org Subject: Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer? Message-ID: <E20E34C4D5A766D4854317A5@Mac-mini.local> In-Reply-To: <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com> References: <62BC29D8E1F6EA5C09759861@[10.12.30.106]> <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--On 20 September 2017 at 12:44:18 +0100 Roger Pau Monn=C3=A9=20 <roger.pau@citrix.com> wrote: >> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant >> of the I/O delays that occur during a storage fail over? > > Do you know whether the VMs saw the disks disconnecting and then > connecting again? I can't see any evidence the drives actually get 'disconnected' from the=20 VM's point of view. Plenty of I/O errors - but no "device destroyed" type=20 stuff. I have seen that kind of error logged on our test kit - when deliberately=20 failed non-HA storage, but I don't see it this time. > Hm, I have the feeling that part of the problem is that in-flight > requests are basically lost when a disconnect/reconnect happens. So if a disconnect doesn't happen (as it appears it isn't) - is there any=20 tunable to set the I/O timeout? 'sysctl -a | grep timeout' finds things like: kern.cam.ada.default_timeout=3D30 I might see if that has any effect (from memory - as I'm out of the office=20 now - it did seem to be about 30 seconds before the VM's started logging=20 I/O related errors to the console). As it's a pure test setup - I can try adjusting this without fear of=20 breaking anything :) Though I'm open to other suggestions... fwiw - Who's responsibility is it to re-send lost "in flight" data, e.g. if = a write is 'in flight' when an I/O error occurs in the lower layers of=20 XenServer is it XenServers responsibility to retry that - before giving up, = or does it just push the error straight back to the VM - expecting the VM=20 to retry it? [or a bit of both?] - just curious. -Karl
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E20E34C4D5A766D4854317A5>