Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Sep 2017 15:54:20 +0100
From:      Karl Pielorz <kpielorz_lst@tdx.co.uk>
To:        =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>
Cc:        freebsd-xen@freebsd.org
Subject:   Re: Storage 'failover' largely kills FreeBSD 10.x under XenServer?
Message-ID:  <E20E34C4D5A766D4854317A5@Mac-mini.local>
In-Reply-To: <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com>
References:  <62BC29D8E1F6EA5C09759861@[10.12.30.106]> <20170920114418.pq6fhnexol2mvkxv@dhcp-3-128.uk.xensource.com>

next in thread | previous in thread | raw e-mail | index | archive | help


--On 20 September 2017 at 12:44:18 +0100 Roger Pau Monn=C3=A9=20
<roger.pau@citrix.com> wrote:

>> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant
>> of the I/O delays that occur during a storage fail over?
>
> Do you know whether the VMs saw the disks disconnecting and then
> connecting again?

I can't see any evidence the drives actually get 'disconnected' from the=20
VM's point of view. Plenty of I/O errors - but no "device destroyed" type=20
stuff.

I have seen that kind of error logged on our test kit - when deliberately=20
failed non-HA storage, but I don't see it this time.

> Hm, I have the feeling that part of the problem is that in-flight
> requests are basically lost when a disconnect/reconnect happens.

So if a disconnect doesn't happen (as it appears it isn't) - is there any=20
tunable to set the I/O timeout?

'sysctl -a | grep timeout' finds things like:

  kern.cam.ada.default_timeout=3D30

I might see if that has any effect (from memory - as I'm out of the office=20
now - it did seem to be about 30 seconds before the VM's started logging=20
I/O related errors to the console).

As it's a pure test setup - I can try adjusting this without fear of=20
breaking anything :)

Though I'm open to other suggestions...

fwiw - Who's responsibility is it to re-send lost "in flight" data, e.g. if =

a write is 'in flight' when an I/O error occurs in the lower layers of=20
XenServer is it XenServers responsibility to retry that - before giving up, =

or does it just push the error straight back to the VM - expecting the VM=20
to retry it? [or a bit of both?] - just curious.

-Karl





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E20E34C4D5A766D4854317A5>