Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 25 Nov 2017 11:37:10 +0000
From:      "Poul-Henning Kamp" <phk@phk.freebsd.dk>
To:        Scott Long <scottl@samsco.org>
Cc:        Andriy Gapon <avg@FreeBSD.org>, FreeBSD FS <freebsd-fs@freebsd.org>, Warner Losh <imp@bsdimp.com>, freebsd-geom@freebsd.org
Subject:   Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom
Message-ID:  <30379.1511609830@critter.freebsd.dk>
In-Reply-To: <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
References:  <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org> <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>

next in thread | previous in thread | raw e-mail | index | archive | help
--------
In message <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>, Scott Long w=
rites:

> Why is overloading EIO so bad?  brelse() will call bdirty() when a BIO_W=
RITE
> command has failed with EIO.  Calling bdirty() has the effect of retryin=
g the I/O.
> This disregards the fact that disk drivers only return EIO when they=E2=80=
=99ve decided
> that the I/O cannot be retried.  It has no termination condition for the=
 retries, and
> will endlessly retry I/O in vain; I=E2=80=99ve seen this quite frequentl=
y.

The really annoying thing about this particular class of errors,
is that if we propagated them up to the filesystems, very often
things could be relocated to different blocks and we would avoid the
unnecessary filesystem corruption.

The real fundamental deficiency is that we do not have a way to say "give =
up
if this bio cannot be completed in X time" which is what people actually w=
ant.

That is suprisingly hard to provide, there are far too many
corner-cases for me to enumerate them all, but let me just give one
example:

Imagine you issue a deadlined write to a RAID5 thing.  Thee component
writes happen smoothly, but the last two fail the deadline, with
no way to predict how long time it will take before they complete
or fail.

* Does the bio write transaction fail ?

* Does the bio write transaction time out ?

* Do you attempt to complete the write to the RAID5 ?

* Where do you store a copy of the data if you do ?

* What happens next time a read happens on this bio's extent ?

Then for an encore, imagine it was a read bio: Three DMAs go smoothly,
two are outstanding and you don't know if/when they will complete/fail.

* If you fail or time out the bio, how do you "taint" the space
  being read into until the two remaining DMAs are outstanding?

* What if that space is mapped into userland ?

* What if that space is being executed ?

* What if one of the two outstanding DMAs later return garbage ?

My conclusion back when I did GEOM, was that the only way to
do something like this sanely, is to have a special GEOM do it
for you, which always allocates a temp-space:

	allocate temp buffer
	if (write)
		copy write data to temp buffer
	issue bio downwards on temp buffer
	if timeout
		park temp buffer until biodone
		return(timeout)
	if (read)
		copy temp buffer to read space
	return (ok/error)


-- =

Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    =

Never attribute to malice what can adequately be explained by incompetence=
.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?30379.1511609830>