From owner-freebsd-arch@FreeBSD.ORG  Thu Aug  2 05:54:11 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 93AF3106564A;
	Thu,  2 Aug 2012 05:54:11 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
	[211.29.132.184])
	by mx1.freebsd.org (Postfix) with ESMTP id 291438FC14;
	Thu,  2 Aug 2012 05:54:10 +0000 (UTC)
Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246])
	by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q725ruHI017740
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 2 Aug 2012 15:54:00 +1000
Date: Thu, 2 Aug 2012 15:53:56 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: davidxu@freebsd.org
In-Reply-To: <5019A77C.4030503@gmail.com>
Message-ID: <20120802150009.G870@besplex.bde.org>
References: <5018992C.8000207@freebsd.org>
	<20120801071934.GJ2676@deviant.kiev.zoral.com.ua>
	<5018E1FC.4080609@gmail.com>
	<D7DC1F82-6CAA-4359-847C-EE89357D8538@bsdimp.com>
	<5019A77C.4030503@gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org
Subject: Re: short read/write and error code
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Aug 2012 05:54:11 -0000

On Thu, 2 Aug 2012, David Xu wrote:

> On 2012/8/1 22:12, Warner Losh wrote:
>> On Aug 1, 2012, at 1:59 AM, David Xu wrote:

Please trim quotes!

>[>>>>...] trimmed

>> You do know that with disk drives it is an all or nothing sort of thing at 
>> the sector level.  Either you get the whole thing, or you get none of it. 
>> There's no partial sector reads, and there's no way to get the data 
>> generally.  Some drives sometimes allow you to access raw tracks, but those 
>> interfaces are never connected to read, but usually an ioctl that issues 
>> the special command and returns the results.  And even then, it returns 
>> everything (perhaps including the ECC bytes)
> Sorry, my example is not precise, see blow.

> Unfortunately, the dofileread and dofilewrite are very high level API, it 
> does not device,
> it is file oriented API, not device oriented. Let me interpret what's wrong 
> in their code.
> dofileread requests fo_read to read back data into user space buffer, the 
> user space buffer
> can be very large.  fo_read is an intermediate layer, assume it supports 
> large buffer size,
> for example, it is file system's  interface to read data, or it can be some 
> intermediate code
> which also supports very large buffer size until max-value of SSIZE_T,  at 
> lowest level, they
> all request device driver to read back data, assume the device driver only 
> supports 16K
> buffer size, if user gives dofileread a 100M buffer, the intermediate layer 
> will split request
> into 16K chunks, the intermediate layer repeatedly read 16K bytes, until at 
> the final block,
> it encountered a problem, and device driver returns EIO error code, the 
> fo_read operation
> read 100M-16K into buffer, and returned EIO too, then what happens in 
> dofileread ?
> it will simply return EIO and throw 100M-16K data away.

This is a perfect example.  All (?) block i/o goes through physio.
(This is well obfuscated by #define'ing physread and physwrite as
physio and never using physio directly.)  Most block devices are
disk or tape ones, and most disks go through the additional geom
layer(s).  All (?) layers follow the FreeBSD API and correctly pass
back both the i/o count and the error code for the block that
failed, if any.  Splitting into blocks of size dev->si_iosize_max
occurs in the physio layer.  This size defaults to DFLTPHYS (64K),
but geom bogusly advertizes that it is always MAXPHYS (128K).  Then
if the actual device's si_iosize_max is less than MAXPHYS, geom does
an additional layer of splitting to get the block size down to
whatever the device supports.  Broken device drivers might do
additional splitting.

An error can easily be generated by writing to the end of a disk.
This error should be ENOSPC.  This error should always happen when
the source disk is larger for copying one disk to another using
primitive methods like cp or dd.  The geom level is or should be
smart about this.  Modulo breakage, it writes a partial block if a
write begins just before the end of a disk and not return an error
for this case.  The partial block of course must have a size that
is a multiple of the disk's block size.  If the write is exactly at
EOF, then the error should be ENOSPC.  If the write is after EOF,
then the error should be either ENOSPC or EINVAL.  Trimming the
final i/o gives a short write with no error like some claim all
device drivers should do if the don't want to return the error.  But
this only works if there is no layering!

You can now copy a small 4.5GB disk using primitive methods a single
i/o if you have enough RAM (dd if=disk1 of=disk2 bs=8g; 8g gives a
safety margin).  There might be no real i/o errors.  There should be
an ENOSPC when EOF is hit on the target.  The count of 4.5GB together
with the error code ENOSPC should be returned to dofilewrite().
dofilewrite() is broken and returns ENOSPC.  It is actually safe to
ignore this particular error, and programs like gnu tar always did so.
It is the standard way of saying that the whole i/o succeeded, except
it tried to overrun EOF.  An EIO in the middle couldn't be ignored,
and the i/o of many GB should be retried, perhaps very slowly with
512-blocks to locate the failing block(s) (actually retry with a sane
block size before that).

> Now because I know how the insane dofileread works,  I split 100M read 
> request into small
> chunk from user space, I  request data in 16K chunk each time, I happily get 
> 100M-16K
> bytes, until at final block, I encountered a problem.  I only have 16K bytes 
> can not be read
>
> If the lowest layer is a byte stream, I would read  100M-1 bytes back, only 
> lost 1 bytes.
> I have rescued my data.
>
> Isn't the difference is very large ?

Of course, you can try using the large block size first and fall back to
a small block size only on error.  Works for most block devices but not
for pipes or for devices with external connections.

> Same problem is applied to dowritefile, I pass 100M data to dowritefile, it 
> wrote out 100M-16K
> bytes, and then it tells me that it did not write anything. if it is a byte 
> stream, it wrote 100M-1
> bytes, only 1 byte encountered a problem.
>
> if my buffer size is 500M, isn't the problem more serious ?

Now falling back doesn't work for write-once devices, and is difficult for
tapes.

> I think there could be a sysctl to control how many bytes I/O is important, 
> for me, I would set it
> to 1, for somebody, the value could be DON'T CARE, dofileread or dofilewrite 
> will return number
> of bytes I/O have been done if the size of I/O completion is larger than the 
> value, otherwise, it
> returns error code.

No, it should just work.

Bruce