From owner-freebsd-arch@FreeBSD.ORG  Wed Aug  1 22:02:41 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 97500106566B;
	Wed,  1 Aug 2012 22:02:41 +0000 (UTC)
	(envelope-from listlog2011@gmail.com)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 74C538FC17;
	Wed,  1 Aug 2012 22:02:41 +0000 (UTC)
Received: from [127.0.0.1] (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q71M2dfU009407;
	Wed, 1 Aug 2012 22:02:40 GMT (envelope-from listlog2011@gmail.com)
Message-ID: <5019A77C.4030503@gmail.com>
Date: Thu, 02 Aug 2012 06:02:36 +0800
From: David Xu <listlog2011@gmail.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1;
	rv:14.0) Gecko/20120713 Thunderbird/14.0
MIME-Version: 1.0
To: Warner Losh <imp@bsdimp.com>
References: <5018992C.8000207@freebsd.org>
	<20120801071934.GJ2676@deviant.kiev.zoral.com.ua>
	<5018E1FC.4080609@gmail.com>
	<D7DC1F82-6CAA-4359-847C-EE89357D8538@bsdimp.com>
In-Reply-To: <D7DC1F82-6CAA-4359-847C-EE89357D8538@bsdimp.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org,
	davidxu@freebsd.org
Subject: Re: short read/write and error code
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: davidxu@freebsd.org
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 01 Aug 2012 22:02:41 -0000

On 2012/8/1 22:12, Warner Losh wrote:
> On Aug 1, 2012, at 1:59 AM, David Xu wrote:
>
>> On 2012/8/1 15:19, Konstantin Belousov wrote:
>>> On Wed, Aug 01, 2012 at 10:49:16AM +0800, David Xu wrote:
>>>> POSIX requires write() to return actually bytes written, same rule is
>>>> applied to read().
>>>>
>>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/write.html
>>>>> ETURN VALUE
>>>>>
>>>>> Upon successful completion, write() [XSI]   and pwrite()  shall
>>>>> return the number of bytes actually written to the file associated
>>>>> with fildes. This number shall never be greater than nbyte.
>>>>> Otherwise, -1 shall be returned and errno set to indicate the error.
>>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/read.html
>>>>> RETURN VALUE
>>>>>
>>>>> Upon successful completion, read() [XSI]   and pread()  shall return
>>>>> a non-negative integer indicating the number of bytes actually read.
>>>>> Otherwise, the functions shall return -1 and set errno to indicate
>>>>> the error.
>>> Note that the wording is only about successful return, not for the case
>>> when error occured. I do think that if fo_read() returned an error, and
>>> error is not of the kind 'interruption', then the error shall be returned
>>> as is.
>> I do think data is more important than error code.  Do you think if a 512 bytes block is bad,
>> all bytes in the block should be thrown away while you could really get some bytes from it,
>> this might be very important to someone, such as a password or a bank account,  this
>> is just an example, whether filesystem works in this way is irrelevant.
> You do know that with disk drives it is an all or nothing sort of thing at the sector level.  Either you get the whole thing, or you get none of it.  There's no partial sector reads, and there's no way to get the data generally.  Some drives sometimes allow you to access raw tracks, but those interfaces are never connected to read, but usually an ioctl that issues the special command and returns the results.  And even then, it returns everything (perhaps including the ECC bytes)
Sorry, my example is not precise, see blow.

>> While program continues to execute,  next read()/write() should return -1 and errno will be
>> set, I think both socket and pipe already work in this way, it is dofileread/dofilewrite have
>> made it not happen.
> Usually it is up to the driver to make this decision.  Most drivers already return 0 when they've put any data into the buffer.  The case where there's an error returned from the driver and also data indicated by resid would be vanishingly small.
Okay, driver works in this way, I don't complain.


>>>> I have following patch to fix our code to be compatible with POSIX:
>>> ...
>>>
>>>> -current only resets error code to zero for short write when code is
>>>> ERESTART, EINTR or EWOULDBLOCK.
>>>> But this is incorrect, at least for pipe, when EPIPE is returned,
>>>> some bytes may have already been written. For a named pipe, I may don't
>>>> care a reader is disappeared or not, because for named pipe, a new
>>>> reader can come in and talk with writer again,  so I need to know
>>>> how many bytes have been written, same is applied to reader, I don't
>>>> care writer is gone, it can come in again and talk with reader. So I
>>>> suggest to remove surplus code in -current's dofilewrite() and
>>>> dofileread().
>>> Then fix the pipe code, and not introduce the behaviour change for all
>>> file types ?
>> see above, I think data is more important than error code,  and next read/write will
>> get the error.
>>
>>>> For EPIPE, We still deliver SIGPIPE to current thread, but returns
>>>> actually bytes written.
>>> And this sounds wrong. I think that fixing the code for pipes would also
>>> semi-magically makes this correct.
> Yes.  Pipes are too magical and don't match devices very well.
Unfortunately, the dofileread and dofilewrite are very high level API, 
it does not device,
it is file oriented API, not device oriented. Let me interpret what's 
wrong in their code.
dofileread requests fo_read to read back data into user space buffer, 
the user space buffer
can be very large.  fo_read is an intermediate layer, assume it supports 
large buffer size,
for example, it is file system's  interface to read data, or it can be 
some intermediate code
which also supports very large buffer size until max-value of SSIZE_T,  
at lowest level, they
all request device driver to read back data, assume the device driver 
only supports 16K
buffer size, if user gives dofileread a 100M buffer, the intermediate 
layer will split request
into 16K chunks, the intermediate layer repeatedly read 16K bytes, until 
at the final block,
it encountered a problem, and device driver returns EIO error code, the 
fo_read operation
read 100M-16K into buffer, and returned EIO too, then what happens in 
dofileread ?
it will simply return EIO and throw 100M-16K data away.

Now because I know how the insane dofileread works,  I split 100M read 
request into small
chunk from user space, I  request data in 16K chunk each time, I happily 
get 100M-16K
bytes, until at final block, I encountered a problem.  I only have 16K 
bytes can not be read

If the lowest layer is a byte stream, I would read  100M-1 bytes back, 
only lost 1 bytes.
I have rescued my data.

Isn't the difference is very large ?

Same problem is applied to dowritefile, I pass 100M data to dowritefile, 
it wrote out 100M-16K
bytes, and then it tells me that it did not write anything. if it is a 
byte stream, it wrote 100M-1
bytes, only 1 byte encountered a problem.

if my buffer size is 500M, isn't the problem more serious ?

I think there could be a sysctl to control how many bytes I/O is 
important, for me, I would set it
to 1, for somebody, the value could be DON'T CARE, dofileread or 
dofilewrite will return number
of bytes I/O have been done if the size of I/O completion is larger than 
the value, otherwise, it
returns error code.


> Warner