Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 2 Aug 2012 04:04:53 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Warner Losh <imp@bsdimp.com>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org, davidxu@freebsd.org
Subject:   Re: short read/write and error code
Message-ID:  <20120802032158.S2978@besplex.bde.org>
In-Reply-To: <D7DC1F82-6CAA-4359-847C-EE89357D8538@bsdimp.com>
References:  <5018992C.8000207@freebsd.org> <20120801071934.GJ2676@deviant.kiev.zoral.com.ua> <5018E1FC.4080609@gmail.com> <D7DC1F82-6CAA-4359-847C-EE89357D8538@bsdimp.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 1 Aug 2012, Warner Losh wrote:

> On Aug 1, 2012, at 1:59 AM, David Xu wrote:
>
>> On 2012/8/1 15:19, Konstantin Belousov wrote:
>>> On Wed, Aug 01, 2012 at 10:49:16AM +0800, David Xu wrote:

Please trim quotes.

>>>> ...[some trimmed]

>>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/read.html
>>>>> RETURN VALUE
>>>>>
>>>>> Upon successful completion, read() [XSI]   and pread()  shall return
>>>>> a non-negative integer indicating the number of bytes actually read.
>>>>> Otherwise, the functions shall return -1 and set errno to indicate
>>>>> the error.
>>> Note that the wording is only about successful return, not for the case
>>> when error occured. I do think that if fo_read() returned an error, and
>>> error is not of the kind 'interruption', then the error shall be returned
>>> as is.
>> I do think data is more important than error code.  Do you think if a 512 bytes block is bad,
>> all bytes in the block should be thrown away while you could really get some bytes from it,
>> this might be very important to someone, such as a password or a bank account,  this
>> is just an example, whether filesystem works in this way is irrelevant.
>
> You do know that with disk drives it is an all or nothing sort of thing at the sector level.  Either you get the whole thing, or you get none of it.  There's no partial sector reads, and there's no way to get the data generally.  Some drives sometimes allow you to access raw tracks, but those interfaces are never connected to read, but usually an ioctl that issues the special command and returns the results.  And even then, it returns everything (perhaps including the ECC bytes)

Please use the Unix newline character.

This (the above this, not the Unix newline character) makes the upper-level
error handling a non-problem for partial blocks.  Disk drives are also
directly addressable, so they can sometimes back out of writes (unwrite
all successfully written data), like for regular files.

>> While program continues to execute,  next read()/write() should return -1 and errno will be
>> set, I think both socket and pipe already work in this way, it is dofileread/dofilewrite have
>> made it not happen.
>
> Usually it is up to the driver to make this decision.  Most drivers already return 0 when they've put any data into the buffer.  The case where there's an error returned from the driver and also data indicated by resid would be vanishingly small.

I hope most drivers aren't that broken.  Backing out of a transfer is
hard.  Most drivers don't even know that they should.  OTOH, they also
don't know how much they wrote, so depending on them returning the
correct count of what they wrote is dangerous.  Normal practice seems
to be to uiomove() to a buffer and then start sending the buffer to
the hardware.  If the latter fails in the middle, then many drivers
don't know exactly where it failed, and most don't uiounmove() the
data.  If the driver write returned with buffered data still not sent
to the hardware, write() must return the count of the amount buffered
and another mechanism must be used to report errors.  If the driver
write blocks waiting for buffer space or the hardware, then failure
is more likely to be detected before write() returns.

uiounmove() of course doesn't exist.  Drivers in sys/dev do a fairly
large number of direct accesses to uio_resid, but not many write
accesses to it (half of the write accesses to it are in cxgb).

>>>> I have following patch to fix our code to be compatible with POSIX:
>>> ...
>>> Then fix the pipe code, and not introduce the behaviour change for all
>>> file types ?
>> see above, I think data is more important than error code,  and next read/write will
>> get the error.
>>
>>>> For EPIPE, We still deliver SIGPIPE to current thread, but returns
>>>> actually bytes written.
>>> And this sounds wrong. I think that fixing the code for pipes would also
>>> semi-magically makes this correct.
>
> Yes.  Pipes are too magical and don't match devices very well.

No.  They match devices that can't unread or unwrite data fairly well.
They are just a bit simpler because the consumer of the data is local
and in software, so you can more easily to see if the written data went
out.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120802032158.S2978>