Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 2 Aug 2012 04:58:49 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        arch@freebsd.org, David Xu <davidxu@freebsd.org>
Subject:   Re: short read/write and error code
Message-ID:  <20120802040542.G2978@besplex.bde.org>
In-Reply-To: <20120801162836.GO2676@deviant.kiev.zoral.com.ua>
References:  <5018992C.8000207@freebsd.org> <20120801071934.GJ2676@deviant.kiev.zoral.com.ua> <20120801183240.K1291@besplex.bde.org> <20120801162836.GO2676@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 1 Aug 2012, Konstantin Belousov wrote:

> On Wed, Aug 01, 2012 at 07:23:09PM +1000, Bruce Evans wrote:
>> On Wed, 1 Aug 2012, Konstantin Belousov wrote:
>>
>>> On Wed, Aug 01, 2012 at 10:49:16AM +0800, David Xu wrote:
> ...
>> The usability is specified for signals.  From an old POSIX draft:
>>
>> % 51235              If write( ) is interrupted by a signal before it
>> writes any data, it shall return -1 with errno set to
>> % 51236              [EINTR].
>> % 51237              If write( ) is interrupted by a signal after it
>> successfully writes some data, it shall return the
>> % 51238              number of bytes written.
> This is exactly what existing code does.

Yes, this is one of the few cases that is completely specified (when
complications for SA_RESTART are added).

>> POSIX formally defines "Successfully Transferred", mainly for aio.  I
>> couldn't find any formal definition of "successfully writes", but clearly
>> it is nonsense for a write to be unsuccessful if a reader on the local
>> system or on an external system has successfully read some of the data
>> written by the write.
>>
>> FreeBSD does try to convert EINTR to 0 after some data has been written,
>> to conform to the above.  SIGPIPE should return EINTR to be returned to
>> dofilewrite(), so there should be no problem for SIGPIPE.  But we were
>> reminded of this old FreeBSD bug by probelms with SIGPIPE.
> Sorry, I do not understand this, esp. second sentence.

Oops, the second sentence has bad grammar.  I meant "generation of
SIGPIPE should cause EINTR to be returned to dofilewrite(), so...".
What actually happens is that lower-level code returns EPIPE, without
generating EINTR, to dofilewrite() (except socket code generates
SIGPIPE directly unless this is disabled by a socket option).  Then
dofilewrite() first fails to see the EINTR because the return code
is EPIPE (this happens even for the socket case).  Then dofilewrite()
generates SIGPIPE, again without changing the error code to EINTR or
looping back to pick up the special EINTR handling.  This is actually
correct -- the error is specified to be EPIPE, not EINTR.  (This is
similar to what happens for the internally-generated SIGXFSZ -- there
is a not-so-special error code for that.)

So the above is not completely specified after all.  EINTR is not
intended to apply when the signal is internally generated (and you
could argue that the write is not interrupted by a signal in that
case; instead, the write fails and generates the signal, which is how
this is implemented).  Then since it wasn't interrupted, the clause
about returning the amount successfully written doesn't apply either.
And since we are failing with error code EPIPE, we aren't returning
the amount successfully written.  So it is apparently up to the
lower-level pipe code to not generate EPIPE if it has written
something sucessfully.  The wording for most of the specication of
EPIPE suggests that this error is for when a there is no reader at the
time the write() starts.

> As I said, patch behaviour in regard of SIGPIPE is just wrong.

It is too simple.

>>> Then fix the pipe code, and not introduce the behaviour change for all
>>> file types ?
>>
>> Because returning the error to userland breaks all file types that
>> want to return a short i/o (mainly special files whose i/o cannot be
>> backed out of).  They are just detecting and returning an error as a
>> courtesy to upper layers, and to simplify the implementation.  The
>> syscall API doesn't permit returning both the error code (the reason
>> for the short i/o) and the short count, so the error code must be
>> cleared to allow the short count to be returned.
> No, there is the only sane behaviour for the fo_read and fo_write, to
> return either no error (or interruption error) and adjust resid, or
> return error. Returning both error and adjusting resid is just wrong.

It is the FreeBSD API.

> Proposed patch makes generic i/o layer much less flexible, and probably
> preventing implementation of things like transactional writes.

Lower layers still have the option of clobbering either the error or the
count, so that uppor layers can't decide which one to use.  But this is
bad.  The syscall layer is not the only one.  Other layers may be able
to use both.

> We should fix sys_pipe.c and not require filesystems to roll back uio
> into inconsistent state to report errors (since rolling back into
> consistent state is typically impossible but is required after the patch).

Rolling back is often possible, but very file-type dependent.  ffs rolls
back to the original EOF if the write started at the original EOF.  This
probably applies to all regular files (except irregular regular files
in procfs), using a truncate call from vfs.  But this is probably still
simpler in individual file systems, since the original EOF position is
not really known to vfs.

It is precisely because backing out is impossible that write() must return
what has been written and not an error.  Similarly for read().

About reporting counts of buffered data as having been succesfully written:
this is little different from write() returning success for data that has
only been written to a buffer.  But when an error is detected for the
current write(), data that is only buffered and belongs to the current
write() should be uiounmove()d if possible.  Otherwise, treat this data
as done by a previous write and don't flush it.  For some file types, e.g.,
terminals, it is specified that data which was reported as written by
write() must be written later and not flushed unless the application asks
for this.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120802040542.G2978>