From owner-freebsd-arch@FreeBSD.ORG Wed Aug 1 22:02:41 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 97500106566B; Wed, 1 Aug 2012 22:02:41 +0000 (UTC) (envelope-from listlog2011@gmail.com) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 74C538FC17; Wed, 1 Aug 2012 22:02:41 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q71M2dfU009407; Wed, 1 Aug 2012 22:02:40 GMT (envelope-from listlog2011@gmail.com) Message-ID: <5019A77C.4030503@gmail.com> Date: Thu, 02 Aug 2012 06:02:36 +0800 From: David Xu User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:14.0) Gecko/20120713 Thunderbird/14.0 MIME-Version: 1.0 To: Warner Losh References: <5018992C.8000207@freebsd.org> <20120801071934.GJ2676@deviant.kiev.zoral.com.ua> <5018E1FC.4080609@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Konstantin Belousov , arch@freebsd.org, davidxu@freebsd.org Subject: Re: short read/write and error code X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: davidxu@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Aug 2012 22:02:41 -0000 On 2012/8/1 22:12, Warner Losh wrote: > On Aug 1, 2012, at 1:59 AM, David Xu wrote: > >> On 2012/8/1 15:19, Konstantin Belousov wrote: >>> On Wed, Aug 01, 2012 at 10:49:16AM +0800, David Xu wrote: >>>> POSIX requires write() to return actually bytes written, same rule is >>>> applied to read(). >>>> >>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/write.html >>>>> ETURN VALUE >>>>> >>>>> Upon successful completion, write() [XSI] and pwrite() shall >>>>> return the number of bytes actually written to the file associated >>>>> with fildes. This number shall never be greater than nbyte. >>>>> Otherwise, -1 shall be returned and errno set to indicate the error. >>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/read.html >>>>> RETURN VALUE >>>>> >>>>> Upon successful completion, read() [XSI] and pread() shall return >>>>> a non-negative integer indicating the number of bytes actually read. >>>>> Otherwise, the functions shall return -1 and set errno to indicate >>>>> the error. >>> Note that the wording is only about successful return, not for the case >>> when error occured. I do think that if fo_read() returned an error, and >>> error is not of the kind 'interruption', then the error shall be returned >>> as is. >> I do think data is more important than error code. Do you think if a 512 bytes block is bad, >> all bytes in the block should be thrown away while you could really get some bytes from it, >> this might be very important to someone, such as a password or a bank account, this >> is just an example, whether filesystem works in this way is irrelevant. > You do know that with disk drives it is an all or nothing sort of thing at the sector level. Either you get the whole thing, or you get none of it. There's no partial sector reads, and there's no way to get the data generally. Some drives sometimes allow you to access raw tracks, but those interfaces are never connected to read, but usually an ioctl that issues the special command and returns the results. And even then, it returns everything (perhaps including the ECC bytes) Sorry, my example is not precise, see blow. >> While program continues to execute, next read()/write() should return -1 and errno will be >> set, I think both socket and pipe already work in this way, it is dofileread/dofilewrite have >> made it not happen. > Usually it is up to the driver to make this decision. Most drivers already return 0 when they've put any data into the buffer. The case where there's an error returned from the driver and also data indicated by resid would be vanishingly small. Okay, driver works in this way, I don't complain. >>>> I have following patch to fix our code to be compatible with POSIX: >>> ... >>> >>>> -current only resets error code to zero for short write when code is >>>> ERESTART, EINTR or EWOULDBLOCK. >>>> But this is incorrect, at least for pipe, when EPIPE is returned, >>>> some bytes may have already been written. For a named pipe, I may don't >>>> care a reader is disappeared or not, because for named pipe, a new >>>> reader can come in and talk with writer again, so I need to know >>>> how many bytes have been written, same is applied to reader, I don't >>>> care writer is gone, it can come in again and talk with reader. So I >>>> suggest to remove surplus code in -current's dofilewrite() and >>>> dofileread(). >>> Then fix the pipe code, and not introduce the behaviour change for all >>> file types ? >> see above, I think data is more important than error code, and next read/write will >> get the error. >> >>>> For EPIPE, We still deliver SIGPIPE to current thread, but returns >>>> actually bytes written. >>> And this sounds wrong. I think that fixing the code for pipes would also >>> semi-magically makes this correct. > Yes. Pipes are too magical and don't match devices very well. Unfortunately, the dofileread and dofilewrite are very high level API, it does not device, it is file oriented API, not device oriented. Let me interpret what's wrong in their code. dofileread requests fo_read to read back data into user space buffer, the user space buffer can be very large. fo_read is an intermediate layer, assume it supports large buffer size, for example, it is file system's interface to read data, or it can be some intermediate code which also supports very large buffer size until max-value of SSIZE_T, at lowest level, they all request device driver to read back data, assume the device driver only supports 16K buffer size, if user gives dofileread a 100M buffer, the intermediate layer will split request into 16K chunks, the intermediate layer repeatedly read 16K bytes, until at the final block, it encountered a problem, and device driver returns EIO error code, the fo_read operation read 100M-16K into buffer, and returned EIO too, then what happens in dofileread ? it will simply return EIO and throw 100M-16K data away. Now because I know how the insane dofileread works, I split 100M read request into small chunk from user space, I request data in 16K chunk each time, I happily get 100M-16K bytes, until at final block, I encountered a problem. I only have 16K bytes can not be read If the lowest layer is a byte stream, I would read 100M-1 bytes back, only lost 1 bytes. I have rescued my data. Isn't the difference is very large ? Same problem is applied to dowritefile, I pass 100M data to dowritefile, it wrote out 100M-16K bytes, and then it tells me that it did not write anything. if it is a byte stream, it wrote 100M-1 bytes, only 1 byte encountered a problem. if my buffer size is 500M, isn't the problem more serious ? I think there could be a sysctl to control how many bytes I/O is important, for me, I would set it to 1, for somebody, the value could be DON'T CARE, dofileread or dofilewrite will return number of bytes I/O have been done if the size of I/O completion is larger than the value, otherwise, it returns error code. > Warner