From owner-freebsd-hackers Sat Nov 18 12:48:46 1995 Return-Path: owner-hackers Received: (from root@localhost) by (8.6.12/8.6.6) id MAA17077 for hackers-outgoing; Sat, 18 Nov 1995 12:48:46 -0800 Received: from (phaeton.Artisoft.COM []) by (8.6.12/8.6.6) with ESMTP id MAA17070 for ; Sat, 18 Nov 1995 12:48:38 -0800 Received: (from terry@localhost) by (8.6.11/8.6.9) id NAA09519; Sat, 18 Nov 1995 13:45:11 -0700 From: Terry Lambert Message-Id: <> Subject: Re: NFS client caching in UNIX To: (Serge A. Babkin) Date: Sat, 18 Nov 1995 13:45:11 -0700 (MST) Cc:, In-Reply-To: <> from "Serge A. Babkin" at Nov 18, 95 02:33:21 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 9527 Sender: Precedence: bulk Sorry about the length of this reply. I hope this will put at least some of the issues involved to rest, so the length may be worth it. > You are describing the read cache here. I'm speaking about the write cache. > > Consider the logic of read and write. The reading needs to spend a round > trip time per each request if we can't predict the request sequence. > The writing does not need a round trip time because after the request is > transferred to the network (in the case of an absolutely reliable network) we > can forget about it and let the program work and generate the next write > request. Or in the case of an unreliable network we can use a windowed > protocol for writes so that when one write request travel through the > network, being executed and reply travels through the network the next > write request(s) can be produced. So, obviously, the writes must be > more effective than writes. But what do we see with NFS ? Reads are > about 5 times more effective than writes. Why? Because the network is > unreliable and we can get an error (in the case of soft mount) that > should be reported immediately to the application and because the > application can use some order of writes (possibly in different files) > to implicitly synchronize their "transactions". > > But if the application uses explicit syncronization and is not very > sensitive in the case of failure which write() precisely returns > failure (presence of at least one failure during "transaction" means > that the whole "transaction" fails) we can "delay" reporting the > failure until any other write request before the end of "transaction" > or the end of "transaction" itself. The "transaction" can be commited > by the calls close(), fsync(), unlock() and possibly lock(). > So we can have windowed writes between the "transaction" delimiters. > > Yes, not all applications will work well under these assumptions, but most > will do. So, we can add such write cache as an option. In the most cases > we will get significant write performance increase, in all other cases > we can simply disable this option for the mounts that will need syncronous > writes. >From "UNIX Internals: The New Frontiers", Vahalia, 10.7, "NFS Performance" , _italics_: ============================================================================ 10.7.2 Client-Side Caching If every operation on a remote file required network access, NFS performance would be intolerably slow. Hence most NFS clients resort to caching both file blocks and file attributes. They cache file blocks in the buffer cache and attributes in the rnodes. This caching is dangerous, since the client has no way of knowing if the contents of the cache are still valid, short of querying the server each time they must be used. Clients take certain precautions to reduce the dangers of using stale data. The kernel maintains an expiry time in the rnode, which monitors how long the attributes have been cached. Typically, the client caches the attributes for 60 seconds or less after fetching them from the server. If they are accessed after the quantum expires, the client fetches them from the server again. Likewise, for file data blocks, the client checks the cache consistency by verifying that the file's _modify time_ has not changed since the cached data was read from the server. The client may use the cached value of the timestamp or issue a GETATTR if it has expored. Client side caching is essential for acceptable performance. The precautions described here reduce, but do not eliminate, the consistency problems. In fact, they introduce some new race conditions, as described in [Mack91] and [Jusz 89]. 10.7.3 Deferral of Writes The NFS requirement of synchronous writes applies only to the server. The client is free to defer writes, since if data is lost due to a client crash, the users know about it. The client policy, therefore, is to use asynchronous writes for full blocks (issue the WRITE rewuest but do not wait for the reply) and delayed writes for partial blocks (issue the WRITE sometime later). Most UNIX implementations flush delayed writes to the server when the file is closed and also every 30 seconds. The _biod_ daemons on the client handle these writes. < ... discussion of NVRAM based disk commit avoidance ... > [Jusz94] shows a technique called write-gathering that reduces the synchronous write bottleneck without using special hardware. It relies on the fact that typical NFS clients use a number of _biod_ daemons to handle write operations. When a client process opens a file and writes to it, the kernel simply caches the changes and marks them for delayed write. When the client closes the file, the kernel flushed its blocks to the server. If there are sufficient _biod_s available on the client, they can issue all writes in parallel. As a result, servers often receive a number of writes requests for the same file bunched together. Using write-gathering, the server does not process the WRITE requests immediately. Rather, it delays them for a little while, in the hope that it receives other WRITEs for the same file in the meantime. It then gathers all the requests for the same file and processes them together. After completing them all, the server replies to each of them. This technique is effective when clients use _biod_s and is optimal when they use a large number of _biod_s. Although it appears to increase the latency of individual requests, it improves performance tremendously by reducing the total number of disk operations on the server. <...> ======================================================================= Now, we can safely use async writes in the NFSv3 protocol. But DOS clients will not generate them because the DOS system call interface is not async. VWIN32 allows for async completion on read, writes, and though it si not explicitly documented in MS's online documentation, device ioctl's. But this does nothing to help the typical DOS client. So what are the overall conclusions (apart from the obvious one of "DOS sucks rocks") in light of a desire to get better WRITE numbers? 1) DOS behaviour is pathological for some types of optimzations which are otherwise typically "good bets". 2) DOS architecture can't handle most caching optimizations without a serious rework of the system architecture or explicit use of non-standard API's by application programs. 3) DOS machines don't use the multiple BIOD's for the highest payback caching optimzation. 4) Even if the did use the required multiple BIOD's and you hacked the DOS by using direct redirection of interrupts instead of INT 2C (NetBIOS), the BSD NFS would require a retrofit to fully support write-gathering. 5) The NFSv3 async write would be a win for UNIX clients that can take advantage of it, but DOS clients couldn't use it. Instead, you'd have to go to a VWIN32 API to get async I/O, which means different applications and Windows95 or WindowsNT instead of Windows3.x or DOS. 6) Uresh Vahalia is a better writer than Terry. 8-). 7) Everyone should buy Uresh's book. 8-). It's quite possible to get huge performance gains out of NFS if you control the client and the server software sufficiently closely. Even if all you control is the client, as long as the implementation is typical of a UNIX kernel environment, you can expect to be able to get fairly large wins. But unless you plan to rewrite a DOS client from scratch, and trap many calls that are not typically associated with simple network redirectors and basically write much of the code from scratch (there is no PD DOS NFS client source code), DOS would be a serious bear. Don't let me discourage you from attempting it, but realize the road you are on before you hit it. Meanwhile, write gathering looks like a reasonable goal for the BSD NFS implementation, though it looks to require both client and server work to get the write numbers up. I guess the next question is what are the v2 vs. V3 numbers on the same hardware, and would it be a sufficient win for the amount of time involved. Regards, Terry Lambert --- Any opinions in this posting are my own and not those of my present or previous employers.