FreeBSD Mail Archives

Date:      Sat, 18 Nov 1995 13:45:11 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        babkin@hq.icb.chel.su (Serge A. Babkin)
Cc:        terry@lambert.org, hackers@freebsd.org
Subject:   Re: NFS client caching in UNIX
Message-ID:  <199511182045.NAA09519@phaeton.artisoft.com>
In-Reply-To: <199511180933.OAA01222@hq.icb.chel.su> from "Serge A. Babkin" at Nov 18, 95 02:33:21 pm

Sorry about the length of this reply.  I hope this will put at least
some of the issues involved to rest, so the length may be worth it.


> You are describing the read cache here. I'm speaking about the write cache.
> 
> Consider the logic of read and write. The reading needs to spend a round
> trip time per each request if we can't predict the request sequence.
> The writing does not need a round trip time because after the request is
> transferred to the network (in the case of an absolutely reliable network) we
> can forget about it and let the program work and generate the next write
> request. Or in the case of an unreliable network we can use a windowed
> protocol for writes so that when one write request travel through the
> network, being executed and reply travels through the network the next
> write request(s) can be produced. So, obviously, the writes must be
> more effective than writes.  But what do we see with NFS ? Reads are
> about 5 times more effective than writes. Why? Because the network is
> unreliable and we can get an error (in the case of soft mount) that
> should be reported immediately to the application and because the
> application can use some order of writes (possibly in different files)
> to implicitly synchronize their "transactions". 
> 
> But if the application uses explicit syncronization and is not very 
> sensitive in the case of failure which write() precisely returns 
> failure (presence of at least one failure during "transaction" means
> that the whole "transaction" fails) we can "delay" reporting the
> failure until any other write request before the end of "transaction"
> or the end of "transaction" itself. The "transaction" can be commited
> by the calls close(), fsync(), unlock() and possibly lock().
> So we can have windowed writes between the "transaction" delimiters.
> 
> Yes, not all applications will work well under these assumptions, but most
> will do. So, we can add such write cache as an option. In the most cases
> we will get significant write performance increase, in all other cases
> we can simply disable this option for the mounts that will need syncronous
> writes.

>From "UNIX Internals: The New Frontiers", Vahalia, 10.7, "NFS Performance"
<my comments>, _italics_:
============================================================================
10.7.2 Client-Side Caching

If every operation on a remote file required network access, NFS
performance would be intolerably slow.  Hence most NFS clients resort
to caching both file blocks and file attributes.  They cache file
blocks in the buffer cache and attributes in the rnodes.  This caching
is dangerous, since the client has no way of knowing if the contents
of the cache are still valid, short of querying the server each time
they must be used.

Clients take certain precautions to reduce the dangers of using stale
data.  The kernel maintains an expiry time in the rnode, which monitors
how long the attributes have been cached.  Typically, the client caches
the attributes for 60 seconds or less after fetching them from the
server.  If they are accessed after the quantum expires, the client
fetches them from the server again.  Likewise, for file data blocks,
the client checks the cache consistency by verifying that the file's
_modify time_ has not changed since the cached data was read from the
server.  The client may use the cached value of the timestamp or
issue a GETATTR if it has expored.

Client side caching is essential for acceptable performance.  The
precautions described here reduce, but do not eliminate, the
consistency problems.  In fact, they introduce some new race
conditions, as described in [Mack91] and [Jusz 89].

<It is my opinion that the race conditions introduced are atypical
 on most UNIX clients, but in fact become pathological on most DOS
 clients because of the nature of DOS and Windows task models.  DOS
 handles "tasking" in interrupt service routines, where as Windows
 task switching is in terms of voluntary context switching.  Thus
 what is "essential for acceptable performance" differes for UNIX
 and DOS clients.>

10.7.3 Deferral of Writes

The NFS requirement of synchronous writes applies only to the server.
The client is free to defer writes, since if data is lost due to a
client crash, the users know about it.  The client policy, therefore,
is to use asynchronous writes for full blocks (issue the WRITE rewuest
but do not wait for the reply) and delayed writes for partial blocks
(issue the WRITE sometime later).  Most UNIX implementations flush
delayed writes to the server when the file is closed and also every
30 seconds.  The _biod_ daemons on the client handle these writes.

< ... discussion of NVRAM based disk commit avoidance ... >

[Jusz94] shows a technique called write-gathering that reduces the
synchronous write bottleneck without using special hardware.  It
relies on the fact that typical NFS clients use a number of _biod_
daemons to handle write operations.  When a client process opens a
file and writes to it, the kernel <on the client> simply caches
the changes and marks them for delayed write.  When the client
closes the file, the kernel flushed its blocks to the server.  If
there are sufficient _biod_s available on the client, they can issue
all writes in parallel.  As a result, servers often receive a number
of writes requests for the same file bunched together.

<Typically, DOS machines are so memory restricted that there is a
 single "biod" consisting of a transmission queue services by a
 packet recieve ISR for queuing the next transmission.  Anything
 else would result in "too much" memory being used by the NFS.  It
 is possible (but difficult, to overcome the DOS memory issues and
 resolve the problem.  A Windows95 or Windows NT IFS module could
 probably do it easily, using the undocumented VWIN32 call
 "CreateRing0Thread" to build *real* biod's for the NFS client
 code.>

Using write-gathering, the server does not process the WRITE requests
immediately.  Rather, it delays them for a little while, in the hope
that it receives other WRITEs for the same file in the meantime.  It
then gathers all the requests for the same file and processes them
together.  After completing them all, the server replies to each of
them.  This technique is effective when clients use _biod_s and is
optimal when they use a large number of _biod_s.  Although it appears
to increase the latency of individual requests, it improves performance
tremendously by reducing the total number of disk operations on the
server. <...>

<It should be noted at this point that: (1) typical DOS clients do not
 use a BIOD implementation, let alone a large number of BIODs, and (2)
 the BSD NFS server code does not implement write-gathering.  So the
 benefit of client caching using delayed writes is negligible>

=======================================================================

Now, we can safely use async writes in the NFSv3 protocol.  But DOS
clients will not generate them because the DOS system call interface
is not async.  VWIN32 allows for async completion on read, writes, and
though it si not explicitly documented in MS's online documentation,
device ioctl's.  But this does nothing to help the typical DOS client.


So what are the overall conclusions (apart from the obvious one of "DOS
sucks rocks") in light of a desire to get better WRITE numbers?

1)	DOS behaviour is pathological for some types of optimzations
	which are otherwise typically "good bets".

2)	DOS architecture can't handle most caching optimizations
	without a serious rework of the system architecture or
	explicit use of non-standard API's by application programs.

3)	DOS machines don't use the multiple BIOD's for the highest
	payback caching optimzation.

4)	Even if the did use the required multiple BIOD's and you
	hacked the DOS by using direct redirection of interrupts
	instead of INT 2C (NetBIOS), the BSD NFS would require
	a retrofit to fully support write-gathering.

5)	The NFSv3 async write would be a win for UNIX clients that
	can take advantage of it, but DOS clients couldn't use it.
	Instead, you'd have to go to a VWIN32 API to get async I/O,
	which means different applications and Windows95 or WindowsNT
	instead of Windows3.x or DOS.

6)	Uresh Vahalia is a better writer than Terry.  8-).

7)	Everyone should buy Uresh's book.  8-).


It's quite possible to get huge performance gains out of NFS if you
control the client and the server software sufficiently closely.  Even
if all you control is the client, as long as the implementation is
typical of a UNIX kernel environment, you can expect to be able to
get  fairly large wins.

But unless you plan to rewrite a DOS client from scratch, and trap many
calls that are not typically associated with simple network redirectors
and basically write much of the code from scratch (there is no PD DOS
NFS client source code), DOS would be a serious bear.

Don't let me discourage you from attempting it, but realize the road
you are on before you hit it.


Meanwhile, write gathering looks like a reasonable goal for the BSD
NFS implementation, though it looks to require both client and server
work to get the write numbers up.  I guess the next question is what
are the v2 vs. V3 numbers on the same hardware, and would it be a
sufficient win for the amount of time involved.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199511182045.NAA09519>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation