From owner-freebsd-questions  Wed Oct 11 17:44:14 1995
Return-Path: owner-questions
Received: (from root@localhost)
          by freefall.freebsd.org (8.6.12/8.6.6) id RAA24749
          for questions-outgoing; Wed, 11 Oct 1995 17:44:14 -0700
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id RAA24740
          for <questions@FreeBSD.org>; Wed, 11 Oct 1995 17:44:09 -0700
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id RAA14102; Wed, 11 Oct 1995 17:40:37 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199510120040.RAA14102@phaeton.artisoft.com>
Subject: Re: NFS performance
To: straka@indirect.com (Richard S. Straka)
Date: Wed, 11 Oct 1995 17:40:37 -0700 (MST)
Cc: questions@FreeBSD.org
In-Reply-To: <Pine.BSD/.3.91.951011161124.21926A-100000@bud.indirect.com> from "Richard S. Straka" at Oct 11, 95 05:00:38 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 6852      
Sender: owner-questions@FreeBSD.org
Precedence: bulk

> I have set up my 486DX-100 as an NFS server for a network that currently 
> has PCNFS running on a Pentium-100 and a 386DX-25 (all machines are using 
> Intel EtherExpress Boards).  I am using a FreeBSD-stable kernel which I 
> downloaded from wcarchive.cdrom.com on approx 1 Oct 95.  The server
> filesystem is mounted on the Pentium with rsize=8192 and wsize=8192.

Try droppwing the rsize/wsize to 1024.  This may seem counter-intuitive,
but if the problem is timeouts expiring and causing retries, this should
actually speed up the client writes.

You don't say what kind of thercard you have; and NE2000 or clone will
have signifcantly reduced performance in one direction or the other
because of an inability to double-buffer because of memory size limits
on that type of card.  They are bad cards for servers.

> While reading from the NFS server, the Pentium can achieve 600-700KB/sec,
> nearly the speed of the ether.

This is expected behaviour.

> While writing to the server, however, the speed of the tranfers 
> seem to be limited to about 100KB/sec with alot of disk thrashing 
> occuring on the server.  While running SYSTAT on the server, I noticed 
> that the processor idle time is still greater than 50% but the disk 
> transfer rate is around 400KB/sec with 40-50 seeks/sec.  Why is the disk 
> transfer rate 4 times the file transfer rate between the client and the 
> server and why so many seeks?  I have used NETSTAT to verfy that I am not 
> dropping any UDP packets.

Because the file system metadata is being synchronously updated to make
the file system more robust in the face of a system crash.

The seeks are because of the synchronus transfers.

The actual problem the file system is attempting to fix by doing synchronus
I/O is ensuring write ordering.  This could be achieved just as well by
prioritizing multiple async queues, or by othewise ordering the writes
(in fact Novell/USL had a patent pending on "delayed ordered writes" for
this exact problem in UFS in UnixWare 2.0 last year when I left them).

The multiple queue soloution has the benefit of not being succeptable to
the patent (if it was even granted -- you can show serious prior art by a
number of people in the disk drive industry if need be).


> In sys/nfs/nfs_serv.c in the kernel code there is a compiler directive
> NFS_ASYNC.  When I compile the kernel with this directive set, the file
> transfer speed while writing to the server increases to about 400KB/sec
> (still not the 600-700KB/sec realized while reading from the server)
> and the apparent disk thrashing is gone (400KB/sec disk transfers with
> 10-20 seeks/sec).

The NFS_ASYNC option almost inevitibly assumes the use of a UPS on your
server.  It gains its speed by causing what would normally be synchronus
metadata updates to be asynchronous, and therefore the write is then
acknowledged by the server before it has occurred.  This is very dangerous
in the sense of increased fragility in the event of system crashes (luckily
FreeBSD is quite stable in almost every case except actual power and
hardware failures).

In reality, there are two types of updates occurring synchronously, and
both are forced async by the change.  In practice, this is probably a
lot more fragile an implementation than it needs to be.

Specifically, if directory entry data is still updates synchronously,
but file system metadata, in particular, time stamps, are updated async,
and then file system data is written async, the damage to file system
structure would be drastically reduced.  With async data writes, though,
you are still open to corrupt file contents, but at least the file system
structure would be secure.

This would require a seperation in the treatment of metadata into two
classes, with time stamps being the inferior class (POSIX guarantees only
apply to *marking* for update, not actually updating with regards to
things like time stamps).


The speed you achieved reading from the server is probably the result
of getting data cache hits.  You are unlikely to ever be able to get
this data rate on writes, period.  Writes on frags or not in core pages
have to fault in a page and partially update it before writing it back
out; in efffect, all write operations, unless a page in size and page
aligned, will take the same time as a read plus the actual write.

Some of this could (but has not been) alleviated by the NFS implementation
itself doing page alignment caching of data writes.  There has been
relatively little effort put into work on server caching.

Ideally, one could "know" the disk was one capable of track write caching
and force the write async after it goes to the controller instead of
waiting for completion.  This is a general speedup in any case.  The
same could be done for controllers with NVRAM for write caching.

Typically, this would not effect you, since you probably do not have
the hardware for it.


The "seek overhead" unless it is being retrieved from the disk itself,
is a fictional number and does not reflect performance.  This is because
it is unlikely that you are running on a drive without translated geometry
or Zone Boundry Recording, etc..  I would ignore this number completely,
since the actual number of seeks is probably unrelated.


> This change, however, has made the server file system very vunerable
> to system crashes.  Two power outages (even after the system had been
> idle for several minutes - no nfs transfers) have resulted in broken
> filesystems (more than 100 bad INODES).

This is expected behavior.  Part of the fault lies in the dirty page
flush policies, which under BSD are not as aggresive as they should be
in terms of a window guarantee until the data is on disk.  Part is also
the fault of the page marking mechanism invalidating a potentially correct
copy (except for time stamps, etc.) on disk, as if the data contained
on disk is completely invalid.

This policy can be (and probably will be) changed.  At the very least,
this also wants to have semantic and integrity update guarantees to be
seperate as well.  At the most, file system idle time should be watched
and file system updates (even to the point of marking the FS clean!)
should be done if the file system is idle for say 30 seconds.

> In both instances, the file system was unrecoverable (FSCK core 
> dumped) and a new filesystem had to be created on the drive.  Are there 
> any suggestions for improving the speed of the clients writing to the 
> server while still maintaining a reasonable tolerance to system crashes?  

I would certainly consider a UPS.  Or you can become a file system or
disk driver hacker.  8-).  Either one would allow safer use of the flag
NFS_ASYNC.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.