From owner-freebsd-questions Wed Oct 11 17:44:14 1995 Return-Path: owner-questions Received: (from root@localhost) by freefall.freebsd.org (8.6.12/8.6.6) id RAA24749 for questions-outgoing; Wed, 11 Oct 1995 17:44:14 -0700 Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id RAA24740 for ; Wed, 11 Oct 1995 17:44:09 -0700 Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id RAA14102; Wed, 11 Oct 1995 17:40:37 -0700 From: Terry Lambert Message-Id: <199510120040.RAA14102@phaeton.artisoft.com> Subject: Re: NFS performance To: straka@indirect.com (Richard S. Straka) Date: Wed, 11 Oct 1995 17:40:37 -0700 (MST) Cc: questions@FreeBSD.org In-Reply-To: from "Richard S. Straka" at Oct 11, 95 05:00:38 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 6852 Sender: owner-questions@FreeBSD.org Precedence: bulk > I have set up my 486DX-100 as an NFS server for a network that currently > has PCNFS running on a Pentium-100 and a 386DX-25 (all machines are using > Intel EtherExpress Boards). I am using a FreeBSD-stable kernel which I > downloaded from wcarchive.cdrom.com on approx 1 Oct 95. The server > filesystem is mounted on the Pentium with rsize=8192 and wsize=8192. Try droppwing the rsize/wsize to 1024. This may seem counter-intuitive, but if the problem is timeouts expiring and causing retries, this should actually speed up the client writes. You don't say what kind of thercard you have; and NE2000 or clone will have signifcantly reduced performance in one direction or the other because of an inability to double-buffer because of memory size limits on that type of card. They are bad cards for servers. > While reading from the NFS server, the Pentium can achieve 600-700KB/sec, > nearly the speed of the ether. This is expected behaviour. > While writing to the server, however, the speed of the tranfers > seem to be limited to about 100KB/sec with alot of disk thrashing > occuring on the server. While running SYSTAT on the server, I noticed > that the processor idle time is still greater than 50% but the disk > transfer rate is around 400KB/sec with 40-50 seeks/sec. Why is the disk > transfer rate 4 times the file transfer rate between the client and the > server and why so many seeks? I have used NETSTAT to verfy that I am not > dropping any UDP packets. Because the file system metadata is being synchronously updated to make the file system more robust in the face of a system crash. The seeks are because of the synchronus transfers. The actual problem the file system is attempting to fix by doing synchronus I/O is ensuring write ordering. This could be achieved just as well by prioritizing multiple async queues, or by othewise ordering the writes (in fact Novell/USL had a patent pending on "delayed ordered writes" for this exact problem in UFS in UnixWare 2.0 last year when I left them). The multiple queue soloution has the benefit of not being succeptable to the patent (if it was even granted -- you can show serious prior art by a number of people in the disk drive industry if need be). > In sys/nfs/nfs_serv.c in the kernel code there is a compiler directive > NFS_ASYNC. When I compile the kernel with this directive set, the file > transfer speed while writing to the server increases to about 400KB/sec > (still not the 600-700KB/sec realized while reading from the server) > and the apparent disk thrashing is gone (400KB/sec disk transfers with > 10-20 seeks/sec). The NFS_ASYNC option almost inevitibly assumes the use of a UPS on your server. It gains its speed by causing what would normally be synchronus metadata updates to be asynchronous, and therefore the write is then acknowledged by the server before it has occurred. This is very dangerous in the sense of increased fragility in the event of system crashes (luckily FreeBSD is quite stable in almost every case except actual power and hardware failures). In reality, there are two types of updates occurring synchronously, and both are forced async by the change. In practice, this is probably a lot more fragile an implementation than it needs to be. Specifically, if directory entry data is still updates synchronously, but file system metadata, in particular, time stamps, are updated async, and then file system data is written async, the damage to file system structure would be drastically reduced. With async data writes, though, you are still open to corrupt file contents, but at least the file system structure would be secure. This would require a seperation in the treatment of metadata into two classes, with time stamps being the inferior class (POSIX guarantees only apply to *marking* for update, not actually updating with regards to things like time stamps). The speed you achieved reading from the server is probably the result of getting data cache hits. You are unlikely to ever be able to get this data rate on writes, period. Writes on frags or not in core pages have to fault in a page and partially update it before writing it back out; in efffect, all write operations, unless a page in size and page aligned, will take the same time as a read plus the actual write. Some of this could (but has not been) alleviated by the NFS implementation itself doing page alignment caching of data writes. There has been relatively little effort put into work on server caching. Ideally, one could "know" the disk was one capable of track write caching and force the write async after it goes to the controller instead of waiting for completion. This is a general speedup in any case. The same could be done for controllers with NVRAM for write caching. Typically, this would not effect you, since you probably do not have the hardware for it. The "seek overhead" unless it is being retrieved from the disk itself, is a fictional number and does not reflect performance. This is because it is unlikely that you are running on a drive without translated geometry or Zone Boundry Recording, etc.. I would ignore this number completely, since the actual number of seeks is probably unrelated. > This change, however, has made the server file system very vunerable > to system crashes. Two power outages (even after the system had been > idle for several minutes - no nfs transfers) have resulted in broken > filesystems (more than 100 bad INODES). This is expected behavior. Part of the fault lies in the dirty page flush policies, which under BSD are not as aggresive as they should be in terms of a window guarantee until the data is on disk. Part is also the fault of the page marking mechanism invalidating a potentially correct copy (except for time stamps, etc.) on disk, as if the data contained on disk is completely invalid. This policy can be (and probably will be) changed. At the very least, this also wants to have semantic and integrity update guarantees to be seperate as well. At the most, file system idle time should be watched and file system updates (even to the point of marking the FS clean!) should be done if the file system is idle for say 30 seconds. > In both instances, the file system was unrecoverable (FSCK core > dumped) and a new filesystem had to be created on the drive. Are there > any suggestions for improving the speed of the clients writing to the > server while still maintaining a reasonable tolerance to system crashes? I would certainly consider a UPS. Or you can become a file system or disk driver hacker. 8-). Either one would allow safer use of the flag NFS_ASYNC. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.