Date: Mon, 03 Jul 2006 15:40:01 -0700 From: Michael Collette <Michael.Collette@TestEquity.com> To: User Freebsd <freebsd@hub.org> Cc: freebsd-stable@freebsd.org Subject: Re: NFS Locking Issue Message-ID: <44A99CC1.7070501@TestEquity.com> In-Reply-To: <20060702162942.D1103@ganymede.hub.org> References: <20060629230309.GA12773@lpthe.jussieu.fr> <20060630041733.GA4941@zibbi.meraka.csir.co.za> <cone.1151802806.162227.42680.1000@zoraida.natserv.net> <20060702162942.D1103@ganymede.hub.org>
next in thread | previous in thread | raw e-mail | index | archive | help
User Freebsd wrote: > On Sat, 1 Jul 2006, Francisco Reyes wrote: > >> John Hay writes: >> >>> I only started to see the lockd problems when upgrading the server side >>> to FreeBSD 6.x and later. I had various FreeBSD clients, between 4.x >>> and 7-current and the lockd problem only showed up when upgrading the >>> server from 5.x to 6.x. >> >> It confirms the same we are experiencing.. constant freezing/locking >> issues. >> I guess no more 6.X for us.. for the foreseable future.. > > Since there are several of us experiencing what looks to be the same > sort of deadlock issue, I beseech you not to give up Honestly trying not to. To tell ya the truth, I've been giving a real hard look at Ubuntu for my serving needs. This NFS thing has got me seriously questioning FreeBSD right at the moment. >... right now, all > we've been able to get to the developers is virtually useless > information (vmstat and such shows the problem, but it doesn't allow > developers to identify the problem) ... > > Is this a problem that you can easily recreate, even on a non-production > machine? Oh yeah. I've got a couple of ways I'm able to get this to fail. Method #1: --------------------------------------------------------------------- Let's start with the simplest. The scenario here involves 2 machines, mach01 and mach02. Both are running 6-STABLE, and both are running rpcbind, rpc.statd, and rpc.lockd. mach01 has exported /documents and mach02 is mounting that export under /mnt. Simple enough? The /documents directory has multiple subdirectories and files of various sizes. The actual amount of data doesn't really matter to produce a failure. All you need to do at this point is to try to copy files from that mount point to somewhere else on the hard drive. cp -Rp /mnt/* /tmp/documents/ You may, or not, see that a couple of subdirectories were created, but no files actually moved over. The cp command is now locked up, and no traffic moves. This usually takes a second or two to show up as a problem. I can repeat this with multiple 6-STABLE boxes. Turn off rpc.lockd on either the server or client before the cp command, and things work. Method #2: --------------------------------------------------------------------- Booting to a diskless work station. The server (mach01) has exported /usr, /usr/local, /usr/X11R6 and enough other stuff to get a diskless workstation up and running. Not going to get into all the details here other than to say that I have a fully functioning setup like this on 5.4 boxes now. I've knocked the boot up of the diskless client (mach02) down to console only. Once at the console I startx with a regular user, taking me in to twm. From there I try to launch a KDE application, which in my test case is kwrite. The same situation is true with launching a GTK app, such as Gimp. X and twm start up. I've got all the rest of the system reasonably functional. When I try to run kwrite, none of the KDE subsystems start up. kwrite just sits there in a lockd state. Same is true of Gimp. If I shutdown rpc.lockd on either machine I'm able to bring up a full KDE desktop, with all applications able to run. Other Testing: --------------------------------------------------------------------- At one point we had in our test network a 6.1 NFS server providing files to 5.4 diskless clients without any problems. We first got to noticing the bulk of the glitches when I moved the diskless setup to use a 6.1 kernel. As I said, I've been looking at Linux alternatives. Especially after reading about Michel Talon's experiences with Fedora. I initially tried CentOS, but wasn't able to get NFS working properly on that thing. I had an Ubuntu CD handy, so I installed it on a test box. Wow, does that NFS server boogie! Using Ubuntu as the server I connected a FreeBSD 5.4 and 6-stable box as clients on a 100Mb/s network. The time trial used a dummy 100Meg file transfered from the server to the client. We measured 90Mb/s transfer, which was FAR faster than I had ever been able to get 2 FreeBSD boxes to perform doing similar tests. I then used Ubuntu to connect to a 5.4 server we have in production. I don't recall the exact stats, but it was close to 10x slower. No lockups here though. After the 4th of July I intend to test Ubuntu as a client to a FreeBSD 6-STABLE server on a gigabit lan to run similar time trials. I'm looking to confirm what I can only suspect at this point, which is that the NFS server on FreeBSD is mucked up, but the client is okay. As time allows I hope to run similar tests between two Ubuntu boxes, then run it all again with Fedora. Seriously debating whether to move some or all of our infrastructure to Linux after all this. A 3-4 month old known bug like this gives me a great deal of concern about FreeBSD. That, and Ubuntu's NFS server speed just about knocked me over! > In my case, I have one machine fully configured for debugging, > but, of course, since re-configuring it, it hasn't exhibited the problem > ... if most of us get our machines configured properly to give useful > information to the developers to debug this, the faster it will get > fixed ... > > My experience with most of the developers is that if you can get into > DDB and give them 'internal traces' of the code, bugs tend to get fixed > very quickly ... vmstat/ps give "external views", more summaries then > anything ... its the details "under the hood" that they need ... its not > much different then your auto-mechanic ... try telling him there is a > 'knocking under the hood, please tell me how to fix it, but you can't > have my car', and he'll brush you off ... give him 30 minutes under the > hood, and not only will he have identified it, but he'll probably fix it > too ... Marc, the car is starting but won't move at all. I don't know if this is the transmission, the steering wheel, or the radio. I am feeling pretty certain that this car should never have left the lot in this condition though. Again, these are problems that have been around for a while... http://www.freebsd.org/cgi/query-pr.cgi?pr=84953 http://www.freebsd.org/cgi/query-pr.cgi?pr=80389 Later on, -- Michael Collette IT Manager TestEquity Inc Michael.Collette@TestEquity.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?44A99CC1.7070501>