From owner-freebsd-fs@FreeBSD.ORG Thu Aug 29 00:03:00 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B9A38E86 for ; Thu, 29 Aug 2013 00:03:00 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 814FA2252 for ; Thu, 29 Aug 2013 00:03:00 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqIEAJOOHlKDaFve/2dsb2JhbABaFoMmUYMnvH6BOHSCTgSBBwINGQJfiBQMmASOf5IrgSmMcoEVgyOBMQOZHpA0gzwggTU5 X-IronPort-AV: E=Sophos;i="4.89,978,1367985600"; d="scan'208";a="47973851" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 28 Aug 2013 20:02:59 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 7A07BB3F1B for ; Wed, 28 Aug 2013 20:02:59 -0400 (EDT) Date: Wed, 28 Aug 2013 20:02:59 -0400 (EDT) From: Rick Macklem To: freebsd-fs Message-ID: <1332572251.15040105.1377734579493.JavaMail.root@uoguelph.ca> Subject: rpc.lockd kernel RPC over UDP patch for testing/review MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Aug 2013 00:03:00 -0000 Hi, Doug White posted this to me via email some time ago (I hope he doesn't mind me reposting it here): > First, we have a installed client system doing heavy NFS lock traffic that occasionally > experiences lockd lockups that require a system reboot to clear. Diagnosis of > the most recent hang identified corruption of one of the tracking variables > (cu->cu_send specifically) in the congestion control in clnt_dg_call() as the culprit. > Since lockd only uses one thread, no congestion control is really necessary. We are > going to make a local patch to avoid the if() that leads to the msleep() if > cu->threads = 1 so we don't run into that again, though the corruption of > cu_send is still a bit troubling. The corruption might stem from repeated retries allowing > cu_send to grow without bound, or some other bizarre code path that causes underflow. After inspecting the code, I found two places where cu_sent (Doug called it cu_send just to try and confuse me. It worked for a while;-) wasn't incremented when a request was re-inserted in the send queue. Since it is always decremented when a request is dequeued, I think this could have resulted in a bogus cu_sent value. The simple patch at: http://people.freebsd.org/~rmacklem/rpcudp.patch adds increments for cu_sent for these two places. If anyone is using rpc.lockd and can test/review this patch, it would be appreciated. Thanks, rick