From owner-freebsd-stable@freebsd.org Tue Jun 13 00:45:14 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A6983D87C3D for ; Tue, 13 Jun 2017 00:45:14 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 904B61C7F for ; Tue, 13 Jun 2017 00:45:14 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: by mailman.ysv.freebsd.org (Postfix) id 8C7FCD87C3C; Tue, 13 Jun 2017 00:45:14 +0000 (UTC) Delivered-To: stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8BF76D87C3B; Tue, 13 Jun 2017 00:45:14 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 67A0E1C7E; Tue, 13 Jun 2017 00:45:13 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (localhost [127.0.0.1]) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3) with ESMTP id v5D0jCVH053880; Mon, 12 Jun 2017 17:45:12 -0700 (PDT) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: (from freebsd-rwg@localhost) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3/Submit) id v5D0jC4a053879; Mon, 12 Jun 2017 17:45:12 -0700 (PDT) (envelope-from freebsd-rwg) From: "Rodney W. Grimes" Message-Id: <201706130045.v5D0jC4a053879@pdx.rh.CN85.dnsmgr.net> Subject: Re: post ino64: lockd no runs? In-Reply-To: To: Xin LI Date: Mon, 12 Jun 2017 17:45:12 -0700 (PDT) CC: John Baldwin , FreeBSD Current , stable@freebsd.org X-Mailer: ELM [version 2.4ME+ PL121h (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Jun 2017 00:45:14 -0000 > On Mon, Jun 12, 2017 at 10:14 AM, John Baldwin wrote: > > On Sunday, June 11, 2017 11:12:25 AM David Wolfskill wrote: > >> On Sun, Jun 04, 2017 at 08:57:44AM -0400, Michael Butler wrote: > >> > It seems that {rpc.}lockd no longer runs after the ino64 changes on any > >> > of my systems after a full rebuild of src and ports. No log entries > >> > offer any insight as to why :-( > >> > > >> > imb > >> > >> I don't tend to use NFS on my systems that are running head, so I > >> haven't had occasion to test this as stated. > >> > >> However, I just completed my weekly update of the "prooduction" systems > >> here at home, running stable/11. And I find that lockd seems to be ... > >> claiming that all is well, but declining to run (for long). > >> > >> To the best of my knowledge, that was not the case until this last > >> update, which was from: > >> > >> FreeBSD albert.catwhisker.org 11.1-PRERELEASE FreeBSD 11.1-PRERELEASE #316 r319566M/319569:1100514: Sun Jun 4 03:54:41 PDT 2017 root@freebeast.catwhisker.org:/common/S1/obj/usr/src/sys/ALBERT amd64 > >> > >> to > >> > >> FreeBSD albert.catwhisker.org 11.1-BETA1 FreeBSD 11.1-BETA1 #322 r319823M/319823:1100514: Sun Jun 11 03:56:10 PDT 2017 root@freebeast.catwhisker.org:/common/S1/obj/usr/src/sys/ALBERT amd64 > >> > >> The "glaringly obvious" symptom in my case is that I am now unable > >> to (directly) save an email message from within mutt(1) by appending > >> it to an NFS-resident file. (Saving it to a local file, then using > >> cat(1) to append that to the NFS- resident file & removing the local > >> copy works....) > >> > >> After a few variations on a theme of: > >> > >> albert(11.1)[5] sudo service lockd restart > >> lockd not running? > >> Starting lockd. > >> albert(11.1)[6] echo $? > >> 0 > >> albert(11.1)[7] service lockd status > >> lockd is not running. > >> > >> I finally(!) thought to ask ktrace what's going on (as tailing > >> /var/log/messages was completely unproductive, even after enabling > >> rc_debug). > >> > >> So I tried: "sudo ktrace -di service lockd restart"; upon exanimation of > >> the output of kdump(1), I see that the trace ends with: > >> > >> ... > >> 2811 rpc.lockd NAMI "/var/run/logpriv" > >> 2786 sh CALL read(0xa,0x627fc0,0x400) > >> 2786 sh GIO fd 10 read 0 bytes > >> "" > >> 2811 rpc.lockd RET connect 0 > >> 2786 sh RET read 0 > >> 2811 rpc.lockd CALL sendto(0x3,0x7fffffffe2c0,0x27,0,0,0) > >> 2786 sh CALL exit(0) > >> 2811 rpc.lockd GIO fd 3 wrote 39 bytes > >> "<30>Jun 11 15:43:10 rpc.lockd: Starting" > >> 2811 rpc.lockd RET sendto 39/0x27 > >> 2811 rpc.lockd CALL sigaction(SIGALRM,0x7fffffffec20,0) > >> 2811 rpc.lockd RET sigaction 0 > >> 2811 rpc.lockd CALL nlm_syscall(0,0x1e,0x4,0x801015040) > >> 2811 rpc.lockd RET nlm_syscall -1 errno 14 Bad address > > > > This is a really good clue. nlm_syscall is dying with EFAULT. The last > > argument is a pointer to an array of char * pointers, and the only way > > I can see it dying is if it fails to copyin() one of the strings pointed > > to by those pointers. You could try running rpc.lockd under gdb from > > ports and setting a breakpoint on 'nlm_syscall' and then printing out > > 'addr_count' and 'p addrs@(addr_count * 2)'. > > Yes, I found that the kernel was trying to copyin() from NULL, and > then found that corresponds to 'uaddr'. After some tracing I found > that the tightened condition for taddr2uaddr have enforced (correctly) > buffer length passed from caller, which was not set correctly since ~9 > years ago (r177633, which sets the size to sizeof(pointer)) but never > gets noticed because there is no check on that, so the solution seems > to be to correctly set the length values to (allocated size), and that > have fixed the issue for me. > > The code could use some cleanups and I plan to do it at some later time. > > > Unfortunately I'm not able to reproduce the failure on a test machine > > I have running head post-ino64. > > This should have been fixed by r319852 in -HEAD ( > https://svnweb.freebsd.org/base?view=revision&revision=319852 ), and > I'll MFC the change after 3 days' settle assuming there is no > objections, as this is a regression. (RE hat on) The next 11.1 release builds start on the 16th, please try to make your RFa to RE and complete the merge before that date, I would really hate to have 11.1 go out without this fixed. -- Rod Grimes rgrimes@freebsd.org