Date: Mon, 12 Jun 2017 17:45:12 -0700 (PDT) From: "Rodney W. Grimes" <freebsd-rwg@pdx.rh.CN85.dnsmgr.net> To: Xin LI <delphij@gmail.com> Cc: John Baldwin <jhb@freebsd.org>, FreeBSD Current <freebsd-current@freebsd.org>, stable@freebsd.org Subject: Re: post ino64: lockd no runs? Message-ID: <201706130045.v5D0jC4a053879@pdx.rh.CN85.dnsmgr.net> In-Reply-To: <CAGMYy3tVZ6CNOfKsTv0uXKV%2BFyUGC0Fw-mN_CHsMEf1FRM-VAg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Mon, Jun 12, 2017 at 10:14 AM, John Baldwin <jhb@freebsd.org> wrote: > > On Sunday, June 11, 2017 11:12:25 AM David Wolfskill wrote: > >> On Sun, Jun 04, 2017 at 08:57:44AM -0400, Michael Butler wrote: > >> > It seems that {rpc.}lockd no longer runs after the ino64 changes on any > >> > of my systems after a full rebuild of src and ports. No log entries > >> > offer any insight as to why :-( > >> > > >> > imb > >> > >> I don't tend to use NFS on my systems that are running head, so I > >> haven't had occasion to test this as stated. > >> > >> However, I just completed my weekly update of the "prooduction" systems > >> here at home, running stable/11. And I find that lockd seems to be ... > >> claiming that all is well, but declining to run (for long). > >> > >> To the best of my knowledge, that was not the case until this last > >> update, which was from: > >> > >> FreeBSD albert.catwhisker.org 11.1-PRERELEASE FreeBSD 11.1-PRERELEASE #316 r319566M/319569:1100514: Sun Jun 4 03:54:41 PDT 2017 root@freebeast.catwhisker.org:/common/S1/obj/usr/src/sys/ALBERT amd64 > >> > >> to > >> > >> FreeBSD albert.catwhisker.org 11.1-BETA1 FreeBSD 11.1-BETA1 #322 r319823M/319823:1100514: Sun Jun 11 03:56:10 PDT 2017 root@freebeast.catwhisker.org:/common/S1/obj/usr/src/sys/ALBERT amd64 > >> > >> The "glaringly obvious" symptom in my case is that I am now unable > >> to (directly) save an email message from within mutt(1) by appending > >> it to an NFS-resident file. (Saving it to a local file, then using > >> cat(1) to append that to the NFS- resident file & removing the local > >> copy works....) > >> > >> After a few variations on a theme of: > >> > >> albert(11.1)[5] sudo service lockd restart > >> lockd not running? > >> Starting lockd. > >> albert(11.1)[6] echo $? > >> 0 > >> albert(11.1)[7] service lockd status > >> lockd is not running. > >> > >> I finally(!) thought to ask ktrace what's going on (as tailing > >> /var/log/messages was completely unproductive, even after enabling > >> rc_debug). > >> > >> So I tried: "sudo ktrace -di service lockd restart"; upon exanimation of > >> the output of kdump(1), I see that the trace ends with: > >> > >> ... > >> 2811 rpc.lockd NAMI "/var/run/logpriv" > >> 2786 sh CALL read(0xa,0x627fc0,0x400) > >> 2786 sh GIO fd 10 read 0 bytes > >> "" > >> 2811 rpc.lockd RET connect 0 > >> 2786 sh RET read 0 > >> 2811 rpc.lockd CALL sendto(0x3,0x7fffffffe2c0,0x27,0,0,0) > >> 2786 sh CALL exit(0) > >> 2811 rpc.lockd GIO fd 3 wrote 39 bytes > >> "<30>Jun 11 15:43:10 rpc.lockd: Starting" > >> 2811 rpc.lockd RET sendto 39/0x27 > >> 2811 rpc.lockd CALL sigaction(SIGALRM,0x7fffffffec20,0) > >> 2811 rpc.lockd RET sigaction 0 > >> 2811 rpc.lockd CALL nlm_syscall(0,0x1e,0x4,0x801015040) > >> 2811 rpc.lockd RET nlm_syscall -1 errno 14 Bad address > > > > This is a really good clue. nlm_syscall is dying with EFAULT. The last > > argument is a pointer to an array of char * pointers, and the only way > > I can see it dying is if it fails to copyin() one of the strings pointed > > to by those pointers. You could try running rpc.lockd under gdb from > > ports and setting a breakpoint on 'nlm_syscall' and then printing out > > 'addr_count' and 'p addrs@(addr_count * 2)'. > > Yes, I found that the kernel was trying to copyin() from NULL, and > then found that corresponds to 'uaddr'. After some tracing I found > that the tightened condition for taddr2uaddr have enforced (correctly) > buffer length passed from caller, which was not set correctly since ~9 > years ago (r177633, which sets the size to sizeof(pointer)) but never > gets noticed because there is no check on that, so the solution seems > to be to correctly set the length values to (allocated size), and that > have fixed the issue for me. > > The code could use some cleanups and I plan to do it at some later time. > > > Unfortunately I'm not able to reproduce the failure on a test machine > > I have running head post-ino64. > > This should have been fixed by r319852 in -HEAD ( > https://svnweb.freebsd.org/base?view=revision&revision=319852 ), and > I'll MFC the change after 3 days' settle assuming there is no > objections, as this is a regression. (RE hat on) The next 11.1 release builds start on the 16th, please try to make your RFa to RE and complete the merge before that date, I would really hate to have 11.1 go out without this fixed. -- Rod Grimes rgrimes@freebsd.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201706130045.v5D0jC4a053879>