From owner-freebsd-stable@FreeBSD.ORG Wed Jun 8 05:06:59 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1622816A41C for ; Wed, 8 Jun 2005 05:06:59 +0000 (GMT) (envelope-from terry@tmk.com) Received: from server.tmk.com (server.tmk.com [204.141.35.63]) by mx1.FreeBSD.org (Postfix) with ESMTP id B796D43D48 for ; Wed, 8 Jun 2005 05:06:58 +0000 (GMT) (envelope-from terry@tmk.com) Received: from tmk.com by tmk.com (PMDF V6.2-X27 #37010) id <01LP76MP3YI8000CP8@tmk.com> for freebsd-stable@freebsd.org; Wed, 08 Jun 2005 01:06:56 -0500 (EST) Date: Wed, 08 Jun 2005 00:37:59 -0500 (EST) From: Terry Kennedy To: freebsd-stable@freebsd.org Message-id: <01LP77OMTW08000CP8@tmk.com> MIME-version: 1.0 Content-type: TEXT/PLAIN; CHARSET=us-ascii Subject: Strange TCP-related hang on startup w/ recent CVSUP X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jun 2005 05:06:59 -0000 I have a number of boxes which have been running well w/ 5.4-stable. The last kernel and userland builds on those boxes were on May 26th, from a CVSUP done that same day. Last night I did another CVSUP and the usual buildworld/buildkernel/in- stallkernel/installworld/mergemaster, and the boxes wouldn't come back up, hanging in sbwait state at the point where they would normally do their first net access. In one box, this is a batch of NFS mounts; on the other it is the initial ntpdate query. At this point the boxes don't respond to pings and sit for- ever (at least an hour) not doing anything. Typing ^C on the console aborts the hanging process, and then the startup proceeds. With a ping going from another system, I see that the net doesn't come up on the problem boxes until several seconds into the execution of the next network-related command - it is almost as if the previous ifconfig (done from rc.conf) didn't have any effect (despite the "link up" message on the console). The other oddity is that I get a "rpc.lockd: 100024 RPC: Port mapper failure" right after lockd starts, even though other net commands have completed successfully. Booting from kernel.old (the May 26th one) boots normally, so it is a ker- nel issue, not something in userland. Hardware is a Tyan S2721-533 (dual Xeon) board w/ onboard Intel "em" Gig- abit Ethernet. To keep this message short, I'm omitting the dmesg output, but it can be found at http://www.tmk.com/raidzilla. Here's the relevant parts of the console output from the two systems: first system: /dev/da1s3h: clean, 199137891 free (819 frags, 24892134 blocks, 0.0% fragmentation) Setting hostname: rz1.tmk.com. em0: flags=8843 mtu 1500 options=b inet 204.141.35.40 netmask 0xffffff00 broadcast 204.141.35.255 inet6 fe80::2e0:81ff:fe28:94d6%em0 prefixlen 64 tentative scopeid 0x1 ether 00:e0:81:28:94:d6 media: Ethernet 1000baseTX (autoselect) status: no carrier lo0: flags=8049 mtu 16384 inet 127.0.0.1 netmask 0xff000000 inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4 add net default: gateway 204.141.35.1 Additional routing options:. Starting devd. Mounting NFS file systems:em0: Link is up 1000 Mbps Full Duplex load: 1.10 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k load: 1.10 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k load: 1.10 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k load: 1.10 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k load: 1.01 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k load: 1.01 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k load: 0.56 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k load: 0.56 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k load: 0.07 cmd: mount_nfs 395 [sbwait] 0.00u 0.00s 0% 820k ^CScript /etc/rc.d/mountcritremote interrupted Starting syslogd. Checking for core dump on /dev/da0s1b... savecore: no dumps found Setting date via ntp. Looking for host gate.tmk.com and service ntp host found : gate.tmk.com Looking for host 204.141.35.61 and service ntp host found : gate.tmk.com Looking for host 199.224.0.146 and service ntp host found : ns1.ispnetinc.net Looking for host 199.224.0.154 and service ntp host found : ns2.ispnetinc.net Looking for host 204.141.40.135 and service ntp host found : ns3.ispnetinc.net 8 Jun 00:30:34 ntpdate[416]: step time server 199.224.0.146 offset 1.526419 sec Starting rpcbind. NFS access cache time=2 ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/X11R6/lib /usr/local/lib a.out ldconfig path: /usr/lib/aout /usr/lib/compat/aout /usr/X11R6/lib/aout Starting mountd. Starting nfsd. Starting statd. Starting lockd. Jun 8 00:30:35 rz1 rpc.lockd: 100024 RPC: Port mapper failure Starting usbd. Starting local daemons:. Updating motd. Starting ntpd. Starting rwhod. Configuring syscons: blanktime screensaver. second system: /dev/da1s3h: clean, 199127439 free (1095 frags, 24890793 blocks, 0.0% fragmentation) Setting hostname: rz2.tmk.com. em0: flags=8843 mtu 1500 options=b inet 204.141.35.41 netmask 0xffffff00 broadcast 204.141.35.255 inet6 fe80::2e0:81ff:fe2e:67b0%em0 prefixlen 64 tentative scopeid 0x1 ether 00:e0:81:2e:67:b0 media: Ethernet autoselect status: no carrier lo0: flags=8049 mtu 16384 inet 127.0.0.1 netmask 0xff000000 inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x4 add net default: gateway 204.141.35.1 Additional routing options:. Starting devd. Mounting NFS file systems:. Starting syslogd. Checking for core dump on /dev/da0s1b... savecore: no dumps found Setting date via ntp. Looking for host gate.tmk.com and service ntp em0: Link is up 1000 Mbps Full Duplex load: 0.75 cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k load: 0.75 cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k load: 0.75 cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k load: 0.75 cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k load: 0.69 cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k load: 0.64 cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k load: 0.64 cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k load: 0.02 cmd: ntpdate 420 [sbwait] 0.00u 0.00s 0% 844k ^CScript /etc/rc.d/ntpdate interrupted Starting rpcbind. NFS access cache time=2 ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/X11R6/lib /usr/local/lib a.out ldconfig path: /usr/lib/aout /usr/lib/compat/aout /usr/X11R6/lib/aout Starting mountd. Starting nfsd. Starting statd. Starting lockd. Jun 8 00:58:41 rz2 rpc.lockd: 100024 RPC: Port mapper failure Starting usbd. Starting local daemons:. Updating motd. Starting ntpd. Starting rwhod. Configuring syscons: blanktime screensaver. Here are the list of changed kernel files since the previous (working) kernel: find . -newer /tmp/date ./alpha/alpha/busdma_machdep.c ./amd64/amd64/identcpu.c ./amd64/amd64/machdep.c ./conf/NOTES ./conf/files ./conf/options ./dev/ata/ata-chipset.c ./dev/ata/ata-pci.h ./dev/bktr/bktr_reg.h ./dev/sound/pci/ich.c ./kern/uipc_socket.c ./kern/kern_event.c ./kern/sysv_shm.c ./kern/subr_unit.c ./modules/netgraph/device ./modules/netgraph/device/Makefile ./modules/netgraph/Makefile.inc ./modules/udbp/Makefile ./netgraph/bluetooth/drivers/ubt ./netgraph/bluetooth/drivers/ubt/ng_ubt.c ./netgraph/ng_eiface.c ./netgraph/netgraph.h ./netgraph/ng_base.c ./netgraph/ng_iface.c ./netgraph/ng_device.c ./netgraph/ng_ksocket.c ./netinet/tcp_subr.c ./netinet/ip_divert.c ./netinet/ip_icmp.c ./netinet/ip_icmp.h ./netinet6/ipsec.c ./sys/systm.h Any ideas? Also, if there is additional information needed, let me know what to do and I'll report back. Terry Kennedy http://www.tmk.com terry@tmk.com New York, NY USA