From owner-freebsd-questions@FreeBSD.ORG Wed Jun 23 04:51:27 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EEE161065673 for ; Wed, 23 Jun 2010 04:51:27 +0000 (UTC) (envelope-from martin.minkus@punz.co.nz) Received: from smtp4.clear.net.nz (smtp4.clear.net.nz [203.97.37.64]) by mx1.freebsd.org (Postfix) with ESMTP id 652848FC18 for ; Wed, 23 Jun 2010 04:51:27 +0000 (UTC) Received: from silver.pulse.local (mail.pulseenergy.co.nz [203.167.138.163]) by smtp4.clear.net.nz (CLEAR Net Mail) with ESMTP id <0L4G00M9XASMF310@smtp4.clear.net.nz> for freebsd-questions@freebsd.org; Wed, 23 Jun 2010 16:50:49 +1200 (NZST) Received: from silver.pulse.local (localhost [127.0.0.1]) by silver.pulse.local (8.13.8/8.13.8) with ESMTP id o5N4ogR7027139 for ; Wed, 23 Jun 2010 16:50:42 +1200 Content-return: prohibited Date: Wed, 23 Jun 2010 16:50:41 +1200 From: Martin Minkus In-reply-to: To: freebsd-questions Message-id: MIME-version: 1.0 x-scalix-Hops: 1 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on silver.pulse.local X-Spam-Status: No, score=-4.3 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00, HTML_MESSAGE autolearn=ham version=3.2.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: RE: sshd / tcp packet corruption ? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Jun 2010 04:51:28 -0000 So definitely some kind of packet corruption; =20 Using netcat to send a single megabyte of binary data to a box with no known issues (from kinetic -> steel): =20 kinetic:/tmp$ dd if=3D/dev/urandom of=3Drandom.testfile bs=3D1k count=3D1= k 1024+0 records in 1024+0 records out 1048576 bytes transferred in 0.018347 secs (57152372 bytes/sec) =20 kinetic:/tmp$ md5 random.testfile=20 MD5 (random.testfile) =3D 9be700336ef81e8f89c60422fc795877 =20 kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile Connection to steel 1234 port [tcp/*] succeeded! kinetic:/tmp$=20 =20 =20 whilst on steel: (a stable linux box kinetic is MEANT to be replacing) =20 ff8a336e2be0c5c645e9f8a2dea67eea random.testfile fae5da747c7857d1d87870c05db1f152 random.testfile a36c7166631ca10c460e323e39071094 random.testfile 50a8f005a772f9321243215d1ea1adb6 random.testfile 5da41b6f475f4655572df8c9bd81e181 random.testfile 3104dd30179bf870e8ec6ef91c34d78f random.testfile 274a16890cf39c3089d8f0eda253f5fd random.testfile e8d0bae998340252c6c67529d520feb4 random.testfile 6d5377ca4545f98a55c017f518567092 random.testfile 6b464f810fe1c2902694a7817f881906 random.testfile 8912007161ececdb3e23a0018af36c36 random.testfile 3f4e17d5a939cd8dfd0941c898c5ac5f random.testfile 9db926ba5f5f39dddcc0607983ed96f0 random.testfile 835de68b981bf6cb871ebb2ce81404e1 random.testfile a211a3260d9c8ae595782d254798cacf random.testfile 030e08f1d3d0fb761046f66c888fdea2 random.testfile =20 If I reboot kinetic and try one last time: =20 9be700336ef81e8f89c60422fc795877 random.testfile =20 Notice that is now the CORRECT checksum on steel. =20 Kinetic=E2=80=99s samba, sshd, etc will play nice for a day or so before returning to corrupting packets. =20 So any idea ? Why would my packets start getting corrupted after a couple days use? =20 This box just runs isc-dhcpd, openldap-server, samba34, and ZFS (the real reason its replacing the Linux box.) =20 Thanks, Martin. =20 From: Martin Minkus=20 Sent: Wednesday, 23 June 2010 16:01 To: freebsd-questions@freebsd.org Subject: sshd / tcp packet corruption ? =20 It seems this issue I reported below may actually be related to some kind of TCP packet corruption ? =20 Still same box. I=E2=80=99ve noticed my SSH connections into the box will= die randomly, with errors. =20 Sshd logs the following on the box itself: =20 Jun 18 11:15:32 kinetic sshd[1406]: Received disconnect from 10.64.10.251: 2: Invalid packet header. This probably indicates a problem with key exchange or encryption.=20 Jun 18 11:15:41 kinetic sshd[15746]: Accepted publickey for martinm from 10.64.10.251 port 56469 ssh2 Jun 18 11:15:58 kinetic su: nss_ldap: could not get LDAP result - Can't contact LDAP server Jun 18 11:15:58 kinetic su: martinm to root on /dev/pts/0 Jun 18 11:16:06 kinetic su: martinm to root on /dev/pts/1 Jun 18 11:16:29 kinetic sshd[15748]: Received disconnect from 10.64.10.251: 2: Invalid packet header. This probably indicates a problem with key exchange or encryption.=20 Jun 18 11:16:30 kinetic sshd[15746]: syslogin_perform_logout: logout() returned an error Jun 18 11:16:34 kinetic sshd[16511]: Accepted publickey for martinm from 10.64.10.251 port 56470 ssh2 Jun 18 11:16:41 kinetic sshd[16513]: Received disconnect from 10.64.10.251: 2: Invalid packet header. This probably indicates a problem with key exchange or encryption.=20 Jun 18 11:16:41 kinetic sshd[16511]: syslogin_perform_logout: logout() returned an error =20 Jun 23 15:52:59 kinetic sshd[56974]: Received disconnect from 10.64.10.209: 5: Message Authentication Code did not verify (packet #75658). Data integrity has been compromised.=20 Jun 23 15:53:12 kinetic sshd[57109]: Accepted publickey for martinm from 10.64.10.209 port 9494 ssh2 Jun 23 15:53:38 kinetic su: martinm to root on /dev/pts/3 Jun 23 15:56:36 kinetic sshd[57111]: Received disconnect from 10.64.10.209: 2: Invalid packet header. This probably indicates a problem with key exchange or encryption.=20 Jun 23 15:56:44 kinetic sshd[57151]: Accepted publickey for martinm from 10.64.10.209 port 9534 ssh2 =20 My googlefu has failed me on this. =20 Any ideas what on earth this could be ? =20 Ethernet card? =20 em0: port 0xcc00-0xcc3f mem 0xfdfe0000-0xfdffffff,0xfdfc0000-0xfdfdffff irq 17 at device 7.0 on pci1 em0: [FILTER] em0: Ethernet address: 00:0e:0c:6b:d6:d3 =20 em0: flags=3D8843 metric 0 mtu 1500 =20 options=3D209b ether 00:0e:0c:6b:d6:d3 inet 10.64.10.10 netmask 0xffffff00 broadcast 10.64.10.255 media: Ethernet autoselect (1000baseT ) status: active =20 Thanks, Martin. =20 =20 From: Martin Minkus=20 Sent: Monday, 14 June 2010 11:21 To: freebsd-questions@freebsd.org Subject: FreeBSD+ZFS+Samba: open_socket_in: Protocol not supported - after a few days? =20 Samba 3.4 on FreeBSD 8-STABLE branch. After a few days I start getting weird errors and windows PC's can't access the samba share, have trouble accessing files, etc, and samba becomes totally unusable. Restarting samba doesn't fix it =E2=80=93 only a reboot does. =20 Accessing files on the ZFS pool locally is fine. Other services (like dhcpd, openldap server) on the box continue to work fine. Only samba dies and by dies I mean it can no longer service clients and windows brings up bizarre errors. Windows can access our other samba servers (on linux, etc) just fine. Kernel: =20 FreeBSD kinetic.pulse.local 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #4: Wed May 26 18:09:14 NZST 2010 martinm@kinetic.pulse.local:/usr/obj/usr/src/sys/PULSE amd64 =20 Zpool status: =20 kinetic:~$ zpool status pool: pulse state: ONLINE scrub: none requested config: =20 NAME STATE READ WRITE CKSUM pulse ONLINE 0 =20 0 0 raidz1 ONLINE 0 =20 0 0 gptid/3baa4ef3-3ef8-0ac0-f110-f61ea23352 ONLINE 0 =20 0 0 gptid/0eaa8131-828e-6449-b9ba-89ac63729d ONLINE 0 =20 0 0 gptid/77a8da7c-8e3c-184c-9893-e0b12b2c60 ONLINE 0 =20 0 0 gptid/dddb2b48-a498-c1cd-82f2-a2d2feea01 ONLINE 0 =20 0 0 =20 errors: No known data errors kinetic:~$ log.smb: [2010/06/10 17:22:39, 0] lib/util_sock.c:902(open_socket_in) open_socket_in(): socket() call failed: Protocol not supported [2010/06/10 17:22:39, 0] smbd/server.c:457(smbd_open_one_socket) smbd_open_once_socket: open_socket_in: Protocol not supported [2010/06/10 17:22:39, 2] smbd/server.c:676(smbd_parent_loop) waiting for connections log.ANYPC: [2010/06/08 19:55:55, 0] lib/util_sock.c:1491(get_peer_addr_internal) getpeername failed. Error was Socket is not connected read_fd_with_timeout: client 0.0.0.0 read error =3D Socket is not connected. The code in lib/util_sock.c, around line 902: /*********************************************************************** ***** Open a socket of the specified type, port, and address for incoming data. ************************************************************************ ****/ int open_socket_in(int type, uint16_t port, int dlevel, const struct sockaddr_storage *psock, bool rebind) { struct sockaddr_storage sock; int res; socklen_t slen =3D sizeof(struct sockaddr_in); sock =3D *psock; #if defined(HAVE_IPV6) if (sock.ss_family =3D=3D AF_INET6) { ((struct sockaddr_in6 *)&sock)->sin6_port =3D htons(port); slen =3D sizeof(struct sockaddr_in6); } #endif if (sock.ss_family =3D=3D AF_INET) { ((struct sockaddr_in *)&sock)->sin_port =3D htons(port); } res =3D socket(sock.ss_family, type, 0 ); if( res =3D=3D -1 ) { if( DEBUGLVL(0) ) { dbgtext( "open_socket_in(): socket() call failed: " ); dbgtext( "%s\n", strerror( errno ) ); } In other words, it looks like something in the kernel is exhausted (what?). I don=E2=80=99t know if tuning is required, or this is some kind= of bug? /boot/loader.conf: mvs_load=3D"YES" zfs_load=3D"YES" vm.kmem_size=3D"20G" #vfs.zfs.arc_min=3D"512M" #vfs.zfs.arc_max=3D"1536M" vfs.zfs.arc_min=3D"512M" vfs.zfs.arc_max=3D"3072M" I=E2=80=99ve played with a few sysctl settings (found these recommendatio= ns online, but they make no difference) /etc/sysctl.conf: kern.ipc.maxsockbuf=3D2097152 net.inet.tcp.sendspace=3D262144 net.inet.tcp.recvspace=3D262144 net.inet.tcp.mssdflt=3D1452 net.inet.udp.recvspace=3D65535 net.inet.udp.maxdgram=3D65535 net.local.stream.recvspace=3D65535 net.local.stream.sendspace=3D65535 Any ideas on what could possibly be going wrong? =20 Any help would be greatly appreciated! =20 Thanks, Martin