Date: Tue, 7 Sep 2010 14:08:13 -0700 From: "Mahlon E. Smith" <mahlon@martini.nu> To: freebsd-stable@freebsd.org Subject: Network memory allocation failures Message-ID: <20100907210813.GI49065@martini.nu>
next in thread | raw e-mail | index | archive | help
--yQbNiKLmgenwUfTN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, all. I picked up a couple of Dell R810 monsters a couple of months ago. 96G of RAM, 24 core. With the aid of this list, got 8.1-RELEASE on there, and they are trucking along merrily as VirtualBox hosts. I'm seeing memory allocation errors when sending data over the network. It is random at best, however I can reproduce it pretty reliably. Sending 100M to a remote machine. Note the 2nd scp attempt worked. Most small files can make it through unmolested. obb# dd if=3D/dev/random of=3D100M-test bs=3D1M count=3D100 100+0 records in 100+0 records out 104857600 bytes transferred in 2.881689 secs (36387551 bytes/sec) obb# rsync -av 100M-test skin:/tmp/ sending incremental file list 100M-test Write failed: Cannot allocate memory rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: B= roken pipe (32) rsync: connection unexpectedly closed (28 bytes received so far) [sende= r] rsync error: unexplained error (code 255) at io.c(601) [sender=3D3.0.7] obb# scp 100M-test skin:/tmp/ 100M-test 52% 52MB 52.1MB/s 00:00 ETAWrite failed: Cannot a= llocate memory lost connection obb# scp 100M-test skin:/tmp/ 100M-test 100% 100MB 50.0MB/s 00:02 =20 obb# scp 100M-test skin:/tmp/ 100M-test 0% 0 0.0KB/s --:-- ETAWrite failed: Cannot a= llocate memory lost connection Fetching a file, however, works. obb# scp skin:/usr/local/tmp/100M-test . 100M-test 100% 100MB 20.0MB/s 00:05 =20 obb# scp skin:/usr/local/tmp/100M-test . 100M-test 100% 100MB 20.0MB/s 00:05 =20 obb# scp skin:/usr/local/tmp/100M-test . 100M-test 100% 100MB 20.0MB/s 00:05 =20 obb# scp skin:/usr/local/tmp/100M-test . 100M-test 100% 100MB 20.0MB/s 00:05 =20 ... I've ruled out bad hardware (mainly due to the behavior being *identical* on the sister machine, in a completely different data center.) It's a broadcom (bce) NIC. mbufs look fine to me. obb# netstat -m 511/6659/7170 mbufs in use (current/cache/total) 510/3678/4188/25600 mbuf clusters in use (current/cache/total/max) 510/3202 mbuf+clusters out of packet secondary zone in use (current/cache) 0/984/984/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 1147K/12956K/14104K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines Plenty of available mem (not surprising): obb# vmstat -hc 5 -w 5=20 procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mf0 mf1 in sy c= s us sy id 0 0 0 722M 92G 115 0 1 0 1067 0 0 0 429 32637 65= 20 0 1 99 0 0 0 722M 92G 1 0 0 0 0 0 0 0 9 31830 32= 79 0 0 100 0 0 0 722M 92G 0 0 0 0 3 0 0 0 8 33171 32= 23 0 0 100 0 0 0 761M 92G 2593 0 0 0 1712 0 5 4 121 35384 39= 07 0 0 99 1 0 0 761M 92G 0 0 0 0 0 0 0 0 10 30237 31= 56 0 0 100 Last bit of info, and here's where it gets really weird. Remember how I said this was a VirtualBox host? Guest machines running on it (mostly centos) don't exhibit the problem, which is also why it took me so long to notice it in the host. They can merrily copy data around at will, even though they are going out through the same host interface. I'm not sure what to check for or toggle at this point. There are all sorts of tunables I've been mucking around with to no avail, and so I've reverted them to defaults. Mostly concentrating on these: hw.intr_storm_threshold net.inet.tcp.rfc1323 kern.ipc.nmbclusters kern.ipc.nmbjumbop net.inet.tcp.sendspace net.inet.tcp.recvspace kern.ipc.somaxconn kern.ipc.maxsockbuf It was suggested to me to try limiting the RAM in loader.conf to under 32G and see what happens. When doing this, it does appear to be "okay". Not sure if that's coincidence, or directly related -- something with the large amount of RAM that is confusing a data structure somewhere? Or potentially a problem with the bce driver, specifically? I've kind of reached a limit here in what to dig for / try next. What else can I do to try and determine the root problem that would be helpful? Anyone ever have to deal with or seen something like this recently? (Or hell, not recently?) Ideas appreciated! -- Mahlon E. Smith =20 http://www.martini.nu/contact.html --yQbNiKLmgenwUfTN Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- iD8DBQFMhqm91bsjBDapbeMRAskpAJ9m0K/uBfhkShaHHjXkTGDbZZNJOQCePK5M E96Iyo2BCuaAhnkLNbfirkM= =1i1D -----END PGP SIGNATURE----- --yQbNiKLmgenwUfTN--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100907210813.GI49065>