Date: Sat, 1 Feb 2014 20:31:12 +0100 From: Matthew Rezny <matthew@reztek.cz> To: freebsd-stable@freebsd.org Subject: Processes hang in state "kmem a", system hang follows Message-ID: <20140201203112.0000210c@unknown>
next in thread | raw e-mail | index | archive | help
I'm seeing rather strange behavior from 10.0 on i386 thus far. This is another long message, so if you want the summary without back-story, skip to the end. Sometimes it's hard to include relevant details without feeling like I'm rambling. I'm seeing rather strange behavior from 10.0 on i386 thus far. I started with FreeBSD not long before 4.0 release and ran 4.x releases on i386 and Alpha for a long time. I tried the 5.x releases and had nothing but trouble so stuck with 4.x through that time. The Alpha never did move off 4.x before it got retired, but some of my i386 boxes made it onto 6.x and then sat there until they were taken out of active use. For years, FreeBSD 4.x and 6.x was the reliable OS I used for everything but my desktop (which had been OS X). More recently I started using FreeBSD 8 on amd64 with ZFS and quickly moved on to 9 as soon as 9.0 was released. At the same time, i386 hardware retired from desktop roles but suitable for network services got 8.x installed on UFS. I had rather good experience with 9-STABLE on amd64 running with ZFS. For the most part it's solid, ZFS support is much better than the sorry state Apple left it in before abandoning it on OS X, though I did get a few kernel panics when simply connecting disks that contained zpools from OS X. Due to both compilation speed difference and the fact older hardware tends to be in more entrenched roles, I left my i386 systems out of the ZFS and 9.x experiments. I did also try 9.x on my one ppc64 box at various times to see if that might be a good way to utilize hardware Apple dropped support for years prior. The state on ppc64 varied between panic on boot to being able to buildworld but an idle system left for a few days would randomly go zombie, console freezes but clearly there is some system activity and it responds to ping but might not take a ssh connection, which I chalked up to the experimental state of the port. I did see console freezes on i386 boxes booted from a 9.1 mfsbsd image but never investigated because I was just using it to image and erase disks on old machines where I considered the hardware suspect. In the last couple months I've been moving my amd64 systems to 10, starting during the RCs and keeping up such that they are now all 10-STABLE. The transition was fairly smooth and they are running quite well. Even one box that has prior chipset and BIOS, which was panicking with an early 10-BETA is now running 10.0-RELEASE with KMS. All very impressive. So, time to start migrating some i386 boxes I figure. I had recently moved a number of them to 9.2 and figured I should just go ahead and move everything up to 10.0 at close to the same time if possible. I had seen no problems with 9.2 or 9-STABLE on the i386 boxes that I was preparing to upgrade, I already sorted out one Clang bug that affected a few (but less worse than a similar GCC bug that remains unfixed) since I had switched compilers when going to 9. Since I started moving i386 boxes to 10.0, I've had nothing but strange problems. Last night I wrote a message about kern.maxswzone, something I started getting warnings about on one particular box when I put 9.2 on it but which I didn't try to do anything about until now. I wrote that message with this one in mind, mentioning that I would have another about processes hanging. That one came first because it has at least some hard numbers and not so much subjective feelings of performance and reliability. Between then and now, the pattern struck me, all my early successes with 10 were amd64, and now all the i386 boxes I've upgraded are barely functional. I have 4 i386 boxes that I tried to put 10.0 on in the past week with various degrees of fail. There are 2 sets within the four, two are the low-end C3 boxes with 256MB and 384MB RAM described in my prior message to the list. The other two are Pentium4 systems, one with 2GB RAM and the other with 3GB, substantially bigger disks, decent GPU, etc. In other words, two are ancient and two are merely a little dated but still very usable. This faster pair I will mention first, then I will return to the slow pair. All these boxes are things I use around the house for network services or as essentially terminals in other rooms (kitchen pc to look up stuff, bedroom pc to watch movies, etc). The i386 boxes that run important services (externally facing network services, routing/firewall, etc) and being left two a second round once all issues are sorted out on these lower-importance boxes first. The P4s had 9-STABLE installed on UFS volumes. I did the switch from csup to svnup to pull the 10.0 sources, did the buildworld/kernel and install on both and all looked good. Before I went on to reinstall packages or anything else, I decided now might be a good time to try switching from UFS to ZFS, everything in /home was already backed up. So far I had only tried ZFS on amd64 due to early reports of flakiness on i386 related to exhausting kernel memory. In the couple years since initial support, the ZFS code has gotten better integrated, more people have tried it, some tuning guides have been written, and I've seen reports of it being used on boxes with 512MB RAM. Most of my i386 boxes in server roles have 2GB and it would be nice to migrate those to ZFS if possible. Best to test on these boxes first and try tuning if needed. I booted both P4 boxes from mfsbsd CD, mounted the existing UFS volums, tar the whole mess and drop the uncompressed tar on my file server. On the server, I fired off xz to compresses the tar file to speed the restore (or so I thought) while I prepared the machines. I setup the zpools in the normal way I'd done all my amd64 boxes. One P4 box has a single disk, the other has two, so one is a single vdev pool and the other is multiple, which adds a little variety for testing. Aside from vdevs, the pool properties, filesystems and their properties are all identical to how I've been setting up my other ZFS boxes. LZ4 on most filesystems, gzip or none on a few, sha256 hashes entirely, no dedupe, pretty normal. With the pools configured and mounted on /zroot, I scp the tar.xz file for each box into /tmp (which is tmpfs), and try tar xjpvf in /zroot. After initial good progress, both boxes seemed to hang at about the same time. Disk activity stops, tar is sitting there as if it's going to do something, but no further progress on either when left for an hour. I started top on both boxes and notice that the tar process on each is in the state "kmem a" and the resident memory allocation on each is exactly the same (around 750MB). My first thought was that I used too much RAM with the 500MB tar.xz file in tmpfs. One box says 800MB free and the other says 1800MB free but maybe there is a shortage of kernel memory. I can't seem to kill tar, so I just reboot each, clear the zpools to try from a fresh state again, mount the swap before filling /tmp this time, then attempt another extract. No joy, it stops the same way, with the exact same memory allocation, and each box is stopped on the exact same file as where each stopped on the first attempt. The free memory reports are the same as before, no sawp is being used, whatever is running out must be non-pageable. The next thing I try is decoupling the stages. The tar process is growing so large because it has to decompress lzma which requires a huge dictionary. I figure maybe the heavy disk I/O is causing buffers/cache to contend with the process in some way. Reboot again for a fresh start, scp the .tar.xz to /zroot/tmp, xz -d so it's just a plain tar, then tar xpvf in /zroot and both complete without error. Set the mointpoint to / for each zroot and reboot into the running system. That was strange but solvable. I don't know what the "kmem a" state is but I can guess it's probably short for something like "kmem alloc" which would suggest to me the process is waiting on a kernel allocation. So I figure I've got some tuning to do and a hung process isn't as bad as the kernel panics others had reported on i386 under heavy I/O load (e.g. rsync) with default settings. After all, the boot messages include two warnings about tuning ZFS memory on i386. In order to do the tuning, I need some reproducible load, and buildworld is good for that. So, first thing is switch from svnup to svnlite that is now in base and use that to get 10-STABLE sources. I do the rm -r on /usr/src and /usr/ports and then fire off the svnlite co for each. I find that the slowness of svn checkout is due to network latency and running the two in parallel doesn't create I/O contention on either disk or network. While the P4s are fetching their sources, I go to deal with the pair of Via C3 boxes that I had taken to 10-PRERELEASE just a week prior and was ready to upgrade to 10-STABLE. Since that upgrade, they sat unused waiting for an impending MFC so I could do away with a local patch. As mentioned in my other message, I made a mistake here on my first attempt, I forgot to clear the existing /usr/src and /usr/ports before starting the svnlite checkout. After realizing my mistake, I did the now larger (as it includes a .svn dir) rm -r of those dirs to start fresh. That's when I hit the problem with rm hanging on one box. Without repeating all the details, I had to boot mfsbsd to do the rm on the one box with only 256MB RAM, but what difference that made is simply inexplicable. Once I had gotten that straightened out, I started off the svnlite checkout fresh. On the box with 384MB, the completed with only one restart for network dropout (common since it takes 2-3 hours per checkout). On the box with 256MB (which had previously fully checked out and gotten to the point where it wanted to prompt me for the conflict on every file in the tree), svnlite could only do a hundred files or so before it seemed to hang in the same way as rm. Running just one instance on /usr/src without the parallel checkout on /usr/ports made no difference. When rm was hanging, I might be able to kill it (after several minutes wait) and reboot or the console might lock. When svnlite hung, I could not login but I might be able to run a command on another VT. I was able to catch that svnlite is getting stuck in the state "kmem a". Hmmm... the same state that tar was getting stuck in on the other boxes. How were those doing now? I look back at the P4s, which should be done as it's been a few hours spent on the C3 boxes. They are sitting there in the middle of checkout not making any visible progress. Ctrl-c doesn't work, I can't switch VTs, even ctrl-alt-del seems to not work. Seems like the consoles are hung in a way eerily similar to what I'd seen from 9.x on non-amd64 platforms (both ppc64 and i386). I attempted to initiate an ssh connection into each of the P4s and then walked off for a minute for refreshment. When I came back, expecting to find a login prompt or a timeout, I found the ssh attempts timed out and the two boxes had rebooted. I don't know if the ctrl-alt-del finally registered or if the incoming ssh connection pushed them over the edge. I wasn't there to see and the logs for both stop sometime before the hang. With both rebooted, I do a svnlite cleanup in /usr/src and /usr/pots or both, then fire off the svnlite co for each directory on both boxes. While those were running, I started digging into the kern.maxswzone tunable on the C3 box with less RAM. The box with more RAM was able to do the rm, svn checkout of both src and ports in parallel, and showed no obvious sign of trouble, though I hadn't started a buildworld yet. The box with less RAM was failing all over the place and the only obvious difference was the warning about that tunable. After I wasted hours figuring out the value is already sufficient but is apparently reduced after it's set, so it can't be effectively turned up, only down, I wrote my previous message to this list on that topic specifically and then went to bed. This morning I got up and was already thinking about the correlation, that 10 is a disaster on all my i386 boxes thus far. The first thing I checked was the P4 boxes. Both completed the svn checkout on both src and ports, good sign. However, the box with 3GB RAM has the message "vm_thread_new: kstack allocation failed" repeated about a dozen times, bad sign. First thing I do is try to run top to see what the size of ARC is, free RAM, etc. "No more processes." Uh Oh, that's no good at all, can't even run top. Curiously, the box with less RAM, only 2GB, has no messages so I try to start top on it to see what it's state is. Nothing happens when I push return, the cursor is just sitting there after top. On another VT, reboot gets the same response, none, cursor just sits. I can't type but I can switch VTs and scroll, until I do ctrl-alt-del, then every key press after that is a beep. Back on the once that said no processes left for top, reboot gets the same non-response. ctrl-alt-del doesn't beep, it just spits out the ^[[3~ typical of a dead console. Ugh, not even a reset button to punch on these P4 boxes. So, svnlite checkout is a real strain that can bring a system to it's knees. I'm not sure if this should be regarded as horrible inefficiency or as a means of checking the box before launching into a buildworld (as if that wasn't enough strain to uncover most problems). While 10.0 is good on amd64, it seems a disaster on i386. Processes hang in this "kmem a" state it doesn't take much more to get the box to livelock. I've only seen the "kmem a" state a few times as most other times I can't inspect anything before the box is locked too hard to do anything. In some cases I'm not sure there's even a way to get the box shutdown clean as the most trivial of things lock it up hard. It's not even required to do anything. When I was experimenting with kern.maxswzone last night I rebooted one box a few dozen times, so if I didn't need to look at systcl output I just hit ctrl-alt-del at the login prompt. Once the console died right then, it had just booted and ctrl-alt-del was met with a beep and then it's hung, have to punch reset. I'm guessing the console dies as a result of total wedging of I/O systems following heavy disk I/O. The cause is not just ZFS because the C3 boxes are UFS. The problem is not just the excess swap on the smallest box because I see the same sort of troubles on the box with the most RAM. Some kernel resource seems to be exhausted regardless of how much RAM or swap is present. I'm going to try buildworld on 3 of these to see what happens. For the fourth, I still need to get sources onto the disk before I can even attempt that. I'm not sure what to expect. It might be instant miserable failure, or it might actually run a long time since the I/O load is in bursts with lots of recovery time between. It'll take a few hours to see if the P4s succeed. It'll take two days to see a C3 succeed. Maybe by that time, someone will get through all I've written and have some useful suggestion for debugging. To me, it's rather hard to debug since I have little hint where to start, when the problem manifests any logging stops, and the box is in a state where it is essentially unobservable without a JTAG to jump in and directly inspect the state of it's world.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140201203112.0000210c>