Date: Sat, 9 Oct 2010 15:37:04 +0200 From: Kai Gallasch <gallasch@free.de> To: freebsd-fs@freebsd.org Subject: Re: Locked up processes after upgrade to ZFS v15 Message-ID: <CF901B53-657E-49FC-A43B-27BC7D49F7A7@free.de> In-Reply-To: <20101009111241.GA58948@icarus.home.lan> References: <39F05641-4E46-4BE0-81CA-4DEB175A5FBE@free.de> <20101009111241.GA58948@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
Am 09.10.2010 um 13:12 schrieb Jeremy Chadwick: > On Wed, Oct 06, 2010 at 02:28:31PM +0200, Kai Gallasch wrote: >> Two days ago I upgraded my server to 8.1-STABLE (amd64) and upgraded = ZFS from v14 to v15. >> After zpool & zfs upgrade the server was running stable for about = half a day, but then apache processes running inside jails would lock up = and could not be terminated any more. > On RELENG_7, the system used ZFS v14, had the same tunings, and had an > uptime of 221 days w/out issue. 8.0 and 8.1-STABLE + ZFS v14 also ran very solid on my servers - dang! > With RELENG_8, the system lasted approximately 12 hours (about half a > day) before getting into a state that looks almost identical to Kai's > system: existing processes were stuck (unkillable, even with -9). New > processes could be spawned (including ones which used the ZFS > filesystems), and commands executed successfully. same here. I can provoke this locked process problem by starting one of my webserver jails. The first httpd process will lock up after = max. 30 minutes. Problem is, that after lot httpd forks, apache can not fork any more = child processes and the stuck (not killable) httpd processes all have a = socket open, with the IP address of the webserver. So a restart of = apache is not possible, because $IP:80 is already occupied. The jail also cannot be stopped/started in this state.. Only choice = there is: Restart the whole jail-host server (some processes would not = die - ps -axl advised + unclean umounts of ufs partitions) or delete the = IP-Adresse from the network interface and migrate the jail to another = server (zfs send/receive).. no fun at all. BTW: zfs destroy also does = not work here. > init complained about wedged processes when the system was rebooted: I use 'procstat -k -k -a | grep faul' to look for this condition.. This will find all processes in the table that contain 'trap_pfault' > Oct 9 02:00:56 init: some processes would not die; ps axl advised >=20 > No indication of any hardware issues on the console. here too. > The administrator who was handling the issue did not use "ps -l", = "top", > nor "procstat -k", so we don't have any indication of what the process > state was in, nor what the kernel calling stack looked like that lead = up > to the wedging. All he stated was that the processes were in D/I > states, which doesn't help since that's what they're in normally = anyway. > If I was around I would have forced DDB and done "call doadump" to > investigate things post-mortem. Another sign is an increased count of processes in 'top'.=20 > Monitoring graphs of the system during this time don't indicate any > signs of memory thrashing (though bsnmp-ucd doesn't provide as much > granularity as top does); the system looks normal except for a = slightly > decreased load average (probably as a result of the deadlocked > processes). My server currently has 28 GB RAM, with < 60% usage and no special zfs = tuning in loader.conf - although I tried to set = vm.pmap.pg_ps_enabled=3D"0" to find out if the locked processes had = anything to do with it. But setting it, did not prevent the problem from reoccurring. > Aside from the top/procstat/kernel dump aspect, what other information > would kernel folks be interested in? Is "call doadump" sufficient for > post-mortem investigation? I need to know since if/when this happens > again (likely), I want to get folks as much information as possible. I'm also willing to help, but need explicit instructions. I could = provoke such a lockup on one of my servers, but don't have that much = time to leave the server in this state.. So only a small time frame to = collect wanted debug data. > Also, a question for Kai: what did you end up doing to resolve this > problem? Did you roll back to an older FreeBSD, or...? This bug struck me really hard, because the affected server is not part = of a cluster and hosts about 50 jails (mail, web, databases). Problem is: Sockets held open by locked processes cannot be closed.. So = a restart of a jammed service is not possible. Theoretically I had the option to boot into the old world/kernel, but = I'm sure with the old zfs.ko a zfs mount of ZFS v15 wouldn't be = possible. AFAIK there is no zfs downgrade command or utility.. Of course a bare metal recovery of the whole server from tape was also a = last option. But really?? my 'solution': - move the most instable jails to other servers and restore them to UFS = partitions. - move everything else in the zpool temporarily to other servers running = zfs (zfs send/recieve) - zfs destroy -r - zpool delete - gpart create -t freebsd-ufs - gpart add ... - restore all jails from zfs to ufs So the server was now reverted to ufs - just for the piece of (my) mind, = although I waste around 50% of the raid capacity for reserved FS = allocation and all the other disadvantages compared to a volume manager. = I will still use zfs on several machines, but for some time not for = critical data. ZFS is a nifty thing, but I really depend on a stable FS. = (Of course for other people zfs v15 may be running smoothly) I must repeat. I offer my help if someone wants to dig into the locking = problem. =20 Regards, Kai.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CF901B53-657E-49FC-A43B-27BC7D49F7A7>