Date: Wed, 20 Jan 2016 21:37:01 +0000 From: bugzilla-noreply@freebsd.org To: freebsd-amd64@FreeBSD.org Subject: [Bug 206448] ZFS hang/stall when drives in ATA mode Message-ID: <bug-206448-6@https.bugs.freebsd.org/bugzilla/>
next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D206448 Bug ID: 206448 Summary: ZFS hang/stall when drives in ATA mode Product: Base System Version: 10.2-RELEASE Hardware: amd64 OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: danmcgrath.ca@gmail.com CC: freebsd-amd64@FreeBSD.org CC: freebsd-amd64@FreeBSD.org Created attachment 165888 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D165888&action= =3Dedit Screenshot of ata console error I had a Dell PowerEdge R210 amd64 system that was exhibiting some off behaviour. A year or two ago I had one of the systems 2 1TB SATA drives drop out of raid, but surprisingly it I simply added it back and it has been fine ever since. Then this week I installed py27-salt on the servers. After installing salt everything seemed fine for the first day. After the d= aily mails for the machine came in however, I noticed that the daily periodic got stuck running some smartd checks for the log. I tried to kill the process b= ut ended up not being able to, which prompted a reboot. After the reboot there were jails that refused to start and all of a sudden found myself unable to= do any writes to the drive, and only the message "ata2: already connected!" showing up on the console. After some digging (thanks to auditd and salt and system logs), I was able = to narrow the trigger down to some camcontrol inquiry and identify commands th= at would reliably trigger the problem. After some more digging I was noticing that only this server (out of several identical/near identical) was showing the problem and that for some strange reason there were /dev/gpt/swap0 (and swap1) files only on this system. Also odd was that when I went to try some tests with stopping swap (`gmirror stop swap`) I found that the second I tried to stop the swap mirror, it redetect= ed the swap mirror but under different device names (see screenshot of the con= sole in attachments). I also noticed that the dmesg of this system only, was sho= wing some odd "unmapped" messages: GEOM_MIRROR: cancelling unmapped because of ada0p2 GEOM_MIRROR: cancelling unmapped because of ada1p2 GEOM_MIRROR: Device mirror/swap launched (2/2). As for the ZFS symptoms, when the console would show the "already attached!" error, ZFS (this was a zfs install with the mirrored swap option enabled) w= ould no longer allow writes (or at least very slowly, in the area of 1 IOPS), and reads would eventually fail (when doing a test with `find /`), which I assu= me happens when they run out of cache entries. In the end I stumbled on the BIOS setting having the drives set to ATA mode instead of AHCI or RAID, and correcting this setting seems to have solved t= he problem. While I can't know for sure if this is a "bug" or just a known limitation of ATA, it would almost seem like camcontrol was somehow briefly disconnecting the drives when being issued commands, and in turn was causing the swap device to switch from ada0p2 to gpt/swap0 and vice versa, possibly causing some sort of bug in ZFS. Anyway, this is the report, and hopefully helps fix a possible bug lurking around the system that could cause problems for other users. Cheers o/ --=20 You are receiving this mail because: You are on the CC list for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-206448-6>