Date: Tue, 31 May 2011 17:08:59 +0300 From: Mikolaj Golub <trociny@freebsd.org> To: Daniel Kalchev <daniel@digsys.bg> Cc: freebsd-stable@freebsd.org Subject: Re: HAST instability Message-ID: <86zkm3t11g.fsf@in138.ua3> In-Reply-To: <4DE4E43B.7030302@digsys.bg> (Daniel Kalchev's message of "Tue, 31 May 2011 15:51:07 %2B0300") References: <4DE21C64.8060107@digsys.bg> <4DE3ACF8.4070809@digsys.bg> <86d3j02fox.fsf@kopusha.home.net> <4DE4E43B.7030302@digsys.bg>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 31 May 2011 15:51:07 +0300 Daniel Kalchev wrote: DK> On 30.05.11 21:42, Mikolaj Golub wrote: >> DK> One strange thing is that there is never established TCP connection >> DK> between both nodes: >> >> DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 >> DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT >> DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 >> DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT >> DK> tcp4 0 0 10.2.101.11.8457 *.* LISTEN >> >> It is normal. hastd uses the connections only in one direction so it calls >> shutdown to close unused directions. DK> So the TCP connections are all too short-lived that I can never see a DK> single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so DK> this might well be possible... No the connections are persistent, just only one (unused) direction of communication is closed. See shutdown(2) for further info. >> I would like to look at full logs for some rather large period, with several >> cases, from both primary and secondary (and be sure about synchronized time). DK> I have made sure clocks are synchronized and am currently running on a freshly rebooted nodes (with two additional SATA drives at each node) -- DK> so far some interesting findings, like I get hash errors and DK> disconnects much more frequent now. Will post when an bonnie++ run on DK> the ZFS filesystem on top of the HAST resources finishes. As I wrote privately, it would be nice to see both netstat and hast logs (from both nodes) for the same rather long period, when several cases occured. It would be good to place them somewere on web so other guys could access them too, as I will be offline for 7-10 days and will not be able to help you until I am back. DK> One additional note: while playing with this setup, I tried to DK> simulate local disk going away in the hope HAST will switch to using DK> the remote disk. Instead of asking someone at the site to pull out the DK> drive, I just issued on the primary DK> hastctl role init data0 DK> which resulted in kernel panic. Unfortunately, there was no sufficient DK> dump space for 48GB. I will re-run this again with more drives for the DK> crash dump. Anything you want me to look for in particular? (kernels DK> have no KDB compiled in yet) Well, removing physical disk (device /dev/gpt/data0 consumed by hastd dissapears) and switching a resource to init role (devive /dev/hast/data0 consumed by FS dissapears) are two different things. Sure you should not normally change the resource role (destroy hast device) before unmounting (exporting) FS. -- Mikolaj Golub
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86zkm3t11g.fsf>