Date: Mon, 30 May 2011 21:42:22 +0300 From: Mikolaj Golub <trociny@freebsd.org> To: Daniel Kalchev <daniel@digsys.bg> Cc: freebsd-stable@freebsd.org Subject: Re: HAST instability Message-ID: <86d3j02fox.fsf@kopusha.home.net> In-Reply-To: <4DE3ACF8.4070809@digsys.bg> (Daniel Kalchev's message of "Mon, 30 May 2011 17:43:04 %2B0300") References: <4DE21C64.8060107@digsys.bg> <4DE3ACF8.4070809@digsys.bg>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 30 May 2011 17:43:04 +0300 Daniel Kalchev wrote: DK> Some further investigation: DK> The HAST nodes do not disconnect when checksum is enabled (either DK> crc32 or sha256). DK> One strange thing is that there is never established TCP connection DK> between both nodes: DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.8457 *.* LISTEN It is normal. hastd uses the connections only in one direction so it calls shutdown to close unused directions. DK> When using sha256 one CPU core is 100% utilized by each hastd process, DK> while 70-80MB/sec per HAST resource is being transferred (total of up DK> to 140 MB/sec traffic for both); DK> When using crc32 each CPU core is at 22% utilization; DK> When using none as checksum, CPU usage is under 10% I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is lower and the problem is not triggered. DK> Eventually after many hours, got corrupted communication: DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch. "Hash mismatch" message suggests that actually you were using checksum then, weren't you? DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive DK> request data: No such file or directory. DK> May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process DK> exited ungracefully (pid=9827, exitcode=75). DK> and DK> May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive DK> reply header: Operation timed out. DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from DK> 10.2.101.12. DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send DK> request (Broken pipe): WRITE(99128470016, 131072). It looks a little different than in your fist message. Do you have clock in sync on both nodes? I would like to look at full logs for some rather large period, with several cases, from both primary and secondary (and be sure about synchronized time). Also, it might worth checking that there is no network packet corruption (some strange things in netstat -di, netstat -s, may be copying large files via net and comparing checksums). -- Mikolaj Golub
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86d3j02fox.fsf>