From owner-freebsd-stable@FreeBSD.ORG Mon May 30 14:43:16 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 04194106566C for ; Mon, 30 May 2011 14:43:16 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.3.230]) by mx1.freebsd.org (Postfix) with ESMTP id 878E48FC18 for ; Mon, 30 May 2011 14:43:15 +0000 (UTC) Received: from dcave.digsys.bg (dcave.digsys.bg [192.92.129.5]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.4/8.14.4) with ESMTP id p4UEh48A050347 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO) for ; Mon, 30 May 2011 17:43:09 +0300 (EEST) (envelope-from daniel@digsys.bg) Message-ID: <4DE3ACF8.4070809@digsys.bg> Date: Mon, 30 May 2011 17:43:04 +0300 From: Daniel Kalchev User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.17) Gecko/20110519 Thunderbird/3.1.10 MIME-Version: 1.0 To: freebsd-stable@freebsd.org References: <4DE21C64.8060107@digsys.bg> In-Reply-To: <4DE21C64.8060107@digsys.bg> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: HAST instability X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 May 2011 14:43:16 -0000 Some further investigation: The HAST nodes do not disconnect when checksum is enabled (either crc32 or sha256). One strange thing is that there is never established TCP connection between both nodes: tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT tcp4 0 0 10.2.101.11.8457 *.* LISTEN When using sha256 one CPU core is 100% utilized by each hastd process, while 70-80MB/sec per HAST resource is being transferred (total of up to 140 MB/sec traffic for both); When using crc32 each CPU core is at 22% utilization; When using none as checksum, CPU usage is under 10% Eventually after many hours, got corrupted communication: May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch. May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive request data: No such file or directory. May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process exited ungracefully (pid=9827, exitcode=75). and May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive reply header: Operation timed out. May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from 10.2.101.12. May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send request (Broken pipe): WRITE(99128470016, 131072). Daniel