From owner-freebsd-fs@FreeBSD.ORG Thu Apr 29 11:23:13 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 18F27106564A; Thu, 29 Apr 2010 11:23:13 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-bw0-f216.google.com (mail-bw0-f216.google.com [209.85.218.216]) by mx1.freebsd.org (Postfix) with ESMTP id 3DC088FC08; Thu, 29 Apr 2010 11:23:12 +0000 (UTC) Received: by bwz8 with SMTP id 8so14161372bwz.3 for ; Thu, 29 Apr 2010 04:23:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:cc:subject :organization:references:date:in-reply-to:message-id:user-agent :mime-version:content-type; bh=1ViKzBRhvomm4yR089Saq8OlhnAwLCFtFofFI7dOsBU=; b=XK2JQIBGXf7gEYBHMvdoSfM+B0ZHKn0fzuJ9Omb6CvTcvW5FK/+NDOkQws+crAx5JQ 9g06rpW36mgmssEap2A3kib5F20/FfDnEa5QCydfgBoXRK5IUyuNaxoj32OH+iaBOt0c DaS2/hE98Y0Q6klpIDNhVSK+dZlpm/IaPgEew= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:cc:subject:organization:references:date:in-reply-to :message-id:user-agent:mime-version:content-type; b=sLWRfu0i5+4dVyFEGDjkgvNNElk/cWGr6cUQG8RJcuR5NdLWO5CZa0zoMc2HPIPSaq JG/cYvo2gG1srbRlCZWdgCPSV+uL4afnxHpncrjtdUxtnmHMFHD5KlIHHwfwurTcnC1z D83U0tMm7eCvUAnMBsoeFAGACLb7KNh1CrdKQ= Received: by 10.204.16.73 with SMTP id n9mr5790253bka.21.1272540184214; Thu, 29 Apr 2010 04:23:04 -0700 (PDT) Received: from localhost (ua1.etadirect.net [91.198.140.16]) by mx.google.com with ESMTPS id 14sm290773bwz.6.2010.04.29.04.23.01 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 29 Apr 2010 04:23:02 -0700 (PDT) From: Mikolaj Golub To: Pawel Jakub Dawidek Organization: TOA Ukraine References: <86r5m9dvqf.fsf@zhuzha.ua1> <20100423062950.GD1670@garage.freebsd.pl> <86k4rye33e.fsf@zhuzha.ua1> <20100424073031.GD3067@garage.freebsd.pl> <868w8dgk4e.fsf@kopusha.onet> <86tyqzeq84.fsf@kopusha.onet> <20100428214636.GD1677@garage.freebsd.pl> <86mxwmk7my.fsf@zhuzha.ua1> <20100429081200.GB1697@garage.freebsd.pl> Date: Thu, 29 Apr 2010 14:22:59 +0300 In-Reply-To: <20100429081200.GB1697@garage.freebsd.pl> (Pawel Jakub Dawidek's message of "Thu, 29 Apr 2010 10:12:00 +0200") Message-ID: <86iq7ajyek.fsf@zhuzha.ua1> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: freebsd-fs Subject: Re: HAST: primary might get stuck when there are connectivity problems with secondary X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Apr 2010 11:23:13 -0000 On Thu, 29 Apr 2010 10:12:00 +0200 Pawel Jakub Dawidek wrote: PJD> On Thu, Apr 29, 2010 at 11:03:33AM +0300, Mikolaj Golub wrote: >> >> On Wed, 28 Apr 2010 23:46:36 +0200 Pawel Jakub Dawidek wrote: >> >> PJD> Could you see if the following patch fixes the problem for you: >> >> PJD> http://people.freebsd.org/~pjd/patches/hastd_timeout.patch >> >> PJD> The patch sets timeout on both incoming and outgoing sockets on primary >> PJD> and on outgoing socket on secondary. Incoming socket on secondary is >> PJD> left with no timeout to avoid problem you described above. >> >> The patch works for me. >> >> After disabling the network connection between the primary and the secondary >> FS operations on the primary do not get stuck and the following messages are >> observed: >> >> Apr 29 10:37:41 hasta hastd: [storage] (primary) Unable to receive reply header: Resource temporarily unavailable. >> Apr 29 10:37:57 hasta hastd: [tank] (primary) Unable to receive reply header: Resource temporarily unavailable. >> Apr 29 10:37:57 hasta hastd: [tank] (primary) Unable to send request (Resource temporarily unavailable): WRITE(972292096, 14336). >> Apr 29 10:38:56 hasta hastd: [storage] (primary) Unable to connect to 172.20.66.202: Operation timed out. >> Apr 29 10:39:12 hasta hastd: [tank] (primary) Unable to connect to 172.20.66.202: Operation timed out. >> >> After restoring the network connection the primary reconnects to the secondary >> and the status changes back from "degraded" to "complete". PJD> Good. And I assume you don't observe problems on secondary? Eg. recv(2) PJD> on secondary doesn't timeout? No problems on secondary. When emulating a network outage, after connectivity restoring the worker is restarted when new connections comes from primary: Apr 29 14:12:39 hastb hastd: Accepting connection to tcp4://0.0.0.0:8457. Apr 29 14:12:39 hastb hastd: Connection from tcp4://172.20.66.202:8457 to tcp4://172.20.66.201:44508. Apr 29 14:12:39 hastb hastd: tcp4://172.20.66.201:44508: resource=tank Apr 29 14:12:39 hastb hastd: [tank] (secondary) Initial connection from tcp4://172.20.66.201:44508. Apr 29 14:12:39 hastb hastd: [tank] (secondary) Worker process exists (pid=1729), stopping it. Apr 29 14:12:39 hastb hastd: [tank] (secondary) Worker process (pid=1729) exited gracefully. Apr 29 14:12:39 hastb hastd: [tank] (secondary) Incoming connection from tcp4://172.20.66.201:44508 configured. If the FS is idle (there is no I/O) secondary is waiting in receive, does not timeout and does not stop workers (as it was with my timeout patch). -- Mikolaj Golub