From owner-freebsd-fs@FreeBSD.ORG Sat Apr 24 11:33:58 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1EBAE106564A; Sat, 24 Apr 2010 11:33:58 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-bw0-f216.google.com (mail-bw0-f216.google.com [209.85.218.216]) by mx1.freebsd.org (Postfix) with ESMTP id 1F7188FC18; Sat, 24 Apr 2010 11:33:56 +0000 (UTC) Received: by bwz8 with SMTP id 8so10005592bwz.3 for ; Sat, 24 Apr 2010 04:33:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:to:cc:subject:references :organization:from:date:in-reply-to:message-id:user-agent :mime-version:content-type; bh=S6PzMXZxIF3M7qQC/4IY8Dpc7k7aHinKTfSpvSGU87s=; b=DAdUozXg8sKzp+l3f6UApYlSrq/UZ4Lo/M9xvqlHugBMTpu42RzUVbBh5JJGCJO4dg 1w+Y3e67BJHStCaoHGeH46VByORT02hGRcebsnuJJ2+ng4IV5nS8FLkGu9uPqBV567Xf bW7THyhDPs/+50OUWNDHQqdeZIHesM6TzpNFw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=to:cc:subject:references:organization:from:date:in-reply-to :message-id:user-agent:mime-version:content-type; b=CY9StfUyor97UWjBut/i8Kq+CbZ+d94i6I6uMnuE5nCYw2TkyjxO2lvJL/+r8SnZsg /hXiDIiNivH4gg2N8Uff3erlyK6ho9HLVZA0sQ3NyQ/ATEtkDsrw5j6HenKVis4ywa9V 6fxJlv1uRDYxj2ZQwvohFiqGwbkoIqFTuQ2JE= Received: by 10.204.84.220 with SMTP id k28mr832568bkl.70.1272108836187; Sat, 24 Apr 2010 04:33:56 -0700 (PDT) Received: from localhost ([95.69.167.160]) by mx.google.com with ESMTPS id 14sm812082bwz.10.2010.04.24.04.33.55 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 24 Apr 2010 04:33:55 -0700 (PDT) To: Pawel Jakub Dawidek References: <86r5m9dvqf.fsf@zhuzha.ua1> <20100423062950.GD1670@garage.freebsd.pl> <86k4rye33e.fsf@zhuzha.ua1> <20100424073031.GD3067@garage.freebsd.pl> Organization: Home From: Mikolaj Golub Date: Sat, 24 Apr 2010 14:33:53 +0300 In-Reply-To: <20100424073031.GD3067@garage.freebsd.pl> (Pawel Jakub Dawidek's message of "Sat\, 24 Apr 2010 09\:30\:31 +0200") Message-ID: <868w8dgk4e.fsf@kopusha.onet> User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: freebsd-fs Subject: Re: HAST: primary might get stuck when there are connectivity problems with secondary X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 24 Apr 2010 11:33:58 -0000 On Sat, 24 Apr 2010 09:30:31 +0200 Pawel Jakub Dawidek wrote: > If secondary is not going to reply, hast_proto_recv_hdr() should > eventually timeout. On timeout, connection should be closed and this > requests (and all the others) should be moved to done queue. > > It doesn't timeout at all or maybe the timeout is too long? After "outage" we have: on the primary: tcp4 0 0 172.20.66.201.57596 172.20.66.202.8457 ESTABLISHED tcp4 0 0 172.20.66.201.41841 172.20.66.202.8457 CLOSED on the secondary: tcp4 0 0 172.20.66.202.8457 172.20.66.201.57596 ESTABLISHED tcp4 0 0 172.20.66.202.8457 172.20.66.201.41841 ESTABLISHED So one of the connections (used by primary/remote_send_thread()) is broken (although the secondary is not aware about this, it it in the recv() at that time) and the second connection (used by primary/remote_recv_thread()) is alive. It does timeout after net.inet.tcp.keepidle (which is 2 hours by default) when the secondary starts to send keep alive packets. The secondary receive RST on its keep alive packet, recv() returns with error and the worker is restarted. As I wrote in my first letter the workaround is to set net.inet.tcp.keepidle to some small value on the secondary so it would notice a broken connection much earlier. >From the code I don't see how hast_proto_recv_hdr() may timeout if the connection is alive, have I missed something? -- Mikolaj Golub