From owner-freebsd-stable@FreeBSD.ORG  Mon Mar 12 23:22:32 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id B2CDB106566B
	for <freebsd-stable@freebsd.org>; Mon, 12 Mar 2012 23:22:32 +0000 (UTC)
	(envelope-from regnauld@x0.dk)
Received: from moof.catpipe.net (moof.catpipe.net [194.28.252.64])
	by mx1.freebsd.org (Postfix) with ESMTP id 5B7AE8FC08
	for <freebsd-stable@freebsd.org>; Mon, 12 Mar 2012 23:22:31 +0000 (UTC)
Received: from localhost (moof.catpipe.net [194.28.252.64])
	by localhost.catpipe.net (Postfix) with ESMTP id B74874CEDDA;
	Tue, 13 Mar 2012 00:22:24 +0100 (CET)
Received: from moof.catpipe.net ([194.28.252.64])
	by localhost (moof.catpipe.net [194.28.252.64]) (amavisd-new,
	port 10024)
	with ESMTP id gaQ5ZNJsyO8E; Tue, 13 Mar 2012 00:22:24 +0100 (CET)
Received: from macbook.bluepipe.net (x0.dk [194.19.205.214])
	(Authenticated sender: relayuser)
	by moof.catpipe.net (Postfix) with ESMTPA id 1F3E64CEDB7;
	Tue, 13 Mar 2012 00:22:24 +0100 (CET)
Received: by macbook.bluepipe.net (Postfix, from userid 1001)
	id B94CF830F9E; Tue, 13 Mar 2012 00:22:23 +0100 (CET)
Date: Tue, 13 Mar 2012 00:22:23 +0100
From: Phil Regnauld <regnauld@x0.dk>
To: Mikolaj Golub <to.my.trociny@gmail.com>
Message-ID: <20120312232223.GG12975@macbook.bluepipe.net>
References: <20120311185457.GB1684@macbook.bluepipe.net>
	<861uoyvpzh.fsf@kopusha.home.net>
	<20120311220911.GD1684@macbook.bluepipe.net>
	<20120312143127.GM12975@macbook.bluepipe.net>
	<86k42pu0tb.fsf@kopusha.home.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <86k42pu0tb.fsf@kopusha.home.net>
X-Operating-System: Darwin 11.3.0 x86_64
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-stable@freebsd.org
Subject: Re: Issue with hast replication
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Mar 2012 23:22:32 -0000

Mikolaj Golub (to.my.trociny) writes:
> 
> It looks like in the case of hastd this was send(2) who returned ENOMEM, but
> it would be good to check. Could you please start synchronization again,
> ktrace primary worker process when ENOMEM errors are observed and show output
> here?

    Ok, took a little while, as running ktrace on the hastd does slow it down
    significantly, and the error normally occurs at 30-90 sec intervals.

       0x0f90 b2f3 3ad5 e657 7f0f 3e50 698f 5deb 12af  |..:..W..>Pi.]...|
       0x0fa0 740d c343 6e80 75f3 e1a7 bfdf a4c1 f6a6  |t..Cn.u.........|
       0x0fb0 ea85 655d e423 bd5e 42f7 7e9a 05d2 363a  |..e].#.^B.~...6:|
       0x0fc0 025e a7b5 0956 417c f31c a6eb 2cd9 d073  |.^...VA|....,..s|
       0x0fd0 2589 e8c0 d76a 889f 8345 eeaf f2a0 c2d6  |%....j...E......|
       0x0fe0 b89e aaef fee2 6593 e515 7271 88aa cf66  |......e...rq...f|
       0x0ff0 d272 411a 7289 d6c9 6643 bdbe 3c8c 8ae8  |.rA.r...fC..<...|
 50959 hastd    RET   sendto 32768/0x8000
 50959 hastd    CALL  sendto(0x6,0x8024bf000,0x8000,0x20000<MSG_NOSIGNAL>,0,0)
 50959 hastd    RET   sendto -1 errno 12 Cannot allocate memory
 50959 hastd    CALL  clock_gettime(0xd,0x7fffff3f86f0)
 50959 hastd    RET   clock_gettime 0
 50959 hastd    CALL  getpid
 50959 hastd    RET   getpid 50959/0xc70f
 50959 hastd    CALL  sendto(0x3,0x7fffff3f8780,0x84,0,0,0)
 50959 hastd    GIO   fd 3 wrote 132 bytes
       "<27>Mar 12 23:42:43 hastd[50959]: [hvol] (primary) Unable to sen\
        d request (Cannot allocate memory): WRITE(8626634752, 131072)."  
 50959 hastd    RET   sendto 132/0x84
 50959 hastd    CALL  close(0x7)
 50959 hastd    RET   close 0

> If it is send(2) who fails then monitoring netstat and network driver
> statistics might be helpful. Something like
> 
> netstat -nax
> netstat -naT
> netstat -m
> netstat -nid

    I could run this in a loop, but that would be a lot of data, and might
    not be appropriate to paste here.

    I didn't see any obvious errors, but I'm not sure what I'm looking for.
    netstat -m didn't show anything close to running out of buffers or
    clusters...

> sysctl -a dev.<nic>
>
> And may be
> 
> vmstat -m
> vmstat -z

    No obvious errors there either, but again what should I look out for ?

    In the meantime, I've also experimented with a few different scenarios, and
    I'm quite puzzled.

    For instance, I configured one of the other gigabit cards on each host to
    provide a dedicated replication network. The main difference is that up
    until now this has been running using tagged vlans. To be on the safe side,
    I decided to use an untagged interface (the second gigabit adapter in each
    machine).
    
    Here's where I observed, and it is very odd:
    
    - doing a dd ... | ssh dd fails in the same fashion as before

    - I created a second zvol + hast resource of just 1 GB, and it replicated
      without any problems, peaking at 75 MB / sec (!) - maybe 1GB is too small
      ?
    
      (side note: hastd doesn't pick up configuration changes even with SIGHUP,
       which makes it hard to provision new resources on the fly) 

    - I restarted replication on the 100 G hast resource, and it's currently
      replicating without any problems over the second ethernet, but it's
      dragging along at 9-10 MB/sec, peaking at 29 MB/sec occasionally.

      Earlier, I was observing peaks at 65-70 MB sec in between failures...

    So I don't really know what to conclude :-|