Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 11 Mar 2012 19:54:57 +0100
From:      Phil Regnauld <regnauld@x0.dk>
To:        freebsd-stable@freebsd.org
Subject:   Issue with hast replication
Message-ID:  <20120311185457.GB1684@macbook.bluepipe.net>

next in thread | raw e-mail | index | archive | help
Hi,

I've got a fairly simple setup: two hosts running 9.0-R (will upgrade to stable
if told to, but want to check here first), ZFS and HAST. HAST is configured to
run on top of zvols configured on each host, as illustrated:

      FS                          FS
   +------+                    +------+ 
   | hvol | <---- hastd -----> | hvol | 
   +------+                    +------+ 
   | zvol |                    | zvol | 
   +------+                    +------+ 
   | zfs  |                    | zfs  | 
   +------+                    +------+ 
      h1                          h2

Connection is gigabit to the same switch. No issues with large TCP
transfers such as SCP/FTP.

Config is vanilla:

# zfs create -V 10G zfs/hvol

hast.conf:

resource hvol {
        on h1 {
                local /dev/zvol/zfs/hvol
                remote tcp4://192.168.1.100
        }
        on h2 {
                local /dev/zvol/zfs/hvol
                remote tcp4://192.168.1.200
        }
}


h1 is behaving fine as primary, either with h2 turned off or in init -
but as soon as I set the role to secondary for h2, the receiver
repeatedly crashes and restarts - see the traces below.

I've seen 

http://lists.freebsd.org/pipermail/freebsd-current/2011-May/024871.html
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2012-01/msg00510.html

... but in the first case the fix is in 9 since last year, and the second
is referring to async replication - I'm using the default (fullsync).

hastctl status on the primary shows the dirty size diminishing slowly,
but obviously this isn't optimal (and causes freezes on I/O to the primary
hvol, causing all kinds of issues with the consumers of the hvol).

Any idea ? Am I doing something wrong ?


Primary:

Mar 11 02:02:30 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200.
Mar 11 02:02:30 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory.
Mar 11 02:02:41 h1 hastd[2282]: [hvol] (primary) Unable to send request (Cannot allocate memory): WRITE(31642091520, 131072).
Mar 11 02:02:41 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200.
Mar 11 02:02:41 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory.
Mar 11 02:02:48 h1 hastd[2282]: [hvol] (primary) Unable to send request (Cannot allocate memory): WRITE(31649693696, 131072).
Mar 11 02:02:48 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200.
Mar 11 02:02:48 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory.
Mar 11 02:02:59 h1 hastd[2282]: [hvol] (primary) Unable to send request (Cannot allocate memory): WRITE(31691243520, 131072).
Mar 11 02:02:59 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200.
Mar 11 02:02:59 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory.
Mar 11 02:03:13 h1 hastd[2282]: [hvol] (primary) Unable to send request (Cannot allocate memory): WRITE(31783256064, 131072).
Mar 11 02:03:13 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200.
Mar 11 02:03:13 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory.
Mar 11 02:03:18 h1 hastd[2282]: [hvol] (primary) Unable to send request (Cannot allocate memory): WRITE(31782731776, 131072).
Mar 11 02:03:18 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200.
Mar 11 02:03:18 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory.
Mar 11 02:03:28 h1 hastd[2282]: [hvol] (primary) Unable to send request (Cannot allocate memory): WRITE(31803441152, 131072).
Mar 11 02:03:28 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200.
Mar 11 02:03:28 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory.
Mar 11 02:03:42 h1 hastd[2282]: [hvol] (primary) Unable to send request (Cannot allocate memory): WRITE(31881953280, 131072).
Mar 11 02:03:42 h1 hastd[2282]: [hvol] (primary) Disconnected from tcp4://192.168.1.200.
Mar 11 02:03:42 h1 hastd[2282]: [hvol] (primary) Unable to write synchronization data: Cannot allocate memory.


Secondary:

Mar 11 01:01:30 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2874, exitcode=75).
Mar 11 01:01:38 h2 hastd[2875]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:01:44 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2875, exitcode=75).
Mar 11 01:01:45 h2 hastd[2876]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:01:50 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2876, exitcode=75).
Mar 11 01:01:56 h2 hastd[2877]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:02:01 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2877, exitcode=75).
Mar 11 01:02:05 h2 hastd[2878]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:02:11 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2878, exitcode=75).
Mar 11 01:02:15 h2 hastd[2879]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:02:20 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2879, exitcode=75).
Mar 11 01:02:30 h2 hastd[2880]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:02:34 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2880, exitcode=75).
Mar 11 01:02:41 h2 hastd[2881]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:02:47 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2881, exitcode=75).
Mar 11 01:02:48 h2 hastd[2882]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:02:54 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2882, exitcode=75).
Mar 11 01:02:59 h2 hastd[2883]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:03:04 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2883, exitcode=75).
Mar 11 01:03:13 h2 hastd[2884]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:03:17 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2884, exitcode=75).
Mar 11 01:03:18 h2 hastd[2885]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:03:23 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2885, exitcode=75).
Mar 11 01:03:28 h2 hastd[2886]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:03:33 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2886, exitcode=75).
Mar 11 01:03:42 h2 hastd[2887]: [hvol] (secondary) Unable to receive request header: Socket is not connected.
Mar 11 01:03:48 h2 hastd[2506]: [hvol] (secondary) Worker process exited ungracefully (pid=2887, exitcode=75).
Mar 11 01:03:48 h2 hastd[2888]: [hvol] (secondary) Unable to receive request header: Socket is not connected.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120311185457.GB1684>