Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 May 2011 18:08:59 +0300
From:      Daniel Kalchev <daniel@digsys.bg>
To:        Mikolaj Golub <trociny@freebsd.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: HAST instability
Message-ID:  <4DE5048B.3080206@digsys.bg>
In-Reply-To: <86zkm3t11g.fsf@in138.ua3>
References:  <4DE21C64.8060107@digsys.bg> <4DE3ACF8.4070809@digsys.bg>	<86d3j02fox.fsf@kopusha.home.net> <4DE4E43B.7030302@digsys.bg> <86zkm3t11g.fsf@in138.ua3>

next in thread | previous in thread | raw e-mail | index | archive | help
On 31.05.11 17:08, Mikolaj Golub wrote:
> As I wrote privately, it would be nice to see both netstat and hast logs (from both nodes) for the same rather long period, when several cases occured. It would be good to place them somewere on web so other guys could access them too, as I will be offline for 7-10 days and will not be able to help you until I am back.

The test finished running for almost three hours, and so here is the 
collected data:

(for the duration of test, on the secondary node)
systat -if
                     /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
      Load Average

       Interface           Traffic               Peak                Total
             lo0  in      0.000 KB/s          0.000 KB/s            1.126 KB
                  out     0.000 KB/s          0.000 KB/s            1.126 KB

             ix1  in      0.003 KB/s        230.590 MB/s          614.688 GB
                  out     0.054 KB/s          7.425 MB/s           19.910 GB

            igb0  in      0.025 KB/s          3.636 KB/s          566.897 KB
                  out     0.072 KB/s          4.296 KB/s            1.091 MB


The primary node is b1a, the secondary node is b1b.
kernel (built just after csup update):

FreeBSD b1a 8.2-STABLE FreeBSD 8.2-STABLE #1: Mon May 30 14:17:50 EEST 
2011     root@b1a:/usr/obj/usr/src/sys/GENERIC  amd64

from primary
messages: http://news.digsys.bg/~admin/hast/test31may/b1a-messages
netstat -in: http://news.digsys.bg/~admin/hast/test31may/b1a-netstat -in
netstat-s: http://news.digsys.bg/~admin/hast/test31may/b1a-netstat-s

from secondary
messages: http://news.digsys.bg/~admin/hast/test31may/b1b-messages
netstat -in: http://news.digsys.bg/~admin/hast/test31may/b1b-netstat -in
netstat-s: http://news.digsys.bg/~admin/hast/test31may/b1b-netstat-s

>   DK>  One additional note: while playing with this setup, I tried to
>   DK>  simulate local disk going away in the hope HAST will switch to using
>   DK>  the remote disk. Instead of asking someone at the site to pull out the
>   DK>  drive, I just issued on the primary
>
>   DK>  hastctl role init data0
>
>   DK>  which resulted in kernel panic. Unfortunately, there was no sufficient
>   DK>  dump space for 48GB. I will re-run this again with more drives for the
>   DK>  crash dump. Anything you want me to look for in particular? (kernels
>   DK>  have no KDB compiled in yet)
>
> Well, removing physical disk (device /dev/gpt/data0 consumed by hastd
> dissapears) and switching a resource to init role (devive /dev/hast/data0
> consumed by FS dissapears) are two different things. Sure you should not
> normally change the resource role (destroy hast device) before unmounting
> (exporting) FS.
Then how do I proceed with a failed drive? Or  a flaky drive that is 
still visible to the OS, that I want to remove from HAST and replace 
with a different one? How do I ask HAST to switch I/O to the secondary? 
Is there other way to get a drive out of HAST? In any case, even if this 
is not allowed operation, it should not panic.

I am now going to reboot and run the same tests without checksums.

Daniel




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4DE5048B.3080206>