From owner-freebsd-stable@FreeBSD.ORG Sat Jul 2 15:49:11 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D4BE3106564A; Sat, 2 Jul 2011 15:49:11 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-fx0-f44.google.com (mail-fx0-f44.google.com [209.85.161.44]) by mx1.freebsd.org (Postfix) with ESMTP id 3C4F88FC0C; Sat, 2 Jul 2011 15:49:10 +0000 (UTC) Received: by fxe6 with SMTP id 6so3352135fxe.17 for ; Sat, 02 Jul 2011 08:49:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=from:to:cc:subject:references:x-comment-to:sender:date:in-reply-to :message-id:user-agent:mime-version:content-type; bh=13VlCpn1fR+pbQ4Qqgh+RYI+OHfUKnMbEe6bDKLDu2Q=; b=Y7O0ad+UWxmDEGxAOQm7T050GlYOmeBH5DTp5PXxJFvBX07ku3nHR23QeKrALtY+ew FaM/mBEVmYNd0FUOr7P+pgsqOp/IUktrGQp05naC630A/Qh5xD6kBmpibn8rxSqa0Ix1 1nLxOHiiU3orktqdPgp/J/vuAE+T+zrdYDs8Y= Received: by 10.223.78.143 with SMTP id l15mr3330574fak.106.1309621749953; Sat, 02 Jul 2011 08:49:09 -0700 (PDT) Received: from localhost ([95.69.173.122]) by mx.google.com with ESMTPS id l9sm53486fal.19.2011.07.02.08.49.07 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 02 Jul 2011 08:49:08 -0700 (PDT) From: Mikolaj Golub To: Timothy Smith References: X-Comment-To: Timothy Smith Sender: Mikolaj Golub Date: Sat, 02 Jul 2011 18:49:05 +0300 In-Reply-To: (Timothy Smith's message of "Thu, 30 Jun 2011 20:02:19 -0700") Message-ID: <8639ioadji.fsf@kopusha.home.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Pawel Jakub Dawidek , freebsd-stable@freebsd.org Subject: Re: HAST + ZFS: no action on drive failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jul 2011 15:49:12 -0000 On Thu, 30 Jun 2011 20:02:19 -0700 Timothy Smith wrote: TS> First posting here, hopefully I'm doing it right =) TS> I also posted this to the FreeBSD forum, but I know some hast folks monitor TS> this list regularly and not so much there, so... TS> Basically, I'm testing failure scenarios with HAST/ZFS. I got two nodes, TS> scripted up a bunch of checks and failover actions between the nodes. TS> Looking good so far, though more complex that I expected. It would be cool TS> to post it somewher to get some pointers/critiques, but that's another TS> thing. TS> Anyway, now I'm just seeing what happens when a drive fails on primary node. TS> Oddly/sadly, NOTHING! TS> Hast just keeps on a ticking, and doesn't change the state of the failed TS> drive, so the zpool has no clue the drive is offline. The TS> /dev/hast/ remains. The hastd does log some errors to the system TS> log like this, but nothing more. TS> messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Unable to TS> flush activemap to disk: Device not configured. TS> messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Local request TS> failed (Device not configured): WRITE(4736512, 512). Although the request to local drive failed it succeeded on remote node, so data was not lost, it was considered as successful, and no error was returned to ZFS. TS> So, I guess the question is, "Do I have to script a cronjob to check for TS> these kinds of errors and then change the hast resource to 'init' or TS> something to handle this?" Or is there some kind of hastd config setting TS> that I need to set? What's the SOP for this? Currently the only way to know is monitoring logs. It is not difficult to hook event for these errors in the HAST code (like it is done for connect/disconnect, syncstart/done etc) so one could script what to do on an error occurrence but I am not sure it is a good idea -- the errors may be generated with high rate. TS> As something related too, when the zpool in FreeBSD does finally notice that TS> the drive is missing because I have manually changed the hast resource to TS> INIT (so the /dev/hast/ is gone), my zpool (raidz2) hot spare doesn't TS> engage, even with "autoreplace=on". The zpool status of the degraded pool TS> seems to indicate that I should manually replace the failed drive. If that's TS> the case, it's not really a "hot spare". Does this mean the "FMA Agent" TS> referred to in the ZFS manual is not implemented in FreeBSD? TS> thanks! TS> _______________________________________________ TS> freebsd-stable@freebsd.org mailing list TS> http://lists.freebsd.org/mailman/listinfo/freebsd-stable TS> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" -- Mikolaj Golub