Date: Fri, 22 Nov 2013 11:18:29 +0000 From: Pete French <petefrench@ingresso.co.uk> To: petefrench@ingresso.co.uk, trociny@FreeBSD.org Cc: freebsd-stable@freebsd.org Subject: Re: Hast locking up under 9.2 Message-ID: <E1Vjokn-000OuU-1Y@dilbert.ingresso.co.uk> In-Reply-To: <20131121203711.GA3736@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> I remember already asking you about replication mode you was using and > don't remember you answered. One of the significant changes is memsync > mode, which is default in 9.2 (it was fullsync in eralier versions). > So if you are using default settings you can try switching to fullsync > as a workaround. Yes, I am using the default settings, so that is something I can try. After three days of downtime last week I will not try it in the immedaiet future though, for fear of my colleaguyes wanting to strange me :-) Will enable on the test system however, and try on live in a couple of weeks if I can. > signal=6 means that hastd crashed due to some assertion failed. > Usually "Assertion failed ..." message precedes this line in the > logs. Don't you see such a message? It might be very helpful. Yes, I do actually! "Assertion failed: (!hio->hio_done), function write_complete, file /usr/src/sbin/hastd/primary.c, line 1130." > Do you always see this error when it gets stuck? That I do not know I am afraid - I was too busy getting the systems back online to have time to try and recocnile the tdowntimes with what is in the logfiles. It was only yesterday that I started trying to tarce what might have happened > Unfortunately the crash did not generated core (due to capsicum). When > I want to get a coredump I rebuild hastd with CFLAGS+=-DHAVE_CAPSICUM > removed in Makefile (and with debugging symbols). There might be an > easier method but I don't know. > > If you don't find the assertion message and the crashes are > reproducible, it would be helpful to rebuild hastd with symbols and > capsicum disabled to make it coredump and provide the backtrace. > > Also, when you have hastd got stuck you can generate a core of the > live process with gcore(1). I didnt know about gcore - thats a very useful feature! The crash is reproducible, but not on any machine that I could actually crash without causing extensive downtime to the rest of the business unfortunately. I can't deliberately crash our master database and it doesnt crash ont he test setup we have. But what I can do is to run it up live again with your suggested change to the config, and if it gets stuck try and generate some more useful debugging then. > What revision are you using? Recently there was a fix for crashes > triggered by this failed assertion: > > Assertion failed: (amp->am_memtab[ext] > 0), function > activemap_write_complete, file activemap.c, line 351. I'm using r257795 - I did an upgrade to get the fix for the above assertion, and in general I keep an eve onm the commits and anything involving hast or zfs I take as soon as I can to try and improve stability. Thanks for the help - if I get any more info I will let you know, of if the above assertyion helps you track something down then I may be able to try some patches. cheers, -pete.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E1Vjokn-000OuU-1Y>