From owner-freebsd-stable@FreeBSD.ORG  Wed Jun 19 13:36:03 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id DBA101A4
 for <freebsd-stable@freebsd.org>; Wed, 19 Jun 2013 13:36:02 +0000 (UTC)
 (envelope-from jdc@koitsu.org)
Received: from relay5-d.mail.gandi.net (relay5-d.mail.gandi.net
 [217.70.183.197])
 by mx1.freebsd.org (Postfix) with ESMTP id C6A2B120C
 for <freebsd-stable@freebsd.org>; Wed, 19 Jun 2013 13:36:01 +0000 (UTC)
Received: from mfilter10-d.gandi.net (mfilter10-d.gandi.net [217.70.178.139])
 by relay5-d.mail.gandi.net (Postfix) with ESMTP id 1EBCD41C08F;
 Wed, 19 Jun 2013 15:35:43 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at mfilter10-d.gandi.net
Received: from relay5-d.mail.gandi.net ([217.70.183.197])
 by mfilter10-d.gandi.net (mfilter10-d.gandi.net [10.0.15.180]) (amavisd-new,
 port 10024)
 with ESMTP id aHM13fDBbeIo; Wed, 19 Jun 2013 15:35:41 +0200 (CEST)
X-Originating-IP: 76.102.14.35
Received: from jdc.koitsu.org (c-76-102-14-35.hsd1.ca.comcast.net
 [76.102.14.35]) (Authenticated sender: jdc@koitsu.org)
 by relay5-d.mail.gandi.net (Postfix) with ESMTPSA id DD1BF41C076;
 Wed, 19 Jun 2013 15:35:40 +0200 (CEST)
Received: by icarus.home.lan (Postfix, from userid 1000)
 id C1C1273A1C; Wed, 19 Jun 2013 06:35:38 -0700 (PDT)
Date: Wed, 19 Jun 2013 06:35:38 -0700
From: Jeremy Chadwick <jdc@koitsu.org>
To: Adam Strohl <adams-freebsd@ateamsystems.com>
Subject: Re: shutdown -r / shutdown -h / reboot all hang and don't cleanly
 dismount
Message-ID: <20130619133538.GA71689@icarus.home.lan>
References: <51C1979D.3010305@ateamsystems.com>
 <20130619122143.GA70813@icarus.home.lan>
 <51C1A9BF.8030304@ateamsystems.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <51C1A9BF.8030304@ateamsystems.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 19 Jun 2013 13:36:03 -0000

On Wed, Jun 19, 2013 at 07:53:19PM +0700, Adam Strohl wrote:
> On 6/19/2013 19:21, Jeremy Chadwick wrote:
> >On Wed, Jun 19, 2013 at 06:35:57PM +0700, Adam Strohl wrote:
> >>Hello -STABLE@,
> >>
> >>So I've seen this situation seemingly randomly on a number of both
> >>physical 9.1 boxes as well as VMs for I would say 6-9 months at
> >>least.  I finally have a physical box here that reproduces it
> >>consistently that I can reboot easily (ie; not a production/client
> >>server).
> >>
> >>No matter what I do:
> >>
> >>reboot
> >>shutdown -p
> >>shutdown -r
> >>
> >>This specific server will stop at "All buffers synced" and not
> >>actually power down or reboot.  KB input seems to be ignored.  This
> >>server is a ZFS NAS (with GMIRROR for boot blocks) but the other
> >>boxes which show this are using GMIRRORs for root/swap/boot (no
> >>ZFS).
> >>
> >>Here is what happens on the console: http://i.imgur.com/1H8JMyB.jpg
> >>
> >>When I reset the server it appears that disks were not dismounted
> >>cleanly ... on this ZFS box it comes back quick because ZFS is good
> >>like that but on the other servers with GMIRROR roots rebuilding the
> >>GMIRROR and fscking at the same time is murder on the
> >>disk/performance until it finishes.
> >
> >1. You mention "as well as VMs".  Anything under a "virtual machine" or
> >under a hypervisor is going to be very, very, **VERY** different than
> >bare metal.  So I hope the issues you're talking about above are on bare
> >metal -- I will assume so.
> 
> Nope, I see basically the same thing sometimes under ESXi 5.0
> Hypervisor (and yes it worries me the implications of something so
> broad).  Those unites I just haven't been able to isolate on a
> server which isn't critical.  Lets focus on this server for now
> though per your suggestion below.

I'm sorry but I don't understand your first sentence -- the first part
of your sentence says "nope" (I have to assume in reply to my "on bare
metal" part), but then says "I see basically the same thing sometimes
under ESXi" which implies an alternate environment in comparison (i.e.
we *are* talking about bare metal).  Consider me confused.  :-)

> >2. We need to know what version of "9.1" you're using, i.e. 9.1-RELEASE.
> >If you use stable/9 (RELENG_9) we need to see uname -a output (you can
> >hide the machine name if you want).
> 
> Sorry, this ZFS box is 9.1-R P4 (kernel built today):
> 
> FreeBSD ilos.dsn 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #6: Wed Jun
> 19 15:31:12 ICT 2013
> root@hostname:/usr/obj/usr/src/sys/ATEAMSYSTEMS  amd64

I suggest trying stable/9 (and staying with it, for that matter).

> >3. Can we please have dmesg from this machine?  The controller and some
> >other hardware details matter.
> 
> Sure take a look at the full log here: http://pastebin.com/k55gVVuU
> 
> This includes a boot, then a reboot as I describe (you can see it
> logs the All Buffers Synced, etc) then powering back on.

Thanks.  I was mainly interested in the storage controller being used
(in this case ahci(4)) and the disks being used (notorious ST3000DM001,
known for excessively parking heads).  AFAIK this isn't one of the
controllers that was known for weird "quirky issues" pertaining to
flushing data to disk on shutdown.

I have to ask: is this FreeBSD box running under a HV?

If it *is not* running under a HV, could we please get exact motherboard
model and version (including BIOS version)?  Sometimes (not always) you
can get this from "kenv | grep smbios."

I can also see you're running your own kernel.  We'll get to that in a
moment.

> >4. Does "sysctl hw.usb.no_shutdown_wait=1" help you?
> 
> Weirdly this allowed it to reboot on the first try (without needing
> to be reset), but not the second.

I'm not surprised.  Pleas re-try with stable/9; Hans has been constantly
working on the USB stack and fixing major bugs.

> The "Starting background file
> system checks in 60 seconds" message appeared ... that only happens
> when something is dirty, right?

No it does not.  That message is always printed when you use background
fsck, which is the default.

I do not advocate using background fsck, because it has been known (and
may still do this -- I do not care to find out, I do not have time for
unreliable filesystem nonsense) to not always fix all filesystem
problems.  Meaning: people using background fsck have been known to boot
into single-user and issue "fsck" manually and find issues.

Place background_fsck="no" in /etc/rc.conf.  If the machine does not
have a clean filesystem on boot-up, you'll know because the system will
immediately begin fsck (in the foreground actively).  You'll recognise
that output if it happens, trust me.

> So the second try with just this I could ctrl alt del it and it
> responded .. kind of:
> http://i.imgur.com/POAIaNg.jpg
> 
> Still had to reset it though.

This looks like a chicken-and-egg problem -- you're probably fighting
with background fsck, as the message there indicate "some processes
would not die".  I'm just taking a guess though.

I am now going to ask you for more information:

1. "gpart show -p xxx" where xxx is each disk you have in the system
2. gmirror list
3. Any/all details of your gmirror setup or other things you can
   think of when you set it up
4. Contents of /etc/fstab
5. Contents of /boot/loader.conf
6. Contents of /etc/rc.conf
7. Contents of /etc/sysctl.conf
8. Contents of /sys/amd64/conf/ATEAMSYSTEMS

> >5. Does "sysctl hw.acpi.handle_reboot=1" help you?
> 
> No change, still responded to a ctrl alt del like above, but like
> that still needs to be reset and comes back dirty.
> 
> >
> >6. Does "sysctl hw.acpi.disable_on_reboot=1" help you?
> 
> No change.  Same as above, ctrl alt del responds but needs a hard
> reset still.

Okay, thank you.

> >7. If none of the above helps, can you please boot verbose mode and then
> >when the system "locks up" on "shutdown -r now" take a picture of the
> >VGA console?
> 
> Lots of debug on boot obviously but not much different on shutdown/hang:
> http://i.imgur.com/SgzSsoP.jpg

It looks to me like the ACPI layer is still actively working at the time
"all buffers are synced", meaning the actual reboot phase itself never
happens.  This to me starts to smell of an ACPI problem, but I do not
have the skill set to debug this, and I'm also grasping at straws.
There are many things that happen during that phase of operation,
particularly the "USB shutdown" phase.

But it all depends on your kernel config, which I've now asked for.

> >8. Does the machine run moused(8) (check the process list please, do not
> >rely on rc.conf) ?
> 
> ps -auxww | grep moused reveals nothing running (which is how I have
> things set).

Okay thank you.

> >>Another interesting thing is that this particular server runs slapd
> >>(OpenLDAP) which, when it comes back up, has a "corrupted" DB
> >>(easily fixed with db_recover, but still).  This might be because FS
> >>commits aren't happening at the end.   I can even manually stop
> >>slapd (service slapd stop) then run sync(8) (I assume this does
> >>something for ZFS too) and it still comes back as hosed if I reboot
> >>shortly after.  If I start/stop slapd it's fine.  So I feel like
> >>there is an FS/dismount thing going on here.
> >
> >sync(8) does not do what you think it does.  Please read (not skim) this
> >entire thread starting here:
> >
> >http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/thread.html#16982
> >http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/016982.html
> 
> Groking this now ..
> 
> >
> >Your problem is related to unclean shutdown; fix that and your issues go
> >away.
> 
> Yeah that is my feeling as well.
> 
> >
> >>Additional information: I also have some boxes which will reboot
> >>(ie; they don't freeze like some do at the end) but they don't
> >>dismount cleanly either and have to rebuild both GMIRROR and fsck.
> >>This might be a different issue, too.
> >
> >Every issue needs to be handled/treated separately.
> 
> Sure, I just had run across some threads about that but will focus
> on this ZFS box (and see if anything that fixes here does anything
> with that once I can reliably reproduce it out of production).

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |