Date: Fri, 10 Oct 2008 01:50:42 -0700 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: Laszlo Nagy <gandalf@shopzeus.com> Cc: freebsd-questions@freebsd.org Subject: Re: 7.1 hangs, shutdown terminated Message-ID: <20081010085042.GA27290@icarus.home.lan> In-Reply-To: <48EF14E1.9080808@shopzeus.com> References: <48EF14E1.9080808@shopzeus.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Oct 10, 2008 at 10:40:01AM +0200, Laszlo Nagy wrote: > Hi, > > A computer hangs every day in the morning at a specific time, between 8 > AM and 9 AM. We can ping it. Apparently the console works, also gdm > works on it, but we are not able to login at all. ssh accepts > connections, but the authentication does not continue (e.g. ssh client > waits for the server forever...) > > I even cannot login on the console as "root" because it accepts the user > name, but does not ask for the password! > > Pressing Ctrl+Alt+Del on the console waits for about one or two minutes, > then I see this on the screen: > > http://www.imghype.com/viewer.php?imgdata=9d95ee9d1fstrange_shutdown.jpg > > Here is /var/log/messages just before the crash: > > Oct 10 01:52:47 shopzeus postgres[81114]: [5-1] WARNING: nonstandard > use of escape in a string literal at character 193 > Oct 10 01:52:47 shopzeus postgres[81114]: [5-2] HINT: Use the escape > string syntax for escapes, e.g., E'\r\n'. > Oct 10 01:57:11 shopzeus postgres[84132]: [5-1] WARNING: nonstandard > use of escape in a string literal at character 188 > Oct 10 01:57:11 shopzeus postgres[84132]: [5-2] HINT: Use the escape > string syntax for escapes, e.g., E'\r\n'. > Oct 10 02:00:01 shopzeus postfix/postfix-script[86167]: fatal: the > Postfix mail system is already running > Oct 10 02:30:00 shopzeus postfix/postfix-script[7240]: fatal: the > Postfix mail system is already running > Oct 10 03:00:00 shopzeus postfix/postfix-script[27437]: fatal: the > Postfix mail system is already running > Oct 10 04:07:54 shopzeus rc.shutdown: 30 second watchdog timeout > expired. Shutdown terminated. > Oct 10 04:09:16 shopzeus postgres[30455]: [5-1] FATAL: terminating > connection due to administrator command > Oct 10 04:09:17 shopzeus syslogd: exiting on signal 15 > Oct 10 04:11:31 shopzeus syslogd: kernel boot file is /boot/kernel/kernel > Oct 10 04:11:31 shopzeus kernel: Copyright (c) 1992-2008 The FreeBSD > Project. > Oct 10 04:11:31 shopzeus kernel: Copyright (c) 1979, 1980, 1983, 1986, > 1988, 1989, 1991, 1992, 1993, 1994 > > After rebooting the machine, nothing happens until the next day. Here > are some possible problems I can think of: > > #1. We are using gjournal. It might be that the journal size is too > small. Although I do not think this is the case, because we have 40GB > journal space for each journaled partition below (except for /home, it > has 10GB only, but /home is rarely used) > > Filesystem 1G-blocks Used Avail Capacity Mounted on > /dev/da0s1a 9 1 7 14% / > devfs 0 0 0 100% /dev > /dev/da0s1f.journal 140 12 117 9% /home > /dev/da0s2d.journal 106 8 89 8% /pgdata0 > /dev/da0s1d 29 0 26 0% /tmp > /dev/da0s2e.journal 585 74 464 14% /usr > /dev/da0s1e.journal 145 17 116 13% /var > /dev/da1s1d.journal 416 0 383 0% /data > > Is it possible that gjournal is hanging up the machine? > > #2. Yesterday when I logged in in the morning, I saw a process running > under root, it was something like " find / -sx ..." and then something. > I don't remember but it was scanning the whole filesystem. It was using > 100% cpu and 100% disk I/O. I wonder if that might be freezing the > computer. I do not know how to disable this maintenance process but I > should. After killing this process, the system worked fine. (We have > zillions of files on the disks, running "find / ..." is a bad idea.) This could be a periodic job (since you said this happens daily) which runs early in the morning (2-3am?) and for some reason isn't finishing in a timely manner. You haven't provided any actual ps -auxwwwwwww data, so we can't easily discern if it's a periodic job or something amiss on your system (for all we know the system could be compromised). I'm also curious what controller your SCSI disks are attached to. Can you provide that information? dmesg would be useful. I remember hearing some reports about 3Ware controllers locking up due to firmware problems which were later fixed via a f/w upgrade. > #3. In the screenshot above, you can see that the IMAP server "dovecot" > was terminated on signal 11. Can it be the problem? I can't believe that > dovecot could freeze the whole system. > > #4. Hardware error. I don't think this is the case since the computer > freezes at the same time, every day, so it is more likely a software > problem. My vote is on a hardware problem. The watchdog timeout you see indicates a portion of the system is locking up hard. The sig 11 would indicate a sudden segfault, which if unexpected, often indicates bad memory or motherboard. I would recommend you start down the hardware path. Replace the RAM and the mainboard, and see what happens. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081010085042.GA27290>