Date: Fri, 6 May 2005 11:08:32 +0200 (CEST) From: Erik Norgaard <norgaard@math.ku.dk> To: questions@freebsd.org Subject: Spontaneous reboots Message-ID: <Pine.LNX.4.40.0505050921350.22295-100000@shannon.math.ku.dk>
next in thread | raw e-mail | index | archive | help
Hi, I am experiencing tremendous problems keeping my FBSD 5 up and happy, yet I keep experiencing spontaneous reboots and crashes. This is a looong story, I have been trying to figure out what's causing the problem for two weeks now. I really appreciate your patience and response if you make it all to the end :-) The setup: FBSD---DSL---Internet The DSL is a Thomsom 510 ADSL router doing 1-1 NAT, no firewall. The FBSD is configured with IPFilter firewall and running named, postfix, cyrus-imap22 with virtual domains and apache with virtual hosts, also to serve the local net (behind the DSL) it runs dhcpd, ntpd and mysql. Postfix, Cyrus-Imap and Apache are all configured with TLS support and I have generated certificates using OpenSSL. This system was installed in november, and upgraded begning january. I have had no problems for months. Then - from the beginning: On April 15, FreeBSD 5.3-p5, I had two simultaneous+/- events: 1) A huge number of incoming mail delivery attempts to addresses of the type randomchars@mydomain.com 2) Kernel panic, fatal trap 12 I had done no prior system tuning or changes. Since then, uptime has been anywhere between 0 and >3 days - the last obtained by stopping all services and disconnecting the machine from the network. 1) By huge, I mean enough to suck up a 512kbps DSL connection, but this should be far from enough to make FBSD cough or even panic. Also, system load is always close to 0.00. I have postfix handling mail and use cyrus-imap with virtual domains as backend. Since postfix didn't know hosted addresses, cyrus rejects the mail. I created a list of existing addresses so mail could be rejected faster. The illicit mail delivery attempts persists. 2) I followed the handbook to investigate the panic: Following the kernel panic faq: Fatal trap 12: Page fault while in kernel mode Fault virtual address = 0xc Fault code = supervisor read, page not present instruction pointer = 0x8:0xc053d638 stack pointer = 0x10:0xcb4ddaec frame pointer = 0x10:0xcb4ddaf8 code segment = base 0x0, limit 0xffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL=0 current process = 28 (swi1:net) trap number = 12 panic: page fault # nm -n /boot/kernel/kernel | grep c053d6 c053d610 T m_copydata c053d670 T m_dup I no longer get this panic, however my system does not deserve the predicate -STABLE. Somehow, I prefered the panic, at least it gave some info for debugging. But now it reboots without a blip. Disk errors: The crashes _always_ causes disk errors that cannot be recovered by the background fsck, particularly on /var where mail resides. This may result in new reboots. To solve this I have tried mounting drives read-only, unless write permission was necesary. It turns out that postfix requires write access to /, /usr and /var - the first two appears to be related to tls(?). Also, I have set fsck_y_enable="yes" in rc.conf, so the disk is thorouly checked on boot after a crash. I had dumpon set in my rc.conf but this just made the partition full making things even worse. I have removed all kernel dumps and also unnecessary data as I understood diskperformance may drop when diskspace is below 15%. The kernel: The first kernel was a 5.3-p5 custom kernel. To make it easier to debug I updated to -p8, GENERIC. No change. No change. Following suggestions by Kris K. I upgraded to 5.4-RC2. This solved the panic - but the system still crashes, also after updating to RC3 and RC4. The system: Upgrading to 5.4, RC2, I built world also. I then realized that some ports may have been built against the old base causing new problems. I have now deinstalled all ports. The system has been completely updated, kernel and base, to 54RC4. I have reinstalled the minimal set of ports needed to serve my needs, version to -CURRENT as of may 3. I still experience crashes. Postfix: I tried to limit the amount of simultaneaous deliveries handles. No change. When a connection is made postfix sends a lot of dns queryes to verify that the sender hostname resolves to the ip, that sender domain exists, and that it is not in a block list. IPFilter: I have restricted access to port 25, now only a handfull of servers are permitted by the firewall. This has helped, uptime is now hours rather than minutes, but I still have crashes. I have reduced all timeouts to prevent state table from saturating, but no change. If I open up for incoming mail, for a (any) /8 segment, the number of connections explode. Due to the limitation of simultaneous postfix threads, many time out. No change. I am working on a black list based on the maillog, but this is another project. DNS: Since mail to mydomain.com is currently useless I have decided to set the MX record to 127.0.0.1. This has stopped the illicit mail, but also all other legitimate mail to that domain - mostly this gives me peace and bandwith. Hardware: (dmesg below) I have tried to change the disk cable, I have a 2.5" disk with a converter cable to standard IDE. Also, I have tried the disk in my laptop and it appears stable, but testing period was limited. I have tried both IDE connectors on the MB and both NIC's. No change. Summary: Despite all my attempts to solve the problem, my system is far from STABLE. I still experience spontaneous crashes, allthough less often. It is my personal belief that there may be a hardware problem, or persistent disk errors. The reason is that despite the traffic load satturates the connection it should not be enough to crash even limited hardware. I have no more ideas on how to debug this. Questions: * Is there a disk tool for analysing the disk, marking sectors bad etc? * How do I find the file if I know the Inode number (as reported by fsck)? * Can malformed packets cause FBSD crash? Could Thomson510 be accountable for such packets? * Did I miss the obvious? * Any ideas where to go now? All help is highly appreciated. Thanks, Erik Disk space: df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad0s1a 507630 76966 390054 16% / devfs 1 1 0 100% /dev /dev/ad0s1g 30859916 14228272 14162852 50% /home /dev/ad0s1f 507630 42 466978 0% /tmp /dev/ad0s1d 12186190 2134420 9076876 19% /usr /dev/ad0s1e 12186190 7689462 3521834 69% /var devfs 1 1 0 100% /var/named/dev last (24h): norgaard ttyp0 x.x.x.x Fri 6 May 10:09 still logged in norgaard ttyp0 x.x.x.x Fri 6 May 09:22 - 09:25 (00:03) norgaard ttyp0 charm Fri 6 May 08:28 - 08:42 (00:13) norgaard ttyp0 charm Fri 6 May 07:48 - 08:00 (00:11) reboot ~ Fri 6 May 04:16 norgaard ttyp1 charm Thu 5 May 22:45 - 23:18 (00:32) norgaard ttyp0 charm Thu 5 May 22:09 - crash (06:07) reboot ~ Thu 5 May 22:05 norgaard ttyp0 charm Thu 5 May 21:45 - crash (00:20) reboot ~ Thu 5 May 21:20 norgaard ttyp1 charm Thu 5 May 21:11 - crash (00:09) norgaard ttyp0 charm Thu 5 May 20:45 - crash (00:35) reboot ~ Thu 5 May 18:57 norgaard ttyp0 x.x.x.x Thu 5 May 18:23 - 18:23 (00:00) reboot ~ Thu 5 May 18:22 norgaard ttyp0 x.x.x.x Thu 5 May 16:44 - crash (01:37) norgaard ttyp0 x.x.x.x Thu 5 May 15:44 - 16:13 (00:28) norgaard ttyp0 x.x.x.x Thu 5 May 13:57 - 13:58 (00:00) norgaard ttyp0 x.x.x.x Thu 5 May 13:38 - 13:51 (00:12) norgaard ttyp0 x.x.x.x Thu 5 May 13:06 - 13:27 (00:21) norgaard ttyp0 x.x.x.x Thu 5 May 10:53 - 11:00 (00:06) reboot ~ Thu 5 May 10:43 norgaard ttyp0 x.x.x.x Thu 5 May 10:37 - crash (00:06) norgaard ttyp0 x.x.x.x Thu 5 May 10:14 - 10:22 (00:08) reboot ~ Thu 5 May 10:06 norgaard ttyp0 charm Thu 5 May 08:38 - crash (01:27) reboot ~ Thu 5 May 08:38 norgaard ttyp0 charm Thu 5 May 07:53 - 07:54 (00:00) norgaard ttyp0 charm Thu 5 May 07:52 - 07:52 (00:00) reboot ~ Thu 5 May 07:17 reboot ~ Thu 5 May 04:59 norgaard ttyp0 charm Thu 5 May 04:17 - crash (00:41) reboot ~ Thu 5 May 04:16 shutdown ~ Thu 5 May 04:14 norgaard ttyp0 charm Thu 5 May 03:45 - shutdown (00:28) reboot ~ Thu 5 May 03:42 reboot ~ Thu 5 May 03:40 norgaard ttyp0 charm Thu 5 May 03:40 - crash (00:00) reboot ~ Thu 5 May 03:31 reboot ~ Thu 5 May 03:27 reboot ~ Thu 5 May 03:13 reboot ~ Thu 5 May 03:03 reboot ~ Thu 5 May 02:58 reboot ~ Thu 5 May 02:51 reboot ~ Thu 5 May 02:47 reboot ~ Thu 5 May 02:41 reboot ~ Thu 5 May 02:35 reboot ~ Thu 5 May 02:29 reboot ~ Thu 5 May 02:25 reboot ~ Thu 5 May 02:20 reboot ~ Thu 5 May 02:09 reboot ~ Thu 5 May 01:58 reboot ~ Thu 5 May 01:53 reboot ~ Thu 5 May 01:50 reboot ~ Thu 5 May 01:46 reboot ~ Thu 5 May 01:42 reboot ~ Thu 5 May 01:33 reboot ~ Thu 5 May 01:30 reboot ~ Thu 5 May 01:27 reboot ~ Thu 5 May 01:13 reboot ~ Thu 5 May 01:08 reboot ~ Thu 5 May 01:05 reboot ~ Thu 5 May 00:58 reboot ~ Thu 5 May 00:53 reboot ~ Thu 5 May 00:44 reboot ~ Thu 5 May 00:34 reboot ~ Thu 5 May 00:24 reboot ~ Thu 5 May 00:20 reboot ~ Thu 5 May 00:13 reboot ~ Wed 4 May 23:58 reboot ~ Wed 4 May 23:43 reboot ~ Wed 4 May 23:40 reboot ~ Wed 4 May 23:36 norgaard ttyp0 charm Wed 4 May 20:57 - 23:29 (02:31) Note the reboots from Wed 4, 23.36 - Thu 5 7.52 appeared to be caused by postfix throtling due to a read only mounted /usr. dmesg.today: Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-RC4 #0: Tue May 3 14:07:30 CEST 2005 root@top.daemonsecurity.com:/usr/obj/usr/src/sys/GENERIC Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: VIA C3 Nehemiah+RNG (1002.28-MHz 686-class CPU) Origin = "CentaurHauls" Id = 0x694 Stepping = 4 Features=0x380b03d<FPU,DE,PSE,TSC,MSR,MTRR,PGE,CMOV,MMX,FXSR,SSE> real memory = 251592704 (239 MB) avail memory = 236548096 (225 MB) npx0: <math processor> on motherboard npx0: INT 16 interface acpi0: <VT9174 AWRDACPI> on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0 cpu0: <ACPI CPU (3 Cx states)> on acpi0 acpi_throttle0: <ACPI CPU Throttling> on cpu0 acpi_button0: <Power Button> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 agp0: <VIA 862x (CLE266) host to PCI bridge> mem 0xd0000000-0xd7ffffff at device 0.0 on pci0 pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0 pci1: <ACPI PCI bus> on pcib1 pci1: <display, VGA> at device 0.0 (no driver attached) vr0: <VIA VT6105 Rhine III 10/100BaseTX> port 0xd000-0xd0ff mem 0xde000000-0xde0000ff irq 12 at device 15.0 on pci0 miibus0: <MII bus> on vr0 ukphy0: <Generic IEEE 802.3u media interface> on miibus0 ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto vr0: Ethernet address: 00:40:63:d4:89:72 uhci0: <VIA 83C572 USB controller> port 0xd400-0xd41f irq 11 at device 16.0 on pci0 usb0: <VIA 83C572 USB controller> on uhci0 usb0: USB revision 1.0 uhub0: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: <VIA 83C572 USB controller> port 0xd800-0xd81f irq 11 at device 16.1 on pci0 usb1: <VIA 83C572 USB controller> on uhci1 usb1: USB revision 1.0 uhub1: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: <VIA 83C572 USB controller> port 0xdc00-0xdc1f irq 9 at device 16.2 on pci0 usb2: <VIA 83C572 USB controller> on uhci2 usb2: USB revision 1.0 uhub2: VIA UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered pci0: <serial bus, USB> at device 16.3 (no driver attached) isab0: <PCI-ISA bridge> at device 17.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <VIA 8235 UDMA133 controller> port 0xe000-0xe00f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 17.1 on pci0 ata0: channel #0 on atapci0 ata1: channel #1 on atapci0 pci0: <multimedia, audio> at device 17.5 (no driver attached) vr1: <VIA VT6102 Rhine II 10/100BaseTX> port 0xe800-0xe8ff mem 0xde002000-0xde0020ff irq 11 at device 18.0 on pci0 miibus1: <MII bus> on vr1 ukphy1: <Generic IEEE 802.3u media interface> on miibus1 ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto vr1: Ethernet address: 00:40:63:d4:89:71 fdc0: <floppy drive controller> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0 fd0: <1440-KB 3.5" drive> on fdc0 drive 0 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A ppc0: <Standard parallel printer port> port 0x378-0x37f irq 7 on acpi0 ppc0: Generic chipset (EPP/NIBBLE) in COMPATIBLE mode ppbus0: <Parallel port bus> on ppc0 plip0: <PLIP network interface> on ppbus0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 sio2: <16550A-compatible COM port> port 0x3e8-0x3ef irq 5 on acpi0 sio2: type 16550A sio3: <16550A-compatible COM port> port 0x2e8-0x2ef irq 10 on acpi0 sio3: type 16550A orm0: <ISA Option ROM> at iomem 0xc0000-0xcdfff on isa0 pmtimer0 on isa0 atkbdc0: <Keyboard controller (i8042)> at port 0x64,0x60 on isa0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounter "TSC" frequency 1002278507 Hz quality 800 Timecounters tick every 10.000 msec ad0: 57231MB <IC25N060ATMR04-0/MO3OAD4A> [116280/16/63] at ata0-master UDMA100 Mounting root from ufs:/dev/ad0s1a WARNING: /home was not properly dismounted WARNING: /tmp was not properly dismounted WARNING: /usr was not properly dismounted WARNING: /var was not properly dismounted IP Filter: v3.4.35 initialized. Default = pass all, Logging = enabled Accounting enabled GnuPG: http://www.locolomo.org/home/norgaard/norgaard.gpg.asc pub 1024D/11D11F9E 2003-08-15 Erik Norgaard <norgaard@locolomo.org> Key fingerprint = C394 81C4 D137 EEE5 39BE 82D5 3E6B FB3E 11D1 1F9E
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.LNX.4.40.0505050921350.22295-100000>