Date: Mon, 27 Apr 2009 09:51:03 -0700 From: Jack Vogel <jfvogel@gmail.com> To: Greg Byshenk <freebsd@byshenk.net> Cc: freebsd-stable@freebsd.org Subject: Re: em0 watchdog timeout (and 3ware problems) 7-stable Message-ID: <2a41acea0904270951i20a7d65fja677e3e7865802b@mail.gmail.com> In-Reply-To: <20090426125008.GK1550@core.byshenk.net> References: <20090426125008.GK1550@core.byshenk.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Greg, I have another report of this problem, and I have a patch for you to try out, will be sending it out a bit later today. Jack On Sun, Apr 26, 2009 at 5:50 AM, Greg Byshenk <freebsd@byshenk.net> wrote: > I have one machine that is seeing watchdog timeouts on em0, running > 7-STABLE > amd64 as of 2009.04.19, and also some other more perverse errors. > > Twice now in the last 48 hours, this machine has become unreachable via the > network, and connecting to the console shows an endless string of > > [...] > em0: watchdog timeout -- resetting > em0: watchdog timeout -- resetting > em0: watchdog timeout -- resetting > > messages. The machine is almost locked up. That is, I can get a login > prompt, but can go no further than typing in a username; after the > username, no password prompt, and nothing further. The only option is > to hard reset the machine or to drop to debugger and reboot. > > Now the "perverse" part. After restarting, the system partition is no > more. > > Background detail: the machine is a fileserver, with a 3Ware 9650SE-16ML > SATA controller, connected to 16 1TB SATA drives, this configured as > a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition > and 6.5TB data partition. The system partition is configured as da1, > with one slice and more or less standard partitions for / /var /tmp, etc. > (the data partition of the array is sliced with gpt). > > The issue here is that, upon restart, all parition information on da0 > seems to have disappeared, and restarting results in a "no operating > system found" message, and a failure to boot (obviously). > > But all of the data is still present. If I boot into rescue mode, > recreate da0s1, mark it bootable, and restore the bsdlabel, then > everything works again. I can restart the machine, and it comes back > up normally (it requires an fsck of everything on da0, but after that > everything is back to normal). > > I don't know if this is two unrelated problems, or one problem with > two symptoms, or something else. I think that I can safely say that > it is not a problem with the 3Ware controller itself, as I replaced > the controller with a spare (identical model), and the problem > recurred. Additionally, I have an almost-identical configuration on > four other machines, none of which are experiencing any problems. > One thing that is different is that the other machines use > Intel PRO/1000 PF (pci-e) NICs. > > Is there some known problem with the Intel 2572 fibre NIC? Or some > potential interaction of it with the 3ware RAID controller? > > For the moment, I've set hw.pci.enable_msi=0 (as discussed in the > threads on 7.2/bge), and am building a new kernel/world from sources > csup'd one hour ago, but I'd really like to hear any ideas about this > -- particularly the wiping of the label. > > Some information about the system: > > > # /dev/da0s1: > 8 partitions: > # size offset fstype [fsize bsize bps/cpg] > a: 2097152 0 4.2BSD 0 0 0 > b: 8388608 2097152 swap > c: 104856192 0 unused 0 0 # "raw" part, don't > edit > d: 8388608 10485760 4.2BSD 0 0 0 > e: 2097152 18874368 4.2BSD 0 0 0 > f: 41943040 20971520 4.2BSD 0 0 0 > g: 41941632 62914560 4.2BSD 0 0 0 > > > em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 > hdr=0x00 > vendor = 'Intel Corporation'thernet Controller (Fiber)' > device = '2572 10/100/1000 Ethernet Controller (Fiber)' > class = networktory, range 32, base 0xda000000, size 131072, > enabled > subclass = ethernetory, range 32, base 0xda000000, size 131072, > enabled > bar [10] = type Memory, range 32, base 0xda000000, size 131072, > enabled > bar [14] = type Memory, range 32, base 0xda020000, size 65536, > enabled0x00 > > twa0@pci0:9:0:0: class=0x010400 card=0x100413c1 chip=0x100413c1 > rev=0x01 hdr=0x00 > device = '9650SE Series PCI-Express SATA2 Raid Controller' > class = mass storage > subclass = RAID > bar [10] = type Prefetchable Memory, range 64, base 0xd8000000, size > 33554432, enabled > bar [18] = type Memory, range 64, base 0xda300000, size 4096, enabled > bar [20] = type I/O Port, range 32, base 0x3000, size 256, enabled > cap 01[40] = powerspec 2 supports D0 D1 D2 D3 current D0 > cap 05[50] = MSI supports 32 messages, 64 bit > cap 10[70] = PCI-Express 1 legacy endpoint > > -- > greg byshenk - gbyshenk@byshenk.net - Leiden, NL > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2a41acea0904270951i20a7d65fja677e3e7865802b>