Date: Wed, 10 Dec 2008 13:18:00 +0100 From: Arnaud Houdelette <arnaud.houdelette@tzim.net> To: Victor Balada Diaz <victor@bsdes.net> Cc: freebsd-stable@freebsd.org, freebsd-amd64@freebsd.org Subject: Re: [ATA] and re(4) stability issues Message-ID: <493FB378.5030106@tzim.net> In-Reply-To: <20081209185236.GA1320@alf.bsdes.net> References: <20081209185236.GA1320@alf.bsdes.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Victor Balada Diaz a écrit : > Hello, > > I got various machines[1] at hetzner.de and I've been having problems > with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've > been trying to narrow the problem so someone more knowledgeable than me > is able to fix it. This mail is an other attempt to ask a question > with regards ATA code to see if this time i got something. > > For the ones that don't actually know what happened: > > With FreeBSD 7.0 -RELEASE for amd64 and default kernel > the system shared re0 interrupt with OHCI and this caused > re(4) to corrupt packets and create interrupt storms. Tried > updating to 7.1 -BETA2 and still had some problems with it. > > I've opened the PR kern/128287[2] and Remko quickly answered > with a workaround: that workaround was removing USB support from > my kernel. I did it and re(4) wasn't sharing interrupts anylonger, > and the interrupt storms were gone. Now sometime later the interface > goes up and down from time to time, but less often. Also sometimes > the machine losts the network interface but continues to work. > > I know it continues to work because some days later i can see that > it tried to deliver the status reports but was unable to resolve the > aliases hostnames. I can't ping the machine and i know the network > is OK. If i reboot the machine everything is working again. > > When switched from 7.0 to 7.1 BETA2 i also found that under load > after some hours the machine created interrupt storms on ATA disks. > > Digging at linux source code i've found that they do some special things > for this chipset that i've been unable to find on our code. This is > linux code for my chipset: > > 371 AHCI_HFLAGS (AHCI_HFLAG_IGN_SERR_INTERNAL | > 372 AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI | > 373 AHCI_HFLAG_SECT255), > > File and the rest of the code in here[3]. > > As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could > think of, switching MSI and MSI-x off for the whole system, so > i added to /boot/loader.conf this tunables: > > hw.pci.enable_msix="0" > hw.pci.enable_msi="0" > > And then rebooted the machine. After various hours of doing almost nothing > i've found that the machine answered ping but was unable to answer any > request (eg, ssh, nagios nrpe, etc). The machine recovered itself after > some minutes and when i was able to ssh into i saw the following in dmesg: > > ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly > ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly > ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly > ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly > ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly > ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1463123158 > > and a lot more errors like that. I didn't get this errors with MSI enabled. > I see WRITE_DMA48 and in linux code i saw AHCI_HFLAG_32BIT_ONLY which is later > used for DMA related things. Could someone who is more knowledgeable check > if we're doing the right thing? > > I've attached verbose dmesg of a machine that's like this one with > 7.1 -BETA2, MSI enabled and GENERIC kernel minus USB and firewrire. > > Also, please, could someone give me a hand on how could i continue debugging > this interrupt issues? I'm a bit lost and digging code and posting each > time i think i've found something is not going to go anywhere. > > I would also like to say that i've seen reports of this kind of problems > on amd64 machines in the lists since various years ago, so i don't think > this is just a problem with this BIOS/motherboard (MSI K9AG Neo2 Digital) > on the lists > > > Thanks in advance for any help. > Regards. > > > [1]: http://www.hetzner.de/hosting/produkte_rootserver/ds7000/ > [2]: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/128287 > [3]: http://fxr.watson.org/fxr/source/drivers/ata/ahci.c?v=linux-2.6#L369 > Sorry I didn't take the time to read all the thread, but I got similar problem with the same IXP600 chipset. Only it was'nt with a Realtek NIC (re) but with a Ralink wireless one. The simptoms where similar : interrupt 22 was shared between the sata controler and the wireless card. And I got Interrupt Storms at random times when using the wireless network. No problem since I removed the ral(4) NIC (got a real access point now). You might not want to point the finger at the re(4) driver too fast. Arnaud Houdelette
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?493FB378.5030106>