Date: Sat, 24 Jan 2004 11:06:20 -0500 (EST) From: Robert Watson <rwatson@freebsd.org> To: Matthew Dillon <dillon@apollo.backplane.com> Cc: hackers@freebsd.org Subject: Re: XL driver checksum producing corrupted but checksum-correct packets Message-ID: <Pine.NEB.3.96L.1040124104814.297K-100000@fledge.watson.org> In-Reply-To: <200401240655.i0O6t8lp030917@apollo.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 23 Jan 2004, Matthew Dillon wrote: > I tracked down an occassional buildworld failure on DragonFly to my > XL driver, which is synchronized to 4.x's XL driver. It would be very helpful if you could do the following: (1) See if you can reproduce this using something other than NFS -- perhaps netperf using UDP_STREAM or the like, between that machine and another machine. This would give us a more reproduceable workload than "builds", and hopefully one that is less sensitive to things like context switching, etc. (2) See if you can reproduce this with a stock 4.9-RELEASE kernel (or 4-STABLE). While the drivers are similar between 4.x and DFBSD, there are actually quite a few structural changes in the DFBSD version. Maybe it would make sense to try backing out the local DFBSD changes to the base FreeBSD version, even if not trying a completely FreeBSD system, to see if they are the cause. It's difficult to diff the two because of reorganization and style changes. > xl0@pci1:6:0: class=0x020000 card=0x764610b7 chip=0x764610b7 rev=0x30 hdr=0x00 Does this card have a product name, or is it one of those chips embedded in a motherboard without a separate name? I took a look through the xl cards/chips on my various machines, and was unable to find anything that had remotely the same card or chip ID. I did some high-volume packet flows between them with hardware checksumming disabled and didn't see any corrupted UDP packets, but the workloads I'm using sound pretty different. Knowing it could be reproduced using a more simple workload (and the specifics) would be good. FYI, I checked the Linux driver for these cards, and didn't see mention of any quirks for the particular chips/card you're using. The only thing of note in the Linux driver was the following: /* Check the PCI latency value. On the 3c590 series the latency timer must be set to the maximum value to avoid data corruption that occurs when the timer expires during a transfer. This bug exists the Vortex chip only. */ if (pdev) { u8 pci_latency; u8 new_latency = (drv_flags & IS_VORTEX) ? 248 : 32; pci_read_config_byte(pdev, PCI_LATENCY_TIMER, &pci_latency); if (pci_latency < new_latency) { printk(KERN_INFO "%s: Overriding PCI latency" " timer (CFLT) setting of %d, new value is %d.\n", dev->name, pci_latency, new_latency); pci_write_config_byte(pdev, PCI_LATENCY_TIMER, new_latency); } } The rate at which you have failures sounds like it could be a similar issue, however -- an occasional collision between a timer and DMA. NFS is often a mix of small RPCs handling lookups and attributes, and larger RPCs carrying data. Using netperf or a related tool might help you identify if one of those is more likely to cause the failure. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Senior Research Scientist, McAfee Research
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1040124104814.297K-100000>