Date: Tue, 13 May 2003 20:25:45 -0700 (PDT) From: Don Lewis <truckman@FreeBSD.org> To: peter@wemm.org Cc: current@FreeBSD.org Subject: Re: 5.1-RELEASE TODO Message-ID: <200305140325.h4E3PjM7051815@gw.catspoiler.org> In-Reply-To: <20030513221637.3B7422A8AC@canning.wemm.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 13 May, Peter Wemm wrote: > Don Lewis wrote: >> Both my AMD system running -current and PII system running -stable are >> afflicted with these data corruption problems. The limited amount of >> information that I've seen about these problems leads me to believe that >> in order to use the 4 MB page feature without danger to system integrity >> is to relocate the kernel. If this is the case, then it would seem to >> make sense to disable the use of 4 MB pages by adding the DISABLE_PSE >> option until the system is patched. > > The thing is, we only use 4MB pages for two things. > 1) The first 4MB of KVM is mapped as a 4MB page. > 2) Large device mappings, eg: the Xserver mmaping /dev/mem for the frame > buffer. The thing is though, these 4MB pages are not mapped with PG_G. > > Are you running X? Are you using the broadcom ethernet driver? Yes on my -stable machine, no on my -current machine. No. > Also of note: I recently saw a brand new P4 system with a genuine intel > motherboard, for a RELENG_4 system. It had shocking data corruption > problems. The memory was swapped - no change. The motherboard and CPU were > swapped (same motherboard model, much newer P4 cpu stepping) - no change. > It was simply unreliable. Backporting DISABLE_PG_G to RELENG_4 and turning > on it and DISABLE_PSE greatly reduced the problem, but it still happened. > In the end, the Intel motherboard was replaced with a P4 Xeon system > motherboard and the problem instantly went away. The trouble appeared > to be a generic problem the Intel 845 chipset motherboard. > > Remember, this was RELENG_4 as of a few months ago. It isn't a 5.x-only > problem. DISABLE_PSE fixed my -stable machine. > The bge driver has been firmly implicated in one of the cases of data > corruption. Paul's recent if_bge fixes completely solved one person's > long-standing problems. There are DMA bugs in the earlier chipsets that > we didn't have the prescribed workarounds for. And even though the compiles > were happening on local disks, all it took was running the build in an Xterm > so that the make output was going over the network, or doing a tail -f etc. > >> PG_G is probably different. A better case can be made that using this >> option is only masking software bugs that should be fixable. The >> problem is that these bugs are only rarely triggered, look a lot like >> flakey hardware, and it's just about impossible for most FreeBSD users >> to track the problem to its root cause. > > For what its worth, we have #ifdef'ed code in i386/pmap.c: > #ifdef I686_CPU_not /* Problem seems to have gone away */ > /* Deal with un-resolved Pentium4 issues */ > if (cpu_class == CPUCLASS_686 && > strcmp(cpu_vendor, "GenuineIntel") == 0 && > (cpu_id & 0xf00) == 0xf00) { > printf("Warning: Pentium 4 cpu: PG_G disabled (global flag)\n"); > pgeflag = 0; > } > #endif > > I really do not want DISABLE_PSE and DISABLE_PG_G turned on for what appears > to have a hardware component. I'd much rather the above ifdef's turned on. I haven't experimented with tuning them separately. It took the better part of a day of crunching for any sign of a problem. > For the folks having problems, here's what I'd like to know: > > - Are you running X? (standard XFree86 or do you have the agp and drm drivers > enabled?) > - What ethernet card? (particularly if bge) > - Is there any network traffic at the time? ie: if you remove the network > card entirely and do the compile tests on a /dev/ttyv0 console, does it still > happen? > - What hardware do you have? (cpuid line shoing the Id = 0xNNN number, > memory size/type and whether it has ECC or not, motherboard chipset, etc) > - Have you replaced any hardware? If so, which parts? On my -current machine: X is installed, but the server is not running. I'm being a bad developer and not exercising all the features. The ethernet card is an fxp. There is network traffic, since I'm accessing the machine via ssh, and my home directory is NFS mounted. CPU: AMD Athlon(tm) XP 1900+ (1608.22-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x662 Stepping = 2 Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM OV,PAT,PSE36,MMX,FXSR,SSE> AMD Features=0xc0480000<MP,AMIE,DSP,3DNow!> real memory = 1073676288 (1023 MB) avail memory = 1035616256 (987 MB) Pentium Pro MTRR support enabled The RAM is PC2100 w/ECC and passes an overnight run of memtest86, but only after I fixed the CAS Latency timing in the BIOS, which set the timing incorrectly in automatic mode. The chipset is an AMD 761 + VIA 82C686B. On my -stable machine: X is running. The ethernet is a motherboard fxp interface. There is network traffic. CPU: Pentium II/Pentium II Xeon/Celeron (400.91-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x652 Stepping = 2 Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PA T,PSE36,MMX,FXSR> real memory = 402640896 (393204K bytes) The RAM is PC100 with/ECC. The chipset is the Intel 440BX. > Oh, and two more things: > - Do DISABLE_PG_G and/or DISABLE_PSE actually affect the stability? > - Are you seeing application faults (segfault etc) or kernel stability > (fatal trap, panic etc). In both cases the problem manifests itself as data corruption. It generally shows up as a corrupted file in /usr/src when running "make buildworld" several times in a row. No -j option is needed. The actual file is clean, which can be observed by rebooting the machine. As I recall, the corruption affects only a small part of the file, about 16 bytes, I think. It looks like some sort of binary trash. I've also had openoffice builds die from the same problem, and since they take more than 24 hours on my -stable box, I'm not overly eager to experiment. I don't see any sign of data corruption with the DISABLE option(s) present in the kernel configuration.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200305140325.h4E3PjM7051815>