From owner-freebsd-questions@FreeBSD.ORG Tue Jun 5 22:38:52 2007 Return-Path: X-Original-To: questions@freebsd.org Delivered-To: freebsd-questions@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F198116A400 for ; Tue, 5 Jun 2007 22:38:52 +0000 (UTC) (envelope-from youshi10@u.washington.edu) Received: from mxout5.cac.washington.edu (mxout5.cac.washington.edu [140.142.32.135]) by mx1.freebsd.org (Postfix) with ESMTP id D7FE613C46E for ; Tue, 5 Jun 2007 22:38:52 +0000 (UTC) (envelope-from youshi10@u.washington.edu) Received: from hymn09.u.washington.edu (hymn09.u.washington.edu [140.142.12.183]) by mxout5.cac.washington.edu (8.13.7+UW06.06/8.13.7+UW07.05) with ESMTP id l55McqPD012469 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 5 Jun 2007 15:38:52 -0700 Received: from localhost (localhost [127.0.0.1]) by hymn09.u.washington.edu (8.13.7+UW06.06/8.13.7+UW07.03) with ESMTP id l55McqWw030984; Tue, 5 Jun 2007 15:38:52 -0700 X-Auth-Received: from [192.55.52.3] by hymn09.u.washington.edu via HTTP; Tue, 05 Jun 2007 15:38:51 PDT Date: Tue, 5 Jun 2007 15:38:51 -0700 (PDT) From: youshi10@u.washington.edu To: "N. Harrington" In-Reply-To: <362995.35822.qm@web34505.mail.mud.yahoo.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-PMX-Version: 5.3.1.294258, Antispam-Engine: 2.5.1.298604, Antispam-Data: 2007.6.5.151733 X-Uwash-Spam: Gauge=IIIIIII, Probability=7%, Report='NO_REAL_NAME 0, __C230066_P2 0, __CT 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0' Cc: questions@freebsd.org Subject: Re: How to solve mysterious system lockups? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2007 22:38:53 -0000 On Tue, 5 Jun 2007, N. Harrington wrote: > > --- Garrett Cooper wrote: > >> N. Harrington wrote: >>> --- Garrett Cooper >> wrote: >>> >>> >>>> N. Harrington wrote: >>>> >>>>> Hello >>>>> I have several systems that are used as squid >>>>> caching servers. I have some systems that use >> SCSI >>>>> disks and some that use SATA disks. They are >>>>> identical in everyway except for the sata vs >> SCSI >>>>> drives. >>>>> >>>>> At random times, the sata based systems seem to >>>>> >>>> be >>>> >>>>> freezing. You can ping them and they respond, >> but >>>>> >>>> you >>>> >>>>> cannot log in. Nor are any logs processed during >>>>> >>>> that >>>> >>>>> time. >>>>> >>>>> I figure it mist be something to do with the >>>>> >>>> disks, >>>> >>>>> but I am not sure how to solve it. There seems >> to >>>>> >>>> be >>>> >>>>> little rhyme or reason. It does not happen >>>>> >>>> necessarily >>>> >>>>> during busy times. It can happen in the middle >> of >>>>> >>>> the >>>> >>>>> night. >>>>> >>>>> Any pointers in how to track down the cause >> would >>>>> >>>> be >>>> >>>>> much appreciated. >>>>> >>>>> Tyan S2881 Motherboard - 4gigs mem >>>>> Using 4 SATA (or scsi) drives >>>>> FreeBSD amd64 6.2-STABLE. >>>>> >>>>> Thanks! >>>>> >>>>> Nicole >>>>> >>>>> >>>> Nicole, >>>> What's the driver in use for the SATA and the >>>> SCSI drives? >>>> -Garrett >>>> >>> >>> Hi Garret >>> Here is the driver info. >>> >>> -- SATA >>> >>> atapci0: port >>> >> > 0xbc00-0xbc07,0xb400-0xb403,0xb000-0xb007,0xac00-0xac03,0xa800-0xa80f >>> >>> mem >>> 0xfeafec00-0xfeafefff irq 17 at device 5.0 on pci3 >>> ata2: on atapci0 >>> ata3: on atapci0 >>> ata4: on atapci0 >>> ata5: on atapci0 >>> pci3: at device 6.0 (no driver >>> attached) >>> isab0: at device 7.0 on pci0 >>> isa0: on isab0 >>> atapci1: port >>> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf >> at >>> device 7.1 on pci0 >>> ata0: on atapci1 >>> ata1: on atapci1 >>> pci0: at device 7.2 (no driver >>> attached) >>> pci0: at device 7.3 (no driver attached) >>> pcib2: at device 10.0 on >> pci0 >>> pci2: on pcib2 >>> >>> -- SCSI >>> >>> ahd0: port >> >>> 0x8000-0x80ff,0x7800-0x78ff >>> mem 0xfc89c000-0xfc89dfff irq 24 at device 10.0 on >>> pci2 >>> ahd0: [GIANT-LOCKED] >>> aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X >>> 67-100Mhz, 512 SCBs >>> ahd1: port >> >>> 0x8800-0x88ff,0x8400-0x84ff >>> mem 0xfc89e000-0xfc89ffff irq 25 at device 10.1 on >>> pci2 >>> ahd1: [GIANT-LOCKED] >>> aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X >>> 67-100Mhz, 512 SCBs >>> pci0: at >>> device 10.1 (no driver attached) >>> pcib3: at device 11.0 on >> pci0 >>> pci1: on pcib3 >>> pci0: at >>> device 11.1 (no driver attached) >>> >>> >>> >>> Thanks! >>> >>> Nicole >> Ok, so it's an AMD 8111 northbridge versus an >> Adaptec onboard SCSI >> controller. >> >> 1. What release / version of FreeBSD are you using? >> You should upgrade >> to 6.2 STABLE because there have been a variety of >> issues worked out in previous releases. > > I have a range of Versions from 6.1-Pre to 6.2-STABLE > as of a few months ago. > >> 2. Do you have any logs for activity during the >> hours when it locks up >> (in particular anything interesting / fishy popping >> up)? > > Nope. That would make it too easy :) > They commit suicide without a note. > >> 3. What scheduler are you using? 4BSD, ULE? > > 4BSD > >> 4. Does your machine (using the SATA controllers) >> lock up under heavy >> load? If so, you may have a northbridge cooling >> issue that you need to >> put a fan on. For instance, the motherboard that I >> was using for a while >> (ASUS P5N-E SLI) was really close to my CPU >> heatsink, and there was a >> lot of heat transfer between my northbridge and CPU >> heatsink, which was >> raising the onboard temperatures 5~10 degrees C. The >> new motherboard >> (ASUS P5B DLX) doesn't do that though. > > The lockups seem rather random. I have healthd > running and they never seem to show very warm. The > room is cold and the servers have great fans. Altho > healthd can seem wonky as the cpu temp has actually > gone below the minimum. Also the -2Volt line seems > very low. But some servers runs forever that way. > > At least with SCSI, since it seems to manage itself > as another layer away from the system, you get some > error messages. Sort of like windows 3.1 dropping to > dos. Verses sata issues where it's just blue screen of > death but without even some debugging code. > > I am going to try the patch chuck Swiger sent me and > see how that effects things. Also try a few > replacement sata cards. Altho that is always fun > especially in 1U servers. As well as seeing if using > SAS drives may help if I can find some cheap enough. > Do you think that using the ULE scheduler could > really help? Don't try it in 6.x. It's not stable by any means. 7-CURRENT's getting a lot closer though, especially as of late (past week).. -Garrett