From owner-freebsd-questions@freebsd.org Wed Mar 9 17:03:08 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D431FAC8D4F for ; Wed, 9 Mar 2016 17:03:08 +0000 (UTC) (envelope-from galtsev@kicp.uchicago.edu) Received: from cosmo.uchicago.edu (cosmo.uchicago.edu [128.135.70.90]) by mx1.freebsd.org (Postfix) with ESMTP id AD65AFA2 for ; Wed, 9 Mar 2016 17:03:08 +0000 (UTC) (envelope-from galtsev@kicp.uchicago.edu) Received: by cosmo.uchicago.edu (Postfix, from userid 48) id 5B89FCB8CBB; Wed, 9 Mar 2016 11:03:02 -0600 (CST) Received: from 128.135.52.6 (SquirrelMail authenticated user valeri) by cosmo.uchicago.edu with HTTP; Wed, 9 Mar 2016 11:03:02 -0600 (CST) Message-ID: <31151.128.135.52.6.1457542982.squirrel@cosmo.uchicago.edu> In-Reply-To: References: Date: Wed, 9 Mar 2016 11:03:02 -0600 (CST) Subject: Re: FreeBSD Crashes Intermittently !! From: "Valeri Galtsev" To: "shahzaib shahzaib" Cc: freebsd-questions@freebsd.org Reply-To: galtsev@kicp.uchicago.edu User-Agent: SquirrelMail/1.4.8-5.el5.centos.7 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Mar 2016 17:03:09 -0000 On Wed, March 9, 2016 6:24 am, shahzaib shahzaib wrote: > Hi, > > I am new to this mailing list so please pardon me for any mistakes. We've > started using FreeBSD from past 4-5 months and facing auto-reboot crash > issue since the beginning. Following are the servers specs : > > Supermicro X5690 (12 cores, 24 threads - 2u) > 96GB RAM > 12x3TB mirror+stripping (HBA-LSI9211) > X8DT3 Board > > We've total of 5 supermicro servers built upon same hardware and all of > them intermittently goes down and sometimes they crash and boot up > automatically (within 6min) and sometimes they gets freeze and we've to > manually boot them via IPMI interface. All the time we get 'MCA Internal > Timer Error' in crash logs. Here is the recent one : > > http://pastebin.com/042SJ11c > > Once we reported this issue to our hardware vendor he said that its due to > FreeBSD incompatibility with hardware and suggested us to try installing > Linux on one of them and so did we proceeded with Debian on one of them > them but all in vain and server was still crashing. Once we reported him > about his failed proposal he then said that it could be related to > application which is causing this crash. Not correct. Normally neither on FreeBSD, nor on Linux application will not be able to crash the system. The worst that could happen, application related process (or processes) will die or get killed. You quite likely have hardware problem. It doesn't seem your hardware vendor did burn-in test of your boxes. > > Now if he really is right then RAM should first swapped out to its full in > order to make OS crash Not correct. Normally if you run system out of memory(including swap) one or few processes can get killed, but the system will not crash. It may have appearance of getting locked (unresponsive) for some time, in the case you have large swap (as with process switching it will have to swap in memory pages, and swap something out to switch to next process, that is why I prefer not to have swap on huge memory boxes, or never have large swap). > but never did that happened, we've never been out > of > Memory as 96GB RAM is pretty high. We've also took some precaution to > debug > this issue : > > - Replacing Power-Supply. > - Reducing CMOS in BIOS. > - Disabling Intel Powersaving features. > - Upgrade Bios > > > Now we do not know how and what to debug. If you need more details, please > visit following thread which we created 2 months back : > > https://forums.freebsd.org/threads/54412/ > To simplify your life, update to the latest (and yes, stay with RELEASE - which I see you have in that forum thread). > Now i am confused if application really can crash server without swapping > it out ? Could there be any php function which could make a crash :-| . Is > FreeBSD is the cause of crash ? Things are pretty blurred right now :(. > Here is the Kernel tuning values : Again, no: "application" can not crash kernel. Apart from hardware, only what runs in kernel context can, e.g. hardware drivers. With your machine I would first make sure your hardware is sane. Here is what I would do if it were my box: 1. go to BIOS and make sure temperature thresholds are not too low (even though this doesn't seem to be your case), remove BIOS hardware memory hole re-mapping 2. inspect what and how is installed inside the box (e.g. some cards may not be installed well, not fully engaged into connectors - which doesn't seem to be your case either) 3. Check that all memory is of the same brand and same type (which likely to be your problem). 4. Check that RAM, CPUs are in the list of supported by motherboard manufacturer for this particular motherboard model 5. if not all memory slots are filled, check motherboard manual how to partially fill memory slots. Basically if memory bus leads are not terminated, one should first fill farthest from CPU or memory controller slots (thus avoiding reflection from the end of not terminated transmission line) 6. leave minimal amount of hardware in the motherboard, and see if the box doesn't crash. This means: remove all added cards which you can run the machine without (for testing purpose), remove all CPUs except for CPU #0 which the machine boots off, put minimal amount of RAM (these days memory controllers are on the CPU substrate, so make sure RAM is plugged into slots connected to CPU #0). Make sure you removed all components first then re-install minimal set. Run memory test (memtest86, you can find bootable CDs which memtest86). Observe anti-static precautions! I've seen memory chips that were slightly fried by static discharge. (other electronics components may be like that too). They were working as if they are good, but at some point later they started failing. They may be a bit out of specs after static discharge, which may cause random errors. What hardware components can cause problems like yours (apart from using CPUs or RAM that are not supported by motherboard). Motherboard itself (e.g., micro cracks in some PCB leads), RAM (most likely), CPU, poorly installed PCI-X (or PCI, PCI-E) cards. Good luck troubleshooting! Incidentally, I do have a bunch of supermicro based systemboard boxes running FreeBSD 9.3 and FreeBSD 10.2, none of them ever crash. Valeri > > http://pastebin.com/nEnxkV6y > > Please help us further !! > > Regards. > Shahzaib > _______________________________________________ > freebsd-questions@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to > "freebsd-questions-unsubscribe@freebsd.org" > ++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++