From owner-freebsd-current@freebsd.org Mon Jan 4 10:41:54 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 278F0A6137A for ; Mon, 4 Jan 2016 10:41:54 +0000 (UTC) (envelope-from shahzaib.cb@gmail.com) Received: from mbob.nabble.com (mbob.nabble.com [162.253.133.15]) by mx1.freebsd.org (Postfix) with ESMTP id 00DE51531 for ; Mon, 4 Jan 2016 10:41:53 +0000 (UTC) (envelope-from shahzaib.cb@gmail.com) Received: from msam.nabble.com (unknown [162.253.133.85]) by mbob.nabble.com (Postfix) with ESMTP id A1C201D54563 for ; Mon, 4 Jan 2016 02:32:17 -0800 (PST) Date: Mon, 4 Jan 2016 03:34:09 -0700 (MST) From: shahzaibcb To: freebsd-current@freebsd.org Message-ID: <1451903649383-6064691.post@n5.nabble.com> Subject: FreeBsd MCA Panic Crash !! MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Mailman-Approved-At: Mon, 04 Jan 2016 13:53:13 +0000 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jan 2016 10:41:54 -0000 Hi, We've switched to FreeBSD recently to accomodate large video storage as we are running video streaming website. So the job of the FreeBSD is to transcode the uploaded videos using ffmpeg and serve them to users via ngin= x webserver but so far our experience is not very good with it. It crashes every 2-3 days and we're unable to track down the problem. The server specs are pretty high : Supermicro X5690 (12 cores, 24 threads - 2u) 96GB RAM 12x3TB RAID-10 (HBA-LSI9211) Here is the screenshot of recent crash : http://prntscr.com/9er3pk One thing worth mentioning is, before going down there's no load on server, more or less free RAM usually is around 12GB. We've tried following solutions so far : - Updated FreeBSD OS - Replaced 800W PS with 900W - We've reduced CMOS from MAX(26x) to 18x as suggested in this post http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-ke= rnel-panic The solution we've not performed so far is : - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics. Here is the crash dump : [root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1=20 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 BANK 5=20 MISC 0 ADDR 802bf6a69=20 MCG status:MCIP=20 MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be00000000800400 MCGSTATUS 4 MCGCAP 1c09 APICID 3 SOCKETID 0=20 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 5=20 MISC 0 ADDR 802bf6a69=20 MCG status:MCIP=20 MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be00000000800400 MCGSTATUS 4 MCGCAP 1c09 APICID 2 SOCKETID 0=20 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 BANK 5=20 MISC 0 ADDR 802bf6a69=20 MCG status:MCIP=20 MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be00000000800400 MCGSTATUS 4 MCGCAP 1c09 APICID 3 SOCKETID 0=20 CPUID Vendor Intel Family 6 Model 44 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 5=20 MISC 0 ADDR 802bf6a69=20 MCG status:MCIP=20 MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be00000000800400 MCGSTATUS 4 MCGCAP 1c09 APICID 2 SOCKETID 0=20 CPUID Vendor Intel Family 6 Model 44 ---------------------------------------------------------------------------= -------- I showed those Hardware errors to Vendor from whom we purchased Supermicro servers . This is what he has to say : ----------------------------------- Why do you not made one test environment with CentOS or one other Linux tha= t you know to use, and see if you have same errors ??? if not than you know that the errors come from OS not from hardware. ( CentOS, RedHead=E2=80=A6.= work diferend like FreeBSD =E2=80=93 work direct on hardware if you don=E2=80=99= t have the right kernel settings can the server crashed. CentOS , RedHead=E2=80=A6. don=E2= =80=99t work direct on hardware and distribute the resource load better and you have better control and you can better debug one situation) ----------------------------------- Now we're on a black hole and unable to find that either issue with FreeBSD or Hardware. We're thinking to disable mca in loader.conf but ppl are not suggesting it. If you guys can help us, it'd be very kind. -- View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-= MCA-Panic-Crash-tp6064691.html Sent from the freebsd-current mailing list archive at Nabble.com.