From owner-freebsd-questions@freebsd.org Wed Dec 21 18:12:07 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 49576C8A66A for ; Wed, 21 Dec 2016 18:12:07 +0000 (UTC) (envelope-from galtsev@kicp.uchicago.edu) Received: from cosmo.uchicago.edu (cosmo.uchicago.edu [128.135.20.71]) by mx1.freebsd.org (Postfix) with ESMTP id 087F21AC4 for ; Wed, 21 Dec 2016 18:12:06 +0000 (UTC) (envelope-from galtsev@kicp.uchicago.edu) Received: by cosmo.uchicago.edu (Postfix, from userid 48) id 51707CB8C9D; Wed, 21 Dec 2016 12:13:08 -0600 (CST) Received: from 128.135.52.6 (SquirrelMail authenticated user valeri) by cosmo.uchicago.edu with HTTP; Wed, 21 Dec 2016 12:13:08 -0600 (CST) Message-ID: <23776.128.135.52.6.1482343988.squirrel@cosmo.uchicago.edu> In-Reply-To: <141a0360-2da5-4769-cadd-48c504edc996@FreeBSD.org> References: <141a0360-2da5-4769-cadd-48c504edc996@FreeBSD.org> Date: Wed, 21 Dec 2016 12:13:08 -0600 (CST) Subject: Re: Where can I get help for debugging system crash ? From: "Valeri Galtsev" To: freebsd-questions@freebsd.org Reply-To: galtsev@kicp.uchicago.edu User-Agent: SquirrelMail/1.4.8-5.el5.centos.7 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Dec 2016 18:12:07 -0000 On Wed, December 21, 2016 11:49 am, Matthew Seaman wrote: > On 21/12/2016 14:00, Manish Jain wrote: >> I am running a FreeBSD 11 amd64 box. The box generally works well, but >> once every while (about once a month), the system produces a crash, with >> a large core file at /var/crash. I had a crash yesterday. The info.0 for >> the the last core reads as : >> >> Dump header from device: /dev/ada0p3 >> Architecture: amd64 >> Architecture Version: 2 >> Dump Length: 1012834304 >> Blocksize: 512 >> Dumptime: Tue Dec 20 19:05:28 2016 >> Hostname: bourne.1dent1ty >> Magic: FreeBSD Kernel Dump >> Version String: FreeBSD 11.0-RELEASE-p1 #0 r306420: Thu Sep 29 >> 01:43:23 UTC 2016 >> root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC >> Panic String: page fault >> Dump Parity: 4025560426 >> Bounds: 0 >> Dump Status: good >> >> /dev/ada0p3 corresponds to my swap partition. My box has 2 solid state >> disks, which provide ada0p1 (efi), ada0p2 (ufs), ada0p3 (swap), ada0p4 >> (ufs) and ada1s1 (ufs). >> >> I need help to determine exactly what is producing the crash - Is it >> some hardware problem or some issue with the FreeBSD code ? If anyone >> can help me get through to the right channel, I will be grateful indeed. > > Hi, Manish, > > The best thing to do here is to open a PR with what details of the crash > you can extract from the core dump. You have a full system core, so you > should be able to follow the instructions here, and extract a backtrace > from the kernel: > > https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-gdb.html > > Or generating a textdump will automatically process your saved core and > produce a textual report with lots of debugging information. This does > require a modified kernel configuration though. See > textdump(4) and http://www.etinc.com/122/Using-FreeBSD-Text-Dumps for > details. > > If you can pin the problem down to a particular subsystem or device, > then that should indicate which mailing list would be a good choice to > discuss the problem. If it doesn't appear to be in any device or > sub-system specific part of the kernel, then try asking on > freebsd-stable@... Thanks Matthew, this is very instructive. Manish, before opening PR though I would first make sure there is nothing fishy with _your_ hardware. Just go over same old routine first: re-seat all cards. Check all fans are spinning (especially CPU ones). Re-seat all memory modules (and CPUs). Check that all memory is from the same batch. I've seen memory with the same specs, but mixed different brands causing crash (very rarely, once a year for each given machine, but that was 32 node cluster, so one of machines of cluster crashed during given Month almost certainly). Try to run with single CPU (system always boot off CPU in the socket number 0 ), minimum memory, without any additional cards in expansions slots (unless you can pinpoint particular card via panic inside particular driver). The worst one can have is if system board (motherboard is jargon for over couple of decades) has micro crack. If you have another hardware with the same model of system board, try to move everything into that box and see if that box crashed as well under that system. Good luck! Valeri > > Cheers, > > Matthew > > > ++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++