From owner-freebsd-scsi@FreeBSD.ORG Mon Feb 3 22:00:43 2014 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A971F75D; Mon, 3 Feb 2014 22:00:43 +0000 (UTC) Received: from khavrinen.csail.mit.edu (khavrinen.csail.mit.edu [IPv6:2001:470:8b2d:1e1c:21b:21ff:feb8:d7b0]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 71E8A18E4; Mon, 3 Feb 2014 22:00:43 +0000 (UTC) Received: from khavrinen.csail.mit.edu (localhost [127.0.0.1]) by khavrinen.csail.mit.edu (8.14.7/8.14.7) with ESMTP id s13M0fei087879 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL CN=khavrinen.csail.mit.edu issuer=Client+20CA); Mon, 3 Feb 2014 17:00:41 -0500 (EST) (envelope-from wollman@khavrinen.csail.mit.edu) Received: (from wollman@localhost) by khavrinen.csail.mit.edu (8.14.7/8.14.7/Submit) id s13M0fJq087876; Mon, 3 Feb 2014 17:00:41 -0500 (EST) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21232.4489.544435.898780@khavrinen.csail.mit.edu> Date: Mon, 3 Feb 2014 17:00:41 -0500 From: Garrett Wollman To: "Kenneth D. Merry" Subject: Re: Heap overflow in mps(4) (was: Re: stable/9 mps(4) rev 254938 == BOOM!) In-Reply-To: <20140131003342.GA11755@nargothrond.kdm.org> References: <21225.19508.683025.581620@khavrinen.csail.mit.edu> <201401292137.s0TLbD5G006716@hergotha.csail.mit.edu> <20140129221514.GA47535@nargothrond.kdm.org> <21225.38749.179621.454579@khavrinen.csail.mit.edu> <20140131003342.GA11755@nargothrond.kdm.org> X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (khavrinen.csail.mit.edu [127.0.0.1]); Mon, 03 Feb 2014 17:00:41 -0500 (EST) Cc: freebsd-scsi@freebsd.org, scottl@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Feb 2014 22:00:43 -0000 < said: > The attached patch should fix the leaked allocations. I'm CCing Steve and > Kashyap at LSI so that they can verify that this is the right place to do > the mapping shutdown. It does fix the leak. > I don't know yet why that particular change is causing problems. Perhaps > it just moved things around and exposed an existing problem. > The fact that the redzone code doesn't expose any problems makes it more > likely that it is a problem other than a heap overflow. > Since it is consistent, is there any chance you could hook up remote gdb to > the box and poke around when it crashes? Perhaps you'll see something > interesting that will point to the problem. No way to do a remote GDB, unfortunately. However, I tried a few other things: - It makes no difference whether mps.ko is preloaded or loaded in single-user mode. - If I boot a kernel/modules without redzone, loading mps.ko instapanics, in a very different place (apologies for the poor transcription; I can either be up in the machine room to plug in USB sticks or use the serial console, not both): --- trap 0xc, rip = 0xffff....f807e934a, rsp = 0xff...94da4c48f0, rbp = 0xff...94da4c4950 --- bzero() at bzero+0xa/frame 0xff...94da4c4af0 mpssas_add_device() at mpssas_add_device+0x78/frame 0xff..94da4c4af0 mpssas_firmware_event_work() at mpssas_firmware_event_work+0x437/frame 0xff....94da4c4b78 taskqueue_run_locked() at taskqueue_run_locked+0x74/frame 0xff..94da4c4bc0 taskqueue_thread_loop() at taskqueue_thread_loop+0x46/frame 0xff..94da4c4be0 Inspection of the code does not reveal any arc from mpssas_add_device to bzero. The return address in the frame is the location of the first function call (to mpssas_startup_increment()) in mpssas_add_device(). So I think it's fair to say that something is scribbling over memory in quite a bad way. Two things that may be relevant: on boot, this server's MPT2 BIOS always complains "adapter configuration may have changed", and I haven't discovered anything in the configuration utility that changes this. Also, on boot, I always get the following messages: failure at /usr/src-9-stable/sys/dev/mps/mps_sas_lsi.c:667/mpssas_add_device()! Could not get ID for device with handle 0x0010 mpssas_fw_work: failed to add device with handle 0x10 This has been true across mps(4) revisions, on all three copies of this hardware that I have in service. -GAWollman