From owner-freebsd-scsi@freebsd.org Thu Aug 27 00:09:36 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9135A9C35D8 for ; Thu, 27 Aug 2015 00:09:36 +0000 (UTC) (envelope-from allan@physics.umn.edu) Received: from mail.physics.umn.edu (smtp.spa.umn.edu [128.101.220.4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 62F52B61 for ; Thu, 27 Aug 2015 00:09:35 +0000 (UTC) (envelope-from allan@physics.umn.edu) Received: from peevish.spa.umn.edu ([128.101.220.230]) by mail.physics.umn.edu with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.77 (FreeBSD)) (envelope-from ) id 1ZUkPd-0005CQ-1U for freebsd-scsi@freebsd.org; Wed, 26 Aug 2015 18:47:25 -0500 To: freebsd-scsi@freebsd.org From: Graham Allan Subject: kernel panic on 9.3 with mps reinit Organization: Physics, University of Minnesota Message-ID: <55DE5008.7000504@physics.umn.edu> Date: Wed, 26 Aug 2015 18:47:20 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Aug 2015 00:09:36 -0000 I have a biggish zfs file server running 9.3-RELEASE-p19 which has a habit of panicking during drive add/remove/scan operations. We made the mistake of using some "not great" desktop drives for our first pool, so there is some turnover of drives through failure. What I find is we can generally replace 2-3 drives over a period of some months, then at some point new drives are no longer recognised, either upon insertion or with "camcontrol rescan". Under 9.1, I found that running "sas2ircu 0 DISPLAY" would (after waiting for a timeout) reinitialize the mps controller, the drives would be found, and we could continue. But on 9.3, the same attempt results in a panic. Attached I have some remotely-received syslog output from a spontaneous panic when a drive suffered hardware failure - looks like the same pattern, with panic shortly after mps reinit. This might be somewhat useless as I have never got a crash dump out of this system; possibly swap is too much smaller than RAM to get a minidump. But if you have any suggestions to debug further I'd welcome them. The system has four Supermicro SC847-JBOD drive chassis attached to a single 9207-8e controller (this used to be one controller per chassis but we had other issues with that back on FreeBSD 9.1). It can operate under heavy load for months on end so I don't believe cabling problems are an issue, it seems mostly related to state after drive loss. Thanks in advance for any suggestions, Graham > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): READ(10). CDB: 28 00 26 0b df 90 00 00 e8 00 length 118784 SMID 772 terminated ioc 804b scsi 0 state 0 xfer 0 > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): READ(6). CDB: 08 00 02 10 10 00 length 8192 SMID 477 terminated ioc 804b scsi 0 state 0 xfer 0 > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 9e 10 00 00 00 10 00 00 length 8192 SMID 212 terminated ioc 804b scsi 0 state 0 xfer 0 > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 a0 10 00 00 00 10 00 00 length 8192 SMID 174 terminated ioc 804b scsi 0 state 0 xfer 0 > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): READ(10). CDB: 28 00 26 0b de a8 00 00 e8 00 > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > Aug 22 15:16:46 hostname kernel: (da58:mps0:0:87:0): Retrying command (per sense data) > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): READ(10). CDB: 28 00 26 0b de a8 00 00 e8 00 length 118784 SMID 921 terminated ioc 804b scsi 0 state 0 xfer 0 > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): READ(10). CDB: 28 00 26 0b df 90 00 00 e8 00 length 118784 SMID 877 terminated ioc 804b scsi 0 state 0 xfer 0 > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): READ(6). CDB: 08 00 02 10 10 00 length 8192 SMID 330 terminated ioc 804b scsi 0 state 0 xfer 0 > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 a0 10 00 00 00 10 00 00 length 8192 SMID 461 terminated ioc 804b scsi 0 state 0 xfer 0 > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): READ(16). CDB: 88 00 00 00 00 01 5d 50 9e 10 00 00 00 10 00 00 > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > Aug 22 15:16:55 hostname kernel: (da58:mps0:0:87:0): Retrying command (per sense data) > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information) > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): Retrying command (per sense data) > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): Error 5, Retries exhausted > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): READ(10). CDB: 28 00 32 5a d1 c0 00 00 20 00 > Aug 22 18:56:19 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > ... snip ... > Aug 22 18:56:25 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:25 hostname kernel: (da58:mps0:0:87:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) > Aug 22 18:56:25 hostname kernel: (da58:mps0:0:87:0): Error 5, Retries exhausted > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): Retrying command (per sense data) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): Error 5, Retries exhausted > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): Retrying command (per sense data) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): Error 5, Retries exhausted > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): Retrying command (per sense data) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): CAM status: SCSI Status Error > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI status: Check Condition > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) > Aug 22 18:56:26 hostname kernel: (da58:mps0:0:87:0): Error 5, Retries exhausted > Aug 22 18:56:49 hostname kernel: da58 at mps0 bus 0 scbus0 target 87 lun 0 > Aug 22 18:56:49 hostname kernel: da58: s/n WD-WCC1T0276990 detached > Aug 22 18:56:49 hostname kernel: (da58:mps0:0:87:0): Periph destroyed > Aug 22 18:56:52 hostname kernel: mpssas_get_sata_identify: error reading SATA PASSTHRU; iocstatus = 0x804b > Aug 22 18:56:53 hostname kernel: mps0: _mapping_add_new_device: failed to add the device with handle 0x00a8 to persistent table because there is no free space available > Aug 22 18:57:53 hostname kernel: mps0: Calling Reinit from mps_wait_command > Aug 22 18:57:53 hostname kernel: mps0: Reinitializing controller, > Aug 22 18:57:54 hostname kernel: mps0: Firmware: 16.00.00.00, Driver: 16.00.00.00-fbsd > Aug 22 18:57:54 hostname kernel: mps0: IOCCapabilities: 1a85c > Aug 22 18:57:54 hostname kernel: mps0: mps_reinit finished sc 0xffffff8001877000 post 4 free 3 > Aug 22 18:57:54 hostname kernel: mps0: Reinit success > Aug 22 18:57:54 hostname kernel: mpssas_get_sata_identify: request for page completed with error 60failure at /usr/src/sys/dev/mps/mps_sas_lsi.c:667/mpssas_add_device()! Could not get ID for device with handle 0x00a8 > Aug 22 18:57:54 hostname kernel: mpssas_fw_work: failed to add device with handle 0xa8 > Aug 22 18:57:54 hostname kernel: _mapping_update_missing_count: device is already removed from mapping table > Aug 22 18:57:56 hostname kernel: > Aug 22 18:57:56 hostname kernel: > Aug 22 18:57:56 hostname kernel: Fatal trap 12: page fault while in kernel mode > Aug 22 18:57:56 hostname kernel: cpuid = 2; apic id = 02 > Aug 22 18:57:56 hostname kernel: fault virtual address#011= 0x0 > Aug 22 18:57:56 hostname kernel: fault code#011#011= supervisor write data, page not present > Aug 22 18:57:56 hostname kernel: instruction pointer#011= 0x20:0xffffffff805bbc19 > Aug 22 18:57:56 hostname kernel: stack pointer#011 = 0x28:0xffffff905c4dc640 > Aug 22 18:57:56 hostname kernel: frame pointer#011 = 0x28:0xffffff905c4dc650 > Aug 22 18:57:56 hostname kernel: code segment#011#011= base 0x0, limit 0xfffff, type 0x1b > Aug 22 18:57:56 hostname kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 > Aug 22 18:57:56 hostname kernel: processor eflags#011= interrupt enabled, resume, IOPL = 0 > Aug 22 18:57:56 hostname kernel: current process#011#011= 0 (mps0 taskq) > Aug 22 18:57:56 hostname kernel: trap number#011#011= 12 --