From owner-freebsd-fs@FreeBSD.ORG Sat Oct 20 05:38:38 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3FF2C49F; Sat, 20 Oct 2012 05:38:38 +0000 (UTC) (envelope-from freebsd@pki2.com) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) by mx1.freebsd.org (Postfix) with ESMTP id E2C758FC0C; Sat, 20 Oct 2012 05:38:37 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by btw.pki2.com (8.14.5/8.14.5) with ESMTP id q9K5cTMo097673; Fri, 19 Oct 2012 22:38:30 -0700 (PDT) (envelope-from freebsd@pki2.com) Subject: Re: ZFS hang status update From: Dennis Glatting To: Andriy Gapon In-Reply-To: <1350698905.86715.33.camel@btw.pki2.com> References: <1350698905.86715.33.camel@btw.pki2.com> Content-Type: text/plain; charset="ISO-8859-1" Date: Fri, 19 Oct 2012 22:38:29 -0700 Message-ID: <1350711509.86715.59.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-yoursite-MailScanner-Information: Dennis Glatting X-yoursite-MailScanner-ID: q9K5cTMo097673 X-yoursite-MailScanner: Found to be clean X-MailScanner-From: freebsd@pki2.com Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Oct 2012 05:38:38 -0000 On Fri, 2012-10-19 at 19:08 -0700, Dennis Glatting wrote: > I applied your debugging patch and that system has been running under > load for 43 hours. I have no idea why. > > That said, some of my prior batch jobs have run for over a month. There > was a time when ZFS was fairly stable but took a dive some months ago. > Boom. Roughly 49 hours, adding a SFTP transfer (60GB off the pool disk-1) and a ls (a directory in the disk-1 pool) in a while loop. My pools: mc# zpool status pool: disk-1 state: ONLINE scan: scrub repaired 0 in 0h38m with 0 errors on Tue Oct 16 16:47:51 2012 config: NAME STATE READ WRITE CKSUM disk-1 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 cache da0 ONLINE 0 0 0 errors: No known data errors pool: disk-2 state: ONLINE scan: scrub repaired 0 in 0h6m with 0 errors on Tue Oct 16 17:05:43 2012 config: NAME STATE READ WRITE CKSUM disk-2 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 da9 ONLINE 0 0 0 da10 ONLINE 0 0 0 errors: No known data errors camcontrol output (static linked and stored in a md): mc# /mnt/camcontrol tags da0 -v (no output. session hung.) mc# /mnt/camcontrol tags da1 -v (** swap disk **) (pass1:mps0:0:5:0): dev_openings 255 (pass1:mps0:0:5:0): dev_active 0 (pass1:mps0:0:5:0): devq_openings 255 (pass1:mps0:0:5:0): devq_queued 0 (pass1:mps0:0:5:0): held 0 (pass1:mps0:0:5:0): mintags 2 (pass1:mps0:0:5:0): maxtags 255 mc# /mnt/camcontrol tags da2 -v (pass2:mps0:0:6:0): dev_openings 245 (pass2:mps0:0:6:0): dev_active 10 (pass2:mps0:0:6:0): devq_openings 245 (pass2:mps0:0:6:0): devq_queued 0 (pass2:mps0:0:6:0): held 0 (pass2:mps0:0:6:0): mintags 2 (pass2:mps0:0:6:0): maxtags 255 mc# /mnt/camcontrol tags da3 -v (pass3:mps0:0:7:0): dev_openings 245 (pass3:mps0:0:7:0): dev_active 10 (pass3:mps0:0:7:0): devq_openings 245 (pass3:mps0:0:7:0): devq_queued 0 (pass3:mps0:0:7:0): held 0 (pass3:mps0:0:7:0): mintags 2 (pass3:mps0:0:7:0): maxtags 255 mc# /mnt/camcontrol tags da4 -v (pass4:mps0:0:8:0): dev_openings 245 (pass4:mps0:0:8:0): dev_active 10 (pass4:mps0:0:8:0): devq_openings 245 (pass4:mps0:0:8:0): devq_queued 0 (pass4:mps0:0:8:0): held 0 (pass4:mps0:0:8:0): mintags 2 (pass4:mps0:0:8:0): maxtags 255 mc# /mnt/camcontrol tags da5 -v (pass5:mps0:0:9:0): dev_openings 245 (pass5:mps0:0:9:0): dev_active 10 (pass5:mps0:0:9:0): devq_openings 245 (pass5:mps0:0:9:0): devq_queued 0 (pass5:mps0:0:9:0): held 0 (pass5:mps0:0:9:0): mintags 2 (pass5:mps0:0:9:0): maxtags 255 mc# /mnt/camcontrol tags da6 -v (pass6:mps0:0:10:0): dev_openings 245 (pass6:mps0:0:10:0): dev_active 10 (pass6:mps0:0:10:0): devq_openings 245 (pass6:mps0:0:10:0): devq_queued 0 (pass6:mps0:0:10:0): held 0 (pass6:mps0:0:10:0): mintags 2 (pass6:mps0:0:10:0): maxtags 255 mc# /mnt/camcontrol tags da7 -v (pass7:mps0:0:11:0): dev_openings 245 (pass7:mps0:0:11:0): dev_active 10 (pass7:mps0:0:11:0): devq_openings 245 (pass7:mps0:0:11:0): devq_queued 0 (pass7:mps0:0:11:0): held 0 (pass7:mps0:0:11:0): mintags 2 (pass7:mps0:0:11:0): maxtags 255 mc# /mnt/camcontrol tags da8 -v (** OS hdw RAID1 **) (pass8:mps1:0:0:0): dev_openings 245 (pass8:mps1:0:0:0): dev_active 10 (pass8:mps1:0:0:0): devq_openings 245 (pass8:mps1:0:0:0): devq_queued 0 (pass8:mps1:0:0:0): held 0 (pass8:mps1:0:0:0): mintags 2 (pass8:mps1:0:0:0): maxtags 255 mc# /mnt/camcontrol tags da9 -v (pass9:mps1:0:9:0): dev_openings 251 (pass9:mps1:0:9:0): dev_active 4 (pass9:mps1:0:9:0): devq_openings 251 (pass9:mps1:0:9:0): devq_queued 0 (pass9:mps1:0:9:0): held 0 (pass9:mps1:0:9:0): mintags 2 (pass9:mps1:0:9:0): maxtags 255 mc# /mnt/camcontrol tags da10 -v (pass10:mps1:0:11:0): dev_openings 251 (pass10:mps1:0:11:0): dev_active 4 (pass10:mps1:0:11:0): devq_openings 251 (pass10:mps1:0:11:0): devq_queued 0 (pass10:mps1:0:11:0): held 0 (pass10:mps1:0:11:0): mintags 2 (pass10:mps1:0:11:0): maxtags 255 I did not run procstat before reboot. I wasn't sure if that was redundant information from my prior email. This is da0 (the cache --SSD) on which camcontrol hanged. It is on the same controller. da0 at mps0 bus 0 scbus0 target 3 lun 0 da0: Fixed Direct Access SCSI-6 device da0: 600.000MB/s transfers da0: Command Queueing enabled da0: 244198MB (500118192 512 byte sectors: 255H 63S/T 31130C)