From owner-freebsd-scsi@freebsd.org Mon Jul 3 07:42:20 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C9FA49E0351; Mon, 3 Jul 2017 07:42:20 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x22d.google.com (mail-wm0-x22d.google.com [IPv6:2a00:1450:400c:c09::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 67F3113A1; Mon, 3 Jul 2017 07:42:20 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x22d.google.com with SMTP id 62so159907784wmw.1; Mon, 03 Jul 2017 00:42:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=fXwnBcvuGOil6HHb8sjqt8q8Sd0CRsYc8QKn66qIsqc=; b=bycwTX8I25IcAuh86TgBS4ofe/XvtneU1PppCxT4V7HiIyd46hORaHWG2EVsByqemu ApTh5cFi1f8n1KDG2tyzKE4m+AUcZtiz3XnnPVFN2oYH28gfgNdjSZG3ey/YlJoTvpjf 2H+GNrMMBE2WMWIUah92k7toD9tX/DvRGIegae8GmeQGdtJ8NeAuFK8BsIy6kCN5UOQJ EO+Q6vqyH26LbHixpB68pulgxrNdE5izJxe7K3DPCfOMDvTW8cBrDB9Rcy/OtNCxUUBc GlZhl04txdaGrK69mPOCubSyPTuEtZEZ7NIWOLaAuBo/rAFx6AaMdZyIxj+cPr1fRzZQ oi4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=fXwnBcvuGOil6HHb8sjqt8q8Sd0CRsYc8QKn66qIsqc=; b=i8dfsMltn5asQZE3jBnYYzaJ3fBNkoKNgT8GTOghVNk03ngntxP3fZu5YdeSM6wQqd Tp36RSbdm4iddcKQbNIYbPo5CN8En0OctV5gGhEWl+5O5ziKQHYrlzM6LjDoVUzBnAgZ JMEXeYNDOkAuTDHHdNc3xxwnKPJj5zZz4Mrv1ph05tpAQArP1rYatYzJ0MWy9ogrysR/ 0QR6YLN/jdN9KSKdV2VtPCL+GgSdkcawyRatyOWNPOyKgln0GA4xXpdyJi/FUUfKbz1z sXYDj6wVuA5WzvotvZNsFqrIHm8X3ZhLOGo+0LcWEF3Gp+BZe42Vrlf8h0l3sg5Dz8T7 xRHA== X-Gm-Message-State: AKS2vOz8OHFZBdU8QFLS7AzM4TZ/UIJZpO/h5hpxUCS4+k8EVexK6F1H DJmKzTFbHEBOE8u0cPk= X-Received: by 10.28.236.19 with SMTP id k19mr22260562wmh.30.1499067737636; Mon, 03 Jul 2017 00:42:17 -0700 (PDT) Received: from ben.home (LFbn-1-11339-180.w2-15.abo.wanadoo.fr. [2.15.165.180]) by smtp.gmail.com with ESMTPSA id q17sm1248103wmd.4.2017.07.03.00.42.16 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 03 Jul 2017 00:42:16 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: I/O to pool appears to be hung, panic ! From: Ben RUBSON X-Priority: 3 In-Reply-To: Date: Mon, 3 Jul 2017 09:42:15 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: freebsd-scsi@freebsd.org, Freebsd fs X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 07:42:20 -0000 > On 29 Jun 2017, at 20:40, Karli Sj=C3=B6berg = wrote: >=20 >> Den 29 juni 2017 3:37 em skrev Ben RUBSON : >>=20 >> I also tried to look for some LSI SAS2008 error counters (on target = side), >> but did not found anything interesting. >> (sysctl -a | grep -i mps) >=20 > Here's a well kept secret I got from LSI once: >=20 > /boot/loader.conf: > dev.mps.0.debug_level=3D"0x1F" Thank you Karli for this tip ! I just noticed this is explained in MPS(4), DEBUGGING section. 0x1F is then : 0x0001 Enable informational prints (set by default). 0x0002 Enable prints for driver faults (set by default). 0x0004 Enable prints for controller events. 0x0008 Enable prints for controller logging. 0x0010 Enable prints for tracing recovery operations. Controller events are however extremely verbose. Perhaps 0x1B is then slightly better. Here is a useful link to "decode" logged messages : http://blog.disksurvey.org/blog/2014/08/10/decoding-lsi-loginfo-codes/ Ben= From owner-freebsd-scsi@freebsd.org Mon Jul 3 09:07:11 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B98F29E1B47; Mon, 3 Jul 2017 09:07:11 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com [IPv6:2a00:1450:400c:c09::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4C23A642A0; Mon, 3 Jul 2017 09:07:11 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x22f.google.com with SMTP id 62so162105234wmw.1; Mon, 03 Jul 2017 02:07:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=pHIxfnfO1kGjFbqmb23JJjYccYBWsUl9Y7g0WX6v6/s=; b=kvUMP3L93038P6ZUGTVt63XhSmAQZ111RDl2JkveNZTRIvifpW1ldTIKok9N+PRzmf 6UtW5IeI+x8ezGlJCiNcsuhimJJnm5YyI0JluisZ/gx7OleRSPKY2PIUtpdi/Ib3t3Vp TYVxRNkP+hucTxedrLDup32TGAYg74chnRW0FB1yxk3VCqQEbs8upM2PZAvFK2L3sju/ exJwR20L/pgVD+yUl1ZZs8OHAT+pjnZEp+I/Z+LGsaBBnnzNd7eM2drhP/rTIWohp2bX pqITY/jjs3rRpXDzOdmlOFHqwIw7aEjNc4Gn1HiXfFlD6cmIuefsHQPys4lWu93a7rL7 5bYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=pHIxfnfO1kGjFbqmb23JJjYccYBWsUl9Y7g0WX6v6/s=; b=TSiNIVb5ukyyuBsYLPZu0+0E/Z8lFUkhpaGU9NakFP4Au0RnHPErUln7stlAzZsLMY AlHbON4Vfd+mRWXA1BOaPjYeYc7NcWnV1qVfZ5NPXZ5oAZYvCY82VHuio9VdXiqzV/65 Rw5A9rAetRQ2Q0XbfVTiGjyZfFh0c4vm+p2GhP1nNlKA0W9G1pLQWLEC4sXObTc9SVl5 79CJV+OiiRJqby5EU2ketEzbHv2v/u8KMp7H56UcPBlFBk7G5l+Re4MpAhsxOwjT4lFj m8bfggTsAHYzEBQuWftzO0MmSu6IqdXtUHro5vkVsZBx9AkjhzqNfhj5JsiIlkPt07oz ropg== X-Gm-Message-State: AKS2vOyZi6hFyEOMMlNxZ5AoSXQalWsyBsavVddnZwrAg0A95BTUU5rJ qhvH+hop9uWmn1Snljk= X-Received: by 10.28.54.217 with SMTP id y86mr15164248wmh.81.1499072829298; Mon, 03 Jul 2017 02:07:09 -0700 (PDT) Received: from ben.home (LFbn-1-11339-180.w2-15.abo.wanadoo.fr. [2.15.165.180]) by smtp.gmail.com with ESMTPSA id c27sm13078689wrb.44.2017.07.03.02.07.08 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 03 Jul 2017 02:07:08 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: I/O to pool appears to be hung, panic ! From: Ben RUBSON In-Reply-To: Date: Mon, 3 Jul 2017 11:07:06 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170629144334.1e283570@fabiankeil.de> To: Freebsd fs , freebsd-scsi@freebsd.org X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 09:07:11 -0000 > On 29 Jun 2017, at 15:36, Ben RUBSON wrote: >=20 >> On 29 Jun 2017, at 14:43, Fabian Keil = wrote: >=20 > Thank you for your feedback Fabian. >=20 >> Ben RUBSON wrote: >>=20 >>> One of my servers did a kernel panic last night, giving the = following message : >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at = '/dev/label/G23iscsi'. >> [...]=20 >>> Here are some numbers regarding this disk, taken from the server = hosting the pool : >>> (unfortunately not from the iscsi target server) >>> https://s23.postimg.org/zd8jy9xaj/busydisk.png >>>=20 >>> We clearly see that suddendly, disk became 100% busy, meanwhile CPU = was almost idle. We also clearly see that 5 minutes later (02:09) disk seems to be back = but became 100% busy again, and that 16 minutes later (default vfs.zfs.deadman_synctime_ms), panic = occurred. >>> No error message at all on both servers. >> [...] >>> The only log I have is the following stacktrace taken from the = server console : >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at = '/dev/label/G23iscsi'. >>> cpuid =3D 0 >>> KDB: stack backtrace: >>> #0 0xffffffff80b240f7 at kdb_backtrace+0x67 >>> #1 0xffffffff80ad9462 at vpanic+0x182 >>> #2 0xffffffff80ad92d3 at panic+0x43 >>> #3 0xffffffff82238fa7 at vdev_deadman+0x127 >>> #4 0xffffffff82238ec0 at vdev_deadman+0x40 >>> #5 0xffffffff82238ec0 at vdev_deadman+0x40 >>> #6 0xffffffff8222d0a6 at spa_deadman+0x86 >>> #7 0xffffffff80af32da at softclock_call_cc+0x18a >>> #8 0xffffffff80af3854 at softclock+0x94 >>> #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f >>> #10 0xffffffff80a936f6 at ithread_loop+0xc6 >>> #11 0xffffffff80a900d5 at fork_exit+0x85 >>> #12 0xffffffff80f846fe at fork_trampoline+0xe >>> Uptime: 92d2h47m6s >>>=20 >>> I would have been pleased to make a dump available. >>> However, despite my (correct ?) configuration, server did not dump : >>> (nevertheless, "sysctl debug.kdb.panic=3D1" make it to dump) >>> # grep ^dump /boot/loader.conf /etc/rc.conf >>> /boot/loader.conf:dumpdev=3D"/dev/mirror/swap" >>> /etc/rc.conf:dumpdev=3D"AUTO" >>=20 >> You may want to look at the NOTES section in gmirror(8). >=20 > Yes, I should already be OK (prefer algorithm set). >=20 >>> I use default kernel, with a rebuilt zfs module : >>> # uname -v >>> FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 = root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC=20 >>>=20 >>> I use the following iSCSI configuration, which disconnects the disks = "as soon as" they are unavailable : >>> kern.iscsi.ping_timeout=3D5 >>> kern.iscsi.fail_on_disconnection=3D1 >>> kern.iscsi.iscsid_timeout=3D5 >>>=20 >>> I then think disk was at least correctly reachable during these 20 = busy minutes. >>>=20 >>> So, any idea why I could have faced this issue ? >>=20 >> Is it possible that the system was under memory pressure? >=20 > No I don't think it was : > https://s1.postimg.org/uvsebpyyn/busydisk2.png > More than 2GB of available memory. > Swap not used (624kB). > ARC behaviour seems correct (anon increases because ZFS can't actually = write I think). > Regarding the pool itself, it was receiving data at 6MB/s, sending = around 30kB blocks to disks. > When disk went busy, throughput fell to some kB, with 128kB blocks. >=20 >> geli's use of malloc() is known to cause deadlocks under memory = pressure: >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209759 >>=20 >> Given that gmirror uses malloc() as well it probably has the same = issue. >=20 > I don't use geli so I should not face this issue. >=20 >>> I would have thought ZFS would have taken the busy device offline, = instead of raising a panic. >>> Perhaps it is already possible to make ZFS behave like this ? >>=20 >> There's a tunable for this: vfs.zfs.deadman_enabled. >> If the panic is just a symptom of the deadlock it's unlikely >> to help though. >=20 > I think this tunable should have prevented the server from having = raised a panic : > # sysctl -d vfs.zfs.deadman_enabled > vfs.zfs.deadman_enabled: Kernel panic on stalled ZFS I/O > # sysctl vfs.zfs.deadman_enabled > vfs.zfs.deadman_enabled: 1 >=20 > But not sure how it would have behaved then... > (busy disk miraculously back to normal status, memory pressure due to = anon increasing...) I then think it would be nice, once vfs.zfs.deadman_synctime_ms has = expired, to be able to take the busy device offline instead of raising a panic. Currently, disabling deadman will avoid the panic but will let the = device slowing down the pool. I still did not found the root cause of this issue, not sure I will, quite difficult actually with a stacktrace and some performance graphs = only :/ Thank you again, Ben From owner-freebsd-scsi@freebsd.org Mon Jul 3 09:55:13 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D5D0B9E2B57 for ; Mon, 3 Jul 2017 09:55:13 +0000 (UTC) (envelope-from school9net@vh19.sweb.ru) Received: from vh19.sweb.ru (vh19.sweb.ru [IPv6:2a02:408:7722:1:77:222:61:14]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 94A2E65842 for ; Mon, 3 Jul 2017 09:55:12 +0000 (UTC) (envelope-from school9net@vh19.sweb.ru) Received: from school9net by vh19.sweb.ru with local (Exim 4.85_2) (envelope-from ) id 1dRy4T-000yRw-PI for freebsd-scsi@freebsd.org; Mon, 03 Jul 2017 12:55:09 +0300 To: freebsd-scsi@freebsd.org Subject: Notification status of your delivery (UPS 3481758) X-PHP-Originating-Script: 11215:post.php(3) : regexp code(1) : eval()'d code(17) : eval()'d code Date: Mon, 3 Jul 2017 12:55:09 +0300 Message-Id: From: school9net@vh19.sweb.ru X-Sender-Uid: 11215 Content-Type: text/plain; charset=us-ascii X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 09:55:13 -0000 Dear Customer, UPS courier was unable to contact you for your parcel delivery. Please check the attachment for details! Many thanks, , UPS Parcels Delivery Agent. From owner-freebsd-scsi@freebsd.org Mon Jul 3 11:10:43 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E182D9E4304; Mon, 3 Jul 2017 11:10:43 +0000 (UTC) (envelope-from etnapierala@gmail.com) Received: from mail-ua0-x22a.google.com (mail-ua0-x22a.google.com [IPv6:2607:f8b0:400c:c08::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9132D67DB4; Mon, 3 Jul 2017 11:10:43 +0000 (UTC) (envelope-from etnapierala@gmail.com) Received: by mail-ua0-x22a.google.com with SMTP id j53so107316280uaa.2; Mon, 03 Jul 2017 04:10:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=B+0z7ykioZF3GZx8Q1X7MWPjoBaI3/p6BB9cNfE0cmc=; b=pUF/CqD/YNI8eHHwsGlZ173/YQpXrQx2ajROT+1By4vJT969ggqBjetOdRJ/h4sqML AqmUcXwVJ6t/SkhXXtQb1wBI+k/orPbzOdJ00/xh8394f+h9bW4xHSTHj/h0JpfWyO7+ M5oxGnQ8w0eYKupTdk9mORfOt93pDzwYXzOx24+DNfjza3wn8/WtAdCKGcktYKYqZdlc 3mHnxaJIfMV1TcyGgup6rgwV0K7OkHFm02PS2XQKwQ90Smx3VM+QspBHSROnvnuXHEt1 6k9Mcqjn2kzZgrQWQbnXaT2WCgEBaP5wEiLrSJ6zTQ6EJ+B2zVg+KL/EF0Th3P0KIUK9 LCFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=B+0z7ykioZF3GZx8Q1X7MWPjoBaI3/p6BB9cNfE0cmc=; b=fat9Hew1n5LpCUdHK2U1LFiQ1l/49M/r+/NXaC6lawbcgMiJ8h23Zylda6GsmBH7L7 mjFxdDudxRZUfFr2c8xYPzvtvaS50bRfC03TiN9Nu5seoSko7WjX2OQ7FUels5/rYSHZ MstGe/pwiyyrouMh8wHW7PImkc5ZBuMXNV5ramPcHbJ/nQlxaZwzgAK/silS/t+SsYu0 kGuf1oKxnu4clNh+B9ieufhpDBqqHglf1rqPmekTkrqKgkLuwFTcSf21SYHs81FMRIUV tRNiQAPGIyL/uDsH3Qfy2bSA39UdZ2r3eXeLUgP2bPD12bEeim+FTb/qKxKusZAi4g3u m4mA== X-Gm-Message-State: AKS2vOzIAqm6urmM07y6F8Yhz6MDwuKwikpOyfkCv8yMjjC4/mPCN3lV 7Sm1K8OUd4G16zsO2H55W3KpgLY+lQ== X-Received: by 10.159.32.133 with SMTP id 5mr18355140uaa.123.1499080242238; Mon, 03 Jul 2017 04:10:42 -0700 (PDT) MIME-Version: 1.0 Sender: etnapierala@gmail.com Received: by 10.176.83.198 with HTTP; Mon, 3 Jul 2017 04:10:41 -0700 (PDT) In-Reply-To: References: <20170629144334.1e283570@fabiankeil.de> From: Edward Napierala Date: Mon, 3 Jul 2017 12:10:41 +0100 X-Google-Sender-Auth: D6LVuMcjFaYveirDGoKPI2y9elM Message-ID: Subject: Re: I/O to pool appears to be hung, panic ! To: Ben RUBSON Cc: Freebsd fs , freebsd-scsi Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 11:10:44 -0000 2017-07-03 10:07 GMT+01:00 Ben RUBSON : > > On 29 Jun 2017, at 15:36, Ben RUBSON wrote: > > > >> On 29 Jun 2017, at 14:43, Fabian Keil > wrote: > > > > Thank you for your feedback Fabian. > > > >> Ben RUBSON wrote: > >> > >>> One of my servers did a kernel panic last night, giving the following > message : > >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at > '/dev/label/G23iscsi'. > >> [...] > >>> Here are some numbers regarding this disk, taken from the server > hosting the pool : > >>> (unfortunately not from the iscsi target server) > >>> https://s23.postimg.org/zd8jy9xaj/busydisk.png > >>> > >>> We clearly see that suddendly, disk became 100% busy, meanwhile CPU > was almost idle. > > We also clearly see that 5 minutes later (02:09) disk seems to be back but > became 100% busy again, > and that 16 minutes later (default vfs.zfs.deadman_synctime_ms), panic > occurred. > > >>> No error message at all on both servers. > >> [...] > >>> The only log I have is the following stacktrace taken from the server > console : > >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at > '/dev/label/G23iscsi'. > >>> cpuid = 0 > >>> KDB: stack backtrace: > >>> #0 0xffffffff80b240f7 at kdb_backtrace+0x67 > >>> #1 0xffffffff80ad9462 at vpanic+0x182 > >>> #2 0xffffffff80ad92d3 at panic+0x43 > >>> #3 0xffffffff82238fa7 at vdev_deadman+0x127 > >>> #4 0xffffffff82238ec0 at vdev_deadman+0x40 > >>> #5 0xffffffff82238ec0 at vdev_deadman+0x40 > >>> #6 0xffffffff8222d0a6 at spa_deadman+0x86 > >>> #7 0xffffffff80af32da at softclock_call_cc+0x18a > >>> #8 0xffffffff80af3854 at softclock+0x94 > >>> #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f > >>> #10 0xffffffff80a936f6 at ithread_loop+0xc6 > >>> #11 0xffffffff80a900d5 at fork_exit+0x85 > >>> #12 0xffffffff80f846fe at fork_trampoline+0xe > >>> Uptime: 92d2h47m6s > >>> > >>> I would have been pleased to make a dump available. > >>> However, despite my (correct ?) configuration, server did not dump : > >>> (nevertheless, "sysctl debug.kdb.panic=1" make it to dump) > >>> # grep ^dump /boot/loader.conf /etc/rc.conf > >>> /boot/loader.conf:dumpdev="/dev/mirror/swap" > >>> /etc/rc.conf:dumpdev="AUTO" > >> > >> You may want to look at the NOTES section in gmirror(8). > > > > Yes, I should already be OK (prefer algorithm set). > > > >>> I use default kernel, with a rebuilt zfs module : > >>> # uname -v > >>> FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 > root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC > >>> > >>> I use the following iSCSI configuration, which disconnects the disks > "as soon as" they are unavailable : > >>> kern.iscsi.ping_timeout=5 > >>> kern.iscsi.fail_on_disconnection=1 > >>> kern.iscsi.iscsid_timeout=5 > >>> > >>> I then think disk was at least correctly reachable during these 20 > busy minutes. > >>> > >>> So, any idea why I could have faced this issue ? > >> > >> Is it possible that the system was under memory pressure? > > > > No I don't think it was : > > https://s1.postimg.org/uvsebpyyn/busydisk2.png > > More than 2GB of available memory. > > Swap not used (624kB). > > ARC behaviour seems correct (anon increases because ZFS can't actually > write I think). > > Regarding the pool itself, it was receiving data at 6MB/s, sending > around 30kB blocks to disks. > > When disk went busy, throughput fell to some kB, with 128kB blocks. > > > >> geli's use of malloc() is known to cause deadlocks under memory > pressure: > >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209759 > >> > >> Given that gmirror uses malloc() as well it probably has the same issue. > > > > I don't use geli so I should not face this issue. > > > >>> I would have thought ZFS would have taken the busy device offline, > instead of raising a panic. > >>> Perhaps it is already possible to make ZFS behave like this ? > >> > >> There's a tunable for this: vfs.zfs.deadman_enabled. > >> If the panic is just a symptom of the deadlock it's unlikely > >> to help though. > > > > I think this tunable should have prevented the server from having raised > a panic : > > # sysctl -d vfs.zfs.deadman_enabled > > vfs.zfs.deadman_enabled: Kernel panic on stalled ZFS I/O > > # sysctl vfs.zfs.deadman_enabled > > vfs.zfs.deadman_enabled: 1 > > > > But not sure how it would have behaved then... > > (busy disk miraculously back to normal status, memory pressure due to > anon increasing...) > > I then think it would be nice, once vfs.zfs.deadman_synctime_ms has > expired, > to be able to take the busy device offline instead of raising a panic. > Currently, disabling deadman will avoid the panic but will let the device > slowing down the pool. > > I still did not found the root cause of this issue, not sure I will, > quite difficult actually with a stacktrace and some performance graphs > only :/ > What exactly is the disk doing when that happens? What does "gstat" say? If the iSCSI target is also FreeBSD, what does ctlstat say? If everything else fails, you might want to do a packet trace using "tcpdump -w" and take a look at it using Wireshark; it decodes iSCSI quite nicely. From owner-freebsd-scsi@freebsd.org Mon Jul 3 13:36:45 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 12D7F9E6FBB; Mon, 3 Jul 2017 13:36:45 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x232.google.com (mail-wm0-x232.google.com [IPv6:2a00:1450:400c:c09::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 8E44D7093A; Mon, 3 Jul 2017 13:36:44 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x232.google.com with SMTP id z75so38180794wmc.0; Mon, 03 Jul 2017 06:36:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:message-id:mime-version:subject:date:references:to:in-reply-to; bh=2PaOtJ9R5xG6KsuRbyJNwSqCDonjEu/CttKecN2VrjU=; b=kfPk5l5yRHVOXpocQafMYOxs9tnJckT6c6SstP7pVYjYjaE5Ud6MS64FoYwT/OQO/c ttoRfbhn6yLEWR0Ihnwp6aJSmRBtZ6Q+Ii6v859rzpChc+Gp8rp1Ur3OYygh5HURBMtn ekBFMKO5lAtzL6eE61FONMev2MS4dr8D202MM3C2KIYhbASzY+hZdLlcyS5aZnS6mL4v bA62mX9jHOPSnXka126FXrDMwTsSpYfesuIZ/CvALBF3TSVjvSuNxOR9mLQnA5jtFA2f OnThdTwcr39zWBNlKLTflguD87jDGaRLy+Wnlxphd0fcAvaywU7zxCnU+sZlqgRlL5lB W3Jg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :references:to:in-reply-to; bh=2PaOtJ9R5xG6KsuRbyJNwSqCDonjEu/CttKecN2VrjU=; b=oC7n1+tTI3rhHvKMaV5l3LJJ/2YjRvquU1YBfru/4mOiSglMRi05ysQgu3uZwNi2f8 Fe5OyOjokOMDALZzTt9kc0l99YnvEJOOR6kLlJWRBIDuO+pZ21py8JKJjQS3kHUJ7byz 4gRB6Ii5nu0iXopFj8cITRZ/o7+ciP4rxzsltVr2dmQcm595vj48IOU8t30IFMutGAH3 +d9xLwp0Z5EM5Xb3kYdMzjY7XGNsuuFHR/uI2fT1GZjM2NSBZ47G/F7ltM5eL/zry2A2 26s0x6fZvd4AbBYau3m7D0gYt1SgNuI5poQV+7VFtSEQh2HbouZlOXhRXS2RVTPB636Z /nmA== X-Gm-Message-State: AKS2vOx04UjB2hU8KkN8G3d4WHoAed8WfXeg0P0IrVDuffyaoc68wa97 P3Q/qvZbEv3y9D94vNc= X-Received: by 10.28.10.194 with SMTP id 185mr25393286wmk.119.1499089002503; Mon, 03 Jul 2017 06:36:42 -0700 (PDT) Received: from ben.home (LFbn-1-11339-180.w2-15.abo.wanadoo.fr. [2.15.165.180]) by smtp.gmail.com with ESMTPSA id 13sm13619182wrl.57.2017.07.03.06.36.41 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 03 Jul 2017 06:36:41 -0700 (PDT) From: Ben RUBSON Message-Id: <1F414ECE-1856-4EA3-A141-88B64703D4D6@gmail.com> Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: I/O to pool appears to be hung, panic ! Date: Mon, 3 Jul 2017 15:36:40 +0200 References: <20170629144334.1e283570@fabiankeil.de> To: Freebsd fs , freebsd-scsi In-Reply-To: X-Mailer: Apple Mail (2.3124) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 13:36:45 -0000 > On 03 Jul 2017, at 13:10, Edward Napierala wrote: >=20 > 2017-07-03 10:07 GMT+01:00 Ben RUBSON >: >=20 > > On 29 Jun 2017, at 15:36, Ben RUBSON > wrote: > > > >> On 29 Jun 2017, at 14:43, Fabian Keil > wrote: > > > > Thank you for your feedback Fabian. > > > >> Ben RUBSON > = wrote: > >> > >>> One of my servers did a kernel panic last night, giving the = following message : > >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... = at '/dev/label/G23iscsi'. > >> [...] > >>> Here are some numbers regarding this disk, taken from the server = hosting the pool : > >>> (unfortunately not from the iscsi target server) > >>> https://s23.postimg.org/zd8jy9xaj/busydisk.png = > >>> > >>> We clearly see that suddendly, disk became 100% busy, meanwhile = CPU was almost idle. >=20 > We also clearly see that 5 minutes later (02:09) disk seems to be back = but became 100% busy again, > and that 16 minutes later (default vfs.zfs.deadman_synctime_ms), panic = occurred. >=20 > >>> No error message at all on both servers. > >> [...] > >>> The only log I have is the following stacktrace taken from the = server console : > >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... = at '/dev/label/G23iscsi'. > >>> cpuid =3D 0 > >>> KDB: stack backtrace: > >>> #0 0xffffffff80b240f7 at kdb_backtrace+0x67 > >>> #1 0xffffffff80ad9462 at vpanic+0x182 > >>> #2 0xffffffff80ad92d3 at panic+0x43 > >>> #3 0xffffffff82238fa7 at vdev_deadman+0x127 > >>> #4 0xffffffff82238ec0 at vdev_deadman+0x40 > >>> #5 0xffffffff82238ec0 at vdev_deadman+0x40 > >>> #6 0xffffffff8222d0a6 at spa_deadman+0x86 > >>> #7 0xffffffff80af32da at softclock_call_cc+0x18a > >>> #8 0xffffffff80af3854 at softclock+0x94 > >>> #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f > >>> #10 0xffffffff80a936f6 at ithread_loop+0xc6 > >>> #11 0xffffffff80a900d5 at fork_exit+0x85 > >>> #12 0xffffffff80f846fe at fork_trampoline+0xe > >>> Uptime: 92d2h47m6s > >>> > >>> I would have been pleased to make a dump available. > >>> However, despite my (correct ?) configuration, server did not dump = : > >>> (nevertheless, "sysctl debug.kdb.panic=3D1" make it to dump) > >>> # grep ^dump /boot/loader.conf /etc/rc.conf > >>> /boot/loader.conf:dumpdev=3D"/dev/mirror/swap" > >>> /etc/rc.conf:dumpdev=3D"AUTO" > >> > >> You may want to look at the NOTES section in gmirror(8). > > > > Yes, I should already be OK (prefer algorithm set). > > > >>> I use default kernel, with a rebuilt zfs module : > >>> # uname -v > >>> FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 = root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC > >>> > >>> I use the following iSCSI configuration, which disconnects the = disks "as soon as" they are unavailable : > >>> kern.iscsi.ping_timeout=3D5 > >>> kern.iscsi.fail_on_disconnection=3D1 > >>> kern.iscsi.iscsid_timeout=3D5 > >>> > >>> I then think disk was at least correctly reachable during these 20 = busy minutes. > >>> > >>> So, any idea why I could have faced this issue ? > >> > >> Is it possible that the system was under memory pressure? > > > > No I don't think it was : > > https://s1.postimg.org/uvsebpyyn/busydisk2.png = > > More than 2GB of available memory. > > Swap not used (624kB). > > ARC behaviour seems correct (anon increases because ZFS can't = actually write I think). > > Regarding the pool itself, it was receiving data at 6MB/s, sending = around 30kB blocks to disks. > > When disk went busy, throughput fell to some kB, with 128kB blocks. > > > >> geli's use of malloc() is known to cause deadlocks under memory = pressure: > >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209759 = > >> > >> Given that gmirror uses malloc() as well it probably has the same = issue. > > > > I don't use geli so I should not face this issue. > > > >>> I would have thought ZFS would have taken the busy device offline, = instead of raising a panic. > >>> Perhaps it is already possible to make ZFS behave like this ? > >> > >> There's a tunable for this: vfs.zfs.deadman_enabled. > >> If the panic is just a symptom of the deadlock it's unlikely > >> to help though. > > > > I think this tunable should have prevented the server from having = raised a panic : > > # sysctl -d vfs.zfs.deadman_enabled > > vfs.zfs.deadman_enabled: Kernel panic on stalled ZFS I/O > > # sysctl vfs.zfs.deadman_enabled > > vfs.zfs.deadman_enabled: 1 > > > > But not sure how it would have behaved then... > > (busy disk miraculously back to normal status, memory pressure due = to anon increasing...) >=20 > I then think it would be nice, once vfs.zfs.deadman_synctime_ms has = expired, > to be able to take the busy device offline instead of raising a panic. > Currently, disabling deadman will avoid the panic but will let the = device slowing down the pool. >=20 > I still did not found the root cause of this issue, not sure I will, > quite difficult actually with a stacktrace and some performance graphs = only :/ >=20 > What exactly is the disk doing when that happens? What does "gstat" = say? If the iSCSI > target is also FreeBSD, what does ctlstat say? As shown on this graph made with gstat numbers from initiator : https://s23.postimg.org/zd8jy9xaj/busydisk.png The disk is continuously writing 3 MBps before the issue happens. When it occurs, response time increases to around 30 seconds (100% = busy), and consequently disk throughput drops down to some kBps. CPU stays at an almost fully idle level. As shown here, no memory pressure : https://s1.postimg.org/uvsebpyyn/busydisk2.png = At the end of graphs' lines, panic is raised. iSCSI target is also FreeBSD, unfortunately ctlstat was not running = during the issue occurred. So numbers will be average since system startup (102 days ago). I also do not have gstat numbers from this disk on target side (to help finding if it's a hardware issue, a iSCSI issue or something = else). I will think about collecting these numbers if ever issue occurs again. > If everything else fails, you might want to do a packet trace using = "tcpdump -w" and take > a look at it using Wireshark; it decodes iSCSI quite nicely. From owner-freebsd-scsi@freebsd.org Mon Jul 3 15:27:25 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A54FC9E8DA6; Mon, 3 Jul 2017 15:27:25 +0000 (UTC) (envelope-from etnapierala@gmail.com) Received: from mail-vk0-x232.google.com (mail-vk0-x232.google.com [IPv6:2607:f8b0:400c:c05::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 58EEA7435D; Mon, 3 Jul 2017 15:27:25 +0000 (UTC) (envelope-from etnapierala@gmail.com) Received: by mail-vk0-x232.google.com with SMTP id y70so97090644vky.3; Mon, 03 Jul 2017 08:27:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=W/0DKT8sHhZ3yNVWjKkTJ772GQVXevDvpoKXY+oigYk=; b=mgGq/hvQwILMsIomtgAL/VM6ON3YadRQNxKhk39W9dPGIGWhI6a+Re93+c5UXRy1Y9 KN0x8xh4fLOGV2avdc+TEGsNhZKwJ40Atd5mZ82PVvroDKMcPPtb1Z/1eE6nENeIemsY b7OVRYCcgiAtGlEdOeWNJ9sRoAGCdR8jMAGi6EVQAMQPtguf8y/48gKBFG3KUPMWcVB6 UbxhSbGeMlcTHuggsm+bN5Q5fDQ5B8t3H0G0J0lx+aHSAAlbliIfZu0Sb9KIXQxeuboJ 1Cmmc0YOFsimoAXajhlaclgufcieRQHKqSIpUh8BE7wza1zR7DbmRzEvke2vGzUqBgoi BhvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=W/0DKT8sHhZ3yNVWjKkTJ772GQVXevDvpoKXY+oigYk=; b=FrWqWEKqSnjHQNyE/rYvKyGLvy+P66XwWlo3RQoLOBAdaqWvBnr3HNFTcpZz6fv16k U80gmKjUZHigB0xEfKs45Bdgw4HR8n50+8oYyt10wcgliIzots8/FfdgiuUq2vS01ygC ca8hQDqyUmQ9pa0JqQQlOscV5HlXYS2ssX4ErF8v8qyU/An0jEgi2+R1rsTvaTAgm1zr 2RlM3cmR9tpYYvg+Fc04J+3CQ0uruSYNLONuaCN0CJMfjOTy/JEdwdDpAoStC4664xsv QXHaNCWnLY2GisjTjMyvpp0FqVn9/hC6QEwhmrKOmk3eaWNroN+Vf6m2Ew6U9dO4iW+Y y4fg== X-Gm-Message-State: AKS2vOzmePGCnJhJd7tbQZiZ7IcbfpTYed4g968rVnZ+J87ND6pH2aeq qvsGykdz3oh6SNNgOh+CMciHar5TFw== X-Received: by 10.31.172.88 with SMTP id v85mr19659845vke.57.1499095644405; Mon, 03 Jul 2017 08:27:24 -0700 (PDT) MIME-Version: 1.0 Sender: etnapierala@gmail.com Received: by 10.176.83.198 with HTTP; Mon, 3 Jul 2017 08:27:23 -0700 (PDT) In-Reply-To: <1F414ECE-1856-4EA3-A141-88B64703D4D6@gmail.com> References: <20170629144334.1e283570@fabiankeil.de> <1F414ECE-1856-4EA3-A141-88B64703D4D6@gmail.com> From: Edward Napierala Date: Mon, 3 Jul 2017 16:27:23 +0100 X-Google-Sender-Auth: MfH634NSHWAqw0dMIrXRjnSuuYA Message-ID: Subject: Re: I/O to pool appears to be hung, panic ! To: Ben RUBSON Cc: Freebsd fs , freebsd-scsi Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 15:27:25 -0000 2017-07-03 14:36 GMT+01:00 Ben RUBSON : > > On 03 Jul 2017, at 13:10, Edward Napierala wrote: > > > > 2017-07-03 10:07 GMT+01:00 Ben RUBSON ben.rubson@gmail.com>>: > > > > > On 29 Jun 2017, at 15:36, Ben RUBSON ben.rubson@gmail.com>> wrote: > > > > > >> On 29 Jun 2017, at 14:43, Fabian Keil > wrote: > > > > > > Thank you for your feedback Fabian. > > > > > >> Ben RUBSON > > wrote: > > >> > > >>> One of my servers did a kernel panic last night, giving the > following message : > > >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at > '/dev/label/G23iscsi'. > > >> [...] > > >>> Here are some numbers regarding this disk, taken from the server > hosting the pool : > > >>> (unfortunately not from the iscsi target server) > > >>> https://s23.postimg.org/zd8jy9xaj/busydisk.png < > https://s23.postimg.org/zd8jy9xaj/busydisk.png> > > >>> > > >>> We clearly see that suddendly, disk became 100% busy, meanwhile CPU > was almost idle. > > > > We also clearly see that 5 minutes later (02:09) disk seems to be back > but became 100% busy again, > > and that 16 minutes later (default vfs.zfs.deadman_synctime_ms), panic > occurred. > > > > >>> No error message at all on both servers. > > >> [...] > > >>> The only log I have is the following stacktrace taken from the > server console : > > >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at > '/dev/label/G23iscsi'. > > >>> cpuid = 0 > > >>> KDB: stack backtrace: > > >>> #0 0xffffffff80b240f7 at kdb_backtrace+0x67 > > >>> #1 0xffffffff80ad9462 at vpanic+0x182 > > >>> #2 0xffffffff80ad92d3 at panic+0x43 > > >>> #3 0xffffffff82238fa7 at vdev_deadman+0x127 > > >>> #4 0xffffffff82238ec0 at vdev_deadman+0x40 > > >>> #5 0xffffffff82238ec0 at vdev_deadman+0x40 > > >>> #6 0xffffffff8222d0a6 at spa_deadman+0x86 > > >>> #7 0xffffffff80af32da at softclock_call_cc+0x18a > > >>> #8 0xffffffff80af3854 at softclock+0x94 > > >>> #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f > > >>> #10 0xffffffff80a936f6 at ithread_loop+0xc6 > > >>> #11 0xffffffff80a900d5 at fork_exit+0x85 > > >>> #12 0xffffffff80f846fe at fork_trampoline+0xe > > >>> Uptime: 92d2h47m6s > > >>> > > >>> I would have been pleased to make a dump available. > > >>> However, despite my (correct ?) configuration, server did not dump : > > >>> (nevertheless, "sysctl debug.kdb.panic=1" make it to dump) > > >>> # grep ^dump /boot/loader.conf /etc/rc.conf > > >>> /boot/loader.conf:dumpdev="/dev/mirror/swap" > > >>> /etc/rc.conf:dumpdev="AUTO" > > >> > > >> You may want to look at the NOTES section in gmirror(8). > > > > > > Yes, I should already be OK (prefer algorithm set). > > > > > >>> I use default kernel, with a rebuilt zfs module : > > >>> # uname -v > > >>> FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 > root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC > > >>> > > >>> I use the following iSCSI configuration, which disconnects the disks > "as soon as" they are unavailable : > > >>> kern.iscsi.ping_timeout=5 > > >>> kern.iscsi.fail_on_disconnection=1 > > >>> kern.iscsi.iscsid_timeout=5 > > >>> > > >>> I then think disk was at least correctly reachable during these 20 > busy minutes. > > >>> > > >>> So, any idea why I could have faced this issue ? > > >> > > >> Is it possible that the system was under memory pressure? > > > > > > No I don't think it was : > > > https://s1.postimg.org/uvsebpyyn/busydisk2.png < > https://s1.postimg.org/uvsebpyyn/busydisk2.png> > > > More than 2GB of available memory. > > > Swap not used (624kB). > > > ARC behaviour seems correct (anon increases because ZFS can't actually > write I think). > > > Regarding the pool itself, it was receiving data at 6MB/s, sending > around 30kB blocks to disks. > > > When disk went busy, throughput fell to some kB, with 128kB blocks. > > > > > >> geli's use of malloc() is known to cause deadlocks under memory > pressure: > > >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209759 < > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209759> > > >> > > >> Given that gmirror uses malloc() as well it probably has the same > issue. > > > > > > I don't use geli so I should not face this issue. > > > > > >>> I would have thought ZFS would have taken the busy device offline, > instead of raising a panic. > > >>> Perhaps it is already possible to make ZFS behave like this ? > > >> > > >> There's a tunable for this: vfs.zfs.deadman_enabled. > > >> If the panic is just a symptom of the deadlock it's unlikely > > >> to help though. > > > > > > I think this tunable should have prevented the server from having > raised a panic : > > > # sysctl -d vfs.zfs.deadman_enabled > > > vfs.zfs.deadman_enabled: Kernel panic on stalled ZFS I/O > > > # sysctl vfs.zfs.deadman_enabled > > > vfs.zfs.deadman_enabled: 1 > > > > > > But not sure how it would have behaved then... > > > (busy disk miraculously back to normal status, memory pressure due to > anon increasing...) > > > > I then think it would be nice, once vfs.zfs.deadman_synctime_ms has > expired, > > to be able to take the busy device offline instead of raising a panic. > > Currently, disabling deadman will avoid the panic but will let the > device slowing down the pool. > > > > I still did not found the root cause of this issue, not sure I will, > > quite difficult actually with a stacktrace and some performance graphs > only :/ > > > > What exactly is the disk doing when that happens? What does "gstat" > say? If the iSCSI > > target is also FreeBSD, what does ctlstat say? > > As shown on this graph made with gstat numbers from initiator : > https://s23.postimg.org/zd8jy9xaj/busydisk.png > The disk is continuously writing 3 MBps before the issue happens. > When it occurs, response time increases to around 30 seconds (100% busy), > and consequently disk throughput drops down to some kBps. > CPU stays at an almost fully idle level. > > As shown here, no memory pressure : > https://s1.postimg.org/uvsebpyyn/busydisk2.png uvsebpyyn/busydisk2.png> > > At the end of graphs' lines, panic is raised. > > iSCSI target is also FreeBSD, unfortunately ctlstat was not running during > the issue occurred. > So numbers will be average since system startup (102 days ago). > I also do not have gstat numbers from this disk on target side > (to help finding if it's a hardware issue, a iSCSI issue or something > else). > I will think about collecting these numbers if ever issue occurs again. > It's kind of hard to say something definitive at this point, but I suspect it's a problem at the target side. I got a report about something quite similar some two years ago, and it turned out to be a problem with a disk controller on the target. From owner-freebsd-scsi@freebsd.org Mon Jul 3 15:35:15 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CBC719E90AB for ; Mon, 3 Jul 2017 15:35:15 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B351A749CA for ; Mon, 3 Jul 2017 15:35:15 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v63FZFwh020569 for ; Mon, 3 Jul 2017 15:35:15 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 219701] crash in camperiphfree() Date: Mon, 03 Jul 2017 15:35:15 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: ken@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 15:35:15 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219701 --- Comment #5 from commit-hook@freebsd.org --- A commit references this bug: Author: ken Date: Mon Jul 3 15:34:20 UTC 2017 New revision: 320601 URL: https://svnweb.freebsd.org/changeset/base/320601 Log: MFC r320421: ------------------------------------------------------------------------ r320421 | ken | 2017-06-27 13:26:02 -0600 (Tue, 27 Jun 2017) | 37 lines Fix a panic in camperiphfree(). If a peripheral driver (e.g. da, sa, cd) is added or removed from the peripheral driver list while an unrelated peripheral driver instance (e= .g. da0, sa5, cd2) is going away and is inside camperiphfree(), we could dereference an invalid pointer. When peripheral drivers are added or removed (see periphdriver_register= () and periphdriver_unregister()), the peripheral driver array is resized and existing entries are moved. Although we hold the topology lock while we traverse the peripheral dri= ver list, we retain a pointer to the location of the peripheral driver poin= ter and then drop the topology lock. So we are still vulnerable to the list getting moved around while the lock is dropped. To solve the problem, cache a copy of the peripheral driver pointer. If its storage location in the list changes while we have the lock dropped= , it won't have any effect. This doesn't solve the issue that peripheral drivers ("da", "cd", as opposed to individual instances like "da0", "cd0") are not generally part of a reference counting scheme to guard against deregistering them while the= re are instances active. The caller (generally the person unloading a mod= ule) has to be aware of active drivers and not unload something that is in u= se. sys/cam/cam_periph.c: In camperiphfree(), cache a pointer to the peripheral driver instance to avoid holding a pointer to an invalid memory location in the event that the peripheral driver list changes while we have the topology lock dropped. PR: kern/219701 Submitted by: avg Sponsored by: Spectra Logic ------------------------------------------------------------------------ PR: kern/219701 Sponsored by: Spectra Logic Changes: _U stable/10/ stable/10/sys/cam/cam_periph.c --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-scsi@freebsd.org Mon Jul 3 15:35:21 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 82D5A9E90CA for ; Mon, 3 Jul 2017 15:35:21 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6979B749FF for ; Mon, 3 Jul 2017 15:35:21 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v63FZLip020732 for ; Mon, 3 Jul 2017 15:35:21 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 219701] crash in camperiphfree() Date: Mon, 03 Jul 2017 15:35:21 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: ken@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 15:35:21 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219701 --- Comment #8 from commit-hook@freebsd.org --- A commit references this bug: Author: ken Date: Mon Jul 3 15:34:21 UTC 2017 New revision: 320602 URL: https://svnweb.freebsd.org/changeset/base/320602 Log: MFC r320421: ------------------------------------------------------------------------ r320421 | ken | 2017-06-27 13:26:02 -0600 (Tue, 27 Jun 2017) | 37 lines Fix a panic in camperiphfree(). If a peripheral driver (e.g. da, sa, cd) is added or removed from the peripheral driver list while an unrelated peripheral driver instance (e= .g. da0, sa5, cd2) is going away and is inside camperiphfree(), we could dereference an invalid pointer. When peripheral drivers are added or removed (see periphdriver_register= () and periphdriver_unregister()), the peripheral driver array is resized and existing entries are moved. Although we hold the topology lock while we traverse the peripheral dri= ver list, we retain a pointer to the location of the peripheral driver poin= ter and then drop the topology lock. So we are still vulnerable to the list getting moved around while the lock is dropped. To solve the problem, cache a copy of the peripheral driver pointer. If its storage location in the list changes while we have the lock dropped= , it won't have any effect. This doesn't solve the issue that peripheral drivers ("da", "cd", as opposed to individual instances like "da0", "cd0") are not generally part of a reference counting scheme to guard against deregistering them while the= re are instances active. The caller (generally the person unloading a mod= ule) has to be aware of active drivers and not unload something that is in u= se. sys/cam/cam_periph.c: In camperiphfree(), cache a pointer to the peripheral driver instance to avoid holding a pointer to an invalid memory location in the event that the peripheral driver list changes while we have the topology lock dropped. PR: kern/219701 Submitted by: avg Sponsored by: Spectra Logic ------------------------------------------------------------------------ PR: kern/219701 Sponsored by: Spectra Logic Changes: _U stable/11/ stable/11/sys/cam/cam_periph.c --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-scsi@freebsd.org Mon Jul 3 15:35:19 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7AC3A9E90C0 for ; Mon, 3 Jul 2017 15:35:19 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 61079749DF for ; Mon, 3 Jul 2017 15:35:19 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v63FZJUU020691 for ; Mon, 3 Jul 2017 15:35:19 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 219701] crash in camperiphfree() Date: Mon, 03 Jul 2017 15:35:19 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: ken@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 15:35:19 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219701 --- Comment #7 from commit-hook@freebsd.org --- A commit references this bug: Author: ken Date: Mon Jul 3 15:34:21 UTC 2017 New revision: 320602 URL: https://svnweb.freebsd.org/changeset/base/320602 Log: MFC r320421: ------------------------------------------------------------------------ r320421 | ken | 2017-06-27 13:26:02 -0600 (Tue, 27 Jun 2017) | 37 lines Fix a panic in camperiphfree(). If a peripheral driver (e.g. da, sa, cd) is added or removed from the peripheral driver list while an unrelated peripheral driver instance (e= .g. da0, sa5, cd2) is going away and is inside camperiphfree(), we could dereference an invalid pointer. When peripheral drivers are added or removed (see periphdriver_register= () and periphdriver_unregister()), the peripheral driver array is resized and existing entries are moved. Although we hold the topology lock while we traverse the peripheral dri= ver list, we retain a pointer to the location of the peripheral driver poin= ter and then drop the topology lock. So we are still vulnerable to the list getting moved around while the lock is dropped. To solve the problem, cache a copy of the peripheral driver pointer. If its storage location in the list changes while we have the lock dropped= , it won't have any effect. This doesn't solve the issue that peripheral drivers ("da", "cd", as opposed to individual instances like "da0", "cd0") are not generally part of a reference counting scheme to guard against deregistering them while the= re are instances active. The caller (generally the person unloading a mod= ule) has to be aware of active drivers and not unload something that is in u= se. sys/cam/cam_periph.c: In camperiphfree(), cache a pointer to the peripheral driver instance to avoid holding a pointer to an invalid memory location in the event that the peripheral driver list changes while we have the topology lock dropped. PR: kern/219701 Submitted by: avg Sponsored by: Spectra Logic ------------------------------------------------------------------------ PR: kern/219701 Sponsored by: Spectra Logic Changes: _U stable/11/ stable/11/sys/cam/cam_periph.c --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-scsi@freebsd.org Mon Jul 3 15:35:17 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B045A9E90B0 for ; Mon, 3 Jul 2017 15:35:17 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 96DCE749D3 for ; Mon, 3 Jul 2017 15:35:17 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v63FZHBu020584 for ; Mon, 3 Jul 2017 15:35:17 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 219701] crash in camperiphfree() Date: Mon, 03 Jul 2017 15:35:17 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: ken@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 15:35:17 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219701 --- Comment #6 from commit-hook@freebsd.org --- A commit references this bug: Author: ken Date: Mon Jul 3 15:34:20 UTC 2017 New revision: 320601 URL: https://svnweb.freebsd.org/changeset/base/320601 Log: MFC r320421: ------------------------------------------------------------------------ r320421 | ken | 2017-06-27 13:26:02 -0600 (Tue, 27 Jun 2017) | 37 lines Fix a panic in camperiphfree(). If a peripheral driver (e.g. da, sa, cd) is added or removed from the peripheral driver list while an unrelated peripheral driver instance (e= .g. da0, sa5, cd2) is going away and is inside camperiphfree(), we could dereference an invalid pointer. When peripheral drivers are added or removed (see periphdriver_register= () and periphdriver_unregister()), the peripheral driver array is resized and existing entries are moved. Although we hold the topology lock while we traverse the peripheral dri= ver list, we retain a pointer to the location of the peripheral driver poin= ter and then drop the topology lock. So we are still vulnerable to the list getting moved around while the lock is dropped. To solve the problem, cache a copy of the peripheral driver pointer. If its storage location in the list changes while we have the lock dropped= , it won't have any effect. This doesn't solve the issue that peripheral drivers ("da", "cd", as opposed to individual instances like "da0", "cd0") are not generally part of a reference counting scheme to guard against deregistering them while the= re are instances active. The caller (generally the person unloading a mod= ule) has to be aware of active drivers and not unload something that is in u= se. sys/cam/cam_periph.c: In camperiphfree(), cache a pointer to the peripheral driver instance to avoid holding a pointer to an invalid memory location in the event that the peripheral driver list changes while we have the topology lock dropped. PR: kern/219701 Submitted by: avg Sponsored by: Spectra Logic ------------------------------------------------------------------------ PR: kern/219701 Sponsored by: Spectra Logic Changes: _U stable/10/ stable/10/sys/cam/cam_periph.c --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-scsi@freebsd.org Mon Jul 3 15:40:50 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DAD5C9E92B0; Mon, 3 Jul 2017 15:40:50 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x229.google.com (mail-wm0-x229.google.com [IPv6:2a00:1450:400c:c09::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5CB4D74D26; Mon, 3 Jul 2017 15:40:50 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x229.google.com with SMTP id f67so59056430wmh.1; Mon, 03 Jul 2017 08:40:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:message-id:mime-version:subject:date:references:to:in-reply-to; bh=ovyVCvyhloMQ/jD0OLGq195wBHMNIKxKL28wsOoe6w0=; b=APPCfGF4zioKvH5abBgg8/dOUCpooS6J4C/jbBbnU9wk0xTzZriBf985lfPrdLH5GB R1eBbxa61k5No5/jaAA3b8Tt5KZ0NbXssmitLSikDCgqJ8KmQAuE81YYs2bp/4lLfJxq 2Tb/M3FbHPBZm5+BpTmofd6Hw5qD6F4pDqWMGTRriTUQ2+1yPFxTjRzv9u9P9slYivjz MPIMhqkNwRiSTBqUKfxTnIcqPQ4VF7rlLoYFlVKO2RhTppiXvR/rRxP6QjYoYBjapq4C wXKw7+lAbWRZbPnzRa/E8Ahp55j5BTQE3QUCCj9JPeELMmoPCwaKG8KK1HrRV7WQ5RXs Sy3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :references:to:in-reply-to; bh=ovyVCvyhloMQ/jD0OLGq195wBHMNIKxKL28wsOoe6w0=; b=mchFoACaKc7oSenWNngsEVH7rMPzDg77ZLQ7w2v00IpOyQwcQs4iAEdszbmrs5MdHs aSk+kwFX4fzLmSxWJDvLxPb+LuTkUPwrMCFgH3WbNd2LRtifhZvalZTDyKHzoA8UhFPk gvzcAa0qv8SrqlIPFBlIZp05vwoC/2/dqFC8f5rd6l8s5zvIyatnzocQoOi1HWJ6SFEF 3y6pn1Vx3ZZ2kz88p8bi8nldzJ0VV7bayAiX5JIqBwD/ZOz4gTRhB/hAyQXJK5nJjZbf dyrPTO/RYiRxPkqkxvt5Eqor6YX1hSGvObjwMMeBhkMajc/D0e26A1VwVLyZ5q4w2Y5s V6Sw== X-Gm-Message-State: AIVw113by322m85Njz0ur3EFHHtlRyX/EHzJBB2SsXerP5m0Me68PaCt t+htYiU76gfj9oiimxs= X-Received: by 10.28.184.87 with SMTP id i84mr7233612wmf.22.1499096448410; Mon, 03 Jul 2017 08:40:48 -0700 (PDT) Received: from ben.home (LFbn-1-11339-180.w2-15.abo.wanadoo.fr. [2.15.165.180]) by smtp.gmail.com with ESMTPSA id n71sm18817841wrb.62.2017.07.03.08.40.47 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 03 Jul 2017 08:40:47 -0700 (PDT) From: Ben RUBSON Message-Id: Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: I/O to pool appears to be hung, panic ! Date: Mon, 3 Jul 2017 17:40:46 +0200 References: <20170629144334.1e283570@fabiankeil.de> <1F414ECE-1856-4EA3-A141-88B64703D4D6@gmail.com> To: Freebsd fs , freebsd-scsi In-Reply-To: X-Mailer: Apple Mail (2.3124) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 15:40:51 -0000 > On 03 Jul 2017, at 17:27, Edward Napierala wrote: >=20 > 2017-07-03 14:36 GMT+01:00 Ben RUBSON >: > > On 03 Jul 2017, at 13:10, Edward Napierala > wrote: > > > > 2017-07-03 10:07 GMT+01:00 Ben RUBSON >>: > > > > > On 29 Jun 2017, at 15:36, Ben RUBSON >> wrote: > > > > > >> On 29 Jun 2017, at 14:43, Fabian Keil = = >> wrote: > > > > > > Thank you for your feedback Fabian. > > > > > >> Ben RUBSON = >> wrote: > > >> > > >>> One of my servers did a kernel panic last night, giving the = following message : > > >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... = at '/dev/label/G23iscsi'. > > >> [...] > > >>> Here are some numbers regarding this disk, taken from the server = hosting the pool : > > >>> (unfortunately not from the iscsi target server) > > >>> https://s23.postimg.org/zd8jy9xaj/busydisk.png = = > > > >>> > > >>> We clearly see that suddendly, disk became 100% busy, meanwhile = CPU was almost idle. > > > > We also clearly see that 5 minutes later (02:09) disk seems to be = back but became 100% busy again, > > and that 16 minutes later (default vfs.zfs.deadman_synctime_ms), = panic occurred. > > > > >>> No error message at all on both servers. > > >> [...] > > >>> The only log I have is the following stacktrace taken from the = server console : > > >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... = at '/dev/label/G23iscsi'. > > >>> cpuid =3D 0 > > >>> KDB: stack backtrace: > > >>> #0 0xffffffff80b240f7 at kdb_backtrace+0x67 > > >>> #1 0xffffffff80ad9462 at vpanic+0x182 > > >>> #2 0xffffffff80ad92d3 at panic+0x43 > > >>> #3 0xffffffff82238fa7 at vdev_deadman+0x127 > > >>> #4 0xffffffff82238ec0 at vdev_deadman+0x40 > > >>> #5 0xffffffff82238ec0 at vdev_deadman+0x40 > > >>> #6 0xffffffff8222d0a6 at spa_deadman+0x86 > > >>> #7 0xffffffff80af32da at softclock_call_cc+0x18a > > >>> #8 0xffffffff80af3854 at softclock+0x94 > > >>> #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f > > >>> #10 0xffffffff80a936f6 at ithread_loop+0xc6 > > >>> #11 0xffffffff80a900d5 at fork_exit+0x85 > > >>> #12 0xffffffff80f846fe at fork_trampoline+0xe > > >>> Uptime: 92d2h47m6s > > >>> > > >>> I would have been pleased to make a dump available. > > >>> However, despite my (correct ?) configuration, server did not = dump : > > >>> (nevertheless, "sysctl debug.kdb.panic=3D1" make it to dump) > > >>> # grep ^dump /boot/loader.conf /etc/rc.conf > > >>> /boot/loader.conf:dumpdev=3D"/dev/mirror/swap" > > >>> /etc/rc.conf:dumpdev=3D"AUTO" > > >> > > >> You may want to look at the NOTES section in gmirror(8). > > > > > > Yes, I should already be OK (prefer algorithm set). > > > > > >>> I use default kernel, with a rebuilt zfs module : > > >>> # uname -v > > >>> FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017 = root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC > > >>> > > >>> I use the following iSCSI configuration, which disconnects the = disks "as soon as" they are unavailable : > > >>> kern.iscsi.ping_timeout=3D5 > > >>> kern.iscsi.fail_on_disconnection=3D1 > > >>> kern.iscsi.iscsid_timeout=3D5 > > >>> > > >>> I then think disk was at least correctly reachable during these = 20 busy minutes. > > >>> > > >>> So, any idea why I could have faced this issue ? > > >> > > >> Is it possible that the system was under memory pressure? > > > > > > No I don't think it was : > > > https://s1.postimg.org/uvsebpyyn/busydisk2.png = = > > > > More than 2GB of available memory. > > > Swap not used (624kB). > > > ARC behaviour seems correct (anon increases because ZFS can't = actually write I think). > > > Regarding the pool itself, it was receiving data at 6MB/s, sending = around 30kB blocks to disks. > > > When disk went busy, throughput fell to some kB, with 128kB = blocks. > > > > > >> geli's use of malloc() is known to cause deadlocks under memory = pressure: > > >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209759 = = > > > >> > > >> Given that gmirror uses malloc() as well it probably has the same = issue. > > > > > > I don't use geli so I should not face this issue. > > > > > >>> I would have thought ZFS would have taken the busy device = offline, instead of raising a panic. > > >>> Perhaps it is already possible to make ZFS behave like this ? > > >> > > >> There's a tunable for this: vfs.zfs.deadman_enabled. > > >> If the panic is just a symptom of the deadlock it's unlikely > > >> to help though. > > > > > > I think this tunable should have prevented the server from having = raised a panic : > > > # sysctl -d vfs.zfs.deadman_enabled > > > vfs.zfs.deadman_enabled: Kernel panic on stalled ZFS I/O > > > # sysctl vfs.zfs.deadman_enabled > > > vfs.zfs.deadman_enabled: 1 > > > > > > But not sure how it would have behaved then... > > > (busy disk miraculously back to normal status, memory pressure due = to anon increasing...) > > > > I then think it would be nice, once vfs.zfs.deadman_synctime_ms has = expired, > > to be able to take the busy device offline instead of raising a = panic. > > Currently, disabling deadman will avoid the panic but will let the = device slowing down the pool. > > > > I still did not found the root cause of this issue, not sure I will, > > quite difficult actually with a stacktrace and some performance = graphs only :/ > > > > What exactly is the disk doing when that happens? What does "gstat" = say? If the iSCSI > > target is also FreeBSD, what does ctlstat say? >=20 > As shown on this graph made with gstat numbers from initiator : > https://s23.postimg.org/zd8jy9xaj/busydisk.png = > The disk is continuously writing 3 MBps before the issue happens. > When it occurs, response time increases to around 30 seconds (100% = busy), > and consequently disk throughput drops down to some kBps. > CPU stays at an almost fully idle level. >=20 > As shown here, no memory pressure : > https://s1.postimg.org/uvsebpyyn/busydisk2.png = = > >=20 > At the end of graphs' lines, panic is raised. >=20 > iSCSI target is also FreeBSD, unfortunately ctlstat was not running = during the issue occurred. > So numbers will be average since system startup (102 days ago). > I also do not have gstat numbers from this disk on target side > (to help finding if it's a hardware issue, a iSCSI issue or something = else). > I will think about collecting these numbers if ever issue occurs = again. >=20 > It's kind of hard to say something definitive at this point, but I = suspect it's a problem > at the target side. I got a report about something quite similar some = two years ago, > and it turned out to be a problem with a disk controller on the = target. Thank you for your feedback. I then : - enabled gstat collection on target, to also have numbers on target, = not only on initiator ; - enabled controller logging (dev.mps.0.debug_level=3D0x1B) ; - disabled deadman. We should be able to investigate further in case issue occurs again. Of course feel free to notify me in case you have other ideas ! Thank you again, Ben From owner-freebsd-scsi@freebsd.org Mon Jul 3 18:21:39 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C76FB9EC9CD for ; Mon, 3 Jul 2017 18:21:39 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B57F57ADE3 for ; Mon, 3 Jul 2017 18:21:39 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v63ILd8l089276 for ; Mon, 3 Jul 2017 18:21:39 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 219701] crash in camperiphfree() Date: Mon, 03 Jul 2017 18:21:39 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: ken@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 18:21:39 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219701 --- Comment #9 from commit-hook@freebsd.org --- A commit references this bug: Author: ken Date: Mon Jul 3 18:20:45 UTC 2017 New revision: 320608 URL: https://svnweb.freebsd.org/changeset/base/320608 Log: Merge r320602 from stable/11 into releng/11.1: ------------------------------------------------------------------------ r320602 | ken | 2017-07-03 09:34:21 -0600 (Mon, 03 Jul 2017) | 45 lines MFC r320421: ---------------------------------------------------------------------= --- r320421 | ken | 2017-06-27 13:26:02 -0600 (Tue, 27 Jun 2017) | 37 lin= es Fix a panic in camperiphfree(). If a peripheral driver (e.g. da, sa, cd) is added or removed from the peripheral driver list while an unrelated peripheral driver instance (e.g. da0, sa5, cd2) is going away and is inside camperiphfree(), we could dereference an invalid pointer. When peripheral drivers are added or removed (see periphdriver_regist= er() and periphdriver_unregister()), the peripheral driver array is resized and existing entries are moved. Although we hold the topology lock while we traverse the peripheral driver list, we retain a pointer to the location of the peripheral driver pointer and then drop the topology lock. So we are still vulnerable to the l= ist getting moved around while the lock is dropped. To solve the problem, cache a copy of the peripheral driver pointer. = If its storage location in the list changes while we have the lock dropp= ed, it won't have any effect. This doesn't solve the issue that peripheral drivers ("da", "cd", as opposed to individual instances like "da0", "cd0") are not generally part of a reference counting scheme to guard against deregistering them while t= here are instances active. The caller (generally the person unloading a module) has to be aware of active drivers and not unload something that is in use. sys/cam/cam_periph.c: In camperiphfree(), cache a pointer to the peripheral driver instance to avoid holding a pointer to an invalid memory location in the event that the peripheral driver list changes while we have the topology lock dropped. PR: kern/219701 Submitted by: avg Sponsored by: Spectra Logic ---------------------------------------------------------------------= --- ------------------------------------------------------------------------ Approved by: re (gjb) Changes: _U releng/11.1/ releng/11.1/sys/cam/cam_periph.c --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-scsi@freebsd.org Mon Jul 3 18:24:11 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CB18D9ECC39 for ; Mon, 3 Jul 2017 18:24:11 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B9DFF7B235 for ; Mon, 3 Jul 2017 18:24:11 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v63IOB4o094999 for ; Mon, 3 Jul 2017 18:24:11 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 219701] crash in camperiphfree() Date: Mon, 03 Jul 2017 18:24:11 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: ken@FreeBSD.org X-Bugzilla-Status: Closed X-Bugzilla-Resolution: FIXED X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: ken@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_status resolution Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jul 2017 18:24:11 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D219701 Kenneth D. Merry changed: What |Removed |Added ---------------------------------------------------------------------------- Status|New |Closed Resolution|--- |FIXED --- Comment #10 from Kenneth D. Merry --- Fixed in head, releng/11.1, stable/11, and stable/10. Thanks for the bug report! --=20 You are receiving this mail because: You are on the CC list for the bug.=