From owner-freebsd-stable@freebsd.org Fri Aug 3 19:11:50 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 946FE104D2E6 for ; Fri, 3 Aug 2018 19:11:50 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 31305712F8 for ; Fri, 3 Aug 2018 19:11:50 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: by mailman.ysv.freebsd.org (Postfix) id E6764104D2E2; Fri, 3 Aug 2018 19:11:49 +0000 (UTC) Delivered-To: stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C4AA2104D2E1 for ; Fri, 3 Aug 2018 19:11:49 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from mail.ijs.si (mail.ijs.si [IPv6:2001:1470:ff80::25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 53BBE712F5; Fri, 3 Aug 2018 19:11:49 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from amavis-ori.ijs.si (localhost [IPv6:::1]) by mail.ijs.si (Postfix) with ESMTP id 41hxVW0XKzz43l; Fri, 3 Aug 2018 21:11:47 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ijs.si; h= user-agent:message-id:references:in-reply-to:organization :subject:subject:from:from:date:date:content-transfer-encoding :content-type:content-type:mime-version:received:received :received:received; s=jakla4; t=1533323503; x=1535915504; bh=Wnt F1h5SER5Nk2yc1xUfWQiaI+G57bR84KYUq+1YnV8=; b=WAN8AH+wlL1J4h3HG/y FILoSZu0AEuEPrOxvhoN8I8gudvMZRPJSxjKqbfDYK7ZkBvOWQLGE6VyGQTTJWyV jiFHyU5lYbfDpmHszjga5pqspoPMAwk04aUjkxvHFUWDfR0wzU4F6ymGnkilVYmO 7T55Lz0272JXl2oweJx7cVRU= X-Virus-Scanned: amavisd-new at ijs.si Received: from mail.ijs.si ([IPv6:::1]) by amavis-ori.ijs.si (mail.ijs.si [IPv6:::1]) (amavisd-new, port 10026) with LMTP id Rvg2e11kgnce; Fri, 3 Aug 2018 21:11:43 +0200 (CEST) Received: from mildred.ijs.si (mailbox.ijs.si [IPv6:2001:1470:ff80::143:1]) by mail.ijs.si (Postfix) with ESMTP id 41hxVR3jmTz43k; Fri, 3 Aug 2018 21:11:43 +0200 (CEST) Received: from nabiralnik.ijs.si (nabiralnik.ijs.si [IPv6:2001:1470:ff80::80:16]) by mildred.ijs.si (Postfix) with ESMTP id 41hxVR0swhzCL; Fri, 3 Aug 2018 21:11:43 +0200 (CEST) Received: from neli.ijs.si (2001:1470:ff80:88:21c:c0ff:feb1:8c91) by nabiralnik.ijs.si with HTTP (HTTP/1.1 POST); Fri, 03 Aug 2018 21:11:42 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Fri, 03 Aug 2018 21:11:42 +0200 From: Mark Martinec To: stable@freebsd.org Cc: Mark Johnston Subject: Re: All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64 Organization: Jozef Stefan Institute In-Reply-To: <2ec91ebeaba54fda5e9437f868d4d590@ijs.si> References: <1a039af7758679ba1085934b4fb81b57@ijs.si> <3e56e4de076111c04c2595068ba71eec@ijs.si> <20180731220948.GA97237@raichu> <2ec91ebeaba54fda5e9437f868d4d590@ijs.si> Message-ID: X-Sender: Mark.Martinec+freebsd@ijs.si User-Agent: Roundcube Webmail/1.3.1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 03 Aug 2018 19:11:50 -0000 More attempts at tracking this down. The suggested dtrace command does usually abort with: Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3330. but with some luck soon after each machine reboot I can leave the dtrace running for about 10 or 20 seconds (max) before terminating it with a ^C, and succeed in collecting the report. If I miss the opportunity to leave dtrace running just long enough to collect useful info, but not long enough for it to hit the assertion check, then any further attempt to run the dtrace script hits the assertion fault immediately. Btw, (just in case) I have recompiled kernel from source (base/release/11.2.0) with debugging symbols, although the behaviour has not changed: FreeBSD floki.ijs.si 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r337238: Fri Aug 3 17:29:42 CEST 2018 mark@xxx.ijs.si:/usr/obj/usr/src/sys/FLOKI amd64 Anyway, after several attempts I was able to collect a useful dtrace output from the suggested dtrace stript: # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count()}' while running "zpool list" repeatedly in another terminal screen: # (while true; do zpool list -Hp >/dev/null; vmstat -m | fgrep solaris; \ sleep 0.2; done) | awk '{print $2-a; a=$2}' 454303 570 570 570 570 570 570 570 570 570 570 570 570 570 570 Two samples of the collected dtrace output (after about 15 seconds) are at: https://www.ijs.si/usr/mark/tmp/dtrace1.out.bz2 https://www.ijs.si/usr/mark/tmp/dtrace2.out.bz2 (the dtrace2.out is probably cleaner, I made sure no other service was running except my sshd and syslog) Not really sure what I'm looking at, but a couple of large entries stand out: $ awk '/^ .*[0-9]+ .*[0-9]$/' dtrace2.out | sort -k1n | tail -5 114688 138 114688 138 114688 138 114688 138 114688 138 Thanks in advance for looking into it, Mark 2018-08-01 09:12, myself wrote: >> On Tue, Jul 31, 2018 at 11:54:29PM +0200, Mark Martinec wrote: >>> I have now upgraded this host from 11.1-RELEASE-p11 to 11.2-RELEASE >>> and the situation has not improved. Also turned off all services. >>> ZFS is still leaking memory about 30 MB per hour, until the host >>> runs out of memory and swap space and crashes, unless I reboot it >>> first every four days. >>> >>> Any advise before I try to get rid of that faulted disk with a pool >>> (or downgrade to 10.3, which was stable) ? > > 2018-08-01 00:09, Mark Johnston wrote: >> If you're able to use dtrace, it would be useful to try tracking >> allocations with the solaris tag: >> >> # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = >> count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = >> count();}' >> >> Try letting that run for one minute, then kill it and paste the >> output. >> Ideally the host will be as close to idle as possible while still >> demonstrating the leak. > > Good and bad news: > > The suggested dtrace command bails out: > > # dtrace -n 'dtmalloc::solaris:malloc {@allocs[stack(), args[3]] = > count()} dtmalloc::solaris:free {@frees[stack(), args[3]] = count();}' > dtrace: description 'dtmalloc::solaris:malloc ' matched 2 probes > Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file > /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, > line 3330. > Abort trap > > But I did get one step further, localizing the culprit. > > I realized that the "solaris" malloc count goes up in sync with > the 'telegraf' monitoring service polls, which also has a ZFS plugin > which monitors the zfs pool and ARC. This plugin runs 'zpool list -Hp' > periodically. > > So after stopping telegraf (and other remaining services), > the 'vmstat -m' shows that InUse count for "solaris" goes up by 552 > every time that I run "zpool list -Hp" : > > # (while true; do zpool list -Hp >/dev/null; vmstat -m | \ > fgrep solaris; sleep 1; done) | awk '{print $2-a; a=$2}' > 6664427 > 541 > 552 > 552 > 552 > 552 > 552 > 552 > 552 > 552