From owner-freebsd-stable@FreeBSD.ORG Mon Mar 16 23:24:10 2015 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D760BD17 for ; Mon, 16 Mar 2015 23:24:10 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 785A88A2 for ; Mon, 16 Mar 2015 23:24:10 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id t2GNO5Lb067344 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 17 Mar 2015 01:24:05 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t2GNO5Lb067344 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id t2GNO5op067343; Tue, 17 Mar 2015 01:24:05 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 17 Mar 2015 01:24:04 +0200 From: Konstantin Belousov To: J David Subject: Re: Significant memory leak in 9.3p10? Message-ID: <20150316232404.GM2379@kib.kiev.ua> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: freebsd-stable X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Mar 2015 23:24:10 -0000 On Mon, Mar 16, 2015 at 06:59:33PM -0400, J David wrote: > Recently we have seen a large-scale memory leak on amd64 machines > running FreeBSD 9.3-RELEASE-p10. > > This was first observed on 9.3p2 but has since shown up all the way through p10. > > Here's what the header of top shows: > > last pid: 32329; load averages: 0.00, 0.01, 0.21 up 3+15:37:29 22:34:04 > 25 processes: 2 running, 22 sleeping, 1 waiting > CPU: % user, % nice, % system, % interrupt, % idle > Mem: 4072M Active, 895M Inact, 1284M Wired, 125M Cache, 826M Buf, 1521M Free > Swap: 1024M Total, 874M Used, 149M Free, 85% Inuse > > About 4G actively being used, another 895M inactive, and another 874M > in swap. So it seems like this is a user-space leak, rather than a > kernel-space leak. > > At the time of measurement, this machine was not doing anything and > every possible process had been killed trying to find a culprit. The > entire output of "ps axlww" is: > > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND > 0 0 0 0 -52 0 0 224 - DLs ?? 0:00.82 [kernel] > 0 1 0 0 20 0 6280 556 wait SLs ?? 0:00.57 /sbin/init -- > 0 2 0 0 -16 0 0 16 pftm DL ?? 0:00.85 [pfpurge] > 0 3 0 0 -16 0 0 16 waiting_ DL ?? 0:00.00 > [sctp_iterator] > 0 4 0 0 -16 0 0 16 - DL ?? 0:00.00 [xpt_thrd] > 0 5 0 0 -16 0 0 16 psleep DL ?? 0:28.85 [pagedaemon] > 0 6 0 0 -16 0 0 16 psleep DL ?? 0:45.03 [vmdaemon] > 0 7 0 0 -16 0 0 16 pollid DL ?? 0:00.23 [idlepoll] > 0 8 0 0 155 0 0 16 pgzero DL ?? 0:00.00 [pagezero] > 0 9 0 0 -16 0 0 16 psleep DL ?? 0:00.83 [bufdaemon] > 0 10 0 0 -16 0 0 16 audit_wo DL ?? 0:00.00 [audit] > 0 11 0 0 155 0 0 32 - RL ?? 8317:13.37 [idle] > 0 12 0 0 -76 0 0 240 - WL ?? 301:43.54 [intr] > 0 13 0 0 -8 0 0 48 - DL ?? 0:09.89 [geom] > 0 14 0 0 -16 0 0 16 - DL ?? 2:58.88 [yarrow] > 0 15 0 0 -68 0 0 64 - DL ?? 0:02.32 [usb] > 0 16 0 0 -16 0 0 16 vlruwt DL ?? 0:06.35 [vnlru] > 0 17 0 0 16 0 0 16 syncer DL ?? 5:28.89 [syncer] > 0 18 0 0 -16 0 0 16 sdflush DL ?? 0:10.27 > [softdepflush] > 0 19 0 0 -16 0 0 16 - DL ?? 0:55.09 [racctd] > 0 830 1 0 20 0 45348 2396 wait Is u0 0:00.07 > login [pam] (login) > 500 32269 830 0 20 0 14556 2428 wait S u0 0:00.09 -sh (sh) > 500 32340 32269 0 20 0 16296 1908 - R+ u0 0:00.00 ps axlww > > Since the issue doesn't seem related to kernel memory usage, vmstat -m > and -z have been skipped, but nothing jumps out as using gigs of RAM; > they do appear consistent with 1284M of wired memory, which is not > unreasonable for the affected machines' tuning and workload. > > The only user-space processes running are login, sh, and ps. So where > did 5.5G of userspace RAM go? > > The only other potentially useful information is that when this > happens, shutting down the system will hang for about ten minutes. > > $ sudo halt -p > Waiting (max 60 seconds) for system process `vnlru' to stop...done > Waiting (max 60 seconds) for system process `bufdaemon' to stop...done > Waiting (max 60 seconds) for system process `syncer' to stop... > Syncing disks, vnodes remaining...0 0 0 0 0 0 0 0 0 done > All buffers synced. <----- 10 MINUTE HANG AFTER PRINTING THIS > Uptime: 3d15h56m32s > usbus0: Controller shutdown > uhub0: at usbus0, port 1, addr 1 (disconnected) > usbus0: controller did not stop > usbus0: Controller shutdown complete > acpi0: Powering system off > Connection closed by foreign host. > > So it seems like somewhere after "All buffers synced" and printing the > uptime, it's very slowly unwinding whatever is using up all that RAM > and swap. > > Does anyone have any idea what might be causing this or how to fix/prevent it? There are a lot of possibilities to create persistent anonymous shared memory objects. Not complete list is tmpfs mounts, swap-backed md disks, sysv shared memory, possibly posix shared memory (I do not remember which implementation is used in stable/9). I quite possible missed some object types. Also note that active/inactive can be explained by cached file pages, and only swap usage suggests that it might be something persisent from the list above.