Date: Thu, 21 Apr 2022 21:18:37 -0700 From: Mark Millard <marklmi@yahoo.com> To: pete@nomadlogic.org, freebsd-current <freebsd-current@freebsd.org> Subject: Re: Chasing OOM Issues - good sysctl metrics to use? Message-ID: <83A713B9-A973-4C97-ACD6-830DF6A50B76@yahoo.com> References: <83A713B9-A973-4C97-ACD6-830DF6A50B76.ref@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Pete Wright <pete_at_nomadlogic.org> wrote on Date: Thu, 21 Apr 2022 19:16:42 -0700 : > on my workstation running CURRENT (amd64/32g of ram) i've been running=20= > into a scenario where after 4 or 5 days of daily use I get an OOM = event=20 > and both chromium and firefox are killed. then in the next day or so=20= > the system will become very unresponsive in the morning when i unlock = my=20 > screensaver in the morning forcing a manual power cycle. >=20 > one thing i've noticed is growing swap usage but plenty of free and=20 > inactive memory as well as a GB or so of memory in the Laundry state=20= > according top. my understanding is that seeing swap usage grow over=20= > time is expected and doesn't necessarily indicate a problem. but what=20= > concerns me is the system locking up while seeing quite a bit of disk=20= > i/o (maybe from paging back in?). >=20 > in order to help chase this down i've setup the=20 > prometheus_sysctl_exporter(8) to send data to a local prometheus=20 > instance. the goal is to examine memory utilizaton over time to help=20= > detect any issues. so my question is this: >=20 > what OID's would be useful to help see to help diagnose weird memory=20= > issues like this? >=20 > i'm currently looking at: > sysctl_vm_domain_0_stats_laundry > sysctl_vm_domain_0_stats_active > sysctl_vm_domain_0_stats_free_count > sysctl_vm_domain_0_stats_inactive_pps >=20 >=20 > thanks in advance - and i'd be happy to share my data if anyone is=20 > interested :) Messages in the console out would be appropriate to report. Messages might also be available via the following at appropriate times: # dmesg -a . . . or: # more /var/log/messages . . . Generally messages from after the boot is complete are more relevant. Messages like the following are some examples that would be of interest: pid . . .(c++), jid . . ., uid . . ., was killed: failed to reclaim = memory pid . . .(c++), jid . . ., uid . . ., was killed: a thread waited too = long to allocate a page pid . . .(c++), jid . . ., uid . . ., was killed: out of swap space (That last is somewhat of a misnomer for the internal issue that leads to it.) I'm hoping you got message(s) of one or more of the above kinds. But others are also relevant: . . . kernel: swap_pager: out of swap space . . . kernel: swp_pager_getswapspace(7): failed . . . kernel: swap_pager: indefinite wait buffer: bufobj: . . ., blkno: = . . ., size: . . . (Those messages do not announce a process kill but give some evidence about context.) Some of the messages with part of the text matching actually identify somewhat different contexts --so each message type is relevant. There may be other types of messages that are relevant. The sequencing of the messages could be relevant. Do you have any swap partitions set up and in use? The details could be relevant. Do you have swap set up some other way than via swap partition use? No swap? If 1+ swap partitions are in use, things that suggest the speeds/latency characteristics of the I/O to the drive could be relevant. ZFS (so with ARC)? UFS? Both? The first block of lines from a top display could be relevant, particularly when it is clearly progressing towards having the problem. (After the problem is too late.) (I just picked top as a way to get a bunch of the information all together automatically.) These sorts of things might help folks help you. =3D=3D=3D Mark Millard marklmi at yahoo.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?83A713B9-A973-4C97-ACD6-830DF6A50B76>