Date: Sun, 20 Jul 2014 16:35:57 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Steven Hartland <killing@multiplay.co.uk> Cc: Adrian Chadd <adrian@freebsd.org>, FreeBSD Stable Mailing List <freebsd-stable@freebsd.org> Subject: Re: Consistently "high" CPU load on 10.0-STABLE Message-ID: <20140720233557.GA88887@icarus.home.lan> In-Reply-To: <3E5D732C440140B9AEE204B91E5B120E@multiplay.co.uk> References: <20140720062413.GA56318@icarus.home.lan> <97EA8E571E634DBBAA70F7AA7F0DE97C@multiplay.co.uk> <20140720173524.GA67065@icarus.home.lan> <ED826825202341E58B71A3F718B60562@multiplay.co.uk> <20140720201655.GA70545@icarus.home.lan> <CAJ-Vmo=O-OH-Ljk1u-SGF1L=jj=Lnwd_aQs0Bqc5Mi03jZvkuw@mail.gmail.com> <20140720225845.GA81033@icarus.home.lan> <3E5D732C440140B9AEE204B91E5B120E@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 21, 2014 at 12:16:48AM +0100, Steven Hartland wrote: > > ----- Original Message ----- > From: "Jeremy Chadwick" <jdc@koitsu.org> > To: "Adrian Chadd" <adrian@freebsd.org> > Cc: "Steven Hartland" <killing@multiplay.co.uk>; "FreeBSD Stable Mailing List" <freebsd-stable@freebsd.org> > Sent: Sunday, July 20, 2014 11:58 PM > Subject: Re: Consistently "high" CPU load on 10.0-STABLE > > > > On Sun, Jul 20, 2014 at 03:09:55PM -0700, Adrian Chadd wrote: > >> hi, > >> > >> it looks like a whole lot of things are waking up at the same time: > >> > >> * dhcpd > >> * em > >> * usb devices > >> > >> So, do you have some shared interrupts going on here? That seems to be > >> what's causing things to all wake up all at once. > > > > I forget how to get an interrupt mapping from the I/O APIC, but dmesg > > indicates the following. Sorted by IRQ order so that you can tell > > what's associated with what, and also RELENG_9 vs. RELENG_10 (because I > > do have an old dmesg.today from this box running RELENG_9). All the > > IRQs match up: > > > > dev RELENG_10 RELENG_9 > > -------- ------------- ------------- > > ioapic0 IRQs 0 to 23 IRQs 0 to 23 (same) > > ioapic1 IRQs 24 to 47 IRQs 24 to 47 (same) > > attimer0 IRQ 0 IRQ 0 (same) > > atkbdc0 IRQ 1 IRQ 1 (same) > > atkbd0 IRQ 1 IRQ 1 (same) > > uart1 IRQ 3 IRQ 3 (same) > > uart0 IRQ 4 IRQ 4 (same) > > atrtc0 IRQ 8 IRQ 8 (same) > > em0 IRQ 16 IRQ 16 (same) > > pcib1 IRQ 16 IRQ 16 (same) > > pcib3 IRQ 16 IRQ 16 (same) > > pcib4 IRQ 16 IRQ 16 (same) > > uhci0 IRQ 16 IRQ 16 (same) > > ahci0 IRQ 17 IRQ 17 (same) > > em1 IRQ 17 IRQ 17 (same) > > ichsmb0 IRQ 17 IRQ 17 (same) > > pcib5 IRQ 17 IRQ 17 (same) > > uhci1 IRQ 17 IRQ 17 (same) > > ehci0 IRQ 18 IRQ 18 (same) > > uhci2 IRQ 18 IRQ 18 (same) > > uhci5 IRQ 18 IRQ 18 (same) > > siis0 IRQ 21 IRQ 21 (same) > > uhci4 IRQ 22 IRQ 22 (same) > > ehci1 IRQ 23 IRQ 23 (same) > > uhci3 IRQ 23 IRQ 23 (same) > > > > And the higher-numbered IRQs per vmstat -i. I only have this for > > RELENG_10 however: > > > > irq256: em0 1848856 26 > > irq259: ahci0:ch0 273086 3 > > irq260: ahci0:ch1 9990 0 > > irq261: ahci0:ch2 48514 0 > > irq262: ahci0:ch3 48046 0 > > irq263: ahci0:ch4 48258 0 > > irq264: ahci0:ch5 48052 0 > > > > vmstat -i for this is kinda painful (discussed this with jhb@ in the > > past, re: kernel just appending "+" to the string to indicate "many > > things using this IRQ"). > > > > I have absolute no USB devices attached to the system (meaning there are > > USB controllers and ports, yeah, but nothing attached to any of them). > > The keyboard is PS/2. All disks are on ahci0 (no disks currently > > attached to siis0). > > > > As for dhcpd: I don't know how that'd be responsible. If I stop the > > process entirely I still see the problem. > > > > I can provide some more ktrdumps, along with turning off as many daemons > > + cron jobs as I can, if you feel that'd be helpful. > > > > Likewise I can provide an ACPI DSDT dump if that would be useful (maybe > > to someone else). > > > > I haven't tried booting the box in single-user and letting it sit there > > to see if anything shows up there. > > > > In the interim I wrote the perl script I mentioned in my mail to Steve. > > When the load shoots up, there is literally no field in "vmstat -s" > > that shows a humongous increase (or decrease) consistently. Meaning > > I'd say 95% of the time when there's a sudden load jump, none of those > > statistics I can correlate with it. It's a pretty "meh" script, but > > it does the job of showing deltas between vmstat -s runs and indicating > > visually when there's a jump in load average (1m avg). It requires a > > VERY wide terminal (about 301 characters): > > > > http://jdc.koitsu.org/freebsd/releng10_perf_issue/load_vmstat.pl > > > > Some example output is here (obviously can't see the red+bold > > highlighting of the line): > > > > http://jdc.koitsu.org/freebsd/releng10_perf_issue/example_data.txt > > > > Load jumps at the following time indexes: > > > > 124.0 (from 0.02 to 0.10, load delta: 0.08) > > 153.0 (from 0.06 to 0.14, load delta: 0.08, time delta: 29.0 sec) > > 178.5 (from 0.10 to 0.17, load delta: 0.07, time delta: 25.5 sec) > > 217.0 (from 0.09 to 0.17, load delta: 0.08, time delta: 38.5 sec) > > 236.0 (from 0.12 to 0.19, load delta: 0.07, time delta: 19.0 sec) > > 244.0 (from 0.17 to 0.24, load delta: 0.07, time delta: 8.0 sec) > > 259.0 (from 0.20 to 0.27, load delta: 0.07, time delta: 15.0 sec) > > 284.5 (from 0.19 to 0.25, load delta: 0.06, time delta: 25.5 sec) > > 310.0 (from 0.18 to 0.25, load delta: 0.07, time delta: 25.5 sec) > > 341.5 (from 0.27 to 0.33, load delta: 0.06, time delta: 31.5 sec) > > > > Some of these could be due to cron jobs I run (though they really aren't > > that intensive on disk, CPU, or memory), but there's a pretty consistent > > pattern going on there load-wise. The reason noted time deltas was > > watching for "periodic tasks", e.g. ZFS txg flush. But this seems to > > have a little bit more variance. > > > > It's just that none of the vmstat -s statistics change rapidly alongside > > the load. But I'm sure there are VM bits that aren't tracked in vmstat. > > Not sure if its in stable/10 but there was some talk about making > ZFS use lz4 for some things by default, wonder if that might have > something to do with it? I hope not -- the ZFS pools and filesystems both were created on RELENG_9 circa January 2014 (according to "zpool history"). Well, actually, the backups pool was created a few days ago (on RELENG_9) due to a disk failure. ZFS details: http://jdc.koitsu.org/freebsd/releng10_perf_issue/zfs_get_all.txt http://jdc.koitsu.org/freebsd/releng10_perf_issue/zpool_status.txt http://jdc.koitsu.org/freebsd/releng10_perf_issue/zdb_ashift.txt I do use lz4 compression on two specific filesystems (data/www and data/cvs); content on data/www is only fed by Apache, and data/cvs isn't used by anything unless I'm using "cvs" (I do run cvspserver). I don't use dedup or snapshots. The backing disks are 4K-sector (they lack 4K kernel quirks in FreeBSD) but ashift is 12 on all of them. If we think it's related to lz4, I can absolutely recreate those filesystems sans compression with little effort -- they're very small and contain very little data: $ zfs list data/www data/cvs NAME USED AVAIL REFER MOUNTPOINT data/cvs 520K 1.27T 520K /cvs data/www 25.9M 1.27T 25.9M /www I don't dare run "zpool upgrade" until this can be figured out or at least determined for sure to be "specific to my environment" (I'll just go back to RELENG_9 in that case, no worries). -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140720233557.GA88887>