Date: Wed, 17 Oct 2018 13:25:31 +1100 From: Darek Margas <darek@tramada.com> To: freebsd-hackers@freebsd.org Subject: High load and MySQL slow without apparent reason Message-ID: <CAG0rGZecYsycwuBzhRBngnBc7TG5Y5913VmdLPPhCbodZPKu8Q@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Hi Everyone, I'm trying to refresh my old FreeBSD experience by moving MySQL platform from Linux onto FreebSD+ZFS. Before I ask for your help I would like to give you some context. The machine is Dell server 2x20 cores, Intel IXL NIC, 1TB of RAM and lots of SAS SSD drives. The kernel is slightly modified by removing some unused stuff, replacing ixl driver with latest from Intel website and enabling NUMA. The whole thing runs number of MySQL daemons packed in jails (bridged network) with settings optimized for ZFS ARC caching (O_DIRECT, small buffers, etc). This is 11.2-RELEASE. When I tested it first time I found troubles with back pressure on ARC whilst short in memory leading machine do death. I also found that disabling ARC compression solved silent death but decided to make some tunes to keep more memory free for sudden need. Ran some tests, used it for replication salves, etc. Here is the thing - how I crashed this machine without understanding what has happened. First my tunes. I adjusted v_free_target and v_free_min aiming to 128G and 64G respectively. However, I overlooked fact that this is in pages not in 1k blocks. As result I set: - 700G max ARC size - 512G v_free_target - 256G v_free_min Obviously this is a nonsense, however, the machine worked calm until ARC got half of memory. Then shit happened. As I made machine with no swap at all I have got number of zombies and problems with reclaiming console (say, open VI which works, then exit and VI stays on console while became zombie). That was "fixed" by disabling swapping via sysctl. I also noticed 25% of CPU taken by "system" with nothing popping in top except pagedaemon and zfs (on arc_reclaim). I have added 40G of swap, rebooted machine but kept wrong settings. It was again calm until ARC got half of memory. This is when I found what I did and fixed v_free stuff to be - 128G v_free_target - 64G v_free_min The machine started managing memory the right way, wiping inactive to laundry and laundering only when needed. I still observed 25% of unexplained load from "system" (floating 5-60%) but all seemed OK. At this point I switched one replica to be master and put production queries on it. Summarizing the above - the machine had issues and has not been rebooted but seemed OK with memory management while having unexplained system load. Once I switched my SQLs from Linux master to FreeBSD I noticed slow performance. There is stored proc called every 15 minutes. On old machine and all others it takes around 30-40s to complete and previous master had spike in ROW executions to 650kps (one minute sample) while new one got it up to 350kps and run for nearly 3 minutes. I started looking deeper and found: - Made all MySQL settings the same (when possible as some follow platform) with no improvement - MySQL reload did not help - Stopping all replicas running around on the same machine (5 of them) to release resources made it worse (over 5 minutes to complete call). Starting replicas made it better again by one minute. BTW - jail was limited to one NUMA zone and half cores. Not all replicas had the same NUMA and CPU group. I copied ZFS content to test machine which is exactly the same and kicked the same MySQL in same jail and with same settings. - Test instance ran correctly within similar completion time to old Linux master - ARC on test machine was loaded up to 700G so I thought it would be good enough to compare but machine still had lots of memory To make it closer I compiled "memory allocator" which simply allocates and fills memory until killed or system dies. Run it on test machine first: - No effect until v_mem_target passed - Once passed pagedaemon kicked in, memory got wiped and shifted, swap got full (paging only anyway) - Load around 20% appeared from system, similar to broken production machine - Got down to 50G passing v_free_min - KIlled allocator - After 1-2s freezing all got back to normal, load from system was gone. - Swap was in use for some time after but finally got clean (that was only 4G swap on test machine) - After some time machine is still calm and MySQL fast Repeated the same on production machine: - All as above, except: - after killing allocator machine got frozen for, say, 10-15s - memory was released but load did not change - neither got much higher while allocating memory nor lower after. - Machine remained slow Finally I rebooted whole machine and now it is fast while building ARC. I believe it won't have the same issue soon as v_free stuff is set correctly, however, I need to understand why this MySQL process suffered and whether it was possible to recover it without reboot. I can imagine it was something running in a loop or contention on something otherwise unused or simply another clash in settings triggering something in unusual way but have no idea where to look to investigate it. Well, it's possible that there is a bug too. Before reboot I collected various vmstats, tops, ran ktrace on MySQL and sysctl to dump settings. Not posting as don't know what would be useful. Could you please point me in right direction? Cheers, Darek
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAG0rGZecYsycwuBzhRBngnBc7TG5Y5913VmdLPPhCbodZPKu8Q>