From nobody Sat Jan 21 12:53:26 2023 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Nzc0p3qnjz2v73F for ; Sat, 21 Jan 2023 13:00:26 +0000 (UTC) (envelope-from pmc@citylink.dinoex.sub.org) Received: from uucp.dinoex.org (uucp.dinoex.org [IPv6:2a0b:f840::12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "uucp.dinoex.sub.de", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Nzc0n1TDDz42hy for ; Sat, 21 Jan 2023 13:00:25 +0000 (UTC) (envelope-from pmc@citylink.dinoex.sub.org) Authentication-Results: mx1.freebsd.org; dkim=none; spf=pass (mx1.freebsd.org: domain of pmc@citylink.dinoex.sub.org designates 2a0b:f840::12 as permitted sender) smtp.mailfrom=pmc@citylink.dinoex.sub.org; dmarc=none Received: from uucp.dinoex.org (uucp.dinoex.org [IPv6:2a0b:f840:0:0:0:0:0:12]) by uucp.dinoex.org (8.17.1/8.17.1) with ESMTPS id 30LD05aR097103 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Sat, 21 Jan 2023 14:00:06 +0100 (CET) (envelope-from pmc@citylink.dinoex.sub.org) Received: (from uucp@localhost) by uucp.dinoex.org (8.17.1/8.17.1/Submit) with UUCP id 30LD05MV097102; Sat, 21 Jan 2023 14:00:05 +0100 (CET) (envelope-from pmc@citylink.dinoex.sub.org) Received: from disp.intra.daemon.contact (disp-e.intra.daemon.contact [IPv6:fd00:0:0:0:0:0:0:112]) by admn.intra.daemon.contact (8.16.1/8.16.1) with ESMTPS id 30LCtB0J034533 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=OK); Sat, 21 Jan 2023 13:55:11 +0100 (CET) (envelope-from pmc@citylink.dinoex.sub.org) Received: from disp.intra.daemon.contact (localhost [127.0.0.1]) by disp.intra.daemon.contact (8.16.1/8.16.1) with ESMTPS id 30LCrQ8O003897 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Sat, 21 Jan 2023 13:53:27 +0100 (CET) (envelope-from pmc@citylink.dinoex.sub.org) Received: (from pmc@localhost) by disp.intra.daemon.contact (8.16.1/8.16.1/Submit) id 30LCrQSA003896; Sat, 21 Jan 2023 13:53:26 +0100 (CET) (envelope-from pmc@citylink.dinoex.sub.org) X-Authentication-Warning: disp.intra.daemon.contact: pmc set sender to pmc@citylink.dinoex.sub.org using -f Date: Sat, 21 Jan 2023 13:53:26 +0100 From: Peter To: Mateusz Guzik Cc: freebsd-stable@freebsd.org Subject: Re: close() taking minutes (ZFS related) Message-ID: References: List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Milter: Spamilter (Reciever: uucp.dinoex.org; Sender-ip: 0:0:2a0b:f840::; Sender-helo: uucp.dinoex.org;) X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (uucp.dinoex.org [IPv6:2a0b:f840:0:0:0:0:0:12]); Sat, 21 Jan 2023 14:00:08 +0100 (CET) X-Spamd-Result: default: False [-2.30 / 15.00]; URI_HIDDEN_PATH(1.00)[http://flag.daemon.contact/.well-known/acme-challenge/ZFS.busy.svg]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.999]; R_SPF_ALLOW(-0.20)[+mx]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_TO(0.00)[gmail.com]; R_DKIM_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; ARC_NA(0.00)[]; ASN(0.00)[asn:205376, ipnet:2a0b:f840::/32, country:DE]; FROM_HAS_DN(0.00)[]; RCVD_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_SOME(0.00)[]; DMARC_NA(0.00)[sub.org]; HAS_XAW(0.00)[]; RCVD_TLS_LAST(0.00)[] X-Rspamd-Queue-Id: 4Nzc0n1TDDz42hy X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N On Fri, Jan 20, 2023 at 03:55:43PM +0100, Mateusz Guzik wrote: ! offcpu.d: ! ! #pragma D option dynvarsize=32m ! ! /* ! * The p_flag & 0x4 test filters out kernel threads. ! */ ! ! sched:::sleep ! /(curthread->td_proc->p_flag & 0x4) == 0/ ! { ! self->ts = timestamp; ! self->mesg = curthread->td_wmesg; ! } ! ! sched:::wakeup ! /self->ts/ ! { ! @[stack(30), stringof(self->mesg)] = sum(timestamp - self->ts); ! self->ts = 0; ! self->mesg = 0; ! } ! ! dtrace:::END ! { ! printa("%k\n%s\n%@d\n\n", @); ! } ! ! dtrace -s offcpu.d -o offcpu.out ! ! git clone https://github.com/brendangregg/FlameGraph.git ! cat offcpu.out | perl FlameGraph/stackcollapse.pl | perl ! FlameGraph/flamegraph.pl --color=io > offcpu.svg Okay, got it. And now? I created two pictures, one when the machine is idle, and one when it is completely stuck. But I don't see significant differences. I drop them into the ACME directory, so You can have a look: http://flag.daemon.contact/.well-known/acme-challenge/ZFS.idle.svg http://flag.daemon.contact/.well-known/acme-challenge/ZFS.busy.svg Probably this needs more time to really understand - my brains are currently engaged with the fallout of these table migrations. I also tried to get a better view on what is exactly happening. This is a little database load: The beginning, everything normal: 770 85475 83995 0 21 0 17812 6032 select S+J 2 0:00.18 pg_restore -C -d postgres pgdump_TESTLOAD The backend starts working, and things start to hang: 770 85503 44085 4 52 0 156652 109460 zcw->zcw DsJ - 0:05.78 postgres: postgres testload [local] ALTER TABLE (post 770 85504 44090 1 20 0 580 384 tx->tx_s DJ - 0:00.00 sh -c /ext/libexec/RedoLog.copy "00000001000000290000 770 85475 83995 4 22 0 17812 6036 select S+J 2 0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD It's getting bad: 0 13187 1 1 20 0 13276 768 tx->tx_s DsJ - 0:14.51 /usr/sbin/blacklistd 770 44089 44085 16 20 0 80464 12620 tx->tx_s DsJ - 0:09.90 postgres: walwriter (postgres) 770 85503 44085 15 52 0 177132 124516 tx->tx_s DsJ - 0:06.65 postgres: postgres testload [local] CREATE INDEX (pos 770 85504 44090 10 20 0 580 384 tx->tx_s DJ - 0:00.00 sh -c /ext/libexec/RedoLog.copy "00000001000000290000 770 85475 83995 19 22 0 17812 6036 select S+J 2 0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD And worse: 770 44089 44085 17 20 0 80464 12620 zcw->zcw DsJ - 0:09.92 postgres: walwriter (postgres) 770 85503 44085 2 27 0 177132 124980 usem SsJ - 0:07.39 postgres: postgres testload [local] CREATE INDEX (pos 8 85660 10053 1 52 0 428 252 tx->tx_s DJ - 0:00.00 expr ## : \\(.\\).* 770 85661 85504 8 22 0 416 232 zfs DJ - 0:00.00 fsync /var/db/pg-int/arch/000000010000002900000014 770 85475 83995 19 20 0 17812 6036 select S+J 2 0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD DB loaded, but things are ugly now: 8 85660 10053 16 52 0 428 252 tx->tx_s DJ - 0:00.00 expr ## : \\(.\\).* 770 85661 85504 8 22 0 416 232 zfs DJ - 0:00.00 fsync /var/db/pg-int/arch/000000010000002900000014 5001 85683 85674 16 20 0 580 380 zfs DsJ - 0:00.00 /bin/sh -c /ext/libexec/svnsync.cron /ext/SVNR svn+ss 0 85687 85677 4 20 0 580 380 zfs DsJ - 0:00.00 /bin/sh -c /usr/libexec/atrun 8 85688 85679 5 20 0 580 380 zfs DsJ - 0:00.00 /bin/sh -c /usr/local/news/bin/send-uucp 770 85666 83995 10 20 0 1020 764 zfs D+J 2 0:00.00 psql -c drop database testload Then, I tried to do that load again with sync write disabled, to see if things change (they do not). But, at that time the daily/periodic appeared, and it does a VACUUM - that is, a first VACUUM on all the migrated tables. And now I give up, because this is unuseable: 0 1430 1429 4 20 0 12860 3496 tx->tx_s D - 0:00.09 / /usr /var /home /ext /j/kerb /media /j/oper /usr/loc 0 2569 10505 14 20 0 21072 5800 tx->tx_s DJ - 0:00.03 sendmail: 30L26S6D002569 wand-n.intra.daemon.contact [ 0 2273 2269 13 30 0 13680 3480 tx->tx_s Ds - 0:00.02 /bin/sh /ext/libexec/Key.renew 770 2645 44085 2 23 0 83024 23528 tx->tx_s DsJ - 0:25.15 postgres: postgres bareos [local] VACUUM (postgres) 770 2924 89818 9 20 0 17176 5308 tx->tx_s DJ - 0:00.01 fetch --no-verify-peer -1 -T 30 -o /tmp/kurse.89801.ht 8 2936 10053 2 22 0 428 252 tx->tx_s DJ - 0:00.00 expr 29 + 1 0 2947 2845 0 20 0 18384 3736 tx->tx_s DJ - 0:00.00 sendmail: ./30L26SBv001628 from queue (sendmail) 770 2949 2668 2 21 0 424 240 zfs DJ - 0:00.00 mv /var/db/pg-int/arch/000000010000002900000023 /var/d 0 3110 3108 12 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SDj001446 from queue (sendmail) 0 3335 3334 11 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26Sjq001053 from queue (sendmail) 997 3504 4734 7 20 0 580 380 zfs DJ - 0:00.01 sh -c /sbin/ping -c2 wand-n.intra.daemon.contact > /de 997 3505 4734 15 20 0 580 380 zfs DJ - 0:00.01 sh -c sleep `jot -r 1 1 55` 997 3506 4734 12 20 0 580 380 zfs DJ - 0:00.01 sh -c /sbin/ping -c2 pole-n.intra.daemon.contact > /de 997 3507 4734 8 21 0 580 380 zfs DJ - 0:00.01 sh -c /usr/local/bin/xlsclients -display disp-e.intra. 0 3548 3547 4 21 0 18384 3696 zfs DJ - 0:00.00 sendmail: ./30L26SRV001167 from queue (sendmail) 0 3742 3740 9 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SrD001048 from queue (sendmail) 0 3955 3954 8 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26Sjs001053 from queue (sendmail) 0 4155 4154 17 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SBx001628 from queue (sendmail) 0 4375 4374 6 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SLE002237 from queue (sendmail) 0 4575 4574 17 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SDl001446 from queue (sendmail) 0 4753 4752 11 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SrF001048 from queue (sendmail) 0 4952 4951 12 20 0 18384 3696 zfs DJ - 0:00.00 sendmail: ./30L26SRX001167 from queue (sendmail) 0 5170 5169 3 20 0 18384 3700 zfs DJ - 0:00.00 sendmail: ./30L26SWb001552 from queue (sendmail) 0 5383 5382 1 20 0 18384 3660 zfs DJ - 0:00.00 sendmail: ./30L26SYD000965 from queue (sendmail) 0 10486 1 6 20 0 13000 620 tx->tx_s DsJ - 0:13.29 /usr/sbin/cron -s 770 3514 83995 16 20 0 596 288 zfs D+J 2 0:00.00 pg_restore -C -d postgres pgdump_TESTLOAD 1000 3518 83924 18 20 0 13396 3200 zfs D+ 4 0:00.00 ls -la /j/admn/etc/ I attached truss to the VACUUM worker, and it does file reads and writes in 8k blocks continuousely, just as it should, keeping the disks busy. But meanwhile, nothing else is moving any step further. I had to wait half an hour until the VACUUM was done, then everything went back to normal, and nothing bad had happened (my scripting is usually written to cope with indefinte stalls at any place). The issue is a lot worse than I initially thought, it concerns everything: sort, reindex, vacuum, ... As it looks, ZFS allows only one single process to write to the pool, and everything else is blocked from accessing it during that time. And this seems SSD-specific; the DB was on mechanical disks before and did fine for the last 15 years, precisely saturating the SCSI and no stalls. I might think this could be related to some of the lots of pseudo-AI that went into ZFS over the years: all the knobs I used to tune have disappeared, even arc-min/arc-max seem in the process of disappearing. Initially in 2008, this DB did run on a Pentium-2 with 750 MB ram, and things did work, after proper tuning. Now it has 20 cores and 48 GB ram, and doesn't work.