From nobody Sun Mar 6 12:41:17 2022 X-Original-To: freebsd-xen@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 9C47D19F0C50 for ; Sun, 6 Mar 2022 12:40:57 +0000 (UTC) (envelope-from zedupsys@gmail.com) Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KBLmS6ldsz3JN3 for ; Sun, 6 Mar 2022 12:40:56 +0000 (UTC) (envelope-from zedupsys@gmail.com) Received: by mail-lf1-x130.google.com with SMTP id b5so1902315lfs.1 for ; Sun, 06 Mar 2022 04:40:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=subject:to:cc:references:reply-to:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding:content-language; bh=P8yvZ+YjMdFDsg2FHN5VndGB6cm25YZpS4UNC3Kuv/k=; b=loc9eon6ely+CBrX8a1pWklybKuOCip+mfhVVck8Tg5bU7x9itQN/pqYqZcZFNuHZS nv4h2mQr0nH7CQTBncFKjqdW68hIGBHeSWsNHDz1/jRzu9w8REzNdmzVTCTL0iXmKKpE qlz4QWh7og05sdTcdZ39ccIraBmTMDH8ZIL63jOP/lkU/689FHT9nB6krd8mCXITx9UB dU2wAw+mAGrMrmqp1OdXtWDRFBGN+ThqxC2AsGyNb2Lsici8OX0X5On/nfIFAyjVUpbz RuF+aNXzfw+v0bFPvW0xONtDvzA8mxMREH4VDFlKf/cK4MzpFnsZwWuHjpdbyXX1TvLv 7elg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:subject:to:cc:references:reply-to:from :message-id:date:user-agent:mime-version:in-reply-to :content-transfer-encoding:content-language; bh=P8yvZ+YjMdFDsg2FHN5VndGB6cm25YZpS4UNC3Kuv/k=; b=U9Tb9tqipdQpnj6b18Xjzp7kjIzdhHF31K7jeJwT0UkuAcaYllj/XEoOV/ggyJ2D9q gywlcpVKiEXw9kfR0z/JK2zTFAbC74od1LrhAf0m8vdgAQXk6aOgaUwmZtRUrqryv2p3 8BUZEK1nlAximq1kN7KDo5lRe6n5D9mLr0U7jShdBtIvfkp7FIN5DHqWBvjJ/9Yt0DM0 tuj0PWabpiveTKtgPh0pVNIB66Ih+zlCcUS8KDB++IixqDh8DVqnAJUnqIYEsfQ3+oAY 1MDuQhAl1ESi17/tQQa2HzQdOyFXbT9W62A07EAZ7zSiG4BbeIfd3xtBdfOHFfV6bfm2 u/4g== X-Gm-Message-State: AOAM5330pTxvjhztQMqjMD2Zy2eNI0g+zKyrf20LBnBNUK55NUeVo1QH dMceXelZZhPUyFBqiuiNhlM= X-Google-Smtp-Source: ABdhPJwFTHJXX3fkjVy4idkYor29kHzYhnVtH70TTM1xcf6mpasVNwGl11tdVcez6TcrRrNRmcPjbA== X-Received: by 2002:a05:6512:2082:b0:443:4236:5f57 with SMTP id t2-20020a056512208200b0044342365f57mr5082551lfr.335.1646570449144; Sun, 06 Mar 2022 04:40:49 -0800 (PST) Received: from [10.3.0.1] ([213.110.65.3]) by smtp.googlemail.com with ESMTPSA id u2-20020a2e9b02000000b002460fa70fa7sm2402887lji.59.2022.03.06.04.40.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 06 Mar 2022 04:40:48 -0800 (PST) Subject: Re: ZFS + FreeBSD XEN dom0 panic To: =?UTF-8?Q?Roger_Pau_Monn=c3=a9?= Cc: freebsd-xen@freebsd.org, buhrow@nfbcal.org References: <202203011540.221FeR4f028103@nfbcal.org> Reply-To: zedupsys@gmail.com From: Ze Dupsys Message-ID: <3d4691a7-c4b3-1c91-9eaa-7af071561bb6@gmail.com> Date: Sun, 6 Mar 2022 14:41:17 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 List-Id: Discussion List-Archive: https://lists.freebsd.org/archives/freebsd-xen List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-xen@freebsd.org X-BeenThere: freebsd-xen@freebsd.org MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: lv X-Rspamd-Queue-Id: 4KBLmS6ldsz3JN3 X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=loc9eon6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of zedupsys@gmail.com designates 2a00:1450:4864:20::130 as permitted sender) smtp.mailfrom=zedupsys@gmail.com X-Spamd-Result: default: False [-0.49 / 15.00]; HAS_REPLYTO(0.00)[zedupsys@gmail.com]; RCVD_VIA_SMTP_AUTH(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; FREEMAIL_FROM(0.00)[gmail.com]; REPLYTO_ADDR_EQ_FROM(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[gmail.com:+]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_SPAM_SHORT(0.32)[0.320]; NEURAL_HAM_LONG(-0.64)[-0.645]; MIME_GOOD(-0.10)[text/plain]; FREEMAIL_REPLYTO(0.00)[gmail.com]; PREVIOUSLY_DELIVERED(0.00)[freebsd-xen@freebsd.org]; NEURAL_SPAM_MEDIUM(0.83)[0.833]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::130:from]; MLMMJ_DEST(0.00)[freebsd-xen]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On 2022.03.03. 12:39, Roger Pau Monné wrote: > On Wed, Mar 02, 2022 at 07:26:18PM +0200, Ze Dupsys wrote: >> Today managed to crash lab Dom0 with: >> xen_cmdline="dom0_mem=6144M dom0_max_vcpus=2 dom0=pvh,verbose=1 >> console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all sync_console=1 >> reboot=no" > Hm, it's weird that reboot=no doesn't work for you. Does noreboot > instead make a difference? > >> I wrote ' vmstat -m | sort -k 2 -r' each 120 seconds, the latest one was as >> in attachment, panic was with the same fingerprint as the one with >> "rman_is_region_manager" line already reported. > > >> The scripts i ran in parallel generally were the same as attached in bug >> report, just a bit modified. >> 1) ./libexec.sh zfs_volstress_fast_4g (this just creates new ZVOLs and >> instead of 2GB, it writes 4BG in each ZVOL created dd if=/dev/zero) >> 2) ./test_vm1_zvol_3gb.sh (this loops commands: start first DomU, write >> 3GB in it's /tmp, restart DomU, removes /tmp, repeat) >> 3) ./test_vm2_zvol_5_on_off.sh (this loops: start second DomU, which has 5 >> disks attached, turn off DomU, repeat) > Right. So the trigger for this seem to be related to creating (and > destroying) VMs in a loop? > > Do you still see the same if you only execute steps 1 and 4 from the > repro described above? > >> 4) monitoring, sleep 120 seconds, print vmstat | sort in serial output. >> >> Around dom id 108, system started to behave suspiciously, xl list showed >> DomUs created, but they did not really start up, script timeout-ed for ssh >> connection, no vnc. When i did xl destroy manually, and xl create, system >> panic happened. > Could you also add the output of `top -n1` to see where memory is > going? > > I'm quite sure we have a leak in some of the backends, maybe the > bounce buffer used by blkback. > > Thanks, Roger. Hello, About XEN command line option changing "reboot=no" to "noreboot"; this did not change anything. Physical machine reboots on kernel panic. Maybe i have misunderstood the use for this option. I somehow thought that reboot=no means that if Dom0 crashes/panics, then physical machine is not rebooted, but in reality it does reboot. At this moment though, this is not important for, since i am recording all console output from serial anyways, so messages are not lost as long as they are flushed out on a timely manner. Maybe kernel itself after panic calls reset to XEN? If i execute steps 1 and 4, it seems that machine does not crash; tested 8 hours Dom0 8GB RAM, 4 hours Dom0 2GB RAM. So creating a lot of ZVOLs and monitoring it does not seem to be the cause. Yesterday was a happy and a sad day at the same time. The happy thing is that i think that i have understood one panic reason and somewhat better located where  one problem might be, but at the same time, i think i have located 2 bugs that cause one specific panic, but 2 other panic reasons are due to different circumstances. Further is my pure speculation about two things, somewhat related; i do not know XEN or kernel internals, thus take this with grain of salt. Knowing that this seemed to be RAM problem and thinking how ZFS sysctl tuning values differ, i started to look at "sysctl -a" diffs for Dom0 with 2GB RAM and 8GB RAM, by pure luck because comparing Dom0 that was running longer and just recently restarted, i noticed there are diff lines like so: .. +dev.xbbd.202.%driver: xbbd +dev.xbbd.202.%location: +dev.xbbd.202.%parent: xenbusb_back0 +dev.xbbd.202.%pnpinfo: +dev.xbbd.202.xenbus_connection_state: Closed +dev.xbbd.202.xenbus_dev_type: vbd +dev.xbbd.202.xenbus_peer_domid: 309 +dev.xbbd.202.xenstore_path: backend/vbd/309/51728 +dev.xbbd.202.xenstore_peer_path: /local/domain/309/device/vbd/51728 +dev.xbbd.203.%desc: Backend Virtual Block Device +dev.xbbd.203.%driver: xbbd +dev.xbbd.203.%location: +dev.xbbd.203.%parent: xenbusb_back0 +dev.xbbd.203.%pnpinfo: +dev.xbbd.203.xenbus_connection_state: Closed .. although actually in system at the time there were only 2 DomUs running max. This seems to be the reason why this felt like memory leak, since after long time of start/stop VMs, most probably sysctl hit some sort of RAM limit (which most probably is auto-calculated from total RAM somehow) and panics. I can repeat this sysctl growing with just starting/stopping VMs in loop although stressing ZVOL at the same time speeds this up for unknown reasons to me. What i do not know though is what exactly triggers this, because there are times when xbbd.$number does not grow larger than total number of disk devices attached to all DomUs. I do not know what would be the best way to continue with this. Then i caught another sysctl variable that is growing due to XEN, "kern.msgbuf: Contents of kernel message buffer". I do not know how this variable grows or by which component it is managed, but in VM start/stop case it grows and contains lines with pattern like so: .. xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status xnb(xnb_ring2pkt:1526): Unknown extra info type 255.  Discarding packet xnb(xnb_dump_txreq:299): netif_tx_request index =0 xnb(xnb_dump_txreq:300): netif_tx_request.gref  =0 xnb(xnb_dump_txreq:301): netif_tx_request.offset=0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =8 xnb(xnb_dump_txreq:303): netif_tx_request.id    =69 xnb(xnb_dump_txreq:304): netif_tx_request.size  =1000 xnb(xnb_dump_txreq:299): netif_tx_request index =1 xnb(xnb_dump_txreq:300): netif_tx_request.gref  =255 xnb(xnb_dump_txreq:301): netif_tx_request.offset=0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =0 xnb(xnb_dump_txreq:303): netif_tx_request.id    =0 xnb(xnb_dump_txreq:304): netif_tx_request.size  =0 .. Those lines in that variable just keep growing and growing, it is not that they are flushed, trimmed or anything. Each time i get the same message on serial output, it has one more section of error appended to "same-previous" serial output message and sysctl variable as well. Thus at some point serial output and sysctl contains a large block of those errors while VM is starting. So at some point the value of this sysctl could be reaching max allowed/available and this makes the system panic.  I do not know the reason for those errors, but actually if there was a patch to suppress them, this could be "solved". Another diff chunk might be related to this: +dev.xnb.1.xenstore_peer_path: /local/domain/7/device/vif/0 +dev.xnb.1.xenbus_peer_domid: 7 +dev.xnb.1.xenbus_connection_state: InitWait +dev.xnb.1.xenbus_dev_type: vif +dev.xnb.1.xenstore_path: backend/vif/7/0 +dev.xnb.1.dump_rings: +dev.xnb.1.unit_test_results: xnb_rxpkt2rsp_empty:1765 Assertion Error: nr_reqs == 0 +xnb_rxpkt2rsp_empty:1767 Assertion Error: memcmp(&rxb_backup, &xnb_unit_pvt.rxb, sizeof(rxb_backup)) == 0 +xnb_rxpkt2rsp_empty:1769 Assertion Error: memcmp(&rxs_backup, xnb_unit_pvt.rxs, sizeof(rxs_backup)) == 0 +52 Tests Passed +1 Tests FAILED What was suspicious about this is that i am using 13.0-RELEASE-p7, it is not a DEV version of XEN or anything, but there are sysctl variables with "dev" prefix. Could it be that ports have been accidentally compiled XEN version with some development flags turned on? So now it felt for me that i understand this panic case: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 02 fault virtual address   = 0x22710028 fault code              = supervisor read data, page not present instruction pointer     = 0x20:0xffffffff80c45892 stack pointer           = 0x28:0xfffffe009656a930 frame pointer           = 0x28:0xfffffe009656a930 code segment            = base 0x0, limit 0xfffff, type 0x1b                         = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags        = interrupt enabled, resume, IOPL = 0 current process         = 17633 (devmatch) trap number             = 12 panic: page fault cpuid = 1 time = 1646139877 KDB: stack backtrace: #0 0xffffffff80c57525 at kdb_backtrace+0x65 #1 0xffffffff80c09f01 at vpanic+0x181 #2 0xffffffff80c09d73 at panic+0x43 #3 0xffffffff8108b1a7 at trap+0xbc7 #4 0xffffffff8108b1ff at trap+0xc1f #5 0xffffffff8108a85d at trap+0x27d #6 0xffffffff81061b18 at calltrap+0x8 #7 0xffffffff80c62011 at rman_is_region_manager+0x241 #8 0xffffffff80c1a051 at sbuf_new_for_sysctl+0x101 #9 0xffffffff80c1949c at kernel_sysctl+0x43c #10 0xffffffff80c19b13 at userland_sysctl+0x173 #11 0xffffffff80c1995f at sys___sysctl+0x5f #12 0xffffffff8108baac at amd64_syscall+0x10c #13 0xffffffff8106243e at Xfast_syscall+0xfe The function "kernel_sysctl" shows that this is sysctl related, and based on previous knowledge we know that there might be too many variables. If i assume that "Xfast_syscall" is a XEN function, then that shows that panic came from XEN call that wanted to do something to sysctl variable, and panic happened. This is all just my assumption. So those were my conclusions. What do you think? How should we proceed with this? Should i try to somehow build XEN from git sources? For what it's worth, system was crashed with 8GB Dom0 RAM, 16GB swap space on dedicated HDD not the one ZFS is on. xen_cmdline="dom0_mem=8192M dom0_max_vcpus=2 dom0=pvh,verbose=1 console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all sync_console=1 noreboot" "top -n 1" output before panic: last pid: 18906;  load averages:  1.39,  2.62,  2.90  up 0+02:22:00    19:54:50 42 processes:  1 running, 41 sleeping CPU:  3.0% user,  0.0% nice, 27.7% system,  2.3% interrupt, 67.0% idle Mem: 872K Active, 4176K Inact, 4356K Laundry, 7656M Wired, 48M Buf, 238M Free ARC: 5183M Total, 6320K MFU, 4752M MRU, 219M Anon, 192M Header, 12M Other      4674M Compressed, 4746M Uncompressed, 1.02:1 Ratio Swap: 16G Total, 40M Used, 16G Free   PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME WCPU COMMAND  1010 root          1  23    0    13M  3044K select   1   1:34 0.98% xenstored At this point i do not think "vmstat -m" output before panic is useful. What i can tell though is that grep'ing out vmstat messages on process "sysctloid" i see that it's RAM consumption most of the time is steadily growing. The sad part about all this though is that i do not believe that these are the bugs that cause panic on my pre-production machine. Maybe sysctl variable "kern.msgbuf" overgrows, but other than that, normally i do not start/stop VMs so often that it would create so many xbbd sysctl variables... Anyways we could try to deal with this and then see, if it fixes. Best wishes!