Date: Wed, 11 Nov 2020 10:50:09 +0100 From: Roger Pau =?utf-8?B?TW9ubsOp?= <roger.pau@citrix.com> To: Brian Buhrow <buhrow@nfbcal.org> Cc: <freebsd-xen@freebsd.org> Subject: Re: ZFS corruption using zvols as backingstore for hvm VM's Message-ID: <20201111095009.6lcik5y3s7wrsh5k@Air-de-Roger> In-Reply-To: <202011110913.0AB9DJr1025354@nfbcal.org> References: <202011100516.0AA5Gp5K015697@nfbcal.org> <202011110913.0AB9DJr1025354@nfbcal.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Nov 11, 2020 at 01:13:18AM -0800, Brian Buhrow wrote: > hello. Following up on my own message, I believe I've run into a > serious problem that exists on FreeBSD-xen with FreeBSD-12.1P10 and > Xen-4.14.0. Just in case I was running into an old bug with yesterday's > post, I updated to xen-4.14.0 and Qemu-5.0.0. the problem was still there, > i.e. when writing to a second virtual hard drive on an hvm domu, the drive > becomes corrupted. Again, zpool scrub shows no errors. Are you using volmode=dev when creating the zvol? # zfs create -V16G -o volmode=dev zroot/foo This is require when using zvol with bhyve, but shouldn't' be required for Xen since the lock the guest disks from the kernel so GEOM cannot taste them. > So, I decided it might be some sort of memory error. I wrote a memory > test program, shown below, and ran it on my hvm domu. It not only > crashed the domu itself, it crashed the entire xen server! There are some > dmesg messages that happened before the xen server crash, shown below, which > suggest a serious problem. In my view, no matter how badly the domu hvm > host behaves, it shouldn't be able to crash the xen server itself! The > domu is running NetBSD-5.2, an admittedly old version of the operating > system, but I'm running a fleet of these machines, both on real hardware > and on older versions of xen with no stability issues whatsoever! And, as > I say, I shouldn't be able to wipe out the xen server from an hvm domu, no > matter what I do! Can you please paste the config file of the domain? > > The memory test program takes one argument, the amount of RAM, in > megabytes, you want it to test. It then allocates that memory, and > sequentially walks through that memory over and over again, writing to it > and reading from it, checking to make sure the data read matches the data > written. this has the effect of causing the resident set size of the > program to grow slowly over time, as it works. It was originally written > to test the paging efficiency of a system, but I modified it to actually > test the memory along the way. > to reproduce the issue, perform the following steps: > > 1. Set up an hvm host, I think FreeBSD as a domu hvm host will work fine. > Use zfs zvols as the backingstore for the virtual disk(s) for your host. > > 2. Compile this program for that host and run it as follows: > ./testmem 1000 > This should ask the program to allocate 1G of memory and then walk through > and test it. It will report each megabyte of memory it's written and > tested. My test hvm had 4G of RAM as it was a 32-bit OS running on the > domu. Nothing else was running on either the xen server or the domu host. > I'm not sure exactly how far the program got in its memory walk before > things went south, but I think it touched about 100 megabytes of its 1000 > megabyte allocation. > My program was not running as root, so it had no special privileges, even > on the domu host. > > I'm not sure if the problem is with qemu, xen, or some combination of > the two. > > It would be great if someone could reproduce this issue and maybe shed > a bit more light on what's going on. > > -thanks > -Brian > > <error messages on xen server just before the crash!> > > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255. Discarding packet > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =69 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =1000 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =255 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255. Discarding packet > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =69 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =1000 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =255 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status Do you have a serial line attached to the server, and if so are those the last messages that you see before the server reboots? I would expect some kind of panic from the FreeBSD dom0 kernel or Xen itself before the server reboots. Those error messages are actually from the PV network controller, so I'm not sure they are related to the disk in any way. Are you doing anything else when this happens? Roger.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20201111095009.6lcik5y3s7wrsh5k>
