From owner-freebsd-xen@freebsd.org Wed Nov 11 09:50:20 2020 Return-Path: Delivered-To: freebsd-xen@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 798012E3933 for ; Wed, 11 Nov 2020 09:50:20 +0000 (UTC) (envelope-from roger.pau@citrix.com) Received: from esa6.hc3370-68.iphmx.com (esa6.hc3370-68.iphmx.com [216.71.155.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "esa6.hc3370-68.iphmx.com", Issuer "HydrantID SSL ICA G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4CWKj71BK1z4YZQ for ; Wed, 11 Nov 2020 09:50:18 +0000 (UTC) (envelope-from roger.pau@citrix.com) IronPort-SDR: LpdM4tE3fWrAI4peF6pT7YvGtfkUngSXW/FGU1OJl6oyIbaF/ZJDLzhPQ+bmIJpuobin00K9lr deQ2VLben+P99LsrbUdj9iYi58RdQ8JjcnBWfPp9mnPj5xiHd1MS357qRYYgfSJtthLI0ST5FS 85KeBiBaXqnakjy4AaWyVfvHer4EelH5SgsAyPNs7UbDAK4pd2iEoaTjQH0gUMS50eErUXG1lp ml042nlSq9sWVy+kw+tVQAAY+kAL5rPGCAqHwQK3F9yx1Gss7rv4a0BjP3Emxywa00tgZE3g1P SFY= X-SBRS: None X-MesageID: 31168478 X-Ironport-Server: esa6.hc3370-68.iphmx.com X-Remote-IP: 162.221.158.21 X-Policy: $RELAYED X-IronPort-AV: E=Sophos;i="5.77,469,1596513600"; d="scan'208";a="31168478" ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Fv9ZfJASoknwKU/LJyXEZ0Nnri2Luvoa4IMINCuvGOENJovOIW0DQDd8jqCG64q6wrw9Eo3izHQm4o7HoeCf5JwhQ1IBJFd+vkAtj8dMiE6z/B+m1fTUYN6BrusbOQ+V8CWwKBpiUaHpyJCMEFFdNjZbrUl+fK9ZdIi5YzAX4tOhy8H7WfXwAToaMGYd2KKGhhu87swAfkV6c70kWvGvA2Tb5t8LTi6O0fq2Zs/Dv0EKHVok6HJ+ToXqC58jxuLOhZonAMJtKMTDfrXA3EIO8SNf1l1Hint60wjlGwb0wKiCntPZGUB4jNgBfWLbCIJXN0SpfPpLZN7vkE/AUQ9SYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=57nuM5e6Y3RLlVATEfzrhN8vx8AHURiia7SlXkHkBTE=; b=f6+QaxKOFdwSBFT74gJtYPris4BUInmUZYzG0AA76+YvPXZusSSb+swy806LfNP9ubJdjzcMCQD95vpwTmgKo1H7mCmiNwPOtWQLOeOgx3DjLVbFZzB6bHFIw/smve6Y9vUTS+uhKGuGng2QYCVAKoOA4KzoCZvKcIk/ldiy/DGMT39idYQubhFLD1c8LTlmv6xSXh3Garkfmz1gRyaW2CuXGDyp9WIF1gWxcCE+gaOBZx5OL2bBX1AEzvSMweVW/BiSWvBl0s8HTDvOEXyLD3XXPrFMZf3RzVetdrDUyvGs18UiSMwzuhnh6e/pmZ/eCyMML92J5H+YdCrGKSB49Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none Date: Wed, 11 Nov 2020 10:50:09 +0100 From: Roger Pau =?utf-8?B?TW9ubsOp?= To: Brian Buhrow CC: Subject: Re: ZFS corruption using zvols as backingstore for hvm VM's Message-ID: <20201111095009.6lcik5y3s7wrsh5k@Air-de-Roger> References: <202011100516.0AA5Gp5K015697@nfbcal.org> <202011110913.0AB9DJr1025354@nfbcal.org> Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <202011110913.0AB9DJr1025354@nfbcal.org> X-ClientProxiedBy: LO2P265CA0018.GBRP265.PROD.OUTLOOK.COM (2603:10a6:600:62::30) To DS7PR03MB5608.namprd03.prod.outlook.com (2603:10b6:5:2c9::18) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 829be376-8f66-46f2-4a05-08d88627302b X-MS-TrafficTypeDiagnostic: DS7PR03MB5575: X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:9508; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: f27m+Bd5tjE2DwLlS3NPSrVc020eAgpD3iDxhQWjpEDdxRhvJfCMxXYVxLwEazJDNLL00hAdqBU5aqVgKzjZiFHKkuPuPoQVDnoMnZBfiVSKCQo3lFkxqLmmBI2uoA6WL+i4gbg6bMSHVtShrNMyVVhaVAQzQVSWuv8tm7zEidVMhWzC511/G9TZoKFzwQg/WJDCvnQrBDjRbo6Tud93zJHb2rCK6Qj05cQEsCm/ytOAlSYKAkRaYAPEPFcx/bq/xmKJVYiItXO1ihC7y6C1wvjqQjn4HzFrnEQltfmEU9NrNtpCGAsBeKiM3lhNyCGG X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DS7PR03MB5608.namprd03.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(7916004)(366004)(376002)(346002)(136003)(396003)(39860400002)(26005)(6496006)(8936002)(83380400001)(66476007)(6916009)(66556008)(66946007)(2906002)(6666004)(5660300002)(6486002)(4326008)(9686003)(86362001)(33716001)(1076003)(8676002)(956004)(186003)(16526019)(316002)(85182001)(478600001); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData: EzAwhw6STSztC1xfRanWWwJR/lbMi7aEmZaEUOSV3WwL/z630JLmd+gVKdlDI3KSkR+G1NbLqV2H3K3s9ndSpbQ/xcv2Z1gGbWzibU8+KsyFkkbTRtYKBPbKAds7EosvinxTOUzfMLJJ3xclUvKOThtmeNDf+FIeaHH0ZqUbDAVgRVgunEJx2hS+kTVL+mtUR7G2jnmHQ7M4tO+pf1rAFW54VzKHttBo2iWTOFvn+7JE/eRzE28DYNDhuv2z6q81GdqJLjm5LYflWyyV3cF2u1e4s6R8Xij7ZWcdeVZCJrGYefbZU243fHH97F5ucXhkTQ5hDDnqLR5UnNLwEc5B8bGifJfFuaCxxdCyCXtFGo3L1TVMNUqV95gTcMqMadGG3ZSp7A9p78YQ/J06t58QMPLDDvhzIkB2camwJiNiGrBWONt2DcleMja9VMWiNVUfd4aNHOaZAdXGLphMJwg7Ryi1owmpNCpvIG/X5C116Y2lclvYAJvI6+iSHSqXr6uilwCfs1lnaJ+cST3xufrHsVEfDUTid8HN3RtPKaVBTRtEuJmyCFRKOzLyYrKkeJ3rEavD9h+qm+w15vb1CzuBU7OG2R60EcBu422Pr/WmN6LalcmnnUh6ci0N47J+RED2wYDx8enmR5/GB11rzDrInfQPh55M8sNmPKy+dbTLv2GbKVURbEqGBOVhHorL6XMM0kv7hpW9LXkD+RTG5XBIasK+Z52GeI4R68PDKBcUWzPKL4nYul/T5nIOUXlqMBLSNVt5m+d2PUqUGUH47c1jXAb2CUwpNG/CZodEicPjNxgxny8oLMghIe8q/pP4mTGqItBXjTgKaqCLYh36T5K+XHUB5lHPLxDb0HkkTqf3GZIxw/SGczPgoa74PwO/DMUOKIJjQ8PlPN4VzNGj4IbWUQ== X-MS-Exchange-CrossTenant-Network-Message-Id: 829be376-8f66-46f2-4a05-08d88627302b X-MS-Exchange-CrossTenant-AuthSource: DS7PR03MB5608.namprd03.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 11 Nov 2020 09:50:14.4717 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 335836de-42ef-43a2-b145-348c2ee9ca5b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: GqKCeEhqY8ap4cKGsgT5PK/8oVe9Bl2Q2mNZRzmhEHsZ/LCPGSLjw6/2mJ8HRSuWT0z9/idvpKT3py8X7d1EeQ== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS7PR03MB5575 X-OriginatorOrg: citrix.com X-Rspamd-Queue-Id: 4CWKj71BK1z4YZQ X-Spamd-Bar: ------ X-Spamd-Result: default: False [-6.20 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; RWL_MAILSPIKE_VERYGOOD(0.00)[216.71.155.175:from]; R_DKIM_ALLOW(-0.20)[citrix.com:s=securemail,citrix.onmicrosoft.com:s=selector2-citrix-onmicrosoft-com]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+exists:216.71.155.175.spf.hc3370-68.iphmx.com]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; RCVD_DKIM_ARC_DNSWL_MED(-0.50)[]; DWL_DNSWL_LOW(-1.00)[citrix.com:dkim]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_MED(-0.20)[216.71.155.175:from]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[citrix.com,reject]; DKIM_TRACE(0.00)[citrix.com:+,citrix.onmicrosoft.com:+]; NEURAL_HAM_SHORT(-1.00)[-1.000]; RCVD_COUNT_ZERO(0.00)[0]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; MID_RHS_NOT_FQDN(0.50)[]; ASN(0.00)[asn:16417, ipnet:216.71.154.0/23, country:US]; ARC_ALLOW(-1.00)[microsoft.com:s=arcselector9901:i=1]; MAILMAN_DEST(0.00)[freebsd-xen] X-BeenThere: freebsd-xen@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Discussion of the freebsd port to xen - implementation and usage List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Nov 2020 09:50:20 -0000 On Wed, Nov 11, 2020 at 01:13:18AM -0800, Brian Buhrow wrote: > hello. Following up on my own message, I believe I've run into a > serious problem that exists on FreeBSD-xen with FreeBSD-12.1P10 and > Xen-4.14.0. Just in case I was running into an old bug with yesterday's > post, I updated to xen-4.14.0 and Qemu-5.0.0. the problem was still there, > i.e. when writing to a second virtual hard drive on an hvm domu, the drive > becomes corrupted. Again, zpool scrub shows no errors. Are you using volmode=dev when creating the zvol? # zfs create -V16G -o volmode=dev zroot/foo This is require when using zvol with bhyve, but shouldn't' be required for Xen since the lock the guest disks from the kernel so GEOM cannot taste them. > So, I decided it might be some sort of memory error. I wrote a memory > test program, shown below, and ran it on my hvm domu. It not only > crashed the domu itself, it crashed the entire xen server! There are some > dmesg messages that happened before the xen server crash, shown below, which > suggest a serious problem. In my view, no matter how badly the domu hvm > host behaves, it shouldn't be able to crash the xen server itself! The > domu is running NetBSD-5.2, an admittedly old version of the operating > system, but I'm running a fleet of these machines, both on real hardware > and on older versions of xen with no stability issues whatsoever! And, as > I say, I shouldn't be able to wipe out the xen server from an hvm domu, no > matter what I do! Can you please paste the config file of the domain? > > The memory test program takes one argument, the amount of RAM, in > megabytes, you want it to test. It then allocates that memory, and > sequentially walks through that memory over and over again, writing to it > and reading from it, checking to make sure the data read matches the data > written. this has the effect of causing the resident set size of the > program to grow slowly over time, as it works. It was originally written > to test the paging efficiency of a system, but I modified it to actually > test the memory along the way. > to reproduce the issue, perform the following steps: > > 1. Set up an hvm host, I think FreeBSD as a domu hvm host will work fine. > Use zfs zvols as the backingstore for the virtual disk(s) for your host. > > 2. Compile this program for that host and run it as follows: > ./testmem 1000 > This should ask the program to allocate 1G of memory and then walk through > and test it. It will report each megabyte of memory it's written and > tested. My test hvm had 4G of RAM as it was a 32-bit OS running on the > domu. Nothing else was running on either the xen server or the domu host. > I'm not sure exactly how far the program got in its memory walk before > things went south, but I think it touched about 100 megabytes of its 1000 > megabyte allocation. > My program was not running as root, so it had no special privileges, even > on the domu host. > > I'm not sure if the problem is with qemu, xen, or some combination of > the two. > > It would be great if someone could reproduce this issue and maybe shed > a bit more light on what's going on. > > -thanks > -Brian > > > > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255. Discarding packet > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =69 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =1000 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =255 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255. Discarding packet > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =69 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =1000 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref =255 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size =0 > Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status Do you have a serial line attached to the server, and if so are those the last messages that you see before the server reboots? I would expect some kind of panic from the FreeBSD dom0 kernel or Xen itself before the server reboots. Those error messages are actually from the PV network controller, so I'm not sure they are related to the disk in any way. Are you doing anything else when this happens? Roger.