From owner-freebsd-fs@freebsd.org Sun Dec 13 17:09:00 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 577914BE3FD for ; Sun, 13 Dec 2020 17:09:00 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-TO1-obe.outbound.protection.outlook.com (mail-to1can01on060e.outbound.protection.outlook.com [IPv6:2a01:111:f400:fe5d::60e]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Cv9wW2QcNz4f7V for ; Sun, 13 Dec 2020 17:08:58 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=hMdvfi2ZYDLLhVMDu0cWx7Tm0Sizv+i68lH0nQPJqcXThDuUY3acyH1jHpjboENndkE3u14bf8JnESTHP3CGiYkJBb3bedBJtQvfecUbXnygC3Nh7P6Lotrqor+ZAvlNM3xGPY4EcQq2P4r5ME6pb2CBe+5+hcQnzQtyJQclfD+6BHQmFPzSlY0yhPhdXllvot1Ed9AmWy6cV2NJADgNFJiB5lckbCFP1pLucDZEiVLlkxfeMz9KFdMs0yVF61vH6Lu8Ld+gRBBh5i7hodOsS3CFxH/ophVYBWRTUgyuqeUQmMNp6oEnMxKikiqRiLUUiyU8XlrYwyhTFCWizrrQsw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=2DDeMw3S37NQzdOUGYpRnF+FdmbmzYCjv59MDYKGXmo=; b=HKBDDdSNe5dDQStyu1kyvKdTSCvWst9Ilo5GVVpFdoz+W40i3oDgOJkkAPb3EGe8kTv7i2+5bcAXvwSayOBvwTF8+OzItBh5UFOihQ/yCcq6k6d1pl/gRpvOk9FTFEq0K7AXd0k7GNYjU0VPmAgm07vwokUkiFLc1i5mJ6gmaD4U24prYVx7/45IGkhgmAXFzaoqe6059Vf8EdACsCWzhjOqvhmR5suT2tQVBR8NjgAUsAs+TD4M012I5bhquwq6BONzXVZmlIUoVQYOprhv8Z5YzbszaWsvtwS2ap1YC0Thc5OKqumyEuyhYDDPVlJH5YEYIlfiIuxl3+qmtXavaA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=uoguelph.ca; dmarc=pass action=none header.from=uoguelph.ca; dkim=pass header.d=uoguelph.ca; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uoguelph.ca; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=2DDeMw3S37NQzdOUGYpRnF+FdmbmzYCjv59MDYKGXmo=; b=O1ukXopONjZEZGT562CFYMu2UoG64HS1OhYlHsoj01UN2b4LcdBGo97E19w/j2MMsO96Zg33671XN67mSijcuZpBV9x/31xITiJE/yuOsvKEaz+dRxabIyP+vxuDjsGtZC96mZbeN37qmw5D720fj+uyHZhjr+1e2cjBMj//tKyK5DjRlRU4P3WUfXlBBzNktMH8x1xgvHLw5KxuMZL1b+Th6a1ZJfSdAghaSBAvcNrnv38RELE58YXIw5bZia0At8RLx4630O05/Ze7BUgVECdyFfNIvGivk+19eKOwotNf4yMAfsrgbfSCeczpE+on5r0rkdoxh/GCfKJmg2VnhQ== Received: from YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:c00:19::29) by QB1PR01MB2689.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:c00:33::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3654.15; Sun, 13 Dec 2020 17:08:55 +0000 Received: from YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM ([fe80::7d6b:aa68:78f4:5d94]) by YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM ([fe80::7d6b:aa68:78f4:5d94%7]) with mapi id 15.20.3654.024; Sun, 13 Dec 2020 17:08:48 +0000 From: Rick Macklem To: J David , Konstantin Belousov CC: "freebsd-fs@freebsd.org" Subject: Re: Major issues with nfsv4 Thread-Topic: Major issues with nfsv4 Thread-Index: AQHWzw/HDat+dHoH9kKG5K3Xpd53kqnxDteQgAFi0QCAABTa84AALLCAgAAVvcmAAAu0AIAA4wiAgAAI5gCAAO/IAIAAiDrm Date: Sun, 13 Dec 2020 17:08:48 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: f6df9b2a-5a2b-461e-edf6-08d89f89c206 x-ms-traffictypediagnostic: QB1PR01MB2689: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: bd28ToaQmlU/Zp6nXt4dw0+hAK793SKXkZohF55MYE5mAAOtVPcjYb/rQzBg2Hm9TLRHZ934OTcQJzE9Jdah+xSE1EylMtAiaF1FruqzmQQ3vucSZRw98uRf0BvJhIvkyqPDXHvB15yDQ8uSzqml7nVEneM4ML10Pyl1k+tHinmNqDkYZmbzJuK4bdhBgJIH/MPBnvDKWqer3QUKt22oexW8YsXvAXcsUS7fwzXkMok8kVqsryfVNmkuqKh1zIv9Ip9s+/sweaMrr/6ybW5v8ypY8uiDLw3Y6a8UppE2ODmB/XBBZUF3gh1idmmF09z3BJhaiVHaQP/KiMNSvHeS9w== x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFS:(376002)(366004)(346002)(136003)(39850400004)(396003)(8936002)(86362001)(66476007)(66946007)(66556008)(186003)(2906002)(64756008)(91956017)(316002)(786003)(4326008)(83380400001)(6506007)(52536014)(5660300002)(55016002)(7696005)(478600001)(66446008)(76116006)(9686003)(8676002)(33656002)(110136005)(71200400001); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: =?iso-8859-1?Q?NqJHeuguBIqHl7qpdfnu9rvFj4e02rUN/CUtgwuwaj3RBwmXKD570VyQ2G?= =?iso-8859-1?Q?8WynsIYAICkcdh8W+0dpDxh5fR0hskxO089xDmoNTl70QNSXuNZwrxsKwr?= =?iso-8859-1?Q?TpfX5MR3pprHJK6Y9Yn2a5eIz6+gNWz9z3KhWM8uh1b8yqh4YcMAiTqu9I?= =?iso-8859-1?Q?GUM6w8e0BzpgXhdtn8tj+BOwWheMb1IS2NHaxCZ/ZhaJ1+jDun8tPGZ90z?= =?iso-8859-1?Q?24+ym27pIz67nN8Y/Th89KoDuNLnlawbuR2+fd1dPpUFrqJOfSaWAF3LUO?= =?iso-8859-1?Q?hJ2Eu3bfSmVJwuD1t14Asi+ou9ocEEGBuVBgG206wECNM2d4fS0xBmkYF2?= =?iso-8859-1?Q?VZjxmsCSa+FbxGcrPWetsuAoZS+mPuY9vfGlZiTF7k6KPGT8QqICFGYGL+?= =?iso-8859-1?Q?jxcP3R0LWGupp3ddIPmLv9fhyiXNVroTEpuHQes5NjWjAZM/gUEogueSiN?= =?iso-8859-1?Q?r2m5HFSj/A+ycOG71JGc08DPXBSUPRq+aY4kqzMLI/19U7ZvqKXunPmSTN?= =?iso-8859-1?Q?w2Z4roBP0JCVRX+B+K2MT0wvjqTtClRR7JJQatdXyg3sNdClSdHBPCE/5P?= =?iso-8859-1?Q?4S7SHFS3ZpXJMtfYZjIX1MI7y6SRMuLZhAnzjPj7+CYaU0vzMmalzZUbwy?= =?iso-8859-1?Q?iJGELDt3GmnT39kouu5TTYBrbAKWfejGezf82mDZxmgLRALgyDNUprLmAf?= =?iso-8859-1?Q?v9CogLKtysNR0LuTlQqJZzgkWWwHKlctXx42J8VqF2lu1mEZEZas4L8ksE?= =?iso-8859-1?Q?o9m1jrPfuxB+eau9fRUanOc/zTxjgyaABRgfdHxnc1yLOBx+4yedJAB/Mr?= =?iso-8859-1?Q?ALMDDdI3BOQ6RtPVal5ylS7w1+EXWYJQ12lvuZItyLQBCPtdc7VAv5QzLH?= =?iso-8859-1?Q?NVLYIzDJ7zs0tU1lQgY8LtrKfJbuHAvEvQ6QCnUoUDMZt/05suh49L0whP?= =?iso-8859-1?Q?z3CtBXjP5Aolj3MTQRcKUAhVC70jiVcQ6i4vquBS/glBDzTGenMzLOTLGo?= =?iso-8859-1?Q?xiHEjr/TYPo/iapis+G+/vH9VvgMWEQ9zg6RQRmLgd86HJRbkQDI7I01de?= =?iso-8859-1?Q?F+ZzgDJwaJbhv1KdZNIKODM=3D?= x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-Network-Message-Id: f6df9b2a-5a2b-461e-edf6-08d89f89c206 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Dec 2020 17:08:48.6835 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: J6fEcc8wowZOaVeYG/ttW0Om7rQpCwg9An0EyVspW0AQOZqzbTipL7AqS6qzRXq8u116CYpx/yrV98u4pb9aPw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: QB1PR01MB2689 X-Rspamd-Queue-Id: 4Cv9wW2QcNz4f7V X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org; dkim=pass header.d=uoguelph.ca header.s=selector1 header.b=O1ukXopO; arc=pass (microsoft.com:s=arcselector9901:i=1); dmarc=pass (policy=none) header.from=uoguelph.ca; spf=pass (mx1.freebsd.org: domain of rmacklem@uoguelph.ca designates 2a01:111:f400:fe5d::60e as permitted sender) smtp.mailfrom=rmacklem@uoguelph.ca X-Spamd-Result: default: False [-6.00 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a01:111:f400::/48]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[uoguelph.ca:+]; DMARC_POLICY_ALLOW(-0.50)[uoguelph.ca,none]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FREEMAIL_TO(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; RBL_DBL_DONT_QUERY_IPS(0.00)[2a01:111:f400:fe5d::60e:from]; ARC_ALLOW(-1.00)[microsoft.com:s=arcselector9901:i=1]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:8075, ipnet:2a01:111:f000::/36, country:US]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[uoguelph.ca:s=selector1]; FREEFALL_USER(0.00)[rmacklem]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-1.00)[-1.000]; TAGGED_RCPT(0.00)[]; MIME_GOOD(-0.10)[text/plain]; SPAMHAUS_ZRD(0.00)[2a01:111:f400:fe5d::60e:from:127.0.2.255]; DWL_DNSWL_LOW(-1.00)[uoguelph.ca:dkim]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MAILMAN_DEST(0.00)[freebsd-fs] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 13 Dec 2020 17:09:00 -0000 J David wrote:=0A= [stuff snipped]=0A= >The most common trace (91 times, or over 1/4 of all observed) is:=0A= >=0A= >__mtx_lock_sleep+0xf8 nfscl_nodeleg+0x207 nfs_lookup+0x314=0A= >VOP_LOOKUP_APV+0x75 null_lookup+0x98 VOP_LOOKUP_APV+0x75 >lookup+0x451=0A= >namei+0x414 kern_statat+0x72 freebsd11_stat+0x30 amd64_syscall+0x387=0A= >fast_syscall_common+0xf8=0A= This is just waiting on the mutex that protects the open/lock/delegation=0A= state.=0A= =0A= Now, is this just caused by heavy load and 130000 opens, or is there=0A= something in nullfs that results in more state handling than needed?=0A= [more stuff snipped]=0A= >It happens these were all Python-based jobs. Python seems predisposed=0A= >to trigger this, but non-Python jobs trigger it as well. Heavy use of=0A= >stat() does seem to be a common element regardless of job type.=0A= >=0A= >Here's the output of "nfsstat -E -c" 60 seconds after running it with -z:= =0A= >=0A= >Client Info:=0A= >RPC Counts:=0A= > Getattr Setattr Lookup Readlink Read W= rite=0A= > 1667 111 6376 0 42 = 153=0A= > Create Remove Rename Link Symlink M= kdir=0A= > 111 7 14 0 0 = 0=0A= > Rmdir Readdir RdirPlus Access Mknod Fs= stat=0A= > 0 0 0 2620 0 = 160=0A= > FSinfo pathConf Commit SetClId SetClIdCf = Lock=0A= > 0 0 113 0 0 = 48=0A= > LockT LockU Open OpenCfr=0A= > 0 48 320 0=0A= > OpenDownGr Close=0A= > 0 402=0A= --> So, during this time interval, there were 320 opens done and 402 closes= .=0A= > RelLckOwn FreeStateID PutRootFH DelegRet GetAcl Se= tAcl=0A= > 0 3 0 0 0 = 0=0A= > ExchangeId CreateSess DestroySess DestroyClId LayoutGet GetDev= Info=0A= > 0 0 0 0 0 = 0=0A= > LayoutCommit LayoutReturn ReclaimCompl ReadDataS WriteDataS >Commit= DataS=0A= > 0 0 0 0 0 = 0=0A= > OpenLayout CreateLayout=0A= > 0 0=0A= > OpenOwner Opens LockOwner Locks Delegs Loca= lOwn=0A= > 21175 130439 30 6 0 = 0=0A= So it has accumulated 130439 opens over 21175 different processes.=0A= (An openowner represents a process on the client.)=0A= Are there typically 20000+ processes/jobs running concurrently on this setu= p?=0A= =0A= That implies that any time a file is opened (and any time a vnode v_usecoun= t=0A= drops to 0) a linear traversal of a 130000 element linked list is done whil= e=0A= holding the mutex that the above common procstat entry is waiting on.=0A= (The nfscl_nodeleg() call to check for a delegation happens whenever lookup= =0A= has a name cache hit.)=0A= -->Also, every VOP_GETATTR() called from stat() will need to acquire the=0A= state mutex for a short period of time.=0A= =0A= -->So I am not surprised that the mount gets "constipated".=0A= To be honest, the NFSv4 state handling in the client was never designe= d=0A= to handle opens at this scale.=0A= --> Unlike the NFS server code, which uses hash tables of linked lists= , the client=0A= only uses a single linked list for all open structures.=0A= I have thought of changing the NFS client code to use hash tabl= es of lists=0A= and may work on doing so during the coming months, but who know= s=0A= when such a patch might be ready.=0A= So long as the lists are short, holding the mutex during list t= raversal=0A= seems to work ok, even on a heavily loaded NFS server, so doing= so=0A= should allow the client code to scale to larger numbers of open= =0A= structures.=0A= =0A= What you are seeing may just be contention for the client state mutex=0A= (called NFSLOCKCLSTATE() in the code), caused by repeated traversals=0A= of the 130000 element linked list.=0A= =0A= The other question is...=0A= Is having 130000 opens normal or is nullfs somehow delaying the=0A= VOP_INACTIVE() call on the underlying NFS vnode enough to cause=0A= these opens to accumulate?=0A= (Do to how these Windows oplock like opens are defined, they cannot=0A= be closed/discarded by the NFS client until VOP_INACTIVE() is called on=0A= the open file's vnode.)=0A= =0A= Counting how many processes are in __mtx_lock_sleep in the procstat=0A= output should give us a better idea if contention on the NFSv4 client=0A= state mutex is causing the problem.=0A= =0A= rick=0A= =0A= > LocalOpen LocalLown LocalLock=0A= > 0 0 0=0A= Rpc Info:=0A= TimedOut Invalid X Replies Retries Requests=0A= 0 0 0 0 12247=0A= Cache Info:=0A= Attr Hits Attr Misses Lkup Hits Lkup Misses=0A= 1110054 858 1002829 6361=0A= BioR Hits BioR Misses BioW Hits BioW Misses=0A= 2000 54 292 153=0A= BioRL Hits BioRL Misses BioD Hits BioD Misses=0A= 6911 0 208 0=0A= DirE Hits DirE Misses=0A= 104 0=0A= =0A= This does reflect the whole machine, not just those four processes.=0A= =0A= Finally, when I attempted to kill those four processes with ktrace=0A= running on all of them, the system panicked:=0A= =0A= Fatal trap 12: page fault while in kernel mode=0A= cpuid =3D 2; apic id =3D 02=0A= fault virtual address =3D 0x10=0A= fault code =3D supervisor read data, page not present=0A= instruction pointer =3D 0x20:0xffffffff8272d28f=0A= stack pointer =3D 0x28:0xfffffe008a5f24c0=0A= frame pointer =3D 0x28:0xfffffe008a5f25f0=0A= code segment =3D base 0x0, limit 0xfffff, type 0x1b=0A= =3D DPL 0, pres 1, long 1, def32 0, gran 1=0A= processor eflags =3D interrupt enabled, resume, IOPL =3D 0=0A= current process =3D 5627 (node)=0A= trap number =3D 12=0A= panic: page fault=0A= cpuid =3D 2=0A= time =3D 1607845622=0A= KDB: stack backtrace:=0A= db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008a5f2= 180=0A= vpanic() at vpanic+0x17b/frame 0xfffffe008a5f21d0=0A= panic() at panic+0x43/frame 0xfffffe008a5f2230=0A= trap_fatal() at trap_fatal+0x391/frame 0xfffffe008a5f2290=0A= trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008a5f22e0=0A= trap() at trap+0x286/frame 0xfffffe008a5f23f0=0A= calltrap() at calltrap+0x8/frame 0xfffffe008a5f23f0=0A= --- trap 0xc, rip =3D 0xffffffff8272d28f, rsp =3D 0xfffffe008a5f24c0, rbp= =0A= =3D 0xfffffe008a5f25f0 ---=0A= null_bypass() at null_bypass+0xaf/frame 0xfffffe008a5f25f0=0A= VOP_ADVLOCK_APV() at VOP_ADVLOCK_APV+0x80/frame 0xfffffe008a5f2620=0A= closef() at closef+0x8f/frame 0xfffffe008a5f26b0=0A= fdescfree_fds() at fdescfree_fds+0x3c/frame 0xfffffe008a5f2700=0A= fdescfree() at fdescfree+0x466/frame 0xfffffe008a5f27c0=0A= exit1() at exit1+0x488/frame 0xfffffe008a5f2820=0A= sigexit() at sigexit+0x159/frame 0xfffffe008a5f2b00=0A= postsig() at postsig+0x2fd/frame 0xfffffe008a5f2bc0=0A= ast() at ast+0x317/frame 0xfffffe008a5f2bf0=0A= doreti_ast() at doreti_ast+0x1f/frame 0x7fffffff67a0=0A= =0A= The fsck ate the ktrace.out file on reboot. This system does not have=0A= kernel debug symbols. If there's a way to try to figure out what=0A= null_bypass+0xaf corresponds to using the machine that built this=0A= kernel, which has the kernel, objects, debug symbols, and source,=0A= please let me know. It's built from releng/12.2 r368515.=0A= =0A= Thanks!=0A=