From owner-freebsd-fs@freebsd.org Wed Aug 23 12:36:08 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1A2B9DE6466 for ; Wed, 23 Aug 2017 12:36:08 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-eopbgr660042.outbound.protection.outlook.com [40.107.66.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "Microsoft IT SSL SHA2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B05026E466 for ; Wed, 23 Aug 2017 12:36:06 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM (10.165.218.133) by YTXPR01MB0702.CANPRD01.PROD.OUTLOOK.COM (10.165.221.24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.1.1362.18; Wed, 23 Aug 2017 12:36:04 +0000 Received: from YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM ([10.165.218.133]) by YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM ([10.165.218.133]) with mapi id 15.01.1362.019; Wed, 23 Aug 2017 12:36:04 +0000 From: Rick Macklem To: =?iso-8859-1?Q?Karli_Sj=F6berg?= CC: Ronald Klop , "freebsd-fs@freebsd.org" Subject: Re: when has a pNFS data server failed? Thread-Topic: when has a pNFS data server failed? Thread-Index: /NjliFtpSs0WOO4N5Uy/bwL72vYOsPRtBMq3 Date: Wed, 23 Aug 2017 12:36:04 +0000 Message-ID: References: <2fbb5be6-f9c0-467a-a200-1783cf2c4a67@email.android.com> In-Reply-To: <2fbb5be6-f9c0-467a-a200-1783cf2c4a67@email.android.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=rmacklem@uoguelph.ca; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; YTXPR01MB0702; 6:mlwTYuQQPkWrkpbGfjNOhp/tNhgugqt6ClUXty01O1bPM9Q1X5MvxQBlk4S+clPWdUqWGNLfOxXiwY0G86Gj0tlZpl7/ZTM1f1ZxKJNQIVU40GYlTtWv9qfdWvG0mW6NUDSgsARGnRJ6C4uFGn7cQ4/bOtBpylpicyprbSdgPGBVBCUPOVGliC+G1X6z0wnIozFLtV4Qhkuh48tq8npKvmtjKYKFbXSGtut6BGLCxL0RLJZVZ4HKhJgPiVI0utRc+uYGdZY6Z2ZuEW/pqEK/W2j72/FbJcKCUh5LppoTXOHyxK0zvu48jO+63b4DtOGqjrlBIbNWFNBKt439Ms7S0Q==; 5:AHeINJVxcZ6AaJ2Tfe6+aCkfeE8a+z0PnG3ACa96iXWOBtuRbHRSe0wDiUJo00xX1cQ/5lZgKMIU0zBAbnsgaDzPwpuW6Sk2PSt+Htw79gmyPRMzX58HbCKrGlscoW/xk8EHKVGKLeiZBzNqDjIiqA==; 24:u2dYR2y2EhnnCy5QyXDQ59AagQP8KXihaee6s1chJH+hk8V3dkqFyhTG7eyjExafVTmqBY0pri1Q/kjBVYtuTqL3GarpZJSGK/t1HKQ1xdg=; 7:oyb1t8VJXf+WJqevWoybLuqfqm05/DuN2187SyDtwJ0XOLI+Sw9PlrM0+VyfD2uoF/y3n0HhimQyKR7NYcHPq1MsbAJwD4ogqf8Lr/IK58+ZiGdQ1izxRxpobmSs3pMtN4d/kDpHUmNHkqZUzE7sBdVG7BCGoREBCzuDFiK9oB7FVSDmER0Ew6WORrw9ElYJqyJGCOVD2xyWDJ3tj4Ur67ChaChhb0t1ulMnkuZYxqs= x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: a41835e2-5dd2-4ac0-6572-08d4ea238504 x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254152)(300000503095)(300135400095)(2017052603031)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095); SRVR:YTXPR01MB0702; x-ms-traffictypediagnostic: YTXPR01MB0702: x-exchange-antispam-report-test: UriScan:(61668805478150)(158342451672863)(82924173822182)(245836752223355); x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(2401047)(8121501046)(5005006)(10201501046)(93006095)(93001095)(100000703101)(100105400095)(3002001)(6041248)(201703131423075)(201702281529075)(201702281528075)(201703061421075)(201703061406153)(20161123560025)(20161123564025)(20161123555025)(20161123562025)(20161123558100)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:YTXPR01MB0702; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:YTXPR01MB0702; x-forefront-prvs: 040866B734 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(24454002)(189002)(199003)(51914003)(5660300001)(8936002)(105586002)(7696004)(97736004)(25786009)(33656002)(6246003)(6916009)(106356001)(81166006)(14454004)(54356999)(81156014)(9686003)(50986999)(4326008)(8676002)(74316002)(76176999)(101416001)(2950100002)(5890100001)(189998001)(2906002)(77096006)(305945005)(102836003)(55016002)(68736007)(6506006)(53936002)(54906002)(6436002)(478600001)(86362001)(6306002)(229853002)(3280700002)(2900100001)(110136004)(74482002)(3660700001); DIR:OUT; SFP:1101; SCL:1; SRVR:YTXPR01MB0702; H:YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Aug 2017 12:36:04.1401 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTXPR01MB0702 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Aug 2017 12:36:08 -0000 Karli Sj=F6berg wrote: [stuff snipped for brevity] >>Rick Macklem wrote: >>To be honest, I think the answer for version 1 will come down to... >> >>How long should the MDS try to communicate with the DS before it gives up= and >>considers it failed? >> >>It will probably be setable via a sysctl, but does need a reasonable defa= ult value. >>(A "very large" value would indicate "leave it for the sysadmin to decide= and do >>manually.) [more stuff snipped] >This is what one prominent "customer" says about timeout: >https://kb.vmware.com/selfservice/microsites/search.do?language=3Den_US&cm= d=3DdisplayKC&externalId=3D1009465 >"These issues occur when the guest operating system timeout values are exc= eeded for >attached storage disks. This may be caused by an underlying stor= age problem or due to >brief transient pauses during normal operations (suc= h as path failover). To accommodate >transient events, the VMware Tools inc= reases the SCSI disk timeout to 60 seconds for >Virtual Infrastructure 3 an= d 180 seconds for vSphere 4 and higher." > >Which means that you have a minute before the "customers" start complainin= g:) Thanks. I was thinking that a minute or two is about what the default might= want to be. It may need to be longer than that, since a DS needs to be able to r= eboot and start servicing RPCs before this timeout happens as one example. (Fortunately a DS does not need to wait for the "grace period that an NFSv4= /MDS server does after boot, since that time is for clients to recover locks an= d the locks are handled by the MDS and not the DSs.) Thanks for the comment, rick