From owner-freebsd-fs@freebsd.org Tue Aug 22 19:51:15 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 36ABDDDFDC1 for ; Tue, 22 Aug 2017 19:51:15 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-TO1-obe.outbound.protection.outlook.com (mail-eopbgr670042.outbound.protection.outlook.com [40.107.67.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "Microsoft IT SSL SHA2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D8C66729CB for ; Tue, 22 Aug 2017 19:51:13 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM (10.165.218.133) by YTXPR01MB0494.CANPRD01.PROD.OUTLOOK.COM (10.165.220.16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.1.1362.18; Tue, 22 Aug 2017 19:51:11 +0000 Received: from YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM ([10.165.218.133]) by YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM ([10.165.218.133]) with mapi id 15.01.1362.019; Tue, 22 Aug 2017 19:51:11 +0000 From: Rick Macklem To: Ronald Klop , "freebsd-fs@freebsd.org" Subject: Re: when has a pNFS data server failed? Thread-Topic: when has a pNFS data server failed? Thread-Index: AQHTGGwqSs0WOO4N5Uy/bwL72vYOsKKQHCeAgACtylg= Date: Tue, 22 Aug 2017 19:51:11 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=rmacklem@uoguelph.ca; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; YTXPR01MB0494; 6:WIRttAOU//CUiQej5C3pGggM7EcS6kNv5OZIXtrWuZuXvKpYnZAur+klHeqBCqhaxeOEbjR5nhoJ8fTsIOoeUv2FVZRmmPYCkaQWXIf031TVnMgKJJSvN2PxwcygDDfezAXdz0d3Jx50ZbeFxgxWP/6LjETOaRMbyli5SB5++pZMbc6jyhhr5N1a9JmvC4PFtj+t0JF2ZbJXr5azxPTuKj/lhtJH+wz6klZm/Yw5T+FHtmOmsbi+EsbAONVqUW6FPzQCRsTf4EJqO46MV78d/4ubF/r8ELFLFGKS6LOrgXEyZU3oqOvCaFZHaT3OA41vcbfFy/VIBJ7tlGZTXDGBuQ==; 5:evsXEPspRFx5cwF8vRlZn2OnyXVDKxBFJajbXbLHg6al6WCvn5LdsxCDWWfkk+UUUuVUi0JFkty/Aelsu8bxZJE3S2osXnnAhVKMymoVKuCSok0b0rBLyPv7O63fCTyNw24pEVNtoDIeP9G6iHi11A==; 24:EH/ayTPgGu2U1QiTB0AWCessOi+7swtXlkoZ4ZQI4HJQuiVix5cQoLFbVGzn67V/DINWK46VlTam3gudkScXo2y6KYj+QjFGlco5lX+zsaQ=; 7:TlFxmXeO/GU6GGmglMfi3yy/+pQnfb6NgDu5E/6hdZ93PlWfKc1wDYaubp2OZvEMHpueMOuofm58Pj4vvjk0x2m1NbxOPDz1z5mfW+lrQG6vOu8Q9LYGEyo2NVsuKIz5mIrPwUd/mcNB+HcLBJeIauLvkMvrQVm5HLlCU2u3vAuO6EeWSPDy1QZBQJjQPbwmFmktDdslGRUfHZz4PBu58AL2KHw3Algm9WeMxMA4kDU= x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: 09174523-4444-4e9c-3ca6-08d4e99723d9 x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254152)(300000503095)(300135400095)(2017052603173)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095); SRVR:YTXPR01MB0494; x-ms-traffictypediagnostic: YTXPR01MB0494: x-exchange-antispam-report-test: UriScan:(158342451672863); x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(2401047)(5005006)(8121501046)(3002001)(10201501046)(100000703101)(100105400095)(93006095)(93001095)(6041248)(20161123564025)(20161123562025)(201703131423075)(201702281529075)(201702281528075)(201703061421075)(201703061406153)(20161123555025)(20161123560025)(20161123558100)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:YTXPR01MB0494; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:YTXPR01MB0494; x-forefront-prvs: 04073E895A x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(189002)(199003)(51914003)(24454002)(2906002)(77096006)(102836003)(2501003)(6306002)(53936002)(9686003)(55016002)(68736007)(3660700001)(74316002)(305945005)(6506006)(6436002)(97736004)(3280700002)(81156014)(76176999)(50986999)(14454004)(8936002)(25786009)(189998001)(54356999)(229853002)(8676002)(81166006)(2950100002)(33656002)(86362001)(478600001)(966005)(101416001)(74482002)(2900100001)(105586002)(106356001)(7696004)(5660300001)(6246003); DIR:OUT; SFP:1101; SCL:1; SRVR:YTXPR01MB0494; H:YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-originalarrivaltime: 22 Aug 2017 19:51:11.6230 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTXPR01MB0494 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2017 19:51:15 -0000 Ronald Klop wrote: >On Fri, 18 Aug 2017 23:52:12 +0200, Rick Macklem >wrote: >> This is kind of a "big picture" question that I thought I 'd throw out. >> >> As a brief background, I now have the code for running mirrored pNFS >> Data Servers >> working for normal operation. You can look at: >> http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt >> if you are interested in details related to the pNFS server code/testing= . >> >> So, now I am facing the interesting part: >> 1 - The Metadata Server (MDS) needs to decide that a mirrored DS has >> failed at some >> point. Once that happens, it stops using the DS, etc. >> --> This brings me to the question of "when should the MDS decide that >> the DS has >> failed and should be taken offline?". >> - I'm not up to date w.r.t. the TCP stack, so I'm not sure how >> long it will take for the >> TCP connection to decide that a DS server is no longer working >> and fail the TCP >> connection. I think it takes a fair amount of time, so I'm not >> sure if TCP connection >> loss is a good indicator of DS server failure or not? >> - It seems to me that the MDS should wait a fairly long time before >> failing the DS, >> since this will have a major impact on the pNFS server, requiring >> repair/resilvering >> by a sysadmin once it happens. >> So, any comments or thoughts on this? rick > >This is a quite common problem for all clustered/connected systems. I >think there is no general answer. And there are a lot of papers written >about it. If you have a suggestion for one good paper, I might be willing to read it. Short answer is I'm retired after 30years of working for a University and I= have roughly a 0 interest in reading academic papers. >For example: in NFS you have the 'soft' option. It is recommended not to >use it. I can imagine that if your home-dir or /usr is mounted over NFS, >but at work I want my http-servers to not hang and just give an IO-error >when the backend fileserver with data is gone. >Something similar happens here. Yes. However, the analogy only works so far, in that a failure of a "soft" = mount affects integrity of the file, if it is a write that fails. In this case, there shouldn't be data corruption/loss, however there may be degraded performance during the mirror failure and subsequent resilvering. (A closer analogy might be a drive failure when in a mirrored configuration with another drive. These days drive hardware does try to indicate "hardwa= re health", which the mirrored server may not provide, at least in the early version.) > Doesn't the protocol definition say something about this? Nope, except for some "on the wire" information that the pNFS client can pr= ovide to indicate to the MDS that it is having problems with a DS. (The RFCs deal with what goes on the wire and not how servers get implement= ed.) > Or what do other implementations do? I have no idea. At this point, all extant pNFS server implementations are p= roprietary blobs, such as a Netapp clustered configuration. I've only seen "high level= " white papers (one notch away from marketing). To be honest, I think the answer for version 1 will come down to... How long should the MDS try to communicate with the DS before it gives up a= nd considers it failed? It will probably be setable via a sysctl, but does need a reasonable defaul= t value. (A "very large" value would indicate "leave it for the sysadmin to decide a= nd do manually.) I also think there might be certain error returns from sosend()/sorecieve()= that may want special handling. A simple example I experienced in recent testing was... - One system was misconfigured with the same IP# as one of the DS systems. After fixing the misconfiguration, the pNFS server was wedged because it= had a bogus arp entry so it couldn't talk to the one mirror. --> This was easily handled by a "arp -d" done by me on the MDS, but if the= MDS had given up on the DS before I did that, it would have been a lot mo= re work to fix. (The bogus arp entry had a very long timeout on it.) Anyhow, thanks for the comments and we'll see if others have comments, rick