From owner-freebsd-fs@freebsd.org Fri Aug 18 21:52:15 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 34A04DDA207 for ; Fri, 18 Aug 2017 21:52:15 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-TO1-obe.outbound.protection.outlook.com (mail-eopbgr670088.outbound.protection.outlook.com [40.107.67.88]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "Microsoft IT SSL SHA2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D9643774AE for ; Fri, 18 Aug 2017 21:52:14 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM (10.165.218.133) by YTXPR01MB0223.CANPRD01.PROD.OUTLOOK.COM (10.165.218.143) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.1.1362.18; Fri, 18 Aug 2017 21:52:12 +0000 Received: from YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM ([10.165.218.133]) by YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM ([10.165.218.133]) with mapi id 15.01.1362.019; Fri, 18 Aug 2017 21:52:12 +0000 From: Rick Macklem To: "freebsd-fs@freebsd.org" Subject: when has a pNFS data server failed? Thread-Topic: when has a pNFS data server failed? Thread-Index: AQHTGGwqSs0WOO4N5Uy/bwL72vYOsA== Date: Fri, 18 Aug 2017 21:52:12 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=rmacklem@uoguelph.ca; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; YTXPR01MB0223; 6:UFWqtINnoSqo0GhFMnU9Nnm4WHKOTKHdrDo5Qh/CtVc++9nAvj5FhAQu3FerXQt8dDB792tuDeGuWJNYLVYZ3KCfCo0KjI86QmNW9b1BssnnxELZkPNDmDYh2ik3Yo6K30FeDelqx6J7VPqd3cTsg03j59L9HG+wfDVHPh+HDRutMXuc5b7t+L2GSsrlf7U11Z4vcgvAg4kkp7nmOfbLiWFsxRmUzaYnexSgTH+7oJoGvwEXnkXuJcvvf3YZZ9wcUioSU1N3ATGRn/IcW58oyDpNlQvvz5Fb2VY2869iCiP/pQy15/OvdGpNUJXHODsLWXL9mz89Vx4AofOWnSnduA==; 5:3krAhmBQA5xk5rAIWFvYb8B21QNFhxXzy75L+JqticBiAGH1Yt+H6aa32QD7tcZNQOac4UWkKTVtd7C2atSOc/ZD2GiMZaVvQS1fdX/x7gjjZHcVsuJSq/ysP9/xqlgWQ1SnqRJqucT/HVw0Twjb8w==; 24:GshqcJ+pf0x0T22jnLENxvQzbiCj21xQEmYy5/r/zmb3kE6W1ILjxm5CHIjiBkCJDveBlhuYIbYpM0VaWHFVM8Ex7nXPfqEkyTN1698L47k=; 7:sSxiEyGcrOxZq3h2Jg0e3f0xxR79s9o+OyLizmBtd3JNMBn3zwuxtVHB1ixSYuWSJwebEnAL05EQOjxefMA6ZfbwbH/yaItiEgIYP9x2xhj7MXjZSNvv8O+J3vH2LS0NsnmK4LEeChc4SRjoVKTRF9I/LgcrkLeolVRfAsxtR1ePU+eDsNrghlCn+SGRlTX5RBUjOsFs4wFJy3mL0nvEbyTOTgIujPslfhaN2kdS2M0= x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: 701494ea-b3e8-46ca-747d-08d4e68361ff x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254152)(300000503095)(300135400095)(2017052603031)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095); SRVR:YTXPR01MB0223; x-ms-traffictypediagnostic: YTXPR01MB0223: x-exchange-antispam-report-test: UriScan:(158342451672863); x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(2401047)(5005006)(8121501046)(10201501046)(3002001)(100000703101)(100105400095)(93006095)(93001095)(6041248)(20161123562025)(20161123555025)(20161123560025)(20161123564025)(20161123558100)(201703131423075)(201702281529075)(201702281528075)(201703061421075)(201703061406153)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:YTXPR01MB0223; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:YTXPR01MB0223; x-forefront-prvs: 040359335D x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(189002)(199003)(68736007)(189998001)(5660300001)(74482002)(54356999)(50986999)(2900100001)(478600001)(25786009)(101416001)(7696004)(8936002)(305945005)(8676002)(81156014)(81166006)(966005)(2906002)(74316002)(14454004)(6436002)(102836003)(2351001)(3280700002)(77096006)(6506006)(86362001)(6916009)(97736004)(55016002)(2501003)(53936002)(110136004)(5640700003)(3660700001)(105586002)(106356001)(9686003)(6306002)(33656002); DIR:OUT; SFP:1101; SCL:1; SRVR:YTXPR01MB0223; H:YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-originalarrivaltime: 18 Aug 2017 21:52:12.4224 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTXPR01MB0223 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Aug 2017 21:52:15 -0000 This is kind of a "big picture" question that I thought I 'd throw out. As a brief background, I now have the code for running mirrored pNFS Data S= ervers working for normal operation. You can look at: http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt if you are interested in details related to the pNFS server code/testing. So, now I am facing the interesting part: 1 - The Metadata Server (MDS) needs to decide that a mirrored DS has failed= at some point. Once that happens, it stops using the DS, etc. --> This brings me to the question of "when should the MDS decide that the = DS has failed and should be taken offline?". - I'm not up to date w.r.t. the TCP stack, so I'm not sure how long i= t will take for the TCP connection to decide that a DS server is no longer working and = fail the TCP connection. I think it takes a fair amount of time, so I'm not sure= if TCP connection loss is a good indicator of DS server failure or not? - It seems to me that the MDS should wait a fairly long time before fai= ling the DS, since this will have a major impact on the pNFS server, requiring rep= air/resilvering by a sysadmin once it happens. So, any comments or thoughts on this? rick