Date: Tue, 22 Aug 2017 11:10:29 +0200 From: "Ronald Klop" <ronald-lists@klop.ws> To: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>, "Rick Macklem" <rmacklem@uoguelph.ca> Subject: Re: when has a pNFS data server failed? Message-ID: <op.y5c7rrupkndu52@klop.ws> In-Reply-To: <YTXPR01MB018952E64C3026F95165B45FDD800@YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM>
index | next in thread | previous in thread | raw e-mail
On Fri, 18 Aug 2017 23:52:12 +0200, Rick Macklem <rmacklem@uoguelph.ca> wrote: > This is kind of a "big picture" question that I thought I 'd throw out. > > As a brief background, I now have the code for running mirrored pNFS > Data Servers > working for normal operation. You can look at: > http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt > if you are interested in details related to the pNFS server code/testing. > > So, now I am facing the interesting part: > 1 - The Metadata Server (MDS) needs to decide that a mirrored DS has > failed at some > point. Once that happens, it stops using the DS, etc. > --> This brings me to the question of "when should the MDS decide that > the DS has > failed and should be taken offline?". > - I'm not up to date w.r.t. the TCP stack, so I'm not sure how > long it will take for the > TCP connection to decide that a DS server is no longer working > and fail the TCP > connection. I think it takes a fair amount of time, so I'm not > sure if TCP connection > loss is a good indicator of DS server failure or not? > - It seems to me that the MDS should wait a fairly long time before > failing the DS, > since this will have a major impact on the pNFS server, requiring > repair/resilvering > by a sysadmin once it happens. > So, any comments or thoughts on this? rick Hi, This is a quite common problem for all clustered/connected systems. I think there is no general answer. And there are a lot of papers written about it. For example: in NFS you have the 'soft' option. It is recommended not to use it. I can imagine that if your home-dir or /usr is mounted over NFS, but at work I want my http-servers to not hang and just give an IO-error when the backend fileserver with data is gone. Something similar happens here. Doesn't the protocol definition say something about this? Or what do other implemenations do? Regards, Ronald.home | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?op.y5c7rrupkndu52>
