Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Aug 2017 11:10:29 +0200
From:      "Ronald Klop" <ronald-lists@klop.ws>
To:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>, "Rick Macklem" <rmacklem@uoguelph.ca>
Subject:   Re: when has a pNFS data server failed?
Message-ID:  <op.y5c7rrupkndu52@klop.ws>
In-Reply-To: <YTXPR01MB018952E64C3026F95165B45FDD800@YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM>

index | next in thread | previous in thread | raw e-mail

On Fri, 18 Aug 2017 23:52:12 +0200, Rick Macklem <rmacklem@uoguelph.ca>  
wrote:

> This is kind of a "big picture" question that I thought I 'd throw out.
>
> As a brief background, I now have the code for running mirrored pNFS  
> Data Servers
> working for normal operation. You can look at:
> http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
> if you are interested in details related to the pNFS server code/testing.
>
> So, now I am facing the interesting part:
> 1 - The Metadata Server (MDS) needs to decide that a mirrored DS has  
> failed at some
>       point. Once that happens, it stops using the DS, etc.
> --> This brings me to the question of "when should the MDS decide that  
> the DS has
>       failed and should be taken offline?".
>       - I'm not up to date w.r.t. the TCP stack, so I'm not sure how  
> long it will take for the
>         TCP connection to decide that a DS server is no longer working  
> and fail the TCP
>         connection. I think it takes a fair amount of time, so I'm not  
> sure if TCP connection
>         loss is a good indicator of DS server failure or not?
>     - It seems to me that the MDS should wait a fairly long time before  
> failing the DS,
>       since this will have a major impact on the pNFS server, requiring  
> repair/resilvering
>       by a sysadmin once it happens.
> So, any comments or thoughts on this? rick

Hi,

This is a quite common problem for all clustered/connected systems. I  
think there is no general answer. And there are a lot of papers written  
about it.
For example: in NFS you have the 'soft' option. It is recommended not to  
use it. I can imagine that if your home-dir or /usr is mounted over NFS,  
but at work I want my http-servers to not hang and just give an IO-error  
when the backend fileserver with data is gone.
Something similar happens here.

Doesn't the protocol definition say something about this? Or what do other  
implemenations do?

Regards,
Ronald.


home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?op.y5c7rrupkndu52>