Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Aug 2017 19:51:11 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Ronald Klop <ronald-lists@klop.ws>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: when has a pNFS data server failed?
Message-ID:  <YTXPR01MB0189D2D15AF6AA25FCF7E08FDD840@YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <op.y5c7rrupkndu52@klop.ws>
References:  <YTXPR01MB018952E64C3026F95165B45FDD800@YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM>, <op.y5c7rrupkndu52@klop.ws>

next in thread | previous in thread | raw e-mail | index | archive | help
Ronald Klop wrote:
>On Fri, 18 Aug 2017 23:52:12 +0200, Rick Macklem <rmacklem@uoguelph.ca>
>wrote:
>> This is kind of a "big picture" question that I thought I 'd throw out.
>>
>> As a brief background, I now have the code for running mirrored pNFS
>> Data Servers
>> working for normal operation. You can look at:
>> http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
>> if you are interested in details related to the pNFS server code/testing=
.
>>
>> So, now I am facing the interesting part:
>> 1 - The Metadata Server (MDS) needs to decide that a mirrored DS has
>> failed at some
>>       point. Once that happens, it stops using the DS, etc.
>> --> This brings me to the question of "when should the MDS decide that
>> the DS has
>>       failed and should be taken offline?".
>>       - I'm not up to date w.r.t. the TCP stack, so I'm not sure how
>> long it will take for the
>>         TCP connection to decide that a DS server is no longer working
>> and fail the TCP
>>         connection. I think it takes a fair amount of time, so I'm not
>> sure if TCP connection
>>         loss is a good indicator of DS server failure or not?
>>     - It seems to me that the MDS should wait a fairly long time before
>> failing the DS,
>>       since this will have a major impact on the pNFS server, requiring
>> repair/resilvering
>>       by a sysadmin once it happens.
>> So, any comments or thoughts on this? rick
>
>This is a quite common problem for all clustered/connected systems. I
>think there is no general answer. And there are a lot of papers written
>about it.
If you have a suggestion for one good paper, I might be willing to read it.
Short answer is I'm retired after 30years of working for a University and I=
 have
roughly a 0 interest in reading academic papers.

>For example: in NFS you have the 'soft' option. It is recommended not to
>use it. I can imagine that if your home-dir or /usr is mounted over NFS,
>but at work I want my http-servers to not hang and just give an IO-error
>when the backend fileserver with data is gone.
>Something similar happens here.
Yes. However, the analogy only works so far, in that a failure of a "soft" =
mount
affects integrity of the file, if it is a write that fails.
In this case, there shouldn't be data corruption/loss, however there may be
degraded performance during the mirror failure and subsequent resilvering.
(A closer analogy might be a drive failure when in a mirrored configuration
 with another drive. These days drive hardware does try to indicate "hardwa=
re health",
 which the mirrored server may not provide, at least in the early version.)

> Doesn't the protocol definition say something about this?
Nope, except for some "on the wire" information that the pNFS client can pr=
ovide
to indicate to the MDS that it is having problems with a DS.
(The RFCs deal with what goes on the wire and not how servers get implement=
ed.)

> Or what do other implementations do?
I have no idea. At this point, all extant pNFS server implementations are p=
roprietary
blobs, such as a Netapp clustered configuration. I've only seen "high level=
" white
papers (one notch away from marketing).

To be honest, I think the answer for version 1 will come down to...

How long should the MDS try to communicate with the DS before it gives up a=
nd
considers it failed?

It will probably be setable via a sysctl, but does need a reasonable defaul=
t value.
(A "very large" value would indicate "leave it for the sysadmin to decide a=
nd do
 manually.)

I also think there might be certain error returns from sosend()/sorecieve()=
 that
may want special handling.
A simple example I experienced in recent testing was...
- One system was misconfigured with the same IP# as one of the DS systems.
   After fixing the misconfiguration, the pNFS server was wedged because it=
 had
   a bogus arp entry so it couldn't talk to the one mirror.
--> This was easily handled by a "arp -d" done by me on the MDS, but if the=
 MDS
      had given up on the DS before I did that, it would have been a lot mo=
re work
      to fix. (The bogus arp entry had a very long timeout on it.)

Anyhow, thanks for the comments and we'll see if others have comments, rick



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTXPR01MB0189D2D15AF6AA25FCF7E08FDD840>