Date: Fri, 22 Jan 2010 18:00:21 +0200 From: Mikolaj Golub <to.my.trociny@gmail.com> To: freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org Subject: Re: FreeBSD NFS client/Linux NFS server issue Message-ID: <86vdeuuo2y.fsf@zhuzha.ua1> In-Reply-To: <86vdeywmha.fsf@zhuzha.ua1> (Mikolaj Golub's message of "Tue\, 19 Jan 2010 10\:02\:57 %2B0200") References: <86ocl272mb.fsf@kopusha.onet> <86tyuqnz9x.fsf@zhuzha.ua1> <86zl4awmon.fsf@zhuzha.ua1> <86vdeywmha.fsf@zhuzha.ua1>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 19 Jan 2010 10:02:57 +0200 Mikolaj Golub wrote: > So, on some of our freebsd7.1 nfs clients (and it looks like we have had > similar case with 6.3), which have several nfs mounts to the same CentOS 5.3 > NFS server (mount options: rw,-3,-T,-s,-i,-r=32768,-w=32768,-o=noinet6), at > some moment the access to one of the NFS mount gets stuck, while the access to > the other mounts works ok. > > In all cases we have been observed so far the first gotten stuck process was > php script (or two) that was (were) writing to logs file (appending). In > tcpdump we see that every write to the file causes the sequence of the > following rpc: ACCESS - READ - WRITE - COMMIT. And at some moment this stops > after READ rpc call and successful reply. > > After this in tcpdump successful readdir/access/lookup/fstat calls are > observed from our other utilities, which just check the presence of some files > and they work ok (df also works). The php process at this state is in bo_wwait > invalidating buffer cache [1]. > > If at this time we try accessing the share with mc then it hangs acquiring the > vn_lock held by php process [2] and after this any operations with this NFS > share hang (df hangs too). > > If instead some other process is started that writes to some other file on > this share (append) then the first process "unfreezes" too (starting from > WRITE rpc, so there is no any retransmits). So it looks for me that the problem here is that eventually problem nfsmount ends up in this state: (kgdb) p *nmp $1 = {nm_mtx = {lock_object = {lo_name = 0xc0b808ee "NFSmount lock", lo_type = 0xc0b808ee "NFSmount lock", lo_flags = 16973824, lo_witness_data = {lod_list = { stqe_next = 0x0}, lod_witness = 0x0}}, mtx_lock = 4, mtx_recurse = 0}, nm_flag = 35399, nm_state = 1310720, nm_mountp = 0xc6b472cc, nm_numgrps = 16, nm_fh = "\001\000\000\000\000\223\000\000\001@\003\n", '\0' <repeats 115 times>, nm_fhsize = 12, nm_rpcclnt = {rc_flag = 0, rc_wsize = 0, rc_rsize = 0, rc_name = 0x0, rc_so = 0x0, rc_sotype = 0, rc_soproto = 0, rc_soflags = 0, rc_timeo = 0, rc_retry = 0, rc_srtt = {0, 0, 0, 0}, rc_sdrtt = {0, 0, 0, 0}, rc_sent = 0, rc_cwnd = 0, rc_timeouts = 0, rc_deadthresh = 0, rc_authtype = 0, rc_auth = 0x0, rc_prog = 0x0, rc_proctlen = 0, rc_proct = 0x0}, nm_so = 0xc6e81d00, nm_sotype = 1, nm_soproto = 0, nm_soflags = 44, nm_nam = 0xc6948640, nm_timeo = 6000, nm_retry = 2, nm_srtt = {15, 15, 31, 52}, nm_sdrtt = {3, 3, 15, 15}, nm_sent = 0, nm_cwnd = 4096, nm_timeouts = 0, nm_deadthresh = 9, nm_rsize = 32768, nm_wsize = 32768, nm_readdirsize = 4096, nm_readahead = 1, nm_wcommitsize = 1177026, nm_acdirmin = 30, nm_acdirmax = 60, nm_acregmin = 3, nm_acregmax = 60, nm_verf = "Jë¾W\000\004oí", nm_bufq = {tqh_first = 0xda82dc70, tqh_last = 0xda8058e0}, nm_bufqlen = 2, nm_bufqwant = 0, nm_bufqiods = 1, nm_maxfilesize = 1099511627775, nm_rpcops = 0xc0c2b5bc, nm_tprintf_initial_delay = 12, nm_tprintf_delay = 30, nm_nfstcpstate = { rpcresid = 0, flags = 1, sock_send_inprog = 0}, nm_hostname = "172.30.10.92\000/var/www/app31", '\0' <repeats 60 times>, nm_clientid = 0, nm_fsid = { val = {0, 0}}, nm_lease_time = 0, nm_last_renewal = 0} We have nonempty nm_bufq, nm_bufqiods = 1, but actually there is no nfsiod thread run for this mount, which is wrong -- nm_bufq will not be emptied until some other process starts writing to the nfsmount and starts nfsiod thread for this mount. Reviewing the code how it could happen I see the following path. Could someone confirm or disprove me? in nfs_bio.c:nfs_asyncio() we have: 1363 mtx_lock(&nfs_iod_mtx); ... 1374 /* 1375 * Find a free iod to process this request. 1376 */ 1377 for (iod = 0; iod < nfs_numasync; iod++) 1378 if (nfs_iodwant[iod]) { 1379 gotiod = TRUE; 1380 break; 1381 } 1382 1383 /* 1384 * Try to create one if none are free. 1385 */ 1386 if (!gotiod) { 1387 iod = nfs_nfsiodnew(); 1388 if (iod != -1) 1389 gotiod = TRUE; 1390 } Let's consider situation when new nfsiod is created. nfs_nfsiod.c:nfs_nfsiodnew() before creating nfssvc_iod thread unlocks nfs_iod_mtx: 179 mtx_unlock(&nfs_iod_mtx); 180 error = kthread_create(nfssvc_iod, nfs_asyncdaemon + i, NULL, RFHIGHPID, 181 0, "nfsiod %d", newiod); 182 mtx_lock(&nfs_iod_mtx); And nfs_nfsiod.c:nfssvc_iod() do the followin: 226 mtx_lock(&nfs_iod_mtx); ... 238 nfs_iodwant[myiod] = curthread->td_proc; 239 nfs_iodmount[myiod] = NULL; ... 244 error = msleep(&nfs_iodwant[myiod], &nfs_iod_mtx, PWAIT | PCATCH, 245 "-", timo); Let's at this moment another nfs_asyncio() request for another nfsmount has happened and this thread has locked nfs_iod_mtx. Then this thread will found nfs_iodwant[iod] in "for" loop and will use it. When the first thread actually has returned from nfs_nfsiodnew() it will insert buffer to nmp->nm_bufq but nfsiod will process other nmp. It looks like the fix for this situation would be to check nfs_iodwant[iod] after nfs_nfsiodnew(): --- nfs_bio.c.orig 2010-01-22 15:38:02.000000000 +0000 +++ nfs_bio.c 2010-01-22 15:39:58.000000000 +0000 @@ -1385,7 +1385,7 @@ again: */ if (!gotiod) { iod = nfs_nfsiodnew(); - if (iod != -1) + if ((iod != -1) && (nfs_iodwant[iod] == NULL)) gotiod = TRUE; } Described here scenario could be our case. We have 7 nfs mounts on the problem host. And by cront at the same time one or two scripts for every mount were started. So we had something like this in top (cron tasks started at 23:02): last pid: 64884; load averages: 0.28, 0.34, 0.24 up 0+22:15:41 23:02:04 300 processes: 6 running, 259 sleeping, 1 stopped, 17 zombie, 17 waiting CPU: 10.2% user, 0.0% nice, 7.6% system, 1.0% interrupt, 81.2% idle Mem: 174M Active, 2470M Inact, 221M Wired, 136M Cache, 112M Buf, 251M Free Swap: 8192M Total, 8192M Free 64793 app12 -1 0 23352K 11980K nfsreq 0 0:00 1.07% php 64789 app16 -1 0 21304K 11084K nfsreq 0 0:00 0.98% php 64784 app16 -1 0 19256K 9696K nfsreq 2 0:00 0.88% php 64768 app20 -1 0 19256K 9300K nfsreq 0 0:00 0.78% php 64759 app20 -1 0 18232K 8888K nfsreq 1 0:00 0.78% php 64722 app31 -1 0 20280K 9956K nfsreq 0 0:00 0.68% php 64781 app18 -1 0 19256K 9412K nfsreq 3 0:00 0.68% php 64778 app26 -1 0 18232K 8840K nfsreq 1 0:00 0.68% php 64800 app8 -1 0 18232K 8664K nfsreq 3 0:00 0.68% php 64728 app31 -1 0 18232K 8752K nfsreq 0 0:00 0.59% php 64795 app18 -1 0 18232K 8676K nfsreq 1 0:00 0.59% php 64777 app22 -1 0 18232K 8984K nfsreq 0 0:00 0.49% php 2342 app31 -4 0 22236K 7780K nfs 1 0:13 0.00% icoms_agent_cox215 58920 root 8 - 0K 8K - 2 0:08 0.00% nfsiod 0 2334 app31 -4 0 18908K 6356K nfs 1 0:05 0.00% icoms_agent_cox001 64297 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 1 64298 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 2 64303 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 3 64874 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 12 64870 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 9 64866 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 4 64873 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 11 64867 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 5 64869 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 8 64872 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 10 64868 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 7 64871 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 6 last pid: 64967; load averages: 0.42, 0.37, 0.25 up 0+22:15:46 23:02:09 295 processes: 7 running, 251 sleeping, 1 stopped, 19 zombie, 17 waiting CPU: 69.1% user, 0.0% nice, 8.3% system, 1.5% interrupt, 21.1% idle Mem: 376M Active, 2488M Inact, 226M Wired, 124M Cache, 106M Buf, 37M Free Swap: 8192M Total, 8192M Free 64793 app12 99 0 86840K 59968K CPU3 3 0:02 16.55% php 64768 app20 -1 0 57144K 38424K nfsreq 1 0:02 15.19% php 64722 app31 99 0 61240K 41228K CPU0 0 0:02 15.19% php 64781 app18 -1 0 54072K 35612K nfsreq 2 0:02 13.67% php 64789 app16 -1 0 48952K 31660K nfsreq 3 0:01 10.60% php 64777 app22 -1 0 43832K 27876K nfsreq 0 0:01 9.86% php 64784 app16 -1 0 45880K 29648K nfsreq 0 0:01 9.77% php 64759 app20 -7 0 36664K 22792K bo_wwa 0 0:01 8.25% php 64800 app8 -7 0 24376K 13596K bo_wwa 1 0:01 2.39% php 64795 app18 -7 0 23352K 12788K bo_wwa 3 0:00 1.37% php 58920 root -1 - 0K 8K nfsreq 2 0:08 0.00% nfsiod 0 64303 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 3 64866 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 4 64297 root -1 - 0K 8K nfsreq 2 0:00 0.00% nfsiod 1 64298 root -1 - 0K 8K nfsreq 3 0:00 0.00% nfsiod 2 64873 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 11 64868 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 7 64867 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 5 64871 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 6 64947 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 14 64950 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 17 64870 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 9 64869 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 8 64949 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 16 64874 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 12 64872 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 10 64952 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 19 64948 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 15 64951 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 18 64946 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 13 last pid: 64968; load averages: 0.54, 0.39, 0.26 up 0+22:15:51 23:02:14 289 processes: 7 running, 243 sleeping, 1 stopped, 21 zombie, 17 waiting CPU: 28.7% user, 0.0% nice, 8.8% system, 1.1% interrupt, 61.4% idle Mem: 404M Active, 2503M Inact, 224M Wired, 83M Cache, 107M Buf, 37M Free Swap: 8192M Total, 8192M Free 64793 app12 -1 0 148M 106M nfsreq 1 0:07 41.55% php 64722 app31 -7 0 61240K 41232K bo_wwa 1 0:03 14.26% php 64768 app20 -7 0 57144K 38424K bo_wwa 3 0:03 13.67% php 64781 app18 -7 0 54072K 35612K bo_wwa 0 0:02 11.18% php 64789 app16 -7 0 48952K 31660K bo_wwa 0 0:02 7.96% php 64784 app16 -7 0 45880K 29648K bo_wwa 0 0:02 7.76% php 64777 app22 -7 0 43832K 27876K bo_wwa 0 0:01 6.40% php 64759 app20 -7 0 36664K 22792K bo_wwa 0 0:01 4.79% php 58920 root -1 - 0K 8K nfsreq 2 0:08 0.00% nfsiod 0 64867 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 5 64873 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 11 64303 root -1 - 0K 8K nfsreq 0 0:00 0.00% nfsiod 3 64866 root -1 - 0K 8K nfsreq 1 0:00 0.00% nfsiod 4 64297 root -1 - 0K 8K nfsreq 0 0:00 0.00% nfsiod 1 64298 root -1 - 0K 8K nfsreq 3 0:00 0.00% nfsiod 2 64871 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 6 64868 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 7 64869 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 8 64947 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 14 64872 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 10 64874 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 12 64870 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 9 64949 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 16 64950 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 17 64948 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 15 64951 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 18 64946 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 13 64952 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 19 last pid: 64969; load averages: 0.50, 0.39, 0.25 up 0+22:15:56 23:02:19 269 processes: 6 running, 219 sleeping, 1 stopped, 26 zombie, 17 waiting CPU: 11.9% user, 0.0% nice, 5.8% system, 0.8% interrupt, 81.5% idle Mem: 264M Active, 2504M Inact, 232M Wired, 83M Cache, 112M Buf, 169M Free Swap: 8192M Total, 8192M Free 64793 app12 -1 0 148M 106M nfsreq 3 0:08 33.69% php 64789 app16 -7 0 48952K 31660K bo_wwa 0 0:02 4.98% php 64784 app16 -7 0 45880K 29648K bo_wwa 0 0:02 4.88% php 58920 root 8 - 0K 8K - 1 0:08 0.20% nfsiod 0 64867 root 8 - 0K 8K - 2 0:00 0.10% nfsiod 5 64303 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 3 64297 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 1 64873 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 11 64866 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 4 64871 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 6 64298 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 2 64868 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 7 64869 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 8 64947 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 14 64872 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 10 64874 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 12 64870 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 9 64949 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 16 64950 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 17 64948 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 15 64951 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 18 64946 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 13 64952 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 19 last pid: 64970; load averages: 0.46, 0.38, 0.25 up 0+22:16:02 23:02:25 263 processes: 5 running, 212 sleeping, 1 stopped, 28 zombie, 17 waiting CPU: 8.7% user, 0.0% nice, 3.1% system, 0.3% interrupt, 87.9% idle Mem: 160M Active, 2502M Inact, 232M Wired, 83M Cache, 112M Buf, 274M Free Swap: 8192M Total, 8192M Free 64789 app16 -7 0 48952K 31660K bo_wwa 0 0:02 3.27% php 64784 app16 -7 0 45880K 29648K bo_wwa 0 0:02 3.17% php 58920 root 8 - 0K 8K - 1 0:08 0.00% nfsiod 0 64867 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 5 64303 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 3 64297 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 1 64873 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 11 64866 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 4 64871 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 6 64298 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 2 64868 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 7 64869 root 8 - 0K 8K - 0 0:00 0.00% nfsiod 8 64947 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 14 64872 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 10 64874 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 12 64870 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 9 64949 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 16 64950 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 17 64948 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 15 64951 root 8 - 0K 8K - 3 0:00 0.00% nfsiod 18 64946 root 8 - 0K 8K - 2 0:00 0.00% nfsiod 13 64952 root 8 - 0K 8K - 1 0:00 0.00% nfsiod 19 And this two php processes were hanged until 23:05 another process started that write to another file on this nfs mount. -- Mikolaj Golub
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86vdeuuo2y.fsf>