Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Jan 2014 09:01:35 +0200
From:      Daniel Braniss <danny@cs.huji.ac.il>
To:        Adrian Chadd <adrian@freebsd.org>
Cc:        Rick Macklem <rmacklem@uoguelph.ca>, FreeBSD stable <freebsd-stable@freebsd.org>
Subject:   Re: on 9.2-stable nfs/zfs and 10g hang
Message-ID:  <C2102616-3239-4425-8475-51B709A57737@cs.huji.ac.il>
In-Reply-To: <CAJ-VmomgV=W6O2fMXiaJnLopMBDV-=N6XDF17mSWe2Tok96Jkg@mail.gmail.com>
References:  <588564685.11730322.1389970076386.JavaMail.root@uoguelph.ca> <2C287272-7B57-4AAD-B22F-6A65D9F8677B@cs.huji.ac.il> <CAJ-VmomgV=W6O2fMXiaJnLopMBDV-=N6XDF17mSWe2Tok96Jkg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Jan 18, 2014, at 6:13 PM, Adrian Chadd <adrian@freebsd.org> wrote:

> Hi!
>=20
> Please try reducing the size down to 32k but leave TSO enabled.
>=20
did so, it worked ok, but took longer:
with TSO disabled:	   14834.61 real       609.29 user      1996.90 =
sys
with TSO + 32k:          15714.46 real       639.98 user      1828.07 =
sys

> It's 9.2, so there may be some bugfixes that haven't been backported
> from 10 or -HEAD. Would you be able to try a -HEAD snapshot here?
>=20
ENOTIME :-).

> What's the NFS server and hosts? I saw the core.txt.16 that says
> "ix0/ix1" so I can glean the basic chipset family but which NIC in
> particular is it? What would people need to try and reproduce it?
>=20
The hosts involved are Dell 720/710
the 10G card  are Intel=20

ix0@pci0:5:0:0: class=3D0x020000 card=3D0x7a118086 chip=3D0x10fb8086 =
rev=3D0x01 hdr=3D0x00
   vendor     =3D 'Intel Corporation'
   device     =3D '82599EB 10-Gigabit SFI/SFP+ Network Connection'
   class      =3D network
   subclass   =3D ethernet

the server is exporting a big ZFS file system, which is served via 2 =
raid controllers:

mfi1@pci0:65:0:0:       class=3D0x010400 card=3D0x1f2d1028 =
chip=3D0x005b1000 rev=3D0x05 hdr=3D0x00
    vendor     =3D 'LSI Logic / Symbios Logic'
    device     =3D 'MegaRAID SAS 2208 [Thunderbolt]'
    class      =3D mass storage
    subclass   =3D RAID
mfi2@pci0:66:0:0:       class=3D0x010400 card=3D0x1f151028 =
chip=3D0x00791000 rev=3D0x05 hdr=3D0x00
    vendor     =3D 'LSI Logic / Symbios Logic'
    device     =3D 'MegaRAID SAS 2108 [Liberator]'
    class      =3D mass storage
    subclass   =3D RAID

- just had the driver card lying around-

I will try a divergent client, which has a Broadcom Nic later.

Q: is the TSO bug in the NIC/driver or in the kernel or both?

cheers
	danny




>=20
> -a
>=20
>=20
> On 18 January 2014 03:24, Daniel Braniss <danny@cs.huji.ac.il> wrote:
>>=20
>> On Jan 17, 2014, at 4:47 PM, Rick Macklem <rmacklem@uoguelph.ca> =
wrote:
>>=20
>>> Daniel Braniss wrote:
>>>> hi all,
>>>>=20
>>>> All was going ok till I decided to connect this host via a 10g nic
>>>> and very soon it started
>>>> to hang. Running multiple make buildworlds from other hosts =
connected
>>>> via 10g and
>>>> using both src and obj on the server via tcp/nfs did ok. but =
running
>>>>     find =85 -exec md5 {} + (the find finds over 6M files)
>>>> from another host (at 10g) will hang it very quickly.
>>>>=20
>>>> If I wait a while (can=92t be more specific) it sometimes recovers =
-
>>>> but my users are not very
>>>> patient :-)
>>>>=20
>>> This suggests that an RPC request/reply gets dropped in a way that =
TCP
>>> doesn't recover. Eventually (after up to about 15min, I think?) the =
TCP
>>> connection will be shut down and a new TCP connection started, with =
a
>>> retry of outstanding RPCs.
>>>=20
>>>> I will soon try the same experiment using the old 1G nic, but in =
the
>>>> meantime, if someone
>>>> could shed some light would be very helpful
>>>>=20
>>>> I=92m attaching core.txt, but if it doesn=92t make it, it=92s also
>>>> available at:
>>>>     ftp://ftp.cs.huji.ac.il/users/danny/freebsd/core.txt.16
>>>>=20
>>> You might try disabling TSO on the net interface. There are been =
issues
>>> with TSO for segments around 64K in the past (or use =
rsize=3D32768,wsize=3D32768
>>> options on the client mount, to avoid RPCs over about 32K in size).
>>>=20
>> BINGO! disabling tso did it. I=92ll try reducing the packet size =
later.
>> some numbers:
>> there where some 7*10^6 files
>> doing it locally (the find + md5) took about 3hs,
>> via nfs at 1g took 11 hrs.
>> at 10g it took 4 hrs.
>>=20
>> thanks!
>>        danny
>>=20
>>=20
>>> Beyond that, capturing a packet trace for the case that hangs easily =
and
>>> looking at what goes on near the end of it in wireshark might give =
you
>>> a hint about what is going on.
>>>=20
>>> rick
>>>=20
>>>> thanks,
>>>>     danny
>>>> _______________________________________________
>>>> freebsd-stable@freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>>>> To unsubscribe, send any mail to
>>>> "freebsd-stable-unsubscribe@freebsd.org"
>>>>=20
>>=20
>> _______________________________________________
>> freebsd-stable@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to =
"freebsd-stable-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C2102616-3239-4425-8475-51B709A57737>