Date: Mon, 20 Jan 2014 09:01:35 +0200 From: Daniel Braniss <danny@cs.huji.ac.il> To: Adrian Chadd <adrian@freebsd.org> Cc: Rick Macklem <rmacklem@uoguelph.ca>, FreeBSD stable <freebsd-stable@freebsd.org> Subject: Re: on 9.2-stable nfs/zfs and 10g hang Message-ID: <C2102616-3239-4425-8475-51B709A57737@cs.huji.ac.il> In-Reply-To: <CAJ-VmomgV=W6O2fMXiaJnLopMBDV-=N6XDF17mSWe2Tok96Jkg@mail.gmail.com> References: <588564685.11730322.1389970076386.JavaMail.root@uoguelph.ca> <2C287272-7B57-4AAD-B22F-6A65D9F8677B@cs.huji.ac.il> <CAJ-VmomgV=W6O2fMXiaJnLopMBDV-=N6XDF17mSWe2Tok96Jkg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jan 18, 2014, at 6:13 PM, Adrian Chadd <adrian@freebsd.org> wrote: > Hi! >=20 > Please try reducing the size down to 32k but leave TSO enabled. >=20 did so, it worked ok, but took longer: with TSO disabled: 14834.61 real 609.29 user 1996.90 = sys with TSO + 32k: 15714.46 real 639.98 user 1828.07 = sys > It's 9.2, so there may be some bugfixes that haven't been backported > from 10 or -HEAD. Would you be able to try a -HEAD snapshot here? >=20 ENOTIME :-). > What's the NFS server and hosts? I saw the core.txt.16 that says > "ix0/ix1" so I can glean the basic chipset family but which NIC in > particular is it? What would people need to try and reproduce it? >=20 The hosts involved are Dell 720/710 the 10G card are Intel=20 ix0@pci0:5:0:0: class=3D0x020000 card=3D0x7a118086 chip=3D0x10fb8086 = rev=3D0x01 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D '82599EB 10-Gigabit SFI/SFP+ Network Connection' class =3D network subclass =3D ethernet the server is exporting a big ZFS file system, which is served via 2 = raid controllers: mfi1@pci0:65:0:0: class=3D0x010400 card=3D0x1f2d1028 = chip=3D0x005b1000 rev=3D0x05 hdr=3D0x00 vendor =3D 'LSI Logic / Symbios Logic' device =3D 'MegaRAID SAS 2208 [Thunderbolt]' class =3D mass storage subclass =3D RAID mfi2@pci0:66:0:0: class=3D0x010400 card=3D0x1f151028 = chip=3D0x00791000 rev=3D0x05 hdr=3D0x00 vendor =3D 'LSI Logic / Symbios Logic' device =3D 'MegaRAID SAS 2108 [Liberator]' class =3D mass storage subclass =3D RAID - just had the driver card lying around- I will try a divergent client, which has a Broadcom Nic later. Q: is the TSO bug in the NIC/driver or in the kernel or both? cheers danny >=20 > -a >=20 >=20 > On 18 January 2014 03:24, Daniel Braniss <danny@cs.huji.ac.il> wrote: >>=20 >> On Jan 17, 2014, at 4:47 PM, Rick Macklem <rmacklem@uoguelph.ca> = wrote: >>=20 >>> Daniel Braniss wrote: >>>> hi all, >>>>=20 >>>> All was going ok till I decided to connect this host via a 10g nic >>>> and very soon it started >>>> to hang. Running multiple make buildworlds from other hosts = connected >>>> via 10g and >>>> using both src and obj on the server via tcp/nfs did ok. but = running >>>> find =85 -exec md5 {} + (the find finds over 6M files) >>>> from another host (at 10g) will hang it very quickly. >>>>=20 >>>> If I wait a while (can=92t be more specific) it sometimes recovers = - >>>> but my users are not very >>>> patient :-) >>>>=20 >>> This suggests that an RPC request/reply gets dropped in a way that = TCP >>> doesn't recover. Eventually (after up to about 15min, I think?) the = TCP >>> connection will be shut down and a new TCP connection started, with = a >>> retry of outstanding RPCs. >>>=20 >>>> I will soon try the same experiment using the old 1G nic, but in = the >>>> meantime, if someone >>>> could shed some light would be very helpful >>>>=20 >>>> I=92m attaching core.txt, but if it doesn=92t make it, it=92s also >>>> available at: >>>> ftp://ftp.cs.huji.ac.il/users/danny/freebsd/core.txt.16 >>>>=20 >>> You might try disabling TSO on the net interface. There are been = issues >>> with TSO for segments around 64K in the past (or use = rsize=3D32768,wsize=3D32768 >>> options on the client mount, to avoid RPCs over about 32K in size). >>>=20 >> BINGO! disabling tso did it. I=92ll try reducing the packet size = later. >> some numbers: >> there where some 7*10^6 files >> doing it locally (the find + md5) took about 3hs, >> via nfs at 1g took 11 hrs. >> at 10g it took 4 hrs. >>=20 >> thanks! >> danny >>=20 >>=20 >>> Beyond that, capturing a packet trace for the case that hangs easily = and >>> looking at what goes on near the end of it in wireshark might give = you >>> a hint about what is going on. >>>=20 >>> rick >>>=20 >>>> thanks, >>>> danny >>>> _______________________________________________ >>>> freebsd-stable@freebsd.org mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >>>> To unsubscribe, send any mail to >>>> "freebsd-stable-unsubscribe@freebsd.org" >>>>=20 >>=20 >> _______________________________________________ >> freebsd-stable@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to = "freebsd-stable-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C2102616-3239-4425-8475-51B709A57737>