From owner-freebsd-net@FreeBSD.ORG Tue Jul 23 11:55:45 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C46BCF5D for ; Tue, 23 Jul 2013 11:55:45 +0000 (UTC) (envelope-from prvs=1916be8aae=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 2FDAA29F4 for ; Tue, 23 Jul 2013 11:55:44 +0000 (UTC) Received: from r2d2 ([82.69.141.170]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50005115231.msg for ; Tue, 23 Jul 2013 12:55:41 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Tue, 23 Jul 2013 12:55:41 +0100 (not processed: message from valid local sender) X-MDDKIM-Result: neutral (mail1.multiplay.co.uk) X-MDRemoteIP: 82.69.141.170 X-Return-Path: prvs=1916be8aae=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: freebsd-net@freebsd.org Message-ID: <7D8CE344ACD04EC8B8C7DC813D1EA3F6@multiplay.co.uk> From: "Steven Hartland" To: =?iso-8859-1?Q?S=E9bastien_RICCIO?= , References: <51EE68E9.5010805@swisscenter.com> Subject: Re: FreeBSD 9.1 and BCM57711 issues (broadcom 10ge ethernet card) Date: Tue, 23 Jul 2013 12:56:12 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Jul 2013 11:55:45 -0000 Have you tried a more recent version e.g. 9.2-PRERELEASE or 9/stable? Regards Steve ----- Original Message ----- From: "Sébastien RICCIO" To: Sent: Tuesday, July 23, 2013 12:28 PM Subject: FreeBSD 9.1 and BCM57711 issues (broadcom 10ge ethernet card) Hi freebsd-net! We recently installed FreeBSD 9.1 64bit on a Dell PowerEdge R510 system in which we have two BCM57711 (for a total of four 10Gbit interfaces.) We're planning to use it as a storage filer using ZFS/NFS. Actually in test, the filer is connected with two 10gigs interfaces to a 10ge Dell PowerConnect switch that serves some linux clients using 10ge cards too. We get into a lot of troubles trying to get something working out of this setup. -- First issue: Without any special tweaking, when we're reading or writing to the NFS server from a client, the network card crashes and become. In the logs I can see: Jul 19 11:49:26 filer-01-a kernel: bxe0: ---------- Begin crash dump ---------- Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------ Idle Check ------------------------------ Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CFC: AC > 1 - LCID 39 CID_CAM 0x7 Value is 0xc Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: VOQ_0, VOQ credit is not equal to initial credit. Values are 0xf8 0x140 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: P0 Byte credit is not equal to initial credit. Values are 0x5a1c 0x8000 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING CCM: XX protection CAM is not empty. Value is 0x1 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING XCM: XX protection CAM is not empty. Value is 0x1 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING BRB1: BRB is not empty. Value is 0x3 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING TCM: FIC0_INIT_CRD is not 64. Value is 0x30 Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR TSEM: interrupt status 0 is not 0. Value is 0x10000 Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CSEM: interrupt status 0 is not 0. Value is 0x10000 Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR XSEM: interrupt status 0 is not 0. Value is 0x10000 Jul 19 11:49:26 filer-01-a kernel: bxe0: bxe_idle_chk(): Failed with 4 error(s) and 0 warning(s)! Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------------------------------------------------ Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------ Idle Check ------------------------------ Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CFC: AC > 1 - LCID 39 CID_CAM 0x7 Value is 0xc Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: VOQ_0, VOQ credit is not equal to initial credit. Values are 0xf8 0x140 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: P0 Byte credit is not equal to initial credit. Values are 0x5a1c 0x8000 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING CCM: XX protection CAM is not empty. Value is 0x1 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING XCM: XX protection CAM is not empty. Value is 0x1 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING BRB1: BRB is not empty. Value is 0x4 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING TCM: FIC0_INIT_CRD is not 64. Value is 0x30 Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING PRS: TCM current credit is not 0. Value is 0x10 Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR TSEM: interrupt status 0 is not 0. Value is 0x10000 Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CSEM: interrupt status 0 is not 0. Value is 0x10000 Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR XSEM: interrupt status 0 is not 0. Value is 0x10000 Jul 19 11:49:26 filer-01-a kernel: bxe0: bxe_idle_chk(): Failed with 4 error(s) and 0 warning(s)! Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------------------------------------------------ Jul 19 11:49:26 filer-01-a kernel: bxe0: ---------- End crash dump ---------- A reboot of the system is even not enough. After rebooting the system, I can't even ping any hosts on the network. It seems that it leaves the card in a bogus state that requires a complete power cycle to get the cards back in business. We found out that disabling: tso4 txcsum rxcsum on the cards prevent this from happening. So although I think it's not, let's say we have a fix for this setting in rc.conf something like this: ifconfig_bxe0="inet 10.50.50.11 netmask 255.255.255.0 mtu 9000 -tso4 -txcsum -rxcsum" -- Second issue, Issuing an ifconfig mtu 9000 on the interfaces randomly produce this error: Jul 19 09:47:03 filer-01-a kernel: bxe0: /usr/src/sys/dev/bxe/if_bxe.c(10934): Memory allocation failure! Cannot fill fp[04] RX chain. Jul 19 09:47:03 filer-01-a kernel: bxe0: /usr/src/sys/dev/bxe/if_bxe.c(3921): NIC initialization failed, aborting! Jul 19 09:47:12 filer-01-a kernel: bxe3: /usr/src/sys/dev/bxe/if_bxe.c(10934): Memory allocation failure! Cannot fill fp[04] RX chain. Jul 19 09:47:12 filer-01-a kernel: bxe3: /usr/src/sys/dev/bxe/if_bxe.c(3921): NIC initialization failed, aborting! That sounds quite bad and, I can't reproduce it with mtu 1500 setting. (But does it makes sens to use a MTU of 1500 on a 10gig local network...?) -- Third issue, part 1) We've tried two interfaces (each interface with an mtu of 9000) using lagg, like this: ifconfig bxe0 up -tso4 -txcsum -rxcsum mtu 9000 ifconfig bxe2 up -tso4 -txcsum -rxcsum mtu 9000 ifconfig lagg0 create ifconfig lagg0 up laggproto failover laggport bxe0 laggport bxe2 10.50.50.11/24 This instantanely crashes the kernel and cause a machine reboot. The log says: Jul 19 09:47:12 filer-01-a kernel: Jul 19 09:47:12 filer-01-a kernel: Jul 19 09:47:12 filer-01-a kernel: Fatal trap 12: page fault while in kernel mode Jul 19 09:47:12 filer-01-a kernel: cpuid = 0; apic id = 20 Jul 19 09:47:12 filer-01-a kernel: fault virtual address = 0x6d Jul 19 09:47:12 filer-01-a kernel: fault code = supervisor read data, page not present Jul 19 09:47:12 filer-01-a kernel: instruction pointer = 0x20:0xffffffff808d5879 Jul 19 09:47:12 filer-01-a kernel: stack pointer = 0x28:0xffffff80003227f0 --*** BOOOM REBOOT ***-- Jul 19 09:49:49 filer-01-a syslogd: kernel boot file is /boot/kernel/kernel /var/crash/core.txt.0 returns: Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 5; apic id = 33 fault virtual address = 0x6d fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff808d5879 stack pointer = 0x28:0xffffff80003227f0 frame pointer = 0x28:0xffffff8000322820 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi6: task queue) trap number = 12 panic: page fault cpuid = 5 KDB: stack backtrace: #0 0xffffffff809208a6 at kdb_backtrace+0x66 #1 0xffffffff808ea8be at panic+0x1ce #2 0xffffffff80bd8240 at trap_fatal+0x290 #3 0xffffffff80bd857d at trap_pfault+0x1ed #4 0xffffffff80bd8b9e at trap+0x3ce #5 0xffffffff80bc315f at calltrap+0x8 #6 0xffffffff8045da8c at bxe_free_buf_rings+0x4c #7 0xffffffff8046c0d5 at bxe_init_locked+0x125 #8 0xffffffff80470cfe at bxe_ioctl+0x4fe #9 0xffffffff8099d08f at if_setlladdr+0x1ff #10 0xffffffff8174c94a at lagg_port_setlladdr+0x8a #11 0xffffffff8092cf55 at taskqueue_run_locked+0x85 #12 0xffffffff8092d0da at taskqueue_run+0x3a #13 0xffffffff808be8d4 at intr_event_execute_handlers+0x104 #14 0xffffffff808c0076 at ithread_loop+0xa6 #15 0xffffffff808bb9ef at fork_exit+0x11f #16 0xffffffff80bc368e at fork_trampoline+0xe Uptime: 39m41s Dumping 1505 out of 32735 MB:..2%..11%..21%..31%..41%..52%..61%..71%..81%..91% Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kernel/zfs.ko.symbols...done. done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /boot/kernel/opensolaris.ko.symbols...done. done. Loaded symbols for /boot/kernel/opensolaris.ko Reading symbols from /boot/kernel/if_lagg.ko...Reading symbols from /boot/kernel/if_lagg.ko.symbols...done. done. Loaded symbols for /boot/kernel/if_lagg.ko #0 doadump (textdump=Variable "textdump" is not available. ) at pcpu.h:224 224 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=Variable "textdump" is not available. ) at pcpu.h:224 #1 0xffffffff808ea3a1 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:448 #2 0xffffffff808ea897 in panic (fmt=0x1
) at /usr/src/sys/kern/kern_shutdown.c:636 #3 0xffffffff80bd8240 in trap_fatal (frame=0xc, eva=Variable "eva" is not available. ) at /usr/src/sys/amd64/amd64/trap.c:857 #4 0xffffffff80bd857d in trap_pfault (frame=0xffffff8000322740, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:773 #5 0xffffffff80bd8b9e in trap (frame=0xffffff8000322740) at /usr/src/sys/amd64/amd64/trap.c:456 #6 0xffffffff80bc315f in calltrap () at /usr/src/sys/amd64/amd64/exception.S:228 #7 0xffffffff808d5879 in free (addr=0xffffff80083e5000, mtp=0xffffffff81198ba0) at uma_int.h:413 #8 0xffffffff8045da8c in bxe_free_buf_rings (sc=0xffffff8000c1c000) at /usr/src/sys/dev/bxe/if_bxe.c:3787 #9 0xffffffff8046c0d5 in bxe_init_locked (sc=0x0, load_mode=0) at /usr/src/sys/dev/bxe/if_bxe.c:4063 #10 0xffffffff80470cfe in bxe_ioctl (ifp=0xfffffe000ec59000, command=Variable "command" is not available. ) at /usr/src/sys/dev/bxe/if_bxe.c:9668 #11 0xffffffff8099d08f in if_setlladdr (ifp=0xfffffe000ec59000, lladdr=0xfffffe00125da4c8 "", len=6) at /usr/src/sys/net/if.c:3304 #12 0xffffffff8174c94a in lagg_port_setlladdr (arg=Variable "arg" is not available. ) at /usr/src/sys/modules/if_lagg/../../net/if_lagg.c:495 #13 0xffffffff8092cf55 in taskqueue_run_locked (queue=0xfffffe000e833980) at /usr/src/sys/kern/subr_taskqueue.c:308 #14 0xffffffff8092d0da in taskqueue_run (queue=0xfffffe000e833980) at /usr/src/sys/kern/subr_taskqueue.c:322 #15 0xffffffff808be8d4 in intr_event_execute_handlers (p=Variable "p" is not available. ) at /usr/src/sys/kern/kern_intr.c:1262 #16 0xffffffff808c0076 in ithread_loop (arg=0xfffffe000e66c140) at /usr/src/sys/kern/kern_intr.c:1275 #17 0xffffffff808bb9ef in fork_exit ( callout=0xffffffff808bffd0 , arg=0xfffffe000e66c140, frame=0xffffff8000322c40) at /usr/src/sys/kern/kern_fork.c:992 #18 0xffffffff80bc368e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:602 #19 0x0000000000000000 in ?? () #20 0x0000000000000000 in ?? () #21 0x0000000000000001 in ?? () #22 0x0000000000000000 in ?? () #23 0x0000000000000000 in ?? () #24 0x0000000000000000 in ?? () #25 0x0000000000000000 in ?? () #26 0x0000000000000000 in ?? () #27 0x0000000000000000 in ?? () #28 0x0000000000000000 in ?? () #29 0x0000000000000000 in ?? () #30 0x0000000000000000 in ?? () #31 0x0000000000000000 in ?? () #32 0x0000000000000000 in ?? () #33 0x0000000000000000 in ?? () #34 0x0000000000000000 in ?? () #35 0x0000000000000000 in ?? () #36 0x0000000000000000 in ?? () #37 0x0000000000000000 in ?? () #38 0x0000000000000000 in ?? () #39 0x0000000000000000 in ?? () #40 0x0000000000000000 in ?? () #41 0x0000000000000000 in ?? () #42 0x0000000000000000 in ?? () #43 0x0000000000000005 in ?? () #44 0xffffffff81244180 in tdq_cpu () #45 0xfffffe000e698000 in ?? () #46 0x0000000000000000 in ?? () #47 0xffffff8000322b30 in ?? () #48 0xffffff8000322ad8 in ?? () #49 0xfffffe000e6728e0 in ?? () #50 0xffffffff8091352e in sched_switch (td=0x0, newtd=0xfffffe000e66c140, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1921 Previous frame inner to this frame (corrupt stack?) (kgdb) Okay guess it has something to do again with the MTU 9000 but this time it does completly panic the kernel. This is no good. Part 2) Trying bonding with normal MTU 1500 ifconfig bxe0 up -tso4 -txcsum -rxcsum mtu 1500 ifconfig bxe2 up -tso4 -txcsum -rxcsum mtu 1500 ifconfig lagg0 create ifconfig lagg0 up laggproto failover laggport bxe0 laggport bxe2 10.50.50.11/24 This time. No error messages, no crash. Yiha! But no. Even everything seems to be correct, the bonding is not working. We can't ping any host on the network. Also the lagg0 says: No carrier see: bxe0: flags=8843 metric 0 mtu 1500 options=b8 ether 00:10:18:98:35:f8 inet6 fe80::210:18ff:fe98:35f8%bxe0 prefixlen 64 scopeid 0x3 nd6 options=29 media: Ethernet autoselect (10Gbase-SR ) status: active bxe2: flags=8843 metric 0 mtu 1500 options=b8 ether 00:10:18:98:35:f8 inet6 fe80::210:18ff:fe95:eaa0%bxe2 prefixlen 64 scopeid 0x5 nd6 options=29 media: Ethernet autoselect (10Gbase-SR ) status: active lagg0: flags=8843 metric 0 mtu 1500 options=b8 ether 00:10:18:98:35:f8 inet6 fe80::7a2b:cbff:fe1a:eab1%lagg0 prefixlen 64 scopeid 0x14 inet 10.50.50.11 netmask 0xffffff00 broadcast 10.50.50.255 nd6 options=21 media: Ethernet autoselect status: no carrier laggproto failover lagghash l2,l3,l4 laggport: bxe2 flags=0<> laggport: bxe0 flags=1 Please note that priore to installing freebsd, the machine was running a Debian 7 GNU/Linux 64 bit OS where we had the cards bonded and MTU'ed to 9000 without any crash or stability issue. So it looks to me that there is something really wrong with the broadcom driver on freebsd 9.1, at least with the NIC's used in Dell servers. Provided that broadcom themselves doesn't supply drivers for freebsd Is there any possible fix ? Thanks for your attention and your help. Cheers, Sébastien _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.