Date: Mon, 3 Dec 2012 11:23:46 -0500 From: George Neville-Neil <gnn@freebsd.org> To: infiniband@freebsd.org Subject: Old panic report... Message-ID: <A6DCF1F6-3DD7-4B73-AE72-F4144FC611EB@freebsd.org> References: <201205301517.25647.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Howdy, Was cleaning out my inbox and found this. Whoever has infiniband on = their plate atm should take a look at this report. Best, George Begin forwarded message: > From: John Baldwin <jhb@freebsd.org> > Subject: Fwd: Re: Kernel panic caused by OFED mlx4 driver > Date: May 30, 2012 3:17:25 EDT > To: "George Neville-Neil" <gnn@freebsd.org> >=20 > FYI... >=20 > --=20 > John Baldwin >=20 > From: Olivier Cinquin <ocinquin@uci.edu> > Subject: Re: Kernel panic caused by OFED mlx4 driver > Date: May 26, 2012 12:20:30 EDT > To: John Baldwin <jhb@freebsd.org> >=20 >=20 > Hi John, > I thought I'd let you know I have things working now. Thanks for your = fix. >=20 > I also wanted to mention that I've identified another problem. This = problem is unlikely to affect me in practice, and I don't know if it's = closely related to your areas of expertise and interest, but I just = thought I'd mention it. When running performance tests of the IP over = Infiniband connection, I found that iperf reported dismal numbers = (around ~1Mb/s). I did further testing and found much higher rates using = the following: > cat /dev/zero | ssh other_machines_ip "cat > /dev/null" > and monitoring traffic with systat -ifs. The throughput of the latter = test is limited by CPU usage of ssh. Using multiple instances of the = above test running in parallel, I could get total throughput up to = ~10Gb/s. However, if after reaching that throughput I launched another = instance of the test, total throughput suddenly dropped back down to = very low levels. >=20 > My guess is that there's a congestion management problem, which I have = no idea how to solve (just to play around, I tried loading kernel = modules cc_cubic.ko or cc_htcp.ko but that didn't address the problem). = It doesn't matter that much to me because my usage is unlikely to = produce rates above 10Gb/s, but other people might run into the problem = (and the iperf results are misleading for all users). >=20 > Best wishes, > Olivier >=20 >=20 >=20 >=20 >=20 > On May 23, 2012, at 6:35 AM, John Baldwin wrote: >=20 >> On Tuesday, May 22, 2012 4:52:52 pm Olivier Cinquin wrote: >>> Here you go... >>> Olivier >>>=20 >>>=20 >>> interrupt total rate >>> irq275: mlx4_core0 0 0 >>> irq276: mlx4_core0 0 0 >>> irq277: mlx4_core0 0 0 >>> irq278: mlx4_core0 0 0 >>> irq279: mlx4_core0 0 0 >>> irq280: mlx4_core0 0 0 >>> irq281: mlx4_core0 0 0 >>> irq282: mlx4_core0 0 0 >>> irq283: mlx4_core0 0 0 >>> irq284: mlx4_core0 0 0 >>> irq285: mlx4_core0 0 0 >>> irq286: mlx4_core0 0 0 >>> irq287: mlx4_core0 0 0 >>> irq288: mlx4_core0 0 0 >>> irq289: mlx4_core0 0 0 >>> irq290: mlx4_core0 0 0 >>> irq291: mlx4_core0 0 0 >>> irq292: mlx4_core0 0 0 >>> irq293: mlx4_core0 0 0 >>> irq294: mlx4_core0 0 0 >>> irq295: mlx4_core0 0 0 >>> irq296: mlx4_core0 0 0 >>> irq297: mlx4_core0 0 0 >>> irq298: mlx4_core0 0 0 >>> irq299: mlx4_core0 0 0 >>> irq300: mlx4_core0 0 0 >>> irq301: mlx4_core0 0 0 >>> irq302: mlx4_core0 0 0 >>> irq303: mlx4_core0 0 0 >>> irq304: mlx4_core0 0 0 >>> irq305: mlx4_core0 0 0 >>> irq306: mlx4_core0 0 0 >>> irq307: mlx4_core0 0 0 >>> irq308: mlx4_core0 0 0 >>> irq309: mlx4_core0 0 0 >>> irq310: mlx4_core0 0 0 >>> irq311: mlx4_core0 0 0 >>> irq312: mlx4_core0 0 0 >>> irq313: mlx4_core0 0 0 >>> irq314: mlx4_core0 0 0 >>> irq315: mlx4_core0 0 0 >>> irq316: mlx4_core0 0 0 >>> irq317: mlx4_core0 0 0 >>> irq318: mlx4_core0 0 0 >>> irq319: mlx4_core0 0 0 >>> irq320: mlx4_core0 0 0 >>> irq321: mlx4_core0 0 0 >>> irq322: mlx4_core0 0 0 >>> irq323: mlx4_core0 0 0 >>> irq324: mlx4_core0 0 0 >>> irq325: mlx4_core0 0 0 >>> irq326: mlx4_core0 0 0 >>> irq327: mlx4_core0 0 0 >>> irq328: mlx4_core0 0 0 >>> irq329: mlx4_core0 0 0 >>> irq330: mlx4_core0 0 0 >>> irq331: mlx4_core0 0 0 >>> irq332: mlx4_core0 0 0 >>> irq333: mlx4_core0 0 0 >>> irq334: mlx4_core0 0 0 >>> irq335: mlx4_core0 0 0 >>> irq336: mlx4_core0 0 0 >>> irq337: mlx4_core0 0 0 >>> irq338: mlx4_core0 0 0 >>> irq339: mlx4_core0 426 0 >>> Total 3076439 341 >>=20 >> 64 interrupts, wow! Well, that explains why you hit this bug then. = I'll=20 >> commit the fix. >>=20 >>>=20 >>> On May 22, 2012, at 1:42 PM, John Baldwin wrote: >>>=20 >>>> On Tuesday, May 22, 2012 2:48:52 pm Olivier Cinquin wrote: >>>>> Thanks, that seems to have fixed the problem! Will this patch make = it=20 >> into=20 >>>> the next release? >>>>> I have no idea how many interrupts my card has. I'm happy to find = out if=20 >> you=20 >>>> let me know how, if that can help you in any way. >>>>> Should I expect everything to work fine now? I take it the card is=20= >>>> recognized since ib0 is attached to mlx4_0 port 1 >>>>>=20 >>>>> mlx4_core0: <mlx4_core> mem = 0xdfe00000-0xdfefffff,0xdc800000-0xdcffffff=20 >> irq=20 >>>> 36 at device 0.0 on pci3 >>>>> mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.2 (August 4, = 2010) >>>>> mlx4_core: Initializing mlx4_core >>>>> vboxdrv: fAsync=3D0 offMin=3D0x123c offMax=3D0xec01 >>>>> vboxnet0: Ethernet address: 0a:00:27:00:00:00 >>>>> mlx4_en: Mellanox ConnectX HCA Ethernet driver v1.5.2 (July 2010) >>>>> mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0-ofed1.5.2 = (August 4,=20 >> 2010) >>>>> ib0: max_srq_sge=3D31 >>>>> ib0: max_cm_mtu =3D 0x10000, num_frags=3D16 >>>>> ib0: Attached to mlx4_0 port 1 >>>>>=20 >>>>>=20 >>>>> (I need to get cables before I can test connectivity between = different=20 >>>> machines). >>>>>=20 >>>>> Thanks again for your help! >>>>=20 >>>> Very interesting! Can you get the output of 'vmstat -ai | grep -v = stray'? >>>>=20 >>>> --=20 >>>> John Baldwin >>>>=20 >>>=20 >>>=20 >>=20 >> --=20 >> John Baldwin >>=20 >=20 >=20 >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A6DCF1F6-3DD7-4B73-AE72-F4144FC611EB>