Date: Mon, 3 Dec 2012 11:23:46 -0500 From: George Neville-Neil <gnn@freebsd.org> To: infiniband@freebsd.org Subject: Old panic report... Message-ID: <A6DCF1F6-3DD7-4B73-AE72-F4144FC611EB@freebsd.org> References: <201205301517.25647.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Howdy, Was cleaning out my inbox and found this. Whoever has infiniband on their plate atm should take a look at this report. Best, George Begin forwarded message: > From: John Baldwin <jhb@freebsd.org> > Subject: Fwd: Re: Kernel panic caused by OFED mlx4 driver > Date: May 30, 2012 3:17:25 EDT > To: "George Neville-Neil" <gnn@freebsd.org> > > FYI... > > -- > John Baldwin > > From: Olivier Cinquin <ocinquin@uci.edu> > Subject: Re: Kernel panic caused by OFED mlx4 driver > Date: May 26, 2012 12:20:30 EDT > To: John Baldwin <jhb@freebsd.org> > > > Hi John, > I thought I'd let you know I have things working now. Thanks for your fix. > > I also wanted to mention that I've identified another problem. This problem is unlikely to affect me in practice, and I don't know if it's closely related to your areas of expertise and interest, but I just thought I'd mention it. When running performance tests of the IP over Infiniband connection, I found that iperf reported dismal numbers (around ~1Mb/s). I did further testing and found much higher rates using the following: > cat /dev/zero | ssh other_machines_ip "cat > /dev/null" > and monitoring traffic with systat -ifs. The throughput of the latter test is limited by CPU usage of ssh. Using multiple instances of the above test running in parallel, I could get total throughput up to ~10Gb/s. However, if after reaching that throughput I launched another instance of the test, total throughput suddenly dropped back down to very low levels. > > My guess is that there's a congestion management problem, which I have no idea how to solve (just to play around, I tried loading kernel modules cc_cubic.ko or cc_htcp.ko but that didn't address the problem). It doesn't matter that much to me because my usage is unlikely to produce rates above 10Gb/s, but other people might run into the problem (and the iperf results are misleading for all users). > > Best wishes, > Olivier > > > > > > On May 23, 2012, at 6:35 AM, John Baldwin wrote: > >> On Tuesday, May 22, 2012 4:52:52 pm Olivier Cinquin wrote: >>> Here you go... >>> Olivier >>> >>> >>> interrupt total rate >>> irq275: mlx4_core0 0 0 >>> irq276: mlx4_core0 0 0 >>> irq277: mlx4_core0 0 0 >>> irq278: mlx4_core0 0 0 >>> irq279: mlx4_core0 0 0 >>> irq280: mlx4_core0 0 0 >>> irq281: mlx4_core0 0 0 >>> irq282: mlx4_core0 0 0 >>> irq283: mlx4_core0 0 0 >>> irq284: mlx4_core0 0 0 >>> irq285: mlx4_core0 0 0 >>> irq286: mlx4_core0 0 0 >>> irq287: mlx4_core0 0 0 >>> irq288: mlx4_core0 0 0 >>> irq289: mlx4_core0 0 0 >>> irq290: mlx4_core0 0 0 >>> irq291: mlx4_core0 0 0 >>> irq292: mlx4_core0 0 0 >>> irq293: mlx4_core0 0 0 >>> irq294: mlx4_core0 0 0 >>> irq295: mlx4_core0 0 0 >>> irq296: mlx4_core0 0 0 >>> irq297: mlx4_core0 0 0 >>> irq298: mlx4_core0 0 0 >>> irq299: mlx4_core0 0 0 >>> irq300: mlx4_core0 0 0 >>> irq301: mlx4_core0 0 0 >>> irq302: mlx4_core0 0 0 >>> irq303: mlx4_core0 0 0 >>> irq304: mlx4_core0 0 0 >>> irq305: mlx4_core0 0 0 >>> irq306: mlx4_core0 0 0 >>> irq307: mlx4_core0 0 0 >>> irq308: mlx4_core0 0 0 >>> irq309: mlx4_core0 0 0 >>> irq310: mlx4_core0 0 0 >>> irq311: mlx4_core0 0 0 >>> irq312: mlx4_core0 0 0 >>> irq313: mlx4_core0 0 0 >>> irq314: mlx4_core0 0 0 >>> irq315: mlx4_core0 0 0 >>> irq316: mlx4_core0 0 0 >>> irq317: mlx4_core0 0 0 >>> irq318: mlx4_core0 0 0 >>> irq319: mlx4_core0 0 0 >>> irq320: mlx4_core0 0 0 >>> irq321: mlx4_core0 0 0 >>> irq322: mlx4_core0 0 0 >>> irq323: mlx4_core0 0 0 >>> irq324: mlx4_core0 0 0 >>> irq325: mlx4_core0 0 0 >>> irq326: mlx4_core0 0 0 >>> irq327: mlx4_core0 0 0 >>> irq328: mlx4_core0 0 0 >>> irq329: mlx4_core0 0 0 >>> irq330: mlx4_core0 0 0 >>> irq331: mlx4_core0 0 0 >>> irq332: mlx4_core0 0 0 >>> irq333: mlx4_core0 0 0 >>> irq334: mlx4_core0 0 0 >>> irq335: mlx4_core0 0 0 >>> irq336: mlx4_core0 0 0 >>> irq337: mlx4_core0 0 0 >>> irq338: mlx4_core0 0 0 >>> irq339: mlx4_core0 426 0 >>> Total 3076439 341 >> >> 64 interrupts, wow! Well, that explains why you hit this bug then. I'll >> commit the fix. >> >>> >>> On May 22, 2012, at 1:42 PM, John Baldwin wrote: >>> >>>> On Tuesday, May 22, 2012 2:48:52 pm Olivier Cinquin wrote: >>>>> Thanks, that seems to have fixed the problem! Will this patch make it >> into >>>> the next release? >>>>> I have no idea how many interrupts my card has. I'm happy to find out if >> you >>>> let me know how, if that can help you in any way. >>>>> Should I expect everything to work fine now? I take it the card is >>>> recognized since ib0 is attached to mlx4_0 port 1 >>>>> >>>>> mlx4_core0: <mlx4_core> mem 0xdfe00000-0xdfefffff,0xdc800000-0xdcffffff >> irq >>>> 36 at device 0.0 on pci3 >>>>> mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.2 (August 4, 2010) >>>>> mlx4_core: Initializing mlx4_core >>>>> vboxdrv: fAsync=0 offMin=0x123c offMax=0xec01 >>>>> vboxnet0: Ethernet address: 0a:00:27:00:00:00 >>>>> mlx4_en: Mellanox ConnectX HCA Ethernet driver v1.5.2 (July 2010) >>>>> mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0-ofed1.5.2 (August 4, >> 2010) >>>>> ib0: max_srq_sge=31 >>>>> ib0: max_cm_mtu = 0x10000, num_frags=16 >>>>> ib0: Attached to mlx4_0 port 1 >>>>> >>>>> >>>>> (I need to get cables before I can test connectivity between different >>>> machines). >>>>> >>>>> Thanks again for your help! >>>> >>>> Very interesting! Can you get the output of 'vmstat -ai | grep -v stray'? >>>> >>>> -- >>>> John Baldwin >>>> >>> >>> >> >> -- >> John Baldwin >> > > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A6DCF1F6-3DD7-4B73-AE72-F4144FC611EB>
