Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 3 Dec 2012 11:23:46 -0500
From:      George Neville-Neil <gnn@freebsd.org>
To:        infiniband@freebsd.org
Subject:   Old panic report...
Message-ID:  <A6DCF1F6-3DD7-4B73-AE72-F4144FC611EB@freebsd.org>
References:  <201205301517.25647.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Howdy,

Was cleaning out my inbox and found this.  Whoever has infiniband on =
their plate atm should
take a look at this report.

Best,
George


Begin forwarded message:

> From: John Baldwin <jhb@freebsd.org>
> Subject: Fwd: Re: Kernel panic caused by OFED mlx4 driver
> Date: May 30, 2012 3:17:25 EDT
> To: "George Neville-Neil" <gnn@freebsd.org>
>=20
> FYI...
>=20
> --=20
> John Baldwin
>=20
> From: Olivier Cinquin <ocinquin@uci.edu>
> Subject: Re: Kernel panic caused by OFED mlx4 driver
> Date: May 26, 2012 12:20:30 EDT
> To: John Baldwin <jhb@freebsd.org>
>=20
>=20
> Hi John,
> I thought I'd let you know I have things working now. Thanks for your =
fix.
>=20
> I also wanted to mention that I've identified another problem. This =
problem is unlikely to affect me in practice, and I don't know if it's =
closely related to your areas of expertise and interest, but I just =
thought I'd mention it. When running performance tests of the IP over =
Infiniband connection, I found that iperf reported dismal numbers =
(around ~1Mb/s). I did further testing and found much higher rates using =
the following:
> cat /dev/zero | ssh other_machines_ip "cat > /dev/null"
> and monitoring traffic with systat -ifs. The throughput of the latter =
test is limited by CPU usage of ssh. Using multiple instances of the =
above test running in parallel, I could get total throughput up to =
~10Gb/s. However, if after reaching that throughput I launched another =
instance of the test, total throughput suddenly dropped back down to =
very low levels.
>=20
> My guess is that there's a congestion management problem, which I have =
no idea how to solve (just to play around, I tried loading kernel =
modules cc_cubic.ko or cc_htcp.ko but that didn't address the problem). =
It doesn't matter that much to me because my usage is unlikely to =
produce rates above 10Gb/s, but other people might run into the problem =
(and the iperf results are misleading for all users).
>=20
> Best wishes,
> Olivier
>=20
>=20
>=20
>=20
>=20
> On May 23, 2012, at 6:35 AM, John Baldwin wrote:
>=20
>> On Tuesday, May 22, 2012 4:52:52 pm Olivier Cinquin wrote:
>>> Here you go...
>>> Olivier
>>>=20
>>>=20
>>> interrupt                          total       rate
>>> irq275: mlx4_core0                     0          0
>>> irq276: mlx4_core0                     0          0
>>> irq277: mlx4_core0                     0          0
>>> irq278: mlx4_core0                     0          0
>>> irq279: mlx4_core0                     0          0
>>> irq280: mlx4_core0                     0          0
>>> irq281: mlx4_core0                     0          0
>>> irq282: mlx4_core0                     0          0
>>> irq283: mlx4_core0                     0          0
>>> irq284: mlx4_core0                     0          0
>>> irq285: mlx4_core0                     0          0
>>> irq286: mlx4_core0                     0          0
>>> irq287: mlx4_core0                     0          0
>>> irq288: mlx4_core0                     0          0
>>> irq289: mlx4_core0                     0          0
>>> irq290: mlx4_core0                     0          0
>>> irq291: mlx4_core0                     0          0
>>> irq292: mlx4_core0                     0          0
>>> irq293: mlx4_core0                     0          0
>>> irq294: mlx4_core0                     0          0
>>> irq295: mlx4_core0                     0          0
>>> irq296: mlx4_core0                     0          0
>>> irq297: mlx4_core0                     0          0
>>> irq298: mlx4_core0                     0          0
>>> irq299: mlx4_core0                     0          0
>>> irq300: mlx4_core0                     0          0
>>> irq301: mlx4_core0                     0          0
>>> irq302: mlx4_core0                     0          0
>>> irq303: mlx4_core0                     0          0
>>> irq304: mlx4_core0                     0          0
>>> irq305: mlx4_core0                     0          0
>>> irq306: mlx4_core0                     0          0
>>> irq307: mlx4_core0                     0          0
>>> irq308: mlx4_core0                     0          0
>>> irq309: mlx4_core0                     0          0
>>> irq310: mlx4_core0                     0          0
>>> irq311: mlx4_core0                     0          0
>>> irq312: mlx4_core0                     0          0
>>> irq313: mlx4_core0                     0          0
>>> irq314: mlx4_core0                     0          0
>>> irq315: mlx4_core0                     0          0
>>> irq316: mlx4_core0                     0          0
>>> irq317: mlx4_core0                     0          0
>>> irq318: mlx4_core0                     0          0
>>> irq319: mlx4_core0                     0          0
>>> irq320: mlx4_core0                     0          0
>>> irq321: mlx4_core0                     0          0
>>> irq322: mlx4_core0                     0          0
>>> irq323: mlx4_core0                     0          0
>>> irq324: mlx4_core0                     0          0
>>> irq325: mlx4_core0                     0          0
>>> irq326: mlx4_core0                     0          0
>>> irq327: mlx4_core0                     0          0
>>> irq328: mlx4_core0                     0          0
>>> irq329: mlx4_core0                     0          0
>>> irq330: mlx4_core0                     0          0
>>> irq331: mlx4_core0                     0          0
>>> irq332: mlx4_core0                     0          0
>>> irq333: mlx4_core0                     0          0
>>> irq334: mlx4_core0                     0          0
>>> irq335: mlx4_core0                     0          0
>>> irq336: mlx4_core0                     0          0
>>> irq337: mlx4_core0                     0          0
>>> irq338: mlx4_core0                     0          0
>>> irq339: mlx4_core0                   426          0
>>> Total                            3076439        341
>>=20
>> 64 interrupts, wow!  Well, that explains why you hit this bug then.  =
I'll=20
>> commit the fix.
>>=20
>>>=20
>>> On May 22, 2012, at 1:42 PM, John Baldwin wrote:
>>>=20
>>>> On Tuesday, May 22, 2012 2:48:52 pm Olivier Cinquin wrote:
>>>>> Thanks, that seems to have fixed the problem! Will this patch make =
it=20
>> into=20
>>>> the next release?
>>>>> I have no idea how many interrupts my card has. I'm happy to find =
out if=20
>> you=20
>>>> let me know how, if that can help you in any way.
>>>>> Should I expect everything to work fine now? I take it the card is=20=

>>>> recognized since ib0 is attached to mlx4_0 port 1
>>>>>=20
>>>>> mlx4_core0: <mlx4_core> mem =
0xdfe00000-0xdfefffff,0xdc800000-0xdcffffff=20
>> irq=20
>>>> 36 at device 0.0 on pci3
>>>>> mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.2 (August 4, =
2010)
>>>>> mlx4_core: Initializing mlx4_core
>>>>> vboxdrv: fAsync=3D0 offMin=3D0x123c offMax=3D0xec01
>>>>> vboxnet0: Ethernet address: 0a:00:27:00:00:00
>>>>> mlx4_en: Mellanox ConnectX HCA Ethernet driver v1.5.2 (July 2010)
>>>>> mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0-ofed1.5.2 =
(August 4,=20
>> 2010)
>>>>> ib0: max_srq_sge=3D31
>>>>> ib0: max_cm_mtu =3D 0x10000, num_frags=3D16
>>>>> ib0: Attached to mlx4_0 port 1
>>>>>=20
>>>>>=20
>>>>> (I need to get cables before I can test connectivity between =
different=20
>>>> machines).
>>>>>=20
>>>>> Thanks again for your help!
>>>>=20
>>>> Very interesting!  Can you get the output of 'vmstat -ai | grep -v =
stray'?
>>>>=20
>>>> --=20
>>>> John Baldwin
>>>>=20
>>>=20
>>>=20
>>=20
>> --=20
>> John Baldwin
>>=20
>=20
>=20
>=20




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A6DCF1F6-3DD7-4B73-AE72-F4144FC611EB>