Date: Thu, 5 May 2011 10:21:55 -0400 From: Arnaud Lacombe <lacombar@gmail.com> To: Jack Vogel <jfvogel@gmail.com> Cc: Olivier Smedts <olivier@gid0.org>, FreeBSD current mailing list <current@freebsd.org> Subject: Re: problems with em(4) since update to driver 7.2.2 Message-ID: <BANLkTimVc2Chq9iKrRVCBfqg6WPmt_O=6w@mail.gmail.com> In-Reply-To: <BANLkTikkbpW6_jE5QznGjAt4Zcpee0RagQ@mail.gmail.com> References: <BANLkTinrfZbO%2BMUDDuzsoaN1y-=_O8LgNA@mail.gmail.com> <4D94A354.9080903@sentex.net> <AANLkTik_XPsVWL-KqHkPic1KQ0SdCSk6u_9ykRefi3VE@mail.gmail.com> <BANLkTi=K5ASG9TWLAh5r%2Bzo9Wy1stMf9WA@mail.gmail.com> <BANLkTikPPzxZ6XRAaqrvdeXBp=Ydvz7hNg@mail.gmail.com> <BANLkTi=rhZ0dyO6Zq13jY6-NKVE8n24YyQ@mail.gmail.com> <4DC07013.9070707@gmx.net> <BANLkTi=DmQsVvJOaoxMr5GPOLkjs7sdTxQ@mail.gmail.com> <4DC078BD.9080908@gmx.net> <BANLkTin1ykoo80%2B9iWe%2Bg5ib1DXw%2B05BgQ@mail.gmail.com> <BANLkTi=STPT13-50dxMRgjLP_pyxL9Utyw@mail.gmail.com> <BANLkTikX8gs7Ln2KLZkA=MyieeCR%2BzKXzQ@mail.gmail.com> <BANLkTikj-wSOFWQX9Y_yN54Q_jk-=vD3LA@mail.gmail.com> <BANLkTin0ANtbWGv4CTr%2BO5xEL58hVRDefg@mail.gmail.com> <BANLkTikzpjxe%2BcMYiTRak0B0tnkhrW%2BBow@mail.gmail.com> <BANLkTikUJOD%2BtzYoiHCoWHrD36PxLQgN7A@mail.gmail.com> <BANLkTin2j3QzO0pwVHe9Nm-L8otEf9pcbg@mail.gmail.com> <BANLkTinmKH40yx5Mgu9zgQ2qEF2O-n6HMQ@mail.gmail.com> <BANLkTikehcbxm0MQtb0SQ0giSfhmkHw99A@mail.gmail.com> <BANLkTikkbpW6_jE5QznGjAt4Zcpee0RagQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, On Thu, May 5, 2011 at 2:59 AM, Jack Vogel <jfvogel@gmail.com> wrote: > OK, but what this does not explain is why I do not see this if > its so easily reproduced, what causes the failure case, any idea? > It is completely random as it depends on the content of the stack. I spent 3 or 4 hours trying to reproduce it using different approach on different platform, with different version of the code and failed. And once `error' was explicitly colored, it popped up. That's the beauty of error related with uninitialized variable. - Arnaud > As I said, given the code was not feasible for igb anyway I would not > be unhappy about returning to the old way of doing things. > I am not sure what you mean by "old way of doing thing", but I'd guess that the ring only need to be setup on a few occasion, like initialization and MTU transition. I'm not sure either how other driver manage their ring. > Jack > > > On Wed, May 4, 2011 at 11:03 PM, Arnaud Lacombe <lacombar@gmail.com> wrot= e: >> >> Hi, >> >> On Thu, May 5, 2011 at 1:20 AM, Arnaud Lacombe <lacombar@gmail.com> wrot= e: >> > Hi, >> > >> > On Wed, May 4, 2011 at 5:38 PM, Jack Vogel <jfvogel@gmail.com> wrote: >> >> I have had my validation engineer busy all day, we have tried both >> >> a 9 kernel as well as 8.2, =A0using the code from HEAD, and we >> >> cannot reproduce this problem. >> >> >> > Actually, it can be trivially reproduced by tainting `error'. As it is >> > uninitialized in HEAD, it's value can be _anything_, so let's mark it >> > as explicitly invalid. >> > >> > diff -u ./if_em.c /data/src/freebsd/em-7.2.2/src/if_em.c >> > --- ./if_em.c =A0 2011-02-18 01:18:23.000000000 -0500 >> > +++ /data/src/freebsd/em-7.2.2/src/if_em.c =A0 =A0 =A02011-05-05 >> > 01:12:01.000000000 -0400 >> > @@ -3912,7 +3912,7 @@ >> > =A0 =A0 =A0 =A0struct =A0adapter =A0 =A0 =A0 =A0 *adapter =3D rxr->ada= pter; >> > =A0 =A0 =A0 =A0struct em_buffer =A0 =A0 =A0 =A0*rxbuf; >> > =A0 =A0 =A0 =A0bus_dma_segment_t =A0 =A0 =A0 seg[1]; >> > - =A0 =A0 =A0 int =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 i, j, nsegs,= error; >> > + =A0 =A0 =A0 int =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 i, j, nsegs,= error =3D -1; >> > >> > The error pointed out in this thread pops up in the next boot. >> > >> I put a call to kdb_enter() at the beginning of the function, helped >> with some textdump I got all the backtrace [0] for all the time >> em_setup_receive_ring() is called. All are exactly the same: >> >> kdb_enter_why(0,c09f6511,f391aaa8,c09be1e2,c09f6511,...) at >> kdb_enter_why+0x3b >> kdb_enter(c09f6511,0,3810,ffffffff,5dc,...) at kdb_enter+0x19 >> em_setup_receive_ring(c3c8d600,c3c8d7a4,c3c96004,310000fa,c3c8d600,...) >> at em_setup_receive_ring+0x22 >> em_setup_receive_structures(c3c96000,f15f2000,38,8100,3,...) at >> em_setup_receive_structures+0x26 >> em_init_locked(c3c96000,0,c09f5de5,414,10000,...) at em_init_locked+0x2f= 2 >> em_ioctl(c3c7d000,80206934,c3ce9d00,c07b7a0b,c3f2a230,...) at >> em_ioctl+0x1c3 >> ifhwioctl(c3f2a230,f391ac34,c07b7a0b,c3f3e3d0,c08df1c0,...) at >> ifhwioctl+0x4b8 >> ifioctl(c3f3e3d0,80206934,c3ce9d00,c3f2a230,c3f2a230,...) at ifioctl+0x8= 2 >> kern_ioctl(c3f2a230,3,80206934,c3ce9d00,c3ce9d00,...) at kern_ioctl+0xa8 >> ioctl(c3f2a230,f391acf8,c,c,f391ad2c,...) at ioctl+0xc5 >> syscall(f391ad38) at syscall+0x17d >> Xint0x80_syscall() at Xint0x80_syscall+0x20 >> --- syscall (54, FreeBSD ELF32, ioctl), eip =3D 0x4816ee23, esp =3D >> 0xbfbfe67c, ebp =3D 0xbfbfe698 --- >> >> This fully explain why the main loop in em_setup_receive_ring() is >> never entered, as we always verify `j =3D=3D rxr->next_to_check' (provid= ed >> that mbuf have been refreshed if some packet were transfered) and >> return the value on the stack. As of now, beside changing the >> call-site of em_setup_receive_ring() to ensure it is never re-entered, >> I'd guess that the patch I sent earlier today, is the only way to >> ensure that no junk is returned. >> >> I'd guess that the driver _is_ able to transmit, if the code was not >> explicitly calling em_stop() upon em_setup_receive_structures() >> failure. >> >> =A0- Arnaud >> >> [0]: I wish that would have been as easy as in Linux, where a WARN() >> call do all the job automatically, but still, I should not hope for >> that much unless I am the one implementing it ... yes, free whining, >> it's 2a.m. ... >> >> > =A0- Arnaud >> > >> >> The data your netstat -m shows suggests to me that what's happening >> >> is somehow setup of the receive ring is running more than once maybe?= ? >> >> >> >> You asked at one point how this could go into STABLE, well, because >> >> not only here at Intel, but at lots of external customers this code h= as >> >> been >> >> used and tested thoroughly. >> >> >> >> I am not calling into question your problem, but until I understand >> >> what it >> >> is I cannot "fix" it :) >> >> >> >> The thing I am guessing right now is the culprit is the setup code, t= he >> >> reason >> >> is that when I ported to the igb driver I found that it did not work = on >> >> our >> >> newer >> >> hardware, and so I went back to the older version of setup for igb. >> >> Now, >> >> even >> >> though I have not seen hardware fail with em, maybe there is some. >> >> >> >> To help me give me a complete pciconf -lv, and if its a namebrand >> >> system >> >> tell me that, including all hardware in it. >> >> >> >> If you like Olivier I can make a version of em for you that also >> >> reverts the >> >> setup code the way I did for igb, see if that fixes it for you? >> >> >> >> Thanks for your patience, >> >> >> >> Jack >> >> _______________________________________________ >> >> freebsd-current@freebsd.org mailing list >> >> http://lists.freebsd.org/mailman/listinfo/freebsd-current >> >> To unsubscribe, send any mail to >> >> "freebsd-current-unsubscribe@freebsd.org" >> >> >> > > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BANLkTimVc2Chq9iKrRVCBfqg6WPmt_O=6w>