Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Aug 2006 19:51:18 +0900
From:      Pyun YongHyeon <pyunyh@gmail.com>
To:        Gleb Smirnoff <glebius@FreeBSD.org>
Cc:        freebsd-current@FreeBSD.org
Subject:   Re: call for bge(4) testers
Message-ID:  <20060823105118.GJ17902@cdnetworks.co.kr>
In-Reply-To: <20060823100420.GG96644@cell.sick.ru>
References:  <20060822042023.GC12848@cdnetworks.co.kr> <20060823093741.GF96644@FreeBSD.org> <20060823095504.GI17902@cdnetworks.co.kr> <20060823100420.GG96644@cell.sick.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Aug 23, 2006 at 02:04:20PM +0400, Gleb Smirnoff wrote:
 > On Wed, Aug 23, 2006 at 06:55:04PM +0900, Pyun YongHyeon wrote:
 > P> On Wed, Aug 23, 2006 at 01:37:41PM +0400, Gleb Smirnoff wrote:
 > P>  > On Tue, Aug 22, 2006 at 01:20:23PM +0900, Pyun YongHyeon wrote:
 > P>  > P> After fixing em(4) watchdog bug, I looked over bge(4) and I think
 > P>  > P> bge(4) may suffer from the same issue. So if you have seen occasional
 > P>  > P> watchdog timeout errors on bge(4) please give the attached patch a try.
 > P>  > P> The patch does fix false watchdog timeout error only.
 > P>  > P> Typical pheonoma for false watchdog timeout error are
 > P>  > P>        o polling(4) fix the issue
 > P>  > P>        o random watchdog error
 > P>  > P> 
 > P>  > P> If my patch fix the issue you could see the following messages.
 > P>  > P> "missing Tx completion interrupt!" or "link lost -- resetting"
 > P>  > 
 > P>  > I still think that this fix is incorrect. It is just a more gentle
 > P>  > recovery from a fake watchdog timeout.
 > P> 
 > P> Its sole purpose is to reinitialize hardware for real watchdog
 > P> timeouts. It's not fix for general watchdog timeouts. As I said other
 > P> mails, the fake watchdog timeout(losing Tx interrupts) for hardwares
 > P> with Tx interrupt moderation capability could be normal thing. So I
 > P> just want to know bge(4) also has the same feature(bug).
 > 
 > According to several emails about em(4) fake watchdog timeouts, the
 > problem can be fixed by setting debug.mpsafenet=0. This makes me think
 > that the problem isn't caused by TX interrupt moderation, but some race
 > in the kernel. Really, if_slowtimo() doesn't acquire driver lock before
 > checking and modifying the if_timer field.
 > 

Hmm... I didn't say the problem was caused by TX interrupt moderation.
I can't sure but I'm under the impression it has *two* different issues.
If you think fake watchdog timeout fix is not adequate one please
let me know. I'll backout the change if you want.

 > Afaik, NIC drivers that can do interrupt moderation should set a timer
 > to a sane value, based on interrupt moderation settings, so that the
 > watchdog won't be ever called fakely.
 > 

Yes. Normally it should. But I saw the issues on Marvell Yukon too.

 > P>  > The more I think, the more I doubt that we really need the
 > P>  > watchdog infrastructure that comes from old days.
 > P> 
 > P> Would you give other way to recover from Tx stuck condition without
 > P> using watchdog?
 > 
 > May be driver should take care of that theirselves, why not? At least
 > the callout routine will have access to the driver mutex, contrary to
 > if_slowtimo().
 > 
-- 
Regards,
Pyun YongHyeon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060823105118.GJ17902>