Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 08 Mar 1997 20:58:23 +0000
From:      Brian Somers <brian@awfulhak.demon.co.uk>
To:        =?KOI8-R?B?4c7E0sXKIP7F0s7P1w==?= <ache@nagual.ru>
Cc:        CVS-committers@freebsd.org, cvs-all@freebsd.org, cvs-usrsbin@freebsd.org
Subject:   Re: cvs commit: src/usr.sbin/ppp timer.c 
Message-ID:  <199703082058.UAA24419@awfulhak.demon.co.uk>
In-Reply-To: Your message of "Sat, 08 Mar 1997 20:17:04 %2B0300." <Pine.BSF.3.95q.970308195709.375C-100000@nagual.ru> 

next in thread | previous in thread | raw e-mail | index | archive | help
> On Sat, 8 Mar 1997, Brian Somers wrote:
> 
> > The only time the code can "forget" to pend a signal is if it gets caused
> > during the call to pending_signal() (between the signal() call and the
> > caused &= bit).
> 
> In original code modem carrier was checked every 10 secs to sense carrier
> drop (via SetTimer). After SIGALRM pending was introduced I got too often
> unkillable loop (eating all CPU time) on carrier drop. It means that
> reading from modem device occurse waiting for something, but got no input.
> Checking carrier code close modem device, so loop never occurse if SIGALRM 
> is not pended.

I don't understand.  The idea is that if an interrupt occurs (calling
the pending function), the select() is interrupted and the pending
interrupt routine is immediately called.  There should be very little
latency.... unless there's some other tight loops in the code ?  I don't
know of any.

> > This change is going to re-invent the recursive malloc() problem (or
> > am I missing something?).
> 
> I never saw this malloc() problem by myself,

Me neither unfortunately.

>                                              moreover checking ppp code I
> find too few malloc() calls, so I don't understand how recursion between
> them is ever possible. The only signal which comes in normal operating
> mode is SIGALRM and I not find any malloc() in SIGALRM handling. 

You're forgetting about SIGHUP and SIGTERM.  They call LogClose(), which
ends up in a call to mballoc() in LogFlush().  Needless to say, mballoc()
calls our friend malloc().  Also, TimerService calls logprintf() which
calls vlogprintf() which calls LogFlush()......

> 
> I consider having malloc() problem after two days of running is lesser bug
> than having dead hang after 5minutes of running and carrier drop.

Ah, but you're the first to complain of the "problem".  Similar code was
released in 2.2-GAMMA and nobody complained (AFAIK).

> Moreover, looking deeper in the code, I saw that all signals was simple
> replaced with pending signals (by editor replace command?) even when
> it is not needed, like after fork, kill signals and fatal signals.

Not needed, but didn't do any harm (I thought).  I'd agree that any
user-causable signals (such as 15) and any system signals (such as 11)
should probably not be pended, or should at least be SIG_DFL'd until
the next time the pending signals are dealt with.  In fact, in my
experience, anyone that traps SIGSEGV is asking for it.

> I think that so-called malloc() bug must be really fixed instead of
> masking it with introducing even more bugs: dead loop on carrier drop +
> unkillable ppp + continue operation on segmentation violation + broken
> fork signals.

Let's not jump to conclusions.  I agree with the SIGSEGV stuff (if it
has to be trapped), and fork signals aren't broken.  SIG_DFL & SIG_IGN
pass right through the pending code.

> Proper fixing assumed not pending SIGALRM calls (true time is valuable
> thing) but making all timer code recursion-safe.

The original problem wasn't *just* with recursive malloc()s in the Timer
code.  2.2-ALPHA (or was it GAMMA) went out with a pending SIGALRM, and
still exhibited the problem.

IMHO, "proper fixing" entails not allowing any malloc() calls to recurse.
AFAIK, POSIX doesn't say anything about malloc() needing to be re-entrant,
therefore it's up to the program not to re-enter.  As a signal may occur
during malloc(), we must make sure that no handler that calls malloc()
may be caused until it's safe (ie via handle_signals()).

You are not sure that all of the changes from pending_signal() to signal()
are changes to calls that use handlers that don't call malloc() (as I
pointed out above), so I will not agree with the changes.  If you insist
on leaving the code there, you can deal with the re-opened recursive
malloc() pr.

I agree with SIGSEGV being signal()d rather than pending_signal()d - or
even that the call is removed altogether.  A return will just re-cause
the signal giving an endless loop (unless it was sent manually).

Either way, I'd like to know where the code is when ppp loops.  I've heard
that this does happen from time to time, but nobody's ever identified where
the code was at the time.  I'd really appreciate if you could tell me
where so that I can scatter a few more calls to handle_signals().  If you
can reproduce the problem, could you remove the signal(SIGSEGV,...) call,
-11 it when it's hung, and ask the ensuing core where it was at ?  TIA.

> -- 
> Andrey A. Chernov
> <ache@null.net>
> http://www.nagual.ru/~ache/

-- 
Brian <brian@awfulhak.demon.co.uk>, <brian@freebsd.org>
      <http://www.awfulhak.demon.co.uk/>;
Don't _EVER_ lose your sense of humour....





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703082058.UAA24419>