From owner-cvs-usrsbin  Sun Mar  9 04:11:44 1997
Return-Path: <owner-cvs-usrsbin>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id EAA16668
          for cvs-usrsbin-outgoing; Sun, 9 Mar 1997 04:11:44 -0800 (PST)
Received: from sovcom.kiae.su (sovcom.kiae.su [193.125.152.1])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id EAA16663;
          Sun, 9 Mar 1997 04:11:39 -0800 (PST)
Received: by sovcom.kiae.su id AA08107
  (5.65.kiae-1 ); Sun, 9 Mar 1997 15:05:55 +0300
Received: by sovcom.KIAE.su (UUMAIL/2.0); Sun,  9 Mar 97 15:05:55 +0300
Received: (from ache@localhost)
	by nagual.ru (8.8.5/8.8.5) id OAA00815;
	Sun, 9 Mar 1997 14:56:08 +0300 (MSK)
Date: Sun, 9 Mar 1997 14:56:04 +0300 (MSK)
From: =?KOI8-R?B?4c7E0sXKIP7F0s7P1w==?= <ache@nagual.ru>
To: Brian Somers <brian@awfulhak.demon.co.uk>
Cc: CVS-committers@freebsd.org, cvs-all@freebsd.org, cvs-usrsbin@freebsd.org
Subject: Re: cvs commit: src/usr.sbin/ppp timer.c 
In-Reply-To: <199703082058.UAA24419@awfulhak.demon.co.uk>
Message-Id: <Pine.BSF.3.95q.970309144156.590B-100000@nagual.ru>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-cvs-usrsbin@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Sat, 8 Mar 1997, Brian Somers wrote:

> I don't understand.  The idea is that if an interrupt occurs (calling
> the pending function), the select() is interrupted and the pending
> interrupt routine is immediately called.  There should be very little
> latency.... unless there's some other tight loops in the code ?  I don't
> know of any.

It seems that some other tight loops present...

> You're forgetting about SIGHUP and SIGTERM.  They call LogClose(), which
> ends up in a call to mballoc() in LogFlush().  Needless to say, mballoc()
> calls our friend malloc().  Also, TimerService calls logprintf() which
> calls vlogprintf() which calls LogFlush()......

This signals can't be bug report reason, since only normal mode bug
assumed in report, not termination bug. And in norman mode only SIGALRM
can happens, other signals are impossible.

> > I consider having malloc() problem after two days of running is lesser bug
> > than having dead hang after 5minutes of running and carrier drop.
> 
> Ah, but you're the first to complain of the "problem".  Similar code was
> released in 2.2-GAMMA and nobody complained (AFAIK).

We have bad phone lines in Russia, so carrier drop is common situation
here. As I hear it almost never occurse in USA.

> Let's not jump to conclusions.  I agree with the SIGSEGV stuff (if it
> has to be trapped), and fork signals aren't broken.  SIG_DFL & SIG_IGN
> pass right through the pending code.

As I say, SIGALRM is only one signal which can happen in normal running
mode.  I didn't see a reason to pend other signals (excepting maybe
SIGTSTP. etc which I left pending).

> > Proper fixing assumed not pending SIGALRM calls (true time is valuable
> > thing) but making all timer code recursion-safe.
> 
> The original problem wasn't *just* with recursive malloc()s in the Timer
> code.  2.2-ALPHA (or was it GAMMA) went out with a pending SIGALRM, and
> still exhibited the problem.

You mean that signal pending not fix the problem? Why bug report stays
closed in this case? 

> IMHO, "proper fixing" entails not allowing any malloc() calls to recurse.
> AFAIK, POSIX doesn't say anything about malloc() needing to be re-entrant,
> therefore it's up to the program not to re-enter.  As a signal may occur
> during malloc(), we must make sure that no handler that calls malloc()
> may be caused until it's safe (ie via handle_signals()).

Yes, proper fixing is not enabling malloc in signal handlers. But pending
alarm ticks is not allowed in any case. They are alarm tics just because
they don't want to be delayed. I.e. alarm signal handlred must be executed
immediately but don't call malloc (you can pend malloc call, not signal
handler itself).

> You are not sure that all of the changes from pending_signal() to signal()
> are changes to calls that use handlers that don't call malloc() (as I
> pointed out above), so I will not agree with the changes.  If you insist
> on leaving the code there, you can deal with the re-opened recursive
> malloc() pr.

Yes, I re-open it. But from your words it happens even with pending
signals or not?

> Either way, I'd like to know where the code is when ppp loops.  I've heard
> that this does happen from time to time, but nobody's ever identified where
> the code was at the time.  I'd really appreciate if you could tell me
> where so that I can scatter a few more calls to handle_signals().  If you
> can reproduce the problem, could you remove the signal(SIGSEGV,...) call,
> -11 it when it's hung, and ask the ensuing core where it was at ?  TIA.

I'll try to debug and tell exact place (it is a bit hard to debug daemon
with unstable effect). Right now I can only say that problem disappearse
when I remove signals pending.

-- 
Andrey A. Chernov
<ache@null.net>
http://www.nagual.ru/~ache/