From owner-freebsd-current  Mon Mar 12 22:59:59 2001
Delivered-To: freebsd-current@freebsd.org
Received: from kyle.tandemedia.com (kyle.tandemedia.com [216.29.169.3])
	by hub.freebsd.org (Postfix) with ESMTP id 9641B37B718
	for <current@freebsd.org>; Mon, 12 Mar 2001 22:59:55 -0800 (PST)
	(envelope-from rmtodd@ichotolot.servalan.com)
Received: by kyle.tandemedia.com (Postfix, from userid 66)
	id 7D03355409; Tue, 13 Mar 2001 01:59:54 -0500 (EST)
Received: from ichotolot.servalan.com([127.0.0.1]) (2917 bytes) by servalan.servalan.com
	via sendmail with P:esmtp/R:smart_host/T:hacked-uux
	(sender: <rmtodd@ichotolot.servalan.com>) 
	id <m14ciTH-004MkiC@servalan.servalan.com>
	for <current@freebsd.org>; Tue, 13 Mar 2001 00:39:51 -0600 (CST)
	(Smail-3.2.0.111 2000-Feb-17 #1 built 2001-Jan-15)
Message-Id: <m14ciTH-004MkiC@servalan.servalan.com>
To: current@freebsd.org
Subject: Tracking down problem with booting large kernels (bug in locore.s)
Date: Tue, 13 Mar 2001 00:39:50 -0600
From: Richard Todd <rmtodd@ichotolot.servalan.com>
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On my system (dual PII/400 running -current), I've noticed for some time that
if I build a kernel with too many device drivers in it (where "too many" seems
to correspond to text size >3M for the resulting kernel), the system reboots
itself immediately upon booting with the new kernel.  Other people have noticed
this before (see the thread "Recent kernels won't boot" in the mailing list
archives at 
http://www.freebsd.org/mail/archive/2000/freebsd-current/20001015.freebsd-current.html
).
However, no fix for or cause of the problem was ever identified, and the
problem still exists in -current cvsuped as of today.   

I spent some time tonight seeing if I could localize the exact place
of the crash, and had some luck finding where it's crashing.  The
problem is annoyingly hard to track down, as even booting with DDB and
boot -d wouldn't catch the bug; the kernel reboots before DDB starts.  I 
had to resort to sticking "hlt" instructions (or calls to cpu_halt()) in 
various places and seeing if I could get the kernel to hang (telling me that
the kernel had gotten as far as where I stuck the halt.)  I narrowed the crash
down to this area of locore.s (note the arrows).

-----------------------------------
/* Now enable paging */
	movl	R(IdlePTD), %eax
	movl	%eax,%cr3			/* load ptd addr into mmu */
	movl	%cr0,%eax			/* get control word */
	orl	$CR0_PE|CR0_PG,%eax		/* enable paging */
	movl	%eax,%cr0			/* and let's page NOW! */

#ifdef BDE_DEBUGGER
/*
 * Complete the adjustments for paging so that we can keep tracing through
 * initi386() after the low (physical) addresses for the gdt and idt become
 * invalid.
 */
	call	bdb_commit_paging
#endif
<---- No crashes as of here
	pushl	$begin				/* jump to high virtualized address */
	ret   

/* now running relocated at KERNBASE where the system is linked to run */
begin:
<==== crashes before it gets here!!!
	/* set up bootstrap stack */
	movl	proc0paddr,%eax			/* location of in-kernel pages */
----------------------------------------------------------

The pushl and ret is where the boot code is jumping to "begin:" at its proper
virtual address after the page tables are setup.  I'm guessing that
create_pagetables is somehow losing and creating bogus page tables such that
the jump to the kernel virtual address space goes into deep space somewhere, 
but frankly the details of page tables on the i386 are beyond my expertise.
So I'm posting this in hopes that someone on here *does* know enough to figure
out what's going wrong when the kernel size is sufficiently large. 

Any takers?

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message