From owner-freebsd-current Mon Mar 12 22:59:59 2001 Delivered-To: freebsd-current@freebsd.org Received: from kyle.tandemedia.com (kyle.tandemedia.com [216.29.169.3]) by hub.freebsd.org (Postfix) with ESMTP id 9641B37B718 for ; Mon, 12 Mar 2001 22:59:55 -0800 (PST) (envelope-from rmtodd@ichotolot.servalan.com) Received: by kyle.tandemedia.com (Postfix, from userid 66) id 7D03355409; Tue, 13 Mar 2001 01:59:54 -0500 (EST) Received: from ichotolot.servalan.com([127.0.0.1]) (2917 bytes) by servalan.servalan.com via sendmail with P:esmtp/R:smart_host/T:hacked-uux (sender: ) id for ; Tue, 13 Mar 2001 00:39:51 -0600 (CST) (Smail-3.2.0.111 2000-Feb-17 #1 built 2001-Jan-15) Message-Id: To: current@freebsd.org Subject: Tracking down problem with booting large kernels (bug in locore.s) Date: Tue, 13 Mar 2001 00:39:50 -0600 From: Richard Todd Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On my system (dual PII/400 running -current), I've noticed for some time that if I build a kernel with too many device drivers in it (where "too many" seems to correspond to text size >3M for the resulting kernel), the system reboots itself immediately upon booting with the new kernel. Other people have noticed this before (see the thread "Recent kernels won't boot" in the mailing list archives at http://www.freebsd.org/mail/archive/2000/freebsd-current/20001015.freebsd-current.html ). However, no fix for or cause of the problem was ever identified, and the problem still exists in -current cvsuped as of today. I spent some time tonight seeing if I could localize the exact place of the crash, and had some luck finding where it's crashing. The problem is annoyingly hard to track down, as even booting with DDB and boot -d wouldn't catch the bug; the kernel reboots before DDB starts. I had to resort to sticking "hlt" instructions (or calls to cpu_halt()) in various places and seeing if I could get the kernel to hang (telling me that the kernel had gotten as far as where I stuck the halt.) I narrowed the crash down to this area of locore.s (note the arrows). ----------------------------------- /* Now enable paging */ movl R(IdlePTD), %eax movl %eax,%cr3 /* load ptd addr into mmu */ movl %cr0,%eax /* get control word */ orl $CR0_PE|CR0_PG,%eax /* enable paging */ movl %eax,%cr0 /* and let's page NOW! */ #ifdef BDE_DEBUGGER /* * Complete the adjustments for paging so that we can keep tracing through * initi386() after the low (physical) addresses for the gdt and idt become * invalid. */ call bdb_commit_paging #endif <---- No crashes as of here pushl $begin /* jump to high virtualized address */ ret /* now running relocated at KERNBASE where the system is linked to run */ begin: <==== crashes before it gets here!!! /* set up bootstrap stack */ movl proc0paddr,%eax /* location of in-kernel pages */ ---------------------------------------------------------- The pushl and ret is where the boot code is jumping to "begin:" at its proper virtual address after the page tables are setup. I'm guessing that create_pagetables is somehow losing and creating bogus page tables such that the jump to the kernel virtual address space goes into deep space somewhere, but frankly the details of page tables on the i386 are beyond my expertise. So I'm posting this in hopes that someone on here *does* know enough to figure out what's going wrong when the kernel size is sufficiently large. Any takers? To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message