From owner-freebsd-stable@FreeBSD.ORG  Thu Apr 25 18:49:27 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id E67D7C02
 for <freebsd-stable@freebsd.org>; Thu, 25 Apr 2013 18:49:27 +0000 (UTC)
 (envelope-from guy.helmer@gmail.com)
Received: from mail-yh0-x233.google.com (mail-yh0-x233.google.com
 [IPv6:2607:f8b0:4002:c01::233])
 by mx1.freebsd.org (Postfix) with ESMTP id AC7411423
 for <freebsd-stable@freebsd.org>; Thu, 25 Apr 2013 18:49:27 +0000 (UTC)
Received: by mail-yh0-f51.google.com with SMTP id l109so346720yhq.10
 for <freebsd-stable@freebsd.org>; Thu, 25 Apr 2013 11:49:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=x-received:content-type:mime-version:subject:from:in-reply-to:date
 :cc:content-transfer-encoding:message-id:references:to:x-mailer;
 bh=2PiAjlwWhpVguukhrCkWe1MK/SANX7SfjOSURJKHVSo=;
 b=Bw7XaAFNvJr/n4OiRw7/dfLAg02RZOLEe36/GtkqSnqU84XAT+zsJ3tMbVzk5oPSk3
 AsCYAAjdXVckaRHh9K700Z8RzchMAnMv3j2wrPE5vh2VNzyTqF3vJpKlze3W3CBcH4to
 S6OtxMUp5aCL7oR4VmKrOncyoYAr1QfxtRpDeTNqLkuo8gUqS0BZWFD1eI05zt5qR8dK
 TJITUe1ikkXyxd9ATQDXfzgsGVdDWalUC0+8iZAW31eh6U5o+jWp/U7e2baQavdeLfhF
 ospn601gABmCY4hnbntEABcPmHcUBx1Vzwlp0JzncGTgEe76BX9uFw7S/F9zAhZLBPSQ
 aITg==
X-Received: by 10.236.153.200 with SMTP id f48mr25889015yhk.117.1366915767209; 
 Thu, 25 Apr 2013 11:49:27 -0700 (PDT)
Received: from [172.16.2.51] ([216.81.189.10])
 by mx.google.com with ESMTPSA id s45sm11192285yhk.22.2013.04.25.11.49.25
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Thu, 25 Apr 2013 11:49:26 -0700 (PDT)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Subject: Re: FreeBSD 9: fdisk -It crashes kernel
From: Guy Helmer <guy.helmer@gmail.com>
In-Reply-To: <20130425175729.GA10142@icarus.home.lan>
Date: Thu, 25 Apr 2013 13:49:24 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <064339D8-0896-4C4E-89FA-A01E31B15E0C@gmail.com>
References: <80F41679-9C3A-4E61-8AAD-403410344C32@gmail.com>
 <20130425155818.GA8454@icarus.home.lan>
 <257D766B-A296-43CD-A2B9-5F70A95A07A2@gmail.com>
 <20130425175729.GA10142@icarus.home.lan>
To: Jeremy Chadwick <jdc@koitsu.org>
X-Mailer: Apple Mail (2.1503)
Cc: FreeBSD Stable <freebsd-stable@freebsd.org>
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Apr 2013 18:49:28 -0000


On Apr 25, 2013, at 12:57 PM, Jeremy Chadwick <jdc@koitsu.org> wrote:

> On Thu, Apr 25, 2013 at 11:58:42AM -0500, Guy Helmer wrote:
>> On Apr 25, 2013, at 10:58 AM, Jeremy Chadwick <jdc@koitsu.org> wrote:
>>=20
>>> On Thu, Apr 25, 2013 at 09:06:49AM -0500, Guy Helmer wrote:
>>>> Encountered a surprise when my disk resizing rc.d script caused =
FreeBSD 9.1-STABLE to crash. I used "fdisk -It ada0" to determine what =
the available size of the disk (which happened to be the root disk), and =
on FreeBSD 9.1 the kernel comes crashing down:
>>>>=20
>>>> + fdisk -It ada0
>>>> + /rescue/sed -En 's,.*start ([0-9]+).*size ([0-9]+).*,\1 + \2,p'
>>>> vnode_pager_getpages: I/O read error
>>>> vm_fault: pager read error, pid 65 (fdisk)
>>>> pid 65 (fdisk), uid 0: exited on signal 11
>>>> eval: arithmetic expression: expecting primary: ""
>>>> Entropy harvesting: point_to_pointeval: date: Device not configured
>>>> eval: df: Device not configured
>>>> eval: dmesg: Device not configured
>>>> cat: /bin/ls: Device not configured
>>>> kickstart.
>>>> eval: cannot open /etc/fstab: Device not configured
>>>> eval: cannot open /etc/fstab: Device not configured
>>>> eval: swapon: Device not configured
>>>> Warning! No /etc/fstab: skipping disk checks
>>>> fstab: /etc/fstab:0: Device not configured
>>>>=20
>>>> Fatal trap 12: page fault while in kernel mode
>>>> cpuid =3D 1; apic id =3D 01
>>>> fault virtual address   =3D 0x0
>>>> fault code                     =3D supervisor read, page not =
present
>>>> instruction pointer      =3D 0x20:0xc0825fc4
>>>> stack pointer               =3D 0x28:0xc5a088c8
>>>> frame pointer              =3D 0x28:0xc5a08914
>>>> code segment            =3D base 0x0, limit 0xfffff, type 0x1b
>>>>                                    =3D DLP 0, pres 1, def32 1, gran =
1
>>>> processor eflags       =3D interrupt enabled, resume, IOPL =3D 0
>>>> current process         =3D 91 (mount)
>>>> [ thread pid 91 tid 100056 ]
>>>> Stopped at  g_access+0x24: mlvl 0(%ebx),%eax
>>>> db> where
>>>> Tracing pid 91 tid 100056 td 0xc84c42f0
>>>> g_access(c8481d34,0,1,1,0,=85) at g_access+0x24/frame 0xc5a08914
>>>> ffs_mount(c8481d34,c0d78380,2,c5a08c00,c829ae6c,=85) af =
ffs_mount+0xf74/frame 0xc5a08a34
>>>> vfs_donmount(c84c42f0,10000,0,c84cf200,c84cf200,=85) at =
vfs_donmount+0x1423/frame 0xc5a08c24
>>>> sys_nmount(c84c42f0,c5a08ccc,c5a08cc4,1010006,c5a08d08,=85) at =
sys_nmount+0x7f/frame 0xc5a08c48
>>>> syscall(c5a08d08) at syscall+0x443/frame 0xc508cfc
>>>> Xint0x80_syscall() at Xint0x80_syscall+0x21/frame 0xc5a08cfc
>>>> --- syscall (378, FreeBSD ELF32, sys_nmount), eip =3D 0x480d5feb, =
esp =3D 0xbfbfce1c, ebp =3D 0xbfbfd378 ---
>>>>=20
>>>> I'll fix my script to not do this, but it seems odd that fdisk -It =
can make the disk "go away".
>>>=20
>>> Please provide a full, unmodified copy of your script.
>>>=20
>>> What's confusing to me is that after your sed call (which I don't =
even
>>> understand, because it doesn't appear to be operating on anything =
except
>>> stdin/stdout, and we don't know what that is -- again, show the =
script),
>>> the kernel starts outputting indications that the root =
disk/filesystem
>>> or its related metadata disappeared:
>>>=20
>>>> vnode_pager_getpages: I/O read error
>>>> vm_fault: pager read error, pid 65 (fdisk)
>>>> pid 65 (fdisk), uid 0: exited on signal 11
>>>=20
>>> Except the kernel stack trace indicates something called =
sys_nmount(),
>>> which called vfs_donmount(), which called ffs_mount(), which calls
>>> g_access().  All of those scream to me "someone tried to mount
>>> something".  fdisk does not do mounting.
>>=20
>> Right, which is why I copied the entire screen output -- it appears =
to me that the rc scripts had stumbled on until the kernel panicked.
>>=20
>>>=20
>>> fdisk also shouldn't be writing to LBA 0 (the MBR) if you used -I =
-t.
>>> I've been staring at fdisk.c for about 20 minutes now and I can't =
work
>>> out a situation where -I -t would cause the MBR to be rewritten
>>> actively.
>>>=20
>>> The only GEOM calls I see in fdisk.c that would get called are
>>> g_device_path(), g_open(), and g_close().  Actual device I/O uses =
read()
>>> and write() (only in write_s0() which shouldn't be called).
>>>=20
>>> Furthermore, GEOM has foot-shooting-prevention mechanisms in place =
(I'm
>>> talking about kern.geom.debugflags) to keep LBA 0 from being =
modified.
>>> Is your script setting that sysctl to 16/0x10 blindly?  Ahem.
>>=20
>> No. The script is intended only to work for drives other than the one =
containing the boot partition.
>>=20
>>>=20
>>> It would also help if you could state exactly what 9.1-STABLE source
>>> you're using; if using svn provide revision (rXXXXXX), else provide
>>> uname -a output.
>>=20
>> rev 249788
>>=20
>>>=20
>>> Finally: I would suggest using gpart(8) instead going forward.  This =
is
>>> a separate recommendation though; if somehow I'm overlooking =
something
>>> in fdisk.c where writes to LBA 0 really do happen, then that needs =
to
>>> get fixed.  But gpart(8) is what you should use in general these =
days
>>> anyway.
>>>=20
>>=20
>> Seems like gpart was giving me some frustration with earlier versions =
of FreeBSD (7, I think) so I went with fdisk instead. Might work OK =
now...
>>=20
>> I have included the full script below.
>>=20
>> { snipping for brevity; for reference, see this url: }
>> { =
http://lists.freebsd.org/pipermail/freebsd-stable/2013-April/073234.html =
}
>=20
> Thanks for this.
>=20
> I could practically write a book on what's going on here.  Rather than
> me spend hours of time reverse-engineering this, you're going to need =
to
> step up to the plate and see if you can figure out what exactly =
triggers
> the issue.
>=20
> I will give you this analysis about fdisk -I -t:
>=20
> When -I is specified, I_flag=3D1.
>=20
> When -t is specified, v_flag=3D1, and also v_flag=3D1.
>=20
> Function open_disk(), when fdisk is used with the -I option, will call
> g_open() with the read-write flag set to 1.  Whether or not this
> succeeds I don't know (and if it fails, but only with EPERM, then it
> retries in read-only mode silently).  The -I flag correlates with the
> I_flag variable (do not confuse this with i_flag):
>=20
> 726 static int
> 727 open_disk(int flag)
> 728 {
> 729         int rwmode;
> 730
> 731         /* Write mode if one of these flags are set. */
> 732         rwmode =3D (a_flag || I_flag || B_flag || flag);
> 733         fd =3D g_open(disk, rwmode);
> 734         /* If the mode fails, try read-only if we didn't. */
> 735         if (fd =3D=3D -1 && errno =3D=3D EPERM && rwmode)
> 736                 fd =3D g_open(disk, 0);
> 737         if (fd =3D=3D -1 && errno =3D=3D ENXIO)
> 738                 return -2;
> 739         if (fd =3D=3D -1) {
> 740                 warnx("can't open device %s", disk);
> 741                 return -1;
> 742         }
> 743         if (get_params() =3D=3D -1) {
> 744                 warnx("can't get disk parameters on %s", disk);
> 745                 return -1;
> 746         }
> 747         return fd;
> 748 }
>=20
> Variable fd is global.
>=20
> After this call to open_disk(), read_disk() is used, but that's only
> doing read operations on fd.
>=20
> After this, the if (I_flag) code gets run.  This calls read_s0(),
> reset_boot() (sounds ominous but isn't), and dos().
>=20
> read_s0() does not issue any write I/O to fd, or call any functions =
that
> issue write I/O.
>=20
> reset_boot() just resets the in-memory-copy of the partition table.
> It does not modify anything on disk.
>=20
> dos() does not do any I/O at all.
>=20
> At this point, if v_flag is set (which it is), print_s0() gets run.
>=20
> print_s0() calls print_params(), which simply prints out the
> in-memory-copy of C/H/S from the disk label and so on.  No file I/O is
> done.  Once that's done, it calls print_part() on each partition,
> which just outputs all the details -- again, no file I/O is done.
>=20
> Finally, at this stage, if t_flag ISN'T set, then write_s0() gets run.
> In this case write_s0() does not get called because t_flag=3D1.  FYI,
> write_s0() is what does the actual write I/O to LBA 0/MBR.  After =
that,
> exit(0) is called.
>=20
> So even though -I -t calls g_open() with the read-write flag set, I
> don't see anything that indicates writing to LBA 0/MBR happens.
>=20
> So I do not see how fdisk -I -t could cause this situation.
>=20
> fdisk -v, maybe, but again, you'll need to do the testing.

Thanks for the analysis.=20

> Now I have a question for you: how did you manage to get this output?
>=20
>>>> + fdisk -It ada0
>>>> + /rescue/sed -En 's,.*start ([0-9]+).*size ([0-9]+).*,\1 + \2,p'
>=20
> Because this looks like /bin/sh -x output, but I need to know if =
that's
> the case or not.

Yes, I inserted "set -x" at the top of the function to get this output.

> /bin/sh -x claims to echo commands to stderr ***before*** they're
> executed.
>=20
> So I'm then left wondering why we don't see output that equates to the
> equivalent of this line:
>=20
>    eval $(fdisk -v $DISK | $SED -En 's,.*start ([0-9]+).*size =
([0-9]+).*,curroff=3D\1 currsize=3D\2,p')
>=20
> Instead, we start seeing this:
>=20
>>>> vnode_pager_getpages: I/O read error
>>>> vm_fault: pager read error, pid 65 (fdisk)
>>>> pid 65 (fdisk), uid 0: exited on signal 11
>>>> eval: arithmetic expression: expecting primary: ""
>>> Entropy harvesting: point_to_pointeval: date: Device not configured
>>> eval: df: Device not configured
>>> eval: dmesg: Device not configured
>>> cat: /bin/ls: Device not configured
>=20
> Your script has only 1 eval statement (and eval is very very =
dangerous.
> I cannot stress this enough.  If you ever think you need eval in shell
> scripts, you probably don't.)

>=20
> Your script does not call df, dmesg, date, or /bin/ls.  So why are =
these
> mentioned?  And "Entropy harvesting" comes from dmesg/the kernel =
message
> buffer too, how is that ending up there?

Hmm, maybe the quoting inside the $() is not working the way I expected. =
Perhaps the * symbols are being expanded as wildcards by the shell as a =
result of the eval.

> Possibly the eval: error line only gets output by sh ***after*** all =
the
> preceding [broken] stuff gets run.
>=20
> But I'm also confused, because there isn't anything =
arithmetic-oriented
> in your eval line, so why is it talking about arithmetic expressions?
> You don't use expr either, so the only math operation comes BEFORE all
> of that, specifically here:
>=20
>    physsize=3D$(($(fdisk -It $DISK | $SED -En 's,.*start =
([0-9]+).*size ([0-9]+).*,\1 + \2,p')))
>=20
> My gut feeling here is that something "unexpected" happened and your
> script went totally haywire as a result (probably some unexpected =
output
> that got turned into something you didn't expect).  My favourite is
> seeing asterisk/wildcards expanded to pull in all the filenames in =
$cwd.
>=20
> I'm sorry to tell you, but there is a point when writing shell scripts
> becomes unreliable/unmanageable/results in too much risk, and is time =
to
> consider writing such things in an actual programming language
> (preferably one without reliance on CLI tools, but real APIs).  I know
> you don't need to hear that right now, but it's true.

There are few tools available at the time this code needs to run -- note =
the use of sed from /rescue as an example. The only alternative would =
probably be C code. Still seems odd that userland code could crash the =
kernel this way.

> See if you can work out exactly what line begins causing problems for
> you.  My guess is that it's the result of fdisk segfaulting, but I'm
> honestly not sure because the above output doesn't make entire sense.
>=20
> Let us know what you determine/find out.

Thanks for your time, Jeremy.

Guy