Date: Thu, 3 Feb 2011 07:13:18 -0800 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Kostik Belousov <kostikbel@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: ext2fs crash in -current (r218056) Message-ID: <20110203151318.GA9986@icarus.home.lan> In-Reply-To: <20110203140142.GH78089@deviant.kiev.zoral.com.ua> References: <4D47B954.3010600@FreeBSD.org> <201102021704.04274.jhb@freebsd.org> <20110202222023.GA45401@icarus.home.lan> <201102030753.55820.jhb@freebsd.org> <20110203140142.GH78089@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Feb 03, 2011 at 04:01:42PM +0200, Kostik Belousov wrote: > On Thu, Feb 03, 2011 at 07:53:55AM -0500, John Baldwin wrote: > > On Wednesday, February 02, 2011 5:20:23 pm Jeremy Chadwick wrote: > > > On Wed, Feb 02, 2011 at 05:04:03PM -0500, John Baldwin wrote: > > > > On Wednesday, February 02, 2011 04:13:48 pm Doug Barton wrote: > > > > > I haven't had a chance to test this patch yet, but John's did not work > > > > > (sorry): > > > > > > > > > > http://dougbarton.us/ext2fs-crash-dump-2.jpg > > > > > > > > > > No actual dump this time either. > > > > > > > > > > I'm happy to test the patch below on Thursday if there is consensus that > > > > > it will work. > > > > > > > > Err, this is a different panic than what you reported earlier. Your disk died > > > > and spewed a bunch of EIO errors. I can look at the locking assertion failure > > > > tomorrow, but this is a differnt issue. Even UFS needed a good bit of work to > > > > handle disks dying gracefully. > > > > > > Are the byte offsets shown in the screenshot within the range of the > > > drive's capacity? They're around the 10.7GB mark, but I have no idea > > > what size disk is being used. > > > > > > The reason I ask is that there have been reported issues in the past > > > where the offsets shown are way outside of the range of the permitted > > > byte offsets of the disk itself (and in some cases even showing a > > > negative number; what is it with people not understanding the difference > > > between signed and unsigned types? Sigh), and I want to make sure this > > > isn't one of those situations. I also don't know if underlying > > > filesystem corruption could cause the problem in question ("filesystem > > > says you should write to block N, which is outside of the permitted > > > range of the device"). > > > > Just one comment. UFS uses negative block numbers to indicate an indirect > > block (or some such) as opposed to a direct block of data. It's a purposeful > > feature that allows one to instantly spot if a problem relates to a direct > > block vs an indirect block. > Yes, but the block numbers are negative within the vnode address range, > not for the on-disk block numbers. ufs_bmap() shall translate negative > vnode block numbers to the positive disk block numbers before buffer is > passed down. I'm a bit out of my league here (going entirely off of kernel source code), but this is educational for me as well as (probably) others. The error string being discussed is something like: g_vfs_done():da0s2[WRITE(offset=10727313400, length=131072)]error = 5 The output comes from src/sys/geom/geom_vfs.c, function g_vfs_done(): 68 static void 69 g_vfs_done(struct bio *bip) 70 { ... 84 if (bip->bio_error) { 85 printf("g_vfs_done():"); 86 g_print_bio(bip); 87 printf("error = %d\n", bip->bio_error); 88 } ... g_print_bio() comes from src/sys/geom/geom_io.c, and prints the contents based on what bip->bio_cmd would contain. In this case, I believe it's BIO_DELETE which is getting called (basing this on the case statement output): 759 void 760 g_print_bio(struct bio *bp) 761 { 762 const char *pname, *cmd = NULL; 763 764 if (bp->bio_to != NULL) 765 pname = bp->bio_to->name; 766 else 767 pname = "[unknown]"; 768 769 switch (bp->bio_cmd) { ... 780 case BIO_WRITE: 781 if (cmd == NULL) 782 cmd = "WRITE"; 783 case BIO_DELETE: 784 if (cmd == NULL) 785 cmd = "DELETE"; 786 printf("%s[%s(offset=%jd, length=%jd)]", pname, cmd, 787 (intmax_t)bp->bio_offset, (intmax_t)bp->bio_length); ... The offset and the length are both explicitly casted and printed as signed numbers here. For me anyway, the next question is "what are bio_offset and bio_length defined as?" (indirectly, "why the explicit cast?"). They're both declared as part of struct bio in src/sys/sys/bio.h as shown: 71 struct bio { ... 78 off_t bio_offset; /* Offset into file. */ ... 92 off_t bio_length; /* Like bio_bcount */ ... Since I'm not familiar with the bio stuff, I can't determine if the above printf() statement is actually correct or incorrect. Ultimately, of course, I'm trying to determine if "offset=XXX, length=XXX" actually represent what folks think they would. I'm now thinking the error message indicates something equivalent to "I got EIO when attempting to work on file offset XXX, when writing or reading length XXX bytes, and that file gets expanded to a device name in this case. Or do I have it wrong and the "file" is actually the disk (filesystem) itself? I imagine this is where the vfs stuff comes into play... somehow. *over my head* :-) -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110203151318.GA9986>