FreeBSD Mail Archives

Date:      Tue, 14 Feb 2012 11:10:15 -0700
From:      Ian Lepore <freebsd@damnhippie.dyndns.org>
To:        Jason Hellenthal <jhell@DataIX.net>
Cc:        john fleming <jflemingeds@yahoo.com>, "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>
Subject:   Re: 6.2-Release ..ish.. CF + ata == freeze?
Message-ID:  <1329243015.37092.185.camel@revolution.hippie.lan>
In-Reply-To: <20120214051255.GA82468@DataIX.net>
References:  <1329194588.14324.YahooMailNeo@web111720.mail.gq1.yahoo.com> <20120214051255.GA82468@DataIX.net>

On Tue, 2012-02-14 at 00:12 -0500, Jason Hellenthal wrote:
> 
> On Mon, Feb 13, 2012 at 08:43:08PM -0800, john fleming wrote:
> > Just thought i would post over here as i'm not getting a warm fuzzy from checkpoint about being able to find the root cause of an issue. I have a large install base of IPSO checkpoint firewalls, which are based on FreeBSD 6.2. I've had 3 firewalls hang basically the same way, with something that looks like a filesystem issue or an issue with a CF card. 
> >  
> > Does anyone happen to know of any bugs (i've been looking around) that could cause something like that? Granted, it could be a batch of bad CF cards, but its odd that i'm seeing the same thing on 3 different boxes and once rebooted they seem ok.
> >  
> > Also is it possible to get useful info form the atacontroller when things go south like this from the ddb prompt?
> >  
> > This is what shows in show msgbuf
> > ad0: timeout waiting to issue command
> > ad0: error issuing WRITE command
> > ad0: timeout waiting to issue command
> > ad0: error issuing WRITE command
> > ad0: timeout waiting to issue command
> > ad0: error issuing WRITE command
> > ad0: timeout waiting to issue command
> > ad0: error issuing WRITE command
> > g_vfs_done():ad0s4h[WRITE(offset=33849344, length=131072)]error = 5 
> > g_vfs_done():ad0s4h[WRITE(offset=33980416, length=131072)]error = 5 
> > g_vfs_done():ad0s4h[WRITE(offset=34111488, length=131072)]error = 5
> >  g_vfs_done():ad0s4h[WRITE(offset=34242560, length=131072)]error = 5 
> > g_vfs_done():ad0s4h[WRITE(offset=34373632, length=131072)]error = 5 
> >  
> > ad0: 1882MB <STEC M2+ CF 9.0.2 K1186-2> at ata0-master PIO4
> > atapci0: <Intel 6300ESB UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x5070-0x507f mem 0x80301000-0x803013ff at device 31.1 on pci0
> > ata0: <ATA channel 0> on atapci0
> > ata1: <ATA channel 1> on atapci0
> > atapci1: <Intel 6300ESB SATA150 controller> port 0x5088-0x508f,0x50a4-0x50a7,0x5080-0x5087,0x50a0-0x50a3,0x5060-0x506f irq 15 at device 31.2 on pci0
> > ata2: <ATA channel 0> on atapci1
> > ata3: <ATA channel 1> on atapci1ad0s4h is basically a r/w ufs partition on the box where almost anything that needs to be written goes.
> > trace
> > Tracing pid 1101 tid 100043 td 0x656d8460
> > kdb_enter(608cc388,6246,656d8460,64ba1400,6095d580,...) at kdb_enter+0x2b
> > siointr1(64ba1400) at siointr1+0xf0
> > siointr(64ba1400) at siointr+0x38
> > intr_execute_handler(6095d580,f0a4ab04,6,6095d580,f0a4aafc,...) at intr_execute_handler+0x61
> > intr_execute_handlers(6095d580,f0a4ab04,6,0,656d8460,...) at intr_execute_handlers+0x40
> > atpic_handle_intr(4) at atpic_handle_intr+0x96
> > Xatpic_intr4() at Xatpic_intr4+0x20
> > --- interrupt, eip = 0x606044af, esp = 0xf0a4ab48, ebp = 0xf0a4ab5c ---
> > lockmgr(e1456a04,6,0,656d8460) at lockmgr+0x58f
> > getdirtybuf(e14569a4,60a405e4,1) at getdirtybuf+0x2e2
> > flush_deplist(68b30850,1,f0a4abb8) at flush_deplist+0x30
> > flush_inodedep_deps(656fa28c,1f235) at flush_inodedep_deps+0xcf
> > softdep_sync_metadata(65964618) at softdep_sync_metadata+0x61
> > ffs_syncvnode(65964618,1) at ffs_syncvnode+0x3a2
> > ffs_fsync(f0a4ac74) at ffs_fsync+0x12
> > VOP_FSYNC_APV(60949260,f0a4ac74) at VOP_FSYNC_APV+0x38
> > fsync(656d8460,f0a4acb4) at fsync+0x170
> > syscall(805003b,806003b,5fbf003b,8050000,288be450,...) at syscall+0x2ee
> > Xint0x80_syscall() at Xint0x80_syscall+0x1f
> 
> This looks to be a problem with softupdates and CF cards. Can you get
> this to repeat on a brand new (good) card ?
> 

EIO errors on a write that lead to a panic nearly always backtrace into
the softupdates code, because that code pretty much has to panic if it
can't write things in the proper order.  That doesn't imply that the
softupdates code is at fault in any way, or that the errors would go
away if softupdates were turned off.  

In fact, I consider it important to have softupdates enabled on CF and
SDCard media.  The number of writes (and especially of repeated
re-writes of the same filesystem metadata sectors) goes way way up
without SU enabled, and that's bad for media with a limited number of
write cycles in its lifetime.

We've been using 6.2 with SU enabled on CF cards for many years at
Symmetricom; we're still shipping systems with that config.  Depending
on the motherboard or SBC, we often have to disable ata DMA, or limit it
to a max of WDMA2 mode.  The indication that you need to do so is
typically a lockup either trying to load the kernel and modules, or
sometimes that works but it locks up while initializing the ata driver.
[1]  If your systems have been running fine with DMA enabled, it's not
the sort of problem that suddenly appears out of the blue.  You find out
when you need to disable it pretty quickly on new hardware because it
doesn't boot reliably.

I tend to agree with Jeremy's assesment that you may have some CF cards
that have neared the end of their life, and especially if they're full
the automatic wear leveling can't find any un-worn cells to use.  If the
cards are old they may have primitive wear-leveling code in the
controllers. [2]  Having 3 cards fail at about the same time on similar
systems could be a sign of this if they were bought at the same time.
We had the unhappy experience of having a large batch of cards bought
around the same time all fail around the same time in the field.

[1] If it locks up in the mbr or btx loader you have to disable DMA in
the bios settings as well as with the loader tunable that's used by the
ata driver.

[2] I've read white papers about new algorithms in CF controllers that
do things like notice when a sector has been occupied with data that was
written once and has been read many times without re-writing.  These
controllers will actually relocate such data to a cell that doesn't have
many write cycles left but can still be read reliably, freeing up a cell
that has only one write in its life history.  Older cards are not nearly
as smart as that.

-- Ian

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1329243015.37092.185.camel>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation