Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Jun 2012 07:28:38 +0400
From:      Sergey Kandaurov <pluknet@gmail.com>
To:        Oliver Fromme <olli@lurza.secnetix.de>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 9-stabe: cd device gone, ATA_CAM panics
Message-ID:  <CAE-mSO%2BwRwgb-GE1UXvVAffDJc8NoONQ05LoQFuhdoN%2BNLgPdw@mail.gmail.com>
In-Reply-To: <201206061929.q56JT6Be027916@lurza.secnetix.de>
References:  <201206061929.q56JT6Be027916@lurza.secnetix.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On 6 June 2012 23:29, Oliver Fromme <olli@lurza.secnetix.de> wrote:
> Hi,
>

Hi, Oliver Fromme.

This is a wild guess, but see below.

> Anyway, I commented atapicam out because it seems that now
> "options ATA_CAM" does the same thing. =A0This time the kernel
> linked, but during boot I got the following panic:
>
> atapci0: <Promise PDC20269 UDMA133 controller> port 0xdc00-0xdc07,0xd880-=
0xd883,0xd800-0xd807,0xcc00-0xcc03,0xc880-0xc88f mem 0xfeaf8000-0xfeafbfff =
irq 21 at device 6.0 on pci3
> ata2: <ATA channel> at channel 0 on atapci0
> ata3: <ATA channel> at channel 1 on atapci0
> [...]
> ata2: reset tp1 mask=3D03 ostat0=3D50 ostat1=3D00
> ata2: stat0=3D0x00 err=3D0x01 lsb=3D0x14 msb=3D0xeb
> ata2: stat1=3D0x00 err=3D0x04 lsb=3D0x00 msb=3D0x00
> ata2: reset tp2 stat0=3D00 stat1=3D00 devices=3D0x10000

Looks like this is the first ata (re-)init pass.

There we walk the long way. xpt_register_async() receives AC_FOUND_DEVICE
and does allocatiion and initialization of cam periph with cd(4) functions.
cdregister() is one of them, it calls to cdstart() via periph_start()
callback with CD_STATE_PROBE, then to cddone() with CD_CCB_PROBE via xpt
action. There we get bad CCB state and eventually parse it as
CAM_AUTOSENSE_FAIL | CAM_DEV_QFRZN. This indicates that cam got an invalid
sense data. Somewhere on this way we seem to gain a reference count on a
peripheral at cdregister() (which is set to 1?), and drop it at cddone().
This looks odd, so I am likely wrong there.

> (cd0:ata2:0:0:0): AutoSense failed
> (cd0:ata2:0:0:0): Error 5, Unretryable error

Both messages are consequence from the generic error handler to indicate
CAM_AUTOSENSE_FAIL, which is not restartable, so EIO error is also set.

> (cd0:ata2:0:0:0): got CAM status 0x50

0x50 stands for CAM_AUTOSENSE_FAIL | CAM_DEV_QFRZN (queue is frozen)

> (cd0:ata2:0:0:0): fatal error, failed to attach to device

The two last messages originate from cddone() xpt completion function.
To the moment, periph is still valid, and periph refcount is also valid
(has a non-zero value).

> (cd0:ata2:0:0:0): lost device

This is from the callback function called from cam_periph_invalidate(),
called from cddone(). It is called to invalidate periph with its specific
callback function. Then periph is marked as invalid with CAM_PERIPH_INVALID
periph flag. periph refcount should still be valid. As such, it is not
freed. At least the last reference is held to be released at cddone().
I don't know that is the state of refcount after cddone() releases it.

> ata2: reset tp1 mask=3D03 ostat0=3D50 ostat1=3D00
> ata2: stat0=3D0x00 err=3D0x01 lsb=3D0x14 msb=3D0xeb
> ata2: stat1=3D0x00 err=3D0x04 lsb=3D0x00 msb=3D0x00
> ata2: reset tp2 stat0=3D00 stat1=3D00 devices=3D0x10000

Looks like this is the second ata (re-)init pass.

> (cd0:ata2:0:0:0): AutoSense failed
> (cd0:ata2:0:0:0): Error 5, Unretryable error

Same as above, but without refcount++ this time as cam periph became
invalidated previously; cam_periph_acquire() handles this and doesn't
allow to increment periph refcount. Messages are a bit shorter now due
to CAM_DEV_QFRZN. Any way we will get to cam_periph_invalidate() from
cddone(). When it comes there periph is already marked as invalid with
CAM_PERIPH_INVALID flag. periph refcount should also be zero due to
cam_periph_acquire() handling. As such, we go to camperiphfree() that
removes the periph from list and destroys it.

> (cd0:ata2:0:0:0): removing device entry

This is from cdcleanup() periph destructor callback called from camperiphfr=
ee().

> panic: cam_periph_release_locked_buses: release of 0xfffffe0007321700 whe=
n refcount is zero

So, at the second pass the periph refcount is only seems to be decremented
to zero, then periph is destroyed which results in geom tasting on a dead
provider.

>
> cpuid =3D 4
> KDB: stack backtrace:
> #0 0xffffffff807a3c96 at kdb_backtrace+0x66
> #1 0xffffffff8076d74e at panic+0x1ce
> #2 0xffffffff802a200e at cam_periph_release_locked_buses+0x3e
> #3 0xffffffff802a202e at cam_periph_release_locked+0x1e
> #4 0xffffffff802a2f52 at cam_periph_release+0x52
> #5 0xffffffff802babcd at cdclose+0xbd
> #6 0xffffffff806d7332 at g_disk_access+0x242
> #7 0xffffffff806db618 at g_access+0x188
> #8 0xffffffff807110f8 at g_raid_md_taste_sii+0x188
> #9 0xffffffff806e96d6 at g_raid_taste+0x126
> #10 0xffffffff806db0cd at g_new_provider_event+0x6d
> #11 0xffffffff806d8c08 at g_run_events+0x1e8
> #12 0xffffffff8073f00e at fork_exit+0x11e
> #13 0xffffffff809a2d4e at fork_trampoline+0xe

Probably the way to fix this is to modify
cam_periph_release_locked_buses() to test CAM_PERIPH_INVALID either.

how about this patch?
(beware: it was not compile tested, just speculating)

Index: sys/cam/cam_periph.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/cam/cam_periph.c        (revision 236694)
+++ sys/cam/cam_periph.c        (working copy)
@@ -374,7 +374,7 @@
 {
        if (periph->refcount !=3D 0) {
                periph->refcount--;
-       } else {
+       } else if ((periph->flags & CAM_PERIPH_INVALID) =3D=3D 0) {
                panic("%s: release of %p when refcount is zero\n ", __func_=
_,
                      periph);
        }

--=20
wbr,
pluknet



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAE-mSO%2BwRwgb-GE1UXvVAffDJc8NoONQ05LoQFuhdoN%2BNLgPdw>