Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Mar 2010 22:26:59 +0100
From:      Kai Gallasch <gallasch@free.de>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        freebsd-fs@FreeBSD.org, Pawel Jakub Dawidek <pjd@FreeBSD.org>, freebsd-stable@FreeBSD.org
Subject:   Re: proliant server lockups with freebsd-amd64-stable (2010-03-10)
Message-ID:  <20100312222659.0198dd03@orwell.free.de>
In-Reply-To: <4B9A8A27.8050608@FreeBSD.org>
References:  <20100311133916.42ba69b0@orwell.free.de> <20100312115028.GG1819@garage.freebsd.pl> <4B9A8A27.8050608@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Am Fri, 12 Mar 2010 20:38:31 +0200
schrieb Alexander Motin <mav@FreeBSD.org>:

> Pawel Jakub Dawidek wrote:
> > On Thu, Mar 11, 2010 at 01:39:16PM +0100, Kai Gallasch wrote:
> >> I have some trouble with an opteron server locking up
> >> spontaneously. It looses all networks connectivity and even
> >> through console I can get no shell.
> >>
> >> Lockups occur mostly under disk load (periodic daily, bacula backup
> >> running, make buildworld/buildkernel) and I can provoke them
> >> easily.
> > [...]
> >>     4     0     0     0  LL     *cissmtx  0xffffff04ed820c00
> >> [g_down]
> > [...]
> >> 100046                   L      *cissmtx  0xffffff04ed820c00
> >> [irq257: ciss0]
> > [...]
> > 
> > I was analizing similar problem as potential ZFS bug. It turned out
> > to be bug in ciss(4) and I believe mav@ (CCed) has fix for that.
> 
> That my patch is already at 8-STABLE since r204873 of 2010-03-08. Make
> sure you have it.

Updating collection src-all/cvs
..
..
 Edit src/sys/dev/ciss/ciss.c
 Edit src/sys/dev/ciss/cissvar.h

Didn't have it! Must have been just a few hours I missed it,
when I built the last kernel. So I will rebuild my kernel and
give it a spin later on.

> In this case trap stopped process at ciss_get_request(), which indeed
> called holding cissmtx lock. But there is no place to sleep or loop
> there, so may be it was just spontaneous. With bugs I was fixing there
> was a chance to loop indefinitely between ciss and CAM on resource
> constraint. That increases chance for such situation to be caught.
> 
> You may try also look what's going on with `top -HS` and `systat -vm
> 1`.

FYI. Without your patch of ciss.c and cissvar.h a lockup with top -HS
and systat -vm running gives following information below. Is there a
pattern visible relating to your patch of the ciss driver?

 Kai.


------ vmstat -vm ------

   3 users    Load  0.21  0.36  0.45                  Mar 12 21:47

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act 2805456   40804 62699320    79936  12358k  count
All 6182560   95796 1136820k   212452          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        cow   16018 total
            491       533   28  576   19  179  174        zfod        atkbd0 1
                                                          ozfod       uart0 irq4
12.9%Sys  12.5%Intr  0.1%User  0.0%Nice 74.5%Idle        %ozfod       ata0 irq14
|    |    |    |    |    |    |    |    |    |    |       daefr       uhci0 45
======+++++++                                             prcfr  2000 cpu0: time
                                        87 dtbuf          totfr    19 bce0 256
Namei     Name-cache   Dir-cache    100000 desvn          react       ciss0 257
   Calls    hits   %    hits   %     40811 numvn          pdwak  2000 cpu1: time
                                     17300 frevn          pdpgs  2000 cpu4: time
                                                          intrn  2000 cpu5: time
Disks   da0   da1   da2   da3   da4 pass0 pass1   4995516 wire   2000 cpu6: time
KB/t   0.00  0.00  0.00  0.00  0.00  0.00  0.00   2593276 act    1999 cpu7: time
tps       0     0     0     0     0     0     0    369560 inact  2000 cpu3: time
MB/s   0.00  0.00  0.00  0.00  0.00  0.00  0.00      9568 cache  2000 cpu2: time
%busy     0     0     0     0     0     0     0  12348752 free
                                                  1252272 buf


----------- top -HS ------------

last pid: 42561;  load averages:  0.35,  0.38,  0.46                    up 0+11:24:36  21:53:39
658 processes: 13 running, 623 sleeping, 21 waiting, 1 lock
CPU:  0.6% user,  0.0% nice, 12.6% system, 25.0% interrupt, 61.8% idle
Mem: 2559M Active, 363M Inact, 4892M Wired, 9548K Cache, 1223M Buf, 12G Free
Swap: 21G Total, 21G Free

  PID USERNAME      PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
   11 root          171 ki31     0K   128K CPU3    3 672:22 100.00% {idle: cpu3}
   11 root          171 ki31     0K   128K CPU1    1 663:50 100.00% {idle: cpu1}
   11 root          171 ki31     0K   128K RUN     4 649:18 100.00% {idle: cpu4}
   12 root          -32    -     0K   384K CPU7    0   6:38 100.00% {swi4: clock}
    4 root           -8    -     0K    16K CPU5    5   1:16 100.00% g_down
   12 root          -64    -     0K   384K CPU2    0   1:15 100.00% {swi2: cambio}
   11 root          171 ki31     0K   128K CPU6    6 672:13 97.27% {idle: cpu6}
   11 root          171 ki31     0K   128K CPU0    0 622:18 96.29% {idle: cpu0}
   11 root          171 ki31     0K   128K RUN     7 676:57 10.99% {idle: cpu7}
 2046 zope1          46    0   468M   379M ucond   6   7:30  1.76% {python}
   11 root          171 ki31     0K   128K RUN     5 663:27  0.00% {idle: cpu5}
   11 root          171 ki31     0K   128K RUN     2 656:03  0.00% {idle: cpu2}
   12 root          -68    -     0K   384K WAIT    0  14:54  0.00% {irq256: bce0}
 2046 zope1          46    0   468M   379M ucond   4   6:02  0.00% {python}
 1719 zope2          48    0   263M   178M ucond   3   2:00  0.00% {python2.4}
 1724 zope2          46    0   321M   235M ucond   5   1:52  0.00% {python2.4}
 6314 root           45    0 10432K  2564K ttyin   4   1:36  0.00% systat
    3 root           -8    -     0K    16K -       2   1:29  0.00% g_up
 6301 zope4          46    0   311M   225M ucond   1   1:28  0.00% {python2.4}
 2320 zope5          44    0   281M   197M ucond   1   1:22  0.00% {python2.4}
   20 root           44    -     0K    16K zio->i  6   1:21  0.00% syncer
 6301 zope4          46    0   311M   225M ucond   5   1:19  0.00% {python2.4}
 1729 zope6          44    0   311M   228M ucond   4   1:15  0.00% {python2.4}
 2320 zope5          45    0   281M   197M ucond   4   1:13  0.00% {python2.4}
 2320 zope5          50    0   281M   197M ucond   7   1:08  0.00% {python2.4}
 6301 zope4          44    0   311M   225M ucond   4   1:02  0.00% {python2.4}
 5138 dave           44    0   341M   248M select  4   0:58  0.00% {initial thread}
 1729 zope6          45    0   311M   228M ucond   6   0:56  0.00% {python2.4}
 8158  19421         44    0 27644K  7560K select  4   0:52  0.00% centerim
 1743 zope7          44    0   287M   201M zio->i  1   0:51  0.00% {python}
 2046 zope1          44    0   468M   379M select  1   0:48  0.00% {initial thread}
19943  19420         44    0 31740K 13044K select  4   0:46  0.00% centerim
 6301 zope4          44    0   311M   225M ucond   0   0:40  0.00% {python2.4}
 1743 zope7          44    0   287M   201M select  0   0:38  0.00% {initial thread}
 2052 zope8          44    0   300M   212M select  3   0:37  0.00% {initial thread}
 1724 zope3          46    0   321M   235M ucond   4   0:33  0.00% {python2.4}
 2320 zope5          44    0   281M   197M ucond   1   0:31  0.00% {python2.4}
 2052 zope8          45    0   300M   212M ucond   7   0:30  0.00% {python}
 2320 zope5          44    0   281M   197M select  6   0:29  0.00% {initial thread}
   12 root          -64    -     0K   384K *cissm  1   0:28  0.00% {irq257: ciss0}
 6301 zope4          44    0   311M   225M select  6   0:28  0.00% {initial thread}
    0 root          -16    0     0K   816K -       0   0:25  0.00% {spa_zio}
 2052 zope8          47    0   300M   212M ucond   2   0:24  0.00% {python}
 1729 zope6          44    0   311M   228M select  3   0:24  0.00% {initial thread}
 1719 zope2          44    0   263M   178M select  1   0:23  0.00% {initial thread}
 1724 zope3          44    0   321M   235M select  1   0:23  0.00% {initial thread}
   12 root          -48    -     0K   384K WAIT    0   0:22  0.00% {swi0: uart uart}
 1734 zope9          44    0   249M   163M select  4   0:21  0.00% {initial thread}
    0 root          -16    0     0K   816K -       4   0:16  0.00% {spa_zio_6}
    0 root          -16    0     0K   816K -       3   0:16  0.00% {spa_zio_4}
    0 root          -16    0     0K   816K -       0   0:16  0.00% {spa_zio_1}
    0 root          -16    0     0K   816K -       4   0:16  0.00% {spa_zio_2}
    0 root          -16    0     0K   816K -       1   0:16  0.00% {spa_zio_5}
    0 root          -16    0     0K   816K -       2   0:16  0.00% {spa_zio_0}
    0 root          -16    0     0K   816K -       1   0:16  0.00% {spa_zio_7}
    0 root          -16    0     0K   816K -       5   0:16  0.00% {spa_zio_3}
 4288 mysql          44    0 62460K 21608K select  4   0:16  0.00% {mysqld}
   59 root           -8    -     0K    92K tx->tx  0   0:12  0.00% {txg_thread_enter}
 6129 mysql          44    0 62788K 23420K select  0   0:12  0.00% {mysqld}
   59 root           -8    -     0K    92K vgeom:  5   0:09  0.00% {vdev da2}
   59 root           -8    -     0K    92K vgeom:  6   0:09  0.00% {vdev da3}
 4288 mysql          44    0 62460K 21608K select  0   0:08  0.00% {mysqld}
 6129 mysql          44    0 62788K 23420K select  0   0:08  0.00% {mysqld}
   13 root           44    -     0K    16K -       0   0:08  0.00% yarrow
    0 root          -16    0     0K   816K -       0   0:08  0.00% {spa_zio_6}
    0 root          -16    0     0K   816K -       7   0:08  0.00% {spa_zio_5}
    0 root          -16    0     0K   816K -       5   0:08  0.00% {spa_zio_2}
    0 root          -16    0     0K   816K -       2   0:08  0.00% {spa_zio_4}
    0 root          -16    0     0K   816K -       3   0:08  0.00% {spa_zio_3}
    0 root          -16    0     0K   816K -       4   0:07  0.00% {spa_zio_1}
    0 root          -16    0     0K   816K -       0   0:07  0.00% {spa_zio_7}
    0 root          -16    0     0K   816K -       1   0:07  0.00% {spa_zio_0}
 5099 root           44    0 11380K  4292K CPU4    4   0:07  0.00% top
   12 root          -44    -     0K   384K WAIT    0   0:07  0.00% {swi1: netisr 0}
 4864 news           45    0 24964K 12108K nanslp  6   0:06  0.00% perl
 5138 dave           76    0   341M   248M ucond   7   0:05  0.00% {python}
 4863 news           76    0  8260K  1920K wait    5   0:05  0.00% sh
 5155 root           44    0   442M  6592K kqread  1   0:04  0.00% {cache-kqueue}
 1734 zope9          44    0   249M   163M ucond   4   0:04  0.00% {python2.4}
 4862 news           44    0   139M 43180K select  6   0:04  0.00% innd
 8173  19421         44    0 30928K  8496K select  4   0:04  0.00% irssi











Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100312222659.0198dd03>