Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 25 Jun 2002 23:21:44 -0600
From:      "Kenneth D. Merry" <ken@kdm.org>
To:        Don Lewis <dl-freebsd@catspoiler.org>
Cc:        scsi@FreeBSD.ORG
Subject:   Re: SCSI timeouts under extreme load
Message-ID:  <20020625232144.A62998@panzer.kdm.org>
In-Reply-To: <200206242211.g5OMBMM1084295@gw.catspoiler.org>; from dl-freebsd@catspoiler.org on Mon, Jun 24, 2002 at 03:11:30PM -0700
References:  <200206242211.g5OMBMM1084295@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jun 24, 2002 at 15:11:30 -0700, Don Lewis wrote:
> I'm seeing SCSI timeouts when my disks are under extreme load.  When
> this happens, iostat reports the activity in the range of 100 to 200
> transfers per second or even more, which I suspect is mostly hitting the
> disk cache.  I suspect that the problem is that the disk gets so busy
> that it defers some commands indefinitely, and it looks like CAM
> attempts to recover by resetting the device.  Typically something gets
> wedged waiting for a page to get swapped in and the system becomes
> totally unresponsive until CAM breaks out the big hammer.  I'm currently
> running 4.6-STABLE, but I've been seeing this problem for the last
> several months.
> 
> I think things might work more smoothly if the I/O service times were
> tracked and I/O requests were throttled back if the service times
> started getting unreasonably long or if any I/O requests were
> outstanding an excessively long time.

That would be nice nice to have.

> >From dmesg.boot:
> 
> ahc0: <Adaptec aic7890/91 Ultra2 SCSI adapter> port 0xd000-0xd0ff mem 0xe0000000
> -0xe0000fff irq 5 at device 6.0 on pci0
> aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs
> sa0 at ahc0 bus 0 target 4 lun 0
> sa0: <TANDBERG SLR5 4/8GB =09:> Removable Sequential Access SCSI-2 device
> sa0: 3.300MB/s transfers
> da1 at ahc0 bus 0 target 1 lun 0
> da1: <SEAGATE ST336737LW 0105> Fixed Direct Access SCSI-3 device
> da1: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
> da1: 35242MB (72176566 512 byte sectors: 255H 63S/T 4492C)
> da0 at ahc0 bus 0 target 0 lun 0
> da0: <SEAGATE ST39173LW 4292> Fixed Direct Access SCSI-2 device
> da0: 80.000MB/s transfers (40.000MHz, offset 15, 16bit), Tagged Queueing Enabled
> da0: 8683MB (17783240 512 byte sectors: 255H 63S/T 1106C)

Justin looked at the driver messages (below), and said that it doesn't
appear that there is anything wrong with the driver.  Rather it looks like
you're right about the cause -- the drive (da0) just isn't responding.
(Until we hit it with a BDR.)

My suggestion would be to reduce the number of outstanding tags on da0 with
'camcontrol tags'.  You'll have to see what the number of openings is, and
then perhaps reduce it from there and see if your problem goes away.

e.g.:

# camcontrol tags da1 -v
(pass1:ahc1:0:1:0): dev_openings  63
(pass1:ahc1:0:1:0): dev_active    0
(pass1:ahc1:0:1:0): devq_openings 63
(pass1:ahc1:0:1:0): devq_queued   0
(pass1:ahc1:0:1:0): held          0
(pass1:ahc1:0:1:0): mintags       2
(pass1:ahc1:0:1:0): maxtags       255
# camcontrol tags da1 -N 32 -v
(pass1:ahc1:0:1:0): tagged openings now 32
(pass1:ahc1:0:1:0): dev_openings  32
(pass1:ahc1:0:1:0): dev_active    0
(pass1:ahc1:0:1:0): devq_openings 32
(pass1:ahc1:0:1:0): devq_queued   0
(pass1:ahc1:0:1:0): held          0
(pass1:ahc1:0:1:0): mintags       2
(pass1:ahc1:0:1:0): maxtags       255

Another alternative would be to see if Seagate has any new firmware for
that drive.  They typically don't put firmware on their web site, so you'll
probably have to call Seagate tech support to get it.

> Jun 24 02:15:20 mousie /kernel: swap_pager: indefinite wait buffer: device: #da/
> 0x20001, blkno: 67392, size: 4096
> Jun 24 02:15:52 mousie /kernel: swap_pager: indefinite wait buffer: device: #da/
> 0x20001, blkno: 67392, size: 4096
> Jun 24 02:15:52 mousie /kernel: (da0:ahc0:0:0:0): SCB 0x4 - timed out
> Jun 24 02:15:52 mousie /kernel: ahc0: Dumping Card State while idle, at SEQADDR
> 0x9
> Jun 24 02:15:52 mousie /kernel: ACCUM = 0x4, SINDEX = 0x64, DINDEX = 0x65, ARG_2
>  = 0x3
> Jun 24 02:15:53 mousie /kernel: HCNT = 0x0 SCBPTR = 0x10
> Jun 24 02:15:53 mousie /kernel: SCSISEQ = 0x12, SBLKCTL = 0xa
> Jun 24 02:15:53 mousie /kernel: DFCNTRL = 0x0, DFSTATUS = 0x89
> Jun 24 02:15:53 mousie /kernel: LASTPHASE = 0x1, SCSISIGI = 0x0, SXFRCTL0 = 0x80
> Jun 24 02:15:53 mousie /kernel: SSTAT0 = 0x0, SSTAT1 = 0xa
> Jun 24 02:15:53 mousie /kernel: STACK == 0x3, 0x17b, 0x165, 0x0   
> Jun 24 02:15:53 mousie /kernel: SCB count = 130
> Jun 24 02:15:53 mousie /kernel: Kernel NEXTQSCB = 84
> Jun 24 02:15:53 mousie /kernel: Card NEXTQSCB = 84
> Jun 24 02:15:53 mousie /kernel: QINFIFO entries:
> Jun 24 02:15:53 mousie /kernel: Waiting Queue entries:
> Jun 24 02:15:53 mousie /kernel: Disconnected Queue entries: 16:14 23:2 2:3 15:9
> 24:47 20:59 0:109 31:49 25:16 22:4
> Jun 24 02:15:53 mousie /kernel: QOUTFIFO entries:
> Jun 24 02:15:53 mousie /kernel: Sequencer Free SCB List: 17 21 18 27 8 11 30 28
> 1 14 3 7 9 5 6 13 10 29 12 19 4 26
> Jun 24 02:15:53 mousie /kernel: Sequencer SCB Info: 0(c 0x64, s 0x7, l 0, t 0x6d
> ) 1(c 0x60, s 0x7, l 0, t 0xff) 2(c 0x66, s 0x7, l 0, t 0x3) 3(c 0x60, s 0x7, l 
> 0, t 0xff) 4(c 0x60, s 0x7, l 0, t 0xff) 5(c 0x60, s 0x7, l 0, t 0xff) 6(c 0x60,
>  s 0x7, l 0, t 0xff) 7(c 0x60, s 0x7, l 0, t 0xff) 8(c 0x60, s 0x7, l 0, t 0xff)
>  9(c 0x60, s 0x7, l 0, t 0xff) 10(c 0x60, s 0x7, l 0, t 0xff) 11(c 0x60, s 0x7,
> l 0, t 0xff) 12(c 0x60, s 0x7, l 0, t 0xff) 13(c 0x60, s 0x7, l 0, t 0xff) 14(c 
> 0x60, s 0x7, l 0, t 0xff) 15(c 0x64, s 0x7, l 0, t 0x9) 16(c 0x64, s 0x7, l 0, t
>  0xe) 17(c 0x60, s 0x7, l 0, t 0xff) 18(c 0x60, s 0x7, l 0, t 0xff) 19(c 0x60, s
>  0x7, l 0, t 0xff) 20(c 0x64, s 0x7, l 0, t 0x3b) 21(c 0x60, s 0x7, l 0, t 0xff)
>  22(c 0x64, s 0x7, l 0, t 0x4) 23(c 0x64, s 0x7, l 0, t 0x2) 24(c 0x64, s 0x7, l
>  0, t 0x2f) 25(c 0x66, s 0x7, l 0, t 0x10) 26(c 0x60, s 0x7, l 0, t 0xff) 27(c 0
> x60, s 0x7, l 0, t 0xff) 28(c 0x60, s 0x7, l 0, t 0xff) 29(c 0x60, s 0x7, l 0, t
>  0xff) 30(c 0x60, s 0x7, l 0, t 0xff) 31(c 0x64, s 0x7, l 0, t 0x31)
> Jun 24 02:15:54 mousie /kernel: Pending list: 14(c 0x60, s 0x7, l 0), 2(c 0x60, 
> s 0x7, l 0), 3(c 0x62, s 0x7, l 0), 9(c 0x60, s 0x7, l 0), 47(c 0x60, s 0x7, l 0
> ), 59(c 0x60, s 0x7, l 0), 109(c 0x60, s 0x7, l 0), 49(c 0x60, s 0x7, l 0), 16(c
>  0x62, s 0x7, l 0), 4(c 0x60, s 0x7, l 0)
> Jun 24 02:15:54 mousie /kernel: Kernel Free SCB list: 66 93 0 57 43 83 99 17 96 
> 29 10 56 35 42 65 15 117 19 31 67 115 33 51 101 21 37 39 103 40 13 50 52 94 7 18
>  91 11 102 58 81 119 24 20 98 116 44 6 85 110 114 26 118 22 113 30 80 129 36 104
>  97 92 5 38 105 12 54 107 28 46 112 108 1 100 82 48 95 27 90 32 45 55 8 25 68 41
>  111 34 69 106 53 23 86 87 88 89 70 71 72 73 74 75 76 77 78 79 60 61 62 63 64 12
> 8 127 126 125 124 123 122 121 120
> Jun 24 02:15:54 mousie /kernel: sg[0] - Addr 0x70ec000 : Length 4096
> Jun 24 02:15:54 mousie /kernel: sg[1] - Addr 0x56ed000 : Length 4096
> Jun 24 02:15:54 mousie /kernel: (da0:ahc0:0:0:0): Queuing a BDR SCB
> Jun 24 02:15:54 mousie /kernel: (da0:ahc0:0:0:0): Bus Device Reset Message Sent
> Jun 24 02:15:54 mousie /kernel: (da0:ahc0:0:0:0): no longer in timeout, status =
>  34b
> Jun 24 02:15:54 mousie /kernel: ahc0: Bus Device Reset on A:0. 10 SCBs aborted 

Ken
-- 
Kenneth Merry
ken@kdm.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020625232144.A62998>