From owner-freebsd-questions Mon Dec 17 15:52:25 2001 Delivered-To: freebsd-questions@freebsd.org Received: from web14703.mail.yahoo.com (web14703.mail.yahoo.com [216.136.224.120]) by hub.freebsd.org (Postfix) with SMTP id 389C237B41D for ; Mon, 17 Dec 2001 15:51:44 -0800 (PST) Message-ID: <20011217235144.76997.qmail@web14703.mail.yahoo.com> Received: from [216.251.139.106] by web14703.mail.yahoo.com via HTTP; Mon, 17 Dec 2001 15:51:44 PST Date: Mon, 17 Dec 2001 15:51:44 -0800 (PST) From: Dennis Dai Subject: rsync causes solid deadlock To: questions@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Hi all, I'm having problems with rsync hanging my server solid. Some backgroud first: Several months ago I built a server to be a hot backup for another server which has about 242GB disk space holding mainly images (that's an image processing company). The config of the server: - ASUS A7V-133E 1GHz - 256MB SDRAM - 30GB WD Harddrive holding the OS - 512MB swap - 3Com 3C905C originally, changed to Intel card (i82555 based) today - 4 promise Ultra100 TX2 cards - 8 80GB WD Harddrives attached to the 4 promise cards - raid5 vinum volume on those 8 HDs - all partitions mounted with softupdate - cron job to use rsync to backup from the main server At first, it was running fine. But recently, as the main server being filled up with files (right now about 190GB), it crashed when running rsync. When it crashed, it just hung the server solidly and didn't produce a crash dump after reboot although I configured it to do so. Also no traces in the log. OK, there might be a problem with rsync. So I tried to copy over the whole thing using tar over ssh (ssh root@main tar cpf - /var/images | (cd /data ; tar xpf -)), but it also had problems. This time I still can access the console and the console was full of error message like "watchdog timed out". So I did a bit search on google and found out that it could be the problem with the 3Com card so I decided to change to an Intel card. After changing the card, when I run rsync again, it locked up after a while. The following is what I got from the log: =================== Dec 17 10:43:20 opa /kernel: ad4: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:07 opa /kernel: ata2: resetting devices .. done Dec 17 10:44:07 opa /kernel: ad6: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:07 opa /kernel: ata3: resetting devices .. done Dec 17 10:44:07 opa /kernel: ad9: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:07 opa /kernel: ata4: resetting devices .. done Dec 17 10:44:07 opa /kernel: ad10: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:07 opa /kernel: ata5: resetting devices .. done Dec 17 10:44:07 opa /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:07 opa /kernel: ata0: resetting devices .. done Dec 17 10:44:07 opa /kernel: fxp0: SCB timeout: 0x70 0x0 0x90 0x400 Dec 17 10:44:07 opa /kernel: fxp0: SCB timeout: 0xf0 0x0 0x90 0x400 Dec 17 10:44:07 opa /kernel: ad4: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:07 opa /kernel: ata2: resetting devices .. done Dec 17 10:44:07 opa /kernel: ad6: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:07 opa /kernel: ata3: resetting devices .. done Dec 17 10:44:07 opa /kernel: fxp0: device timeout Dec 17 10:44:07 opa /kernel: fxp0: SCB timeout: 0xe0 0x0 0x90 0x400 Dec 17 10:44:07 opa /kernel: fxp0: SCB timeout: 0x86 0x0 0x90 0x400 Dec 17 10:44:07 opa /kernel: fxp0: SCB timeout: 0xc0 0x0 0x90 0x400 Dec 17 10:44:07 opa /kernel: fxp0: DMA timeout Dec 17 10:44:07 opa /kernel: fxp0: SCB timeout: 0x90 0x0 0x90 0x400 Dec 17 10:44:07 opa /kernel: fxp0: DMA timeout Dec 17 10:44:07 opa /kernel: fxp0: SCB timeout: 0x90 0x0 0x90 0x400 Dec 17 10:44:07 opa /kernel: fxp0: SCB timeout: 0x90 0x0 0x90 0x400 Dec 17 10:44:07 opa /kernel: ad9: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:07 opa /kernel: ata4: resetting devices .. done Dec 17 10:44:07 opa /kernel: ad10: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata5: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata0: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad4: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata2: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad6: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata3: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad9: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata4: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad10: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata5: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata0: resetting devices .. done Dec 17 10:44:08 opa /kernel: fxp0: command queue timeout Dec 17 10:44:08 opa /kernel: ad4: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata2-master: timeout waiting for command=ef s=d0 e=00 Dec 17 10:44:08 opa /kernel: ad4: trying fallback to PIO mode Dec 17 10:44:08 opa /kernel: ata2: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad6: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata3-master: timeout waiting for command=ef s=d0 e=00 Dec 17 10:44:08 opa /kernel: ad6: trying fallback to PIO mode Dec 17 10:44:08 opa /kernel: ata3: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad9: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ad9: trying fallback to PIO mode Dec 17 10:44:08 opa /kernel: ata4: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad10: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ad10: trying fallback to PIO mode Dec 17 10:44:08 opa /kernel: ata5: resetting devices .. done Dec 17 10:44:08 opa /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 17 10:44:08 opa /kernel: ata0-master: timeout waiting for command=ef s=d0 e=00 Dec 17 10:44:08 opa /kernel: ad0: trying fallback to PIO mode Dec 17 10:44:08 opa /kernel: ata0: resetting devices .. done Dec 17 10:44:08 opa /kernel: fxp0: SCB timeout: 0x81 0x0 0x90 0x400 Dec 17 10:44:10 opa last message repeated 14 times Dec 17 10:44:11 opa /kernel: ad5: READ command timeout tag=0 serv=0 - resetting Dec 17 10:44:11 opa /kernel: ata2: resetting devices .. done ====================== OK, now trying the tar over ssh method. It locked up in less than 10 minutes again. And this time it didn't even leave any trace in the log (actually the above is the only thing I got from the various crashes before), just locked up the server and no crash dump after reboot. So I'm kind of stuck as to what is causing the deadlock. Anyone has any ideas? TIA, Dennis PS. dmesg below: ============================= Dec 17 12:07:23 opa /kernel: Copyright (c) 1992-2001 The FreeBSD Project. Dec 17 12:07:23 opa /kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Dec 17 12:07:23 opa /kernel: The Regents of the University of California. All rights reserved. Dec 17 12:07:23 opa /kernel: FreeBSD 4.4-RELEASE-p1 #3: Wed Dec 12 09:03:56 PST 2001 Dec 17 12:07:23 opa /kernel: root@opa:/usr/obj/usr/src/sys/OPA Dec 17 12:07:23 opa /kernel: Timecounter "i8254" frequency 1193182 Hz Dec 17 12:07:23 opa /kernel: Timecounter "TSC" frequency 1008990763 Hz Dec 17 12:07:23 opa /kernel: CPU: AMD Athlon(tm) Processor (1008.99-MHz 686-class CPU) Dec 17 12:07:23 opa /kernel: Origin = "AuthenticAMD" Id = 0x642 Stepping = 2 Dec 17 12:07:23 opa /kernel: Features=0x183f9ffDec 17 12:07:23 opa /kernel: AMD Features=0xc0440000<,AMIE,DSP,3DNow!> Dec 17 12:07:23 opa /kernel: real memory = 268353536 (262064K bytes) Dec 17 12:07:23 opa /kernel: avail memory = 258154496 (252104K bytes) Dec 17 12:07:23 opa /kernel: Preloaded elf kernel "kernel" at 0xc02f2000. Dec 17 12:07:23 opa /kernel: Pentium Pro MTRR support enabled Dec 17 12:07:23 opa /kernel: Using $PIR table, 9 entries at 0xc00f1690 Dec 17 12:07:23 opa /kernel: npx0: on motherboard Dec 17 12:07:23 opa /kernel: npx0: INT 16 interface Dec 17 12:07:23 opa /kernel: pcib0: on motherboard Dec 17 12:07:23 opa /kernel: pci0: on pcib0 Dec 17 12:07:23 opa /kernel: pcib2: at device 1.0 on pci0 Dec 17 12:07:23 opa /kernel: pci1: on pcib2 Dec 17 12:07:23 opa /kernel: pci1: at 0.0 irq 11 Dec 17 12:07:23 opa /kernel: isab0: at device 7.0 on pci0 Dec 17 12:07:23 opa /kernel: isa0: on isab0 Dec 17 12:07:23 opa /kernel: atapci0: port 0xd800-0xd80f at device 7.1 on pci0 Dec 17 12:07:23 opa /kernel: ata0: at 0x1f0 irq 14 on atapci0 Dec 17 12:07:23 opa /kernel: ata1: at 0x170 irq 15 on atapci0 Dec 17 12:07:23 opa /kernel: atapci1: port 0x9000-0x900f,0x9400-0x9403,0x9800-0x9807,0xa000-0xa003,0xa400-0xa407 mem 0xf7800000-0xf7803fff irq 12 at device 10.0 on pci0 Dec 17 12:07:23 opa /kernel: ata2: at 0xa400 on atapci1 Dec 17 12:07:23 opa /kernel: ata3: at 0x9800 on atapci1 Dec 17 12:07:23 opa /kernel: atapci2: port 0x7400-0x740f,0x7800-0x7803,0x8000-0x8007,0x8400-0x8403,0x8800-0x8807 mem 0xf7000000-0xf7003fff irq 12 at device 11.0 on pci0 Dec 17 12:07:23 opa /kernel: ata4: at 0x8800 on atapci2 Dec 17 12:07:23 opa /kernel: ata5: at 0x8000 on atapci2 Dec 17 12:07:23 opa /kernel: fxp0: port 0x7000-0x703f mem 0xf6000000-0xf601ffff,0xf6800000-0xf6800fff irq 10 at device 14.0 on pci0 Dec 17 12:07:23 opa /kernel: fxp0: Ethernet address 00:02:b3:4b:a4:56 Dec 17 12:07:23 opa /kernel: inphy0: on miibus0 Dec 17 12:07:23 opa /kernel: inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto Dec 17 12:07:23 opa /kernel: pcib1: on motherboard Dec 17 12:07:23 opa /kernel: pci2: on pcib1 Dec 17 12:07:23 opa /kernel: orm0: