From owner-freebsd-current@FreeBSD.ORG Tue Oct 30 01:42:54 2007 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2261C16A468 for ; Tue, 30 Oct 2007 01:42:54 +0000 (UTC) (envelope-from matrix@itlegion.ru) Received: from corpmail.itlegion.ru (corpmail.itlegion.ru [84.21.226.211]) by mx1.freebsd.org (Postfix) with SMTP id 765A513C4A6 for ; Tue, 30 Oct 2007 01:42:53 +0000 (UTC) (envelope-from matrix@itlegion.ru) Received: (qmail 26184 invoked from network); 29 Oct 2007 09:47:05 +0300 Received: from unknown (HELO Artem) (192.168.0.12) by 84.21.226.211 with SMTP; 29 Oct 2007 09:47:05 +0300 X-AntiVirus: Checked by Dr.Web [version: 4.44, engine: 4.44.0.09170, virus records: 251535, updated: 28.10.2007] Message-ID: <00f101c819f7$833d5370$0c00a8c0@Artem> From: "Artem Kuchin" To: Date: Mon, 29 Oct 2007 09:46:59 +0300 Organization: IT Legion MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="koi8-r"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.3138 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 Subject: Problems with gjournal or something else. X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 30 Oct 2007 01:42:54 -0000 I am experiencing a very weird problem with filesystem and it seems to be related to gjournal. It is FreeBSD 7-BETA1 RAID controller: 3WARE 7500x device driver: twe SMP enabled (Pentium D) Mirror raid. I have created the following partitions: twed1s1a 1100MB * twed1s1b swap 1024MB SWAP twed1s1d 5120MB * twed1s1e 30720MB * twed1s1f 261GB * did reboot just is case something is cached. Then did: newfs -J -b 8192 -f 1024 -g 50000 -h 20 -i 40960 /dev/twed1s1f gjournal load gjournal label -f /dev/twed1s1f tunefs -J enable -n disable /dev/twed1s1f mount -o noatime /dev/twed1s1f.journal /NEW/suit osiris# tunefs -p /dev/twed1s1f tunefs: ACLs: (-a) disabled tunefs: MAC multilabel: (-l) disabled tunefs: soft updates: (-n) disabled tunefs: gjournal: (-J) enabled tunefs: maximum blocks per file in a cylinder group: (-e) 1024 tunefs: average file size: (-f) 50000 tunefs: average number of files in a directory: (-s) 20 tunefs: minimum percentage of free space: (-m) 8% tunefs: optimization preference: (-o) time tunefs: volume label: (-L) # newfs command for /dev/twed1s1f (/dev/twed1s1f) newfs -O 2 -a 16 -b 8192 -d 8192 -e 1024 -f 1024 -g 50000 -h 20 -m 8 -o time -s 273771329 /dev/twed1s1f Then i started a huge and long copying process from the old raid 5 array (about 200GB of data). Some time later i have found machine practically frozen becauase log file is filling with error: Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275085824, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278362624, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279272857600, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278493696, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275216896, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278624768, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279272988672, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275347968, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278755840, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279273119744, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278886912, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275479040, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279279017984, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279273250816, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279279149056, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275610112, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279279280128, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279273381888, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279279411200, length=131072)]error = 5 Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275741184, length=131072)]error = 5 Since it is a EIO i have started verify on the contoller - everything is ok. Did cat /dev/random > /NEW/suit/aaa.dat filling the whole fs with a hunge file. - ok did dd if=/dev/twed1s1f of=/dev/null bs=1M - ok The i re-newfs-ed this fs w/o -J, unloaded gjournal and did the same copying - it took several hours and went just fine. So, it is not a hardware problem and it seems to be related to gjournal. One more weird thing happened here. gjournal complained hat BIO_FLUSH is not supported by the driver. However, AFAIK twe is working via scsi subsystem and the authour of gjournal said somewhere that he has had implemeneted BIO_FLISH for scsi and he specifically mentioned that he has tested twe and twa and they both support BIO_FLUSH. Alo, I think offset value in the error message is out of range of this filesystem. The controller has a cache of 64MB on board and the author of gjournal said in some discussion that if BIO_FLUSH support is missing and controller chache is larger than gjournal's cache then there might be problems. I did not find any specific value for the gjournal cache. So, the problem maybe related to this issue (something gets messed up). but i am not sure. Any idea anyone? -- Regards, Artem