From owner-freebsd-current@FreeBSD.ORG  Tue Oct 30 01:42:54 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2261C16A468
	for <freebsd-current@freebsd.org>; Tue, 30 Oct 2007 01:42:54 +0000 (UTC)
	(envelope-from matrix@itlegion.ru)
Received: from corpmail.itlegion.ru (corpmail.itlegion.ru [84.21.226.211])
	by mx1.freebsd.org (Postfix) with SMTP id 765A513C4A6
	for <freebsd-current@freebsd.org>; Tue, 30 Oct 2007 01:42:53 +0000 (UTC)
	(envelope-from matrix@itlegion.ru)
Received: (qmail 26184 invoked from network); 29 Oct 2007 09:47:05 +0300
Received: from unknown (HELO Artem) (192.168.0.12)
	by 84.21.226.211 with SMTP; 29 Oct 2007 09:47:05 +0300
X-AntiVirus: Checked by Dr.Web [version: 4.44, engine: 4.44.0.09170,
	virus records: 251535, updated: 28.10.2007]
Message-ID: <00f101c819f7$833d5370$0c00a8c0@Artem>
From: "Artem Kuchin" <matrix@itlegion.ru>
To: <freebsd-current@freebsd.org>
Date: Mon, 29 Oct 2007 09:46:59 +0300
Organization: IT Legion
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset="koi8-r"; reply-type=original
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.3138
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198
Subject: Problems with gjournal or something else.
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 30 Oct 2007 01:42:54 -0000

I am experiencing  a very weird problem with filesystem
and it seems to be related to gjournal.

It is FreeBSD 7-BETA1
RAID controller: 3WARE 7500x
device driver: twe
SMP enabled (Pentium D)
Mirror raid.

I have created the following partitions:

twed1s1a  <none>       1100MB *
twed1s1b  swap         1024MB SWAP
twed1s1d  <none>       5120MB *
twed1s1e  <none>      30720MB *
twed1s1f  <none>        261GB *

did reboot just is case something is cached.

Then did:

newfs -J -b 8192 -f 1024 -g 50000 -h 20 -i 40960 /dev/twed1s1f

gjournal load
gjournal label -f /dev/twed1s1f
tunefs -J enable -n disable /dev/twed1s1f
mount -o noatime /dev/twed1s1f.journal /NEW/suit

osiris# tunefs -p /dev/twed1s1f
tunefs: ACLs: (-a)                                         disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 disabled
tunefs: gjournal: (-J)                                     enabled
tunefs: maximum blocks per file in a cylinder group: (-e)  1024
tunefs: average file size: (-f)                            50000
tunefs: average number of files in a directory: (-s)       20
tunefs: minimum percentage of free space: (-m)             8%
tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)

# newfs command for /dev/twed1s1f (/dev/twed1s1f)
newfs -O 2 -a 16 -b 8192 -d 8192 -e 1024 -f 1024 -g 50000 -h 20 -m 8 -o time -s 273771329 /dev/twed1s1f


Then i started a huge and long copying process from the old raid 5 array (about 200GB of data).
Some time later i have found machine practically frozen becauase log file is filling
with error:

Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275085824, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278362624, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279272857600, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278493696, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275216896, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278624768, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279272988672, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275347968, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278755840, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279273119744, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279278886912, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275479040, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279279017984, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279273250816, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279279149056, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275610112, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279279280128, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279273381888, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279279411200, length=131072)]error = 5
Oct 28 22:18:42 osiris kernel: g_vfs_done():twed1s1f.journal[WRITE(offset=279275741184, length=131072)]error = 5

Since it is a EIO i have started verify on the contoller - everything is ok.
Did 
cat /dev/random > /NEW/suit/aaa.dat
filling the whole fs with a hunge file. - ok
did
dd if=/dev/twed1s1f of=/dev/null bs=1M - ok
The i re-newfs-ed this fs w/o -J, unloaded gjournal and did the same copying - it  took several hours
and went just fine.
So, it is not a hardware problem and it seems to be related to gjournal.

One more weird thing happened here. gjournal complained hat BIO_FLUSH is not supported by the driver.
However, AFAIK twe is working via scsi subsystem and the authour of  gjournal said somewhere that he
has had implemeneted BIO_FLISH for scsi and he specifically mentioned that he has tested twe and twa
and they both support BIO_FLUSH.

Alo, I think offset value in the error message is out of range of this filesystem. 

The controller has a cache of  64MB on board and the author of gjournal said in some
discussion that if BIO_FLUSH support is missing and controller chache is larger than
gjournal's cache then there might be problems. I did not find any specific value for
the gjournal cache. So, the problem maybe related to this issue (something gets messed up).
but i am not sure.

Any idea anyone?

--
Regards,
Artem