From owner-freebsd-stable@FreeBSD.ORG  Thu Mar 11 08:54:49 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 88CA7106564A;
	Thu, 11 Mar 2010 08:54:49 +0000 (UTC)
	(envelope-from borjam@sarenet.es)
Received: from proxypop1.sarenet.es (proxypop1.sarenet.es [194.30.0.99])
	by mx1.freebsd.org (Postfix) with ESMTP id D19048FC0A;
	Thu, 11 Mar 2010 08:54:48 +0000 (UTC)
Received: from [172.16.1.204] (unknown [192.148.167.2])
	by proxypop1.sarenet.es (Postfix) with ESMTP id 8F1635CDE;
	Thu, 11 Mar 2010 09:54:47 +0100 (CET)
Mime-Version: 1.0 (Apple Message framework v1077)
Content-Type: text/plain; charset=us-ascii
From: Borja Marcos <borjam@sarenet.es>
In-Reply-To: <20100311084527.2934034895hvgxaw@webmail.leidinger.net>
Date: Thu, 11 Mar 2010 09:54:47 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <764BD545-B86C-47DC-9004-964EB2216AF0@sarenet.es>
References: <864468D4-DCE9-493B-9280-00E5FAB2A05C@lassitu.de>
	<20100309122954.GE3155@garage.freebsd.pl>
	<EC9BC6B4-8D0E-4FE3-852F-0E3A24569D33@sarenet.es>
	<20100309125815.GF3155@garage.freebsd.pl>
	<CB854F58-03AF-46DD-8153-85FA96037C21@sarenet.es>
	<BFF1E2D6-B48A-4A5E-ACEE-8577FDB07820@sarenet.es>
	<20100310110202.GA1715@garage.freebsd.pl>
	<E04F91AA-B2C4-4166-A24A-74F1BEF01519@sarenet.es>
	<20100310173143.GD1715@garage.freebsd.pl>
	<20100311084527.2934034895hvgxaw@webmail.leidinger.net>
To: Alexander Leidinger <Alexander@Leidinger.net>
X-Mailer: Apple Mail (2.1077)
Cc: freebsd-fs@FreeBSD.org, Stable <freebsd-stable@FreeBSD.org>,
	FreeBSD@FreeBSD.ORG, Pawel Jakub Dawidek <pjd@FreeBSD.org>
Subject: Re: Many processes stuck in zfs
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Mar 2010 08:54:49 -0000


On Mar 11, 2010, at 8:45 AM, Alexander Leidinger wrote:

> Quoting Pawel Jakub Dawidek <pjd@FreeBSD.org> (from Wed, 10 Mar 2010 =
18:31:43 +0100):
>=20
> There is a 4th possibility, if you can rule out everything else: bugs =
in the CPU. I stumbled upon this with ZFS (but UFS was exposing the =
problem much faster). The problem in my case was that the BIOS was not =
recognizing the CPU and as such was not uploading microcode updates.
>=20
> Borja, can you confirm that the CPU is correctly announced in FreeBSD =
(just look at "dmesg | grep CPU:" output, if it tells you it is a AMD or =
Intel XXX CPU it is correctly detected by the BIOS)?

A CPU bug? Weird. Very.

Let me explain the whole history of this.

We are using ZFS to maintain a couple of servers in an active/passive =
arrangement. At 30 second intervals we create a snapshot on the master =
server and send it to the slave. Actually I prefer this scheme to =
drbd-style arrangements, but that's another story ;)

We started our tests and soon ran into problems: deadlocked filesystem. =
At one point I remember that the deadlock affected UFS as well, not only =
ZFS. I mean, having both ZFS and UFS, the system also lost access to the =
UFS filesystems when this happened.

Looking at the hours when it happened, it turned out to be one or two of =
these events: periodic scripts running (which, among other things, =
traverse the whole filesystem) and/or a backup being made with Bacula. =
Either way, there seemed to be a problem: read activity on a dataset on =
which I was receiving a snapshot at the same time could lead to a =
deadlock. I am sure I have never tried to receive two snapshots =
simultaneously, etc. The replicating program guaratees it.

As the servers had to be rolled into production, and such tests with =
real servers can be quite time consuming, I set up a couple of FreeBSD =
virtual machines, using VMWare Fusion (version 2 then, now version 3) on =
a Macbook (Macbook 4,1 Intel Core2Duo, 2.1 GHz) and tried to reproduce =
it.

To reproduce it, I set up a "master" machine, with /usr/src and /usr/obj =
on a dataset (pool/src), replicating it at 30 second intervals to =
another virtual machine, the slave. On the slave, I launch "tar" in an =
infinite loop, so that the contents of the replicated dataser (pool/src) =
is copied to another dataset (pool/thecopy).

With that running, and, remember, there are replications at 30 second =
intervals (longer if a replication takes a long time, of course) I run a =
make buildworld on the master machine. The destination soon gets =
deadlocked.

I have tried to fiddle with the virtual machine, for example, trying to =
offer a single or dual core CPU, and there's no difference. With dual =
cores it *seems* to deadlock earlier, but I'm not sure. For the latest =
test results I've posted, I was using a single core CPU.=20

The original machines on which I detected the problem (problem I have =
subsequently reproduced successfully on virtual machines running on =
VMWare Fusion) are Dell PowerEdge 2950, and this is the CPU description:


Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz (2496.25-MHz =
K8-class CPU)
  Origin =3D "GenuineIntel"  Id =3D 0x1067a  Stepping =3D 10
  =
Features=3D0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE=
,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  =
Features2=3D0x40ce3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,P=
DCM,DCA,SSE4.1,XSAVE>
  AMD Features=3D0x20100800<SYSCALL,NX,LM>
  AMD Features2=3D0x1<LAHF>
  TSC: P-state invariant
real memory  =3D 8589934592 (8192 MB)
avail memory =3D 8250003456 (7867 MB)
ACPI APIC Table: <DELL   PE_SC3  >
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 8 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
 cpu4 (AP): APIC ID:  4
 cpu5 (AP): APIC ID:  5
 cpu6 (AP): APIC ID:  6
 cpu7 (AP): APIC ID:  7
ioapic0: Changing APIC ID to 8
ioapic0 <Version 2.0> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <DELL PE_SC3> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
acpi_hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on =
acpi0
Timecounter "HPET" frequency 14318180 Hz quality 900


The virtual machine (VMWare Fusion 3.0.0, Macbook, Mac OS X 10.6.2) =
reports this:
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Core(TM)2 Duo CPU     T8100  @ 2.10GHz (2116.62-MHz =
K8-class CPU)
  Origin =3D "GenuineIntel"  Id =3D 0x10676  Stepping =3D 6
  =
Features=3D0xfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,=
MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS>
  Features2=3D0x80082201<SSE3,SSSE3,CX16,SSE4.1,<b31>>
  AMD Features=3D0x20100800<SYSCALL,NX,LM>
  AMD Features2=3D0x1<LAHF>
  TSC: P-state invariant
real memory  =3D 1153433600 (1100 MB)
avail memory =3D 1090441216 (1039 MB)
ACPI APIC Table: <PTLTD          APIC  >
MADT: Forcing active-low polarity and level trigger for SCI
ioapic0 <Version 1.1> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <INTEL 440BX> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
Timecounter "ACPI-safe" frequency 3579545 Hz quality 850
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0


In order to compare to Solaris, I installed a virtual machine running =
Solaris 10 as well, and used it as a target for the replication. The =
same test didn't deadlock and it seemed to work like a charm.

Sometimes I've tried to run more than one "tar" job in parallel instead =
of just one. It just makes it deadlock earlier, no other difference.

Any more tests I can do?


Borja.