From owner-freebsd-arm@FreeBSD.ORG Tue May 12 16:25:47 2015 Return-Path: Delivered-To: freebsd-arm@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C4A59E10; Tue, 12 May 2015 16:25:47 +0000 (UTC) Received: from gromit.dlib.vt.edu (gromit.dlib.vt.edu [128.173.126.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gromit.dlib.vt.edu", Issuer "Chumby Certificate Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 9978C13CA; Tue, 12 May 2015 16:25:47 +0000 (UTC) Received: from pmather.lib.vt.edu (pmather.lib.vt.edu [128.173.126.193]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by gromit.dlib.vt.edu (Postfix) with ESMTPSA id 57FA4202; Tue, 12 May 2015 12:25:40 -0400 (EDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: state of FreeBSD ARM (less stable than 6 months ago) From: Paul Mather In-Reply-To: <1431438508.6170.258.camel@freebsd.org> Date: Tue, 12 May 2015 12:25:40 -0400 Cc: Ralf Wenk , freebsd-arm@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <9F66E210-24E2-42C4-BEF9-5234F56686BA@gromit.dlib.vt.edu> References: <5550C252.6030001@foxvalley.net> <1431357226.2428197.265704673.6A544F74@webmail.messagingengine.com> <555177D9.8080001@foxvalley.net> <1431438508.6170.258.camel@freebsd.org> To: Ian Lepore X-Mailer: Apple Mail (2.2098) X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 May 2015 16:25:47 -0000 On May 12, 2015, at 9:48 AM, Ian Lepore wrote: > On Tue, 2015-05-12 at 08:45 +0200, Ralf Wenk wrote: >> On Mon, 11 May 2015, at 21:47:37, Dan Raymond wrote: >>> On 5/11/2015 9:13 AM, Mark Felder wrote: >>>> On Mon, May 11, 2015, at 09:53, Dan Raymond wrote: >>>>> I've been running an email and web server using FreeBSD 11 on a >>>>> Raspberry Pi B+ since November. It has crashed 3 times since then >>>>> (roughly every two months). I'm currently running r277334. I = thought >>>>> I'd try the latest build to see if stability has improved. I = purchased a >>>>> Raspberry Pi 2 and used the latest crochet to built r282738. No >>>>> problems building it and it booted up fine. However, it crashes = about >>>>> an hour into building some ports I use for my server (nginx, php, >>>>> etc.). I tried twice last night and it crashed both times. Is = anybody >>>>> looking into these stability issues? >>>>>=20 >>>> RPi2 support is something like less than a week old for SMP and DMA >>>> transport. I'm not sure more than a handful of people have actually >>>> tried it yet. The bugs here will be worked out in time, but if you = have >>>> any core dumps or info that can assist in tracking down issues = you're >>>> experiencing that would certainly be appreciated. >>>>=20 >>>=20 >>> These panics always seem to be mmcsd related. I doubt it has = anything=20 >>> to do with RPi2 or SMP. >>>=20 >>> sdhci_bcm0-slot0: Controller timeout >>> sdhci_bcm0-slot0: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D = REGISTER DUMP =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> sdhci_bcm0-slot0: Sys addr: 0x4d295a00 | Version: 0x00009902 >>> sdhci_bcm0-slot0: Blk size: 0x00000200 | Blk cnt: 0x00000020 >>> sdhci_bcm0-slot0: Argument: 0x002d19c0 | Trn mode: 0x0000193a >>> sdhci_bcm0-slot0: Present: 0x01ff0506 | Host ctl: 0x00000003 >>> sdhci_bcm0-slot0: Power: 0x0000000f | Blk gap: 0x00000000 >>> sdhci_bcm0-slot0: Wake-up: 0x00000000 | Clock: 0x00000507 >>> sdhci_bcm0-slot0: Timeout: 0x0000000e | Int stat: 0x00000010 >>> sdhci_b >>>=20 >>>=20 >>>=20 >>> mmcsd0: Error indicated: 1 Timeout >>> g_vfs_done():mmcsd0s2a[WRITE(offset=3D1460830208, = length=3D24576)]error =3D 5 >>> panic: No b_bufobj 0xd767ca00 >>> cpuid =3D 1 >>> KDB: enter: panic >>> [ thread pid 12 tid 100013 ] >>> Stopped at $d.7: ldrb r15, [r15, r15, ror r15]! >>> db> >>=20 >> I see such panics every two to three months. They happen on a RPi B >> and RPi B+ as well. I have tried different the SD-Cards on the B and >> the B+ of course. So I think it is not related to SD-card, = manufacturer >> or RPi board. >>=20 >> Usually they happen in the middle of the night when syslogd(8) tries = to >> write something. I have never seen them happen when the RPi has some = work >> to do, e.g. is compiling a port. >>=20 >> Continuing out of the debugger prints the usual messages, but on = reboot >> the RPi freeze. Only a power cycle will get it back to operating. >>=20 >> Very often after such a panic happened my RPi gets "unstable" and = panics >> within the next 48 hours again with the same cause. I found out that, = if >> that happened and I force an fsck ignoring the journal there will be = some >> minor issue fixed and the RPi is stable again. For the next 2 or 3 = months. >>=20 >>=20 >> Ralf >=20 > IMO, the moral of that story is: Never use softupdates with journaling > enabled. For years there have been reports on the mailing lists of = fsck > failures when journaling is enabled (not arm-specific). Sometimes a = few > months goes by without a report and you wonder if it got fixed with = some > checkin you didn't notice, then the reports crop up again. My > conclusion is that journalling has never really worked right. >=20 > The only advantage of journaling is to speed up fsck on huge > filesystems. An sdcard with a handful of GB isn't huge. It would be really nice if the default for ARM images was not to enable=20= soft updates journalling on the root file system. I've found it to=20 cause problems to the point that the first thing I do with a=20 newly-installed FreeBSD/arm image nowadays is to "tunefs -j disable" on=20= the SD card root file system. (I still use soft updates, which I don't=20= find to be a problem.) (BTW, I do have sympathy for the point of view that says, "But if we=20 turn it off, and people aren't using it, how will we ever test it/fix=20 it?...") Cheers, Paul.