From owner-freebsd-arm@FreeBSD.ORG  Thu Sep 12 16:05:22 2013
Return-Path: <owner-freebsd-arm@FreeBSD.ORG>
Delivered-To: freebsd-arm@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 065CAB27
 for <freebsd-arm@freebsd.org>; Thu, 12 Sep 2013 16:05:22 +0000 (UTC)
 (envelope-from imp@bsdimp.com)
Received: from mail-ob0-f169.google.com (mail-ob0-f169.google.com
 [209.85.214.169])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id C02AB2D43
 for <freebsd-arm@freebsd.org>; Thu, 12 Sep 2013 16:05:21 +0000 (UTC)
Received: by mail-ob0-f169.google.com with SMTP id wp4so3775obc.0
 for <freebsd-arm@freebsd.org>; Thu, 12 Sep 2013 09:05:15 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:sender:subject:mime-version:content-type:from
 :in-reply-to:date:cc:content-transfer-encoding:message-id:references
 :to; bh=KIEYjGWQFNcj+Ic0v/vPu7vgZbeQyJPS3GSe12HWqTk=;
 b=eHPSO8G1OwZM/sLzgZJGEd/bHZaONuW8FNeWis35DiolJ2aaQt4H7VpXNCAygHciXV
 0mCAxkJ6KKqY+WF31FaE8otT+0txFyGPpmK3qZe1+AeCLhKNyvNhvVJCr3eD2TksdBzZ
 IwwWNPCgRvVo1XPb3RhB7aLC9zCmFuftYObGpfSK4i+j6wN87q/YuOaZTGIAzR5XeXLl
 BPC3pcw8YPnhJjFNodGOLZVx4IhUFhxrbIFM6IabQsADWU2/Q8nPnNCJXvHqy5uuOTUr
 L5e99GqGjDTljWaWPakzbu0hRyIWRTS4ykIFN+z/d2U8N2hftfQb6h7d/tCxicZVMWrA
 WdtA==
X-Gm-Message-State: ALoCoQkOWwMhm41wn4besYjsmYaorWZj1bRxCS4BsoAeoYFqsDQI9tyDqFsScqVfnjPtHNutokNd
X-Received: by 10.60.65.227 with SMTP id a3mr7437098oet.13.1379001915701;
 Thu, 12 Sep 2013 09:05:15 -0700 (PDT)
Received: from monkey-bot.int.fusionio.com ([209.117.142.2])
 by mx.google.com with ESMTPSA id u3sm6763234oeq.3.1969.12.31.16.00.00
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Thu, 12 Sep 2013 09:05:15 -0700 (PDT)
Sender: Warner Losh <wlosh@bsdimp.com>
Subject: Re: Panic mounting root on BeagleBone Black
Mime-Version: 1.0 (Apple Message framework v1085)
Content-Type: text/plain; charset=windows-1252
From: Warner Losh <imp@bsdimp.com>
In-Reply-To: <1379001216.1111.633.camel@revolution.hippie.lan>
Date: Thu, 12 Sep 2013 10:05:12 -0600
Content-Transfer-Encoding: quoted-printable
Message-Id: <01C5A0CC-A0FC-4635-8370-EAFDC8E8A854@bsdimp.com>
References: <47E403AE-01A2-4AC8-8028-41F0298FAC3E@freebsd.org>
 <1378997738.1111.631.camel@revolution.hippie.lan>
 <F85C2A12-21DC-41C6-9037-15AFD0B1AD7E@bsdimp.com>
 <1379001216.1111.633.camel@revolution.hippie.lan>
To: Ian Lepore <ian@FreeBSD.org>
X-Mailer: Apple Mail (2.1085)
Cc: "freebsd-arm@freebsd.org" <freebsd-arm@FreeBSD.org>
X-BeenThere: freebsd-arm@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Porting FreeBSD to the StrongARM Processor <freebsd-arm.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arm>
List-Post: <mailto:freebsd-arm@freebsd.org>
List-Help: <mailto:freebsd-arm-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 12 Sep 2013 16:05:22 -0000


On Sep 12, 2013, at 9:53 AM, Ian Lepore wrote:

> On Thu, 2013-09-12 at 09:44 -0600, Warner Losh wrote:
>> On Sep 12, 2013, at 8:55 AM, Ian Lepore wrote:
>>=20
>>> On Wed, 2013-09-11 at 06:43 -0700, Tim Kientzle wrote:
>>>> Just built a new image for BBB from SVN r255438.
>>>>=20
>>>> At the second boot, I got this:
>>>> =10=10
>>>> Mounting local file systems:.
>>>> mmcsd0: Error indicated: 1 Timeout
>>>> g_vfs_done():mmcsd0s2a[READ(offset=3D2016903168, length=3D4096)]error=
 =3D 5
>>>> vnode_pager_getpages: I/O read error
>>>> vm_fault: pager read error, pid 126 (ps)
>>>> mmcsd0: Error indicated: 1 Timeout
>>>> g_vfs_done():mmcsd0s2a[READ(offset=3D131072, length=3D32768)]error =
=3D 5
>>>> sdhci_ti0-slot0: Got data interrupt 0x00000010, but there is no =
active command.
>>>> sdhci_ti0-slot0: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =
REGISTER DUMP =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> sdhci_ti0-slot0: Sys addr: 0x00000000 | Version:  0x00003101
>>>> sdhci_ti0-slot0: Blk size: 0x00000200 | Blk cnt:  0x00000010
>>>> sdhci_ti0-slot0: Argument: 0x0024679e | Trn mode: 0x0000193a
>>>> sdhci_ti0-slot0: Present:  0x01f70000 | Host ctl: 0x00000006
>>>> sdhci_ti0-slot0: Power:    0x0000000d | Blk gap:  0x00000000
>>>> sdhci_ti0-slot0: Wake-up:  0x00000000 | Clock:    0x00000007
>>>> sdhci_ti0-slot0: Timeout:  0x0000000d | Int stat: 0x00000000
>>>> sdhci_ti0-slot0: Int enab: 0x017f00fb | Sig enab: 0x017f00fb
>>>> sdhci_ti0-slot0: AC12 err: 0x00000000 | Slot int: 0x00000000
>>>> sdhci_ti0-slot0: Caps:     0x06e10080 | Max curr: 0x00000000
>>>> sdhci_ti0-slot0: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

>>>>=20
>>>> =85. few more similar messages, then =85.
>>>>=20
>>>> mmcsd0: Error indicated: 1 Timeout
>>>> g_vfs_done():mmcsd0s2a[WRITE(offset=3D20808192, length=3D512)]error =
=3D 5
>>>> g_vfs_done():mmcsd0s2a[WRITE(offset=3D1276346368, =
length=3D24576)]error =3D 5
>>>> panic: brelse: inappropriate B_PAGING or B_CLUSTER bp 0xcd148778
>>>> [bt snipped]
>>>>=20
>>>=20
>>> This was a single occurance, right?  Like you're not dead in the =
water
>>> or anything?
>>>=20
>>> There's insanity in that info... the register dump shows a =
multi-block
>>> write (8kbytes) was set up, but the command that timed out was a =
read.
>>> If a prior write had timed out why isn't there a g_vfs_done() error
>>> logged for it?
>>>=20
>>> I think what we really need is some better error recovery in the mmc =
and
>>> sd layers.  Retrying a failed IO is cheap and easy.  More complex
>>> recovery is possible too (power cycling and re-intializing the card
>>> and/or controller).  But that has its own difficulties -- what if =
the
>>> nature of the problem was that the user swapped cards? -- you don't =
want
>>> to retry a write under those conditions.
>>=20
>> I'd disagree with this...  Retrying often is the wrong thing to do. =
If the write didn't work the first time, why would it work the second? =
Looks like a programming bug here in controlling the sdhci controller =
since we got errors, then we got an interrupt with no pending commands. =
This suggests that our timeout isn't quite right...
>>=20
>> Warner
>>=20
>=20
> Retrying too often or endlessly is wrong, but IMO so is not retrying =
at
> all, especially when the standard specifies error recovery strategies.

I thought we followed the standard's error recovery stuff... Maybe newer =
versions have more extensive retry things than in the past...  But the =
retry should be at the transaction to the card level, not any higher...

Warner=