Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Nov 2013 12:02:07 +1000
From:      David Gwynne <david@gwynne.id.au>
To:        Mark Johnston <markj@FreeBSD.org>
Cc:        Steve McCoy <smccoy@greatbaysoftware.com>, freebsd-scsi@freebsd.org
Subject:   Re: adding BBU relearn support to mfiutil
Message-ID:  <7351EE9D-4250-450F-9D1F-57E12102B6B2@gwynne.id.au>
In-Reply-To: <20131106230356.GA86666@charmander.sandvine.com>
References:  <20130304033836.GA33631@oddish> <1365196956.17311.13.camel@localhost> <20130406000809.GA96223@raichu> <527A7603.7090303@greatbaysoftware.com> <20131106230356.GA86666@charmander.sandvine.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On 7 Nov 2013, at 9:03 am, Mark Johnston <markj@FreeBSD.org> wrote:

> On Wed, Nov 06, 2013 at 12:01:55PM -0500, Charles Owens wrote:
>> Hi, we've been playing with this patch in the context of =
8.4-RELEASE-p4=20
>> (we extracted r250483 and r250497 from stable/8 and applied to=20
>> releng/8.4).  I'm seeing some results that make me question whether =
or=20
>> not caching is really working correctly after a BBU relearn operation=20=

>> has completed -- or maybe whether or not the new BBU patch is talking =
to=20
>> LSI controller properly.
>>=20
>> Our test system had a BBU in the failed state (relearn needed).  We =
used=20
>> the "start learn command" and it seemed to go well, but strangely, =
when=20
>> process is seems to have completed, and now several days later, =
status=20
>> is still LEARN_CYCLE_REQUESTED (as seen with "mfiutil show battery"). =
=20
>> This may be entirely normal -- maybe it says that because the =
autolearn=20
>> feature is now enabled?
>=20
> I suspect that the status is bogus and that the battery is in fact =
dead.
> There seem to be a few firmware bugs in the BBU status reporting, at
> least with iBBU07. In your output below, I see:
>=20
>        Design Capacity: 1215 mAh
>   Full Charge Capacity: 65262 mAh
>       Current Capacity: 61543 mAh
>=20
> which clearly isn't right. I've seen this problem before as well: over
> time, the full charge capacity decreases, and eventually it seems to
> wrap around to 65535. MegaCli (LSI's binary RAID management tool) =
reports
> exactly the same thing, so it's a problem with the controller =
firmware.
> If you look at MegaCli output you get things like "Absolute charge: =
6000%".
> So I suspect that the status is incorrect as well; when I've run into
> this problem, I still see "status: normal".
>=20

ive been staring at bbus on dell perc5s and perc6s recently after we had =
a bunch of bbus get too old.

i havent seen the full charge or current capacity values wrap, but what =
i did figure out is that the write cache wont be enabled if the SOH flag =
is set in whats reported by the BBU STATE response. the SOH flag seems =
to either be based on whether the firmware thinks the battery will last =
a reasonable amount of time (like 72h or something), or whether the bbu =
full capacity is above 30% of its design capacity.

either way, the reality is that batteries degrade and need to be =
replaced. the nearly four year old battery that has gone through 120 =
learn cycles in your output below is what i call a good candidate for =
replacement.

later megaraid firmwares (well, firmwares on later megaraids) have more =
status bits that clearly indicate whether the firmware wants you to =
replace the battery. it takes an annoying amount of interpretation on =
the older ones.

dlg

>>=20
>> The "cache" status command also suggests also is a bit strange. Here =
is=20
>> the raw output of these status commands:
>>=20
>> # mfiutil cache mfid0
>> mfi0 volume mfid0 cache settings:
>>              I/O caching: disabled
>>            write caching: write-back
>> write cache with bad BBU: disabled
>>               read ahead: adaptive
>>        drive write cache: enabled
>> Cache disabled due to dead battery or ongoing battery relearn
>>=20
>>=20
>> # ./mfiutil show battery
>> mfi0: Battery State:
>>      Manufacture Date: 3/18/2010
>>         Serial Number: 77
>>          Manufacturer: LS1111001A
>>                 Model: 3598501
>>             Chemistry: LION
>>       Design Capacity: 1215 mAh
>>  Full Charge Capacity: 65262 mAh
>>      Current Capacity: 61543 mAh
>>         Charge Cycles: 120
>>        Current Charge: 94%
>>        Design Voltage: 3700 mV
>>       Current Voltage: 4081 mV
>>           Temperature: 23 C
>>      Autolearn period: 30 days
>>       Next learn time: Tue Nov 26 20:06:40 2013
>>  Learn delay interval: 0 hours
>>        Autolearn mode: enabled
>>                Status: LEARN_CYCLE_REQUESTED
>>=20
>>=20
>> /Why does cache status now say  "Cache disabled due to dead battery =
or=20
>> ongoing battery relearn"/?  Shouldn't this no longer be the case =
since=20
>> I've run the "learn" operation?  Does this indicate that the I/O =
caching=20
>> is really disabled?
>=20
> I believe so. You can try changing the write caching policy to =
write-back
> with bad BBU and see if that re-enables the cache. If it does, that's
> more evidence that the BBU is dead and needs to be replaced.
>=20
>>=20
>> I'd appreciate any and all assistance.  Here's a bit of other info =
that=20
>> might be of interest:
>>=20
>> # mfiutil show adapter
>> mfi0 Adapter:
>>     Product Name: Integrated Intel(R) RAID Controller SROMBSASMP2
>>    Serial Number:
>>         Firmware: 11.0.1-0036
>>      RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
>>   Battery Backup: present
>>            NVRAM: 32K
>>   Onboard Memory: 512M
>>   Minimum Stripe: 8k
>>   Maximum Stripe: 1M
>>=20
>> # mfiutil show drives
>> mfi0 Physical Drives:
>>  1 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=3D6TB005JE> =
SAS E1:S0
>>  2 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=3D6TB005JV> =
SAS E1:S1
>>  3 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=3D6TB005KD> =
SAS E1:S4
>>  4 (  136G) ONLINE    <SEAGATE ST9146852SS 0005 serial=3D6TB005BQ> =
SAS E1:S2
>>  5 (  136G) HOT SPARE <SEAGATE ST9146852SS 0005 serial=3D6TB005FJ> =
SAS E1:S3
>>=20
>> The storage volume is 4-drives, RAID10.  System has 16GB RAM, dual =
Xeon=20
>> E5530 CPUs, on an Intel S5520UR motherboard.
>=20
> It might be useful to check the output of "mfiutil show events -c =
info".
>=20
>>=20
>> Thanks!
>>=20
>> Charles Owens
>> Great Bay Software
>>=20
>>=20
>>=20
>> On Fri Apr 5 20:08:09 2013, Mark Johnston wrote:
>>>=20
>>> On Fri, Apr 05, 2013 at 02:22:36PM -0700, Sean Bruno wrote:
>>>>=20
>>>> On Sun, 2013-03-03 at 22:38 -0500, Mark Johnston wrote:
>>>>>=20
>>>>> Hi Everyone,
>>>>>=20
>>>>> I recently needed to add a couple of features to mfiutil related =
to BBU
>>>>> relearning. I've pasted a patch below which
>>>>>=20
>>>>> 1. adds extra fields to the output of "mfiutil show battery" =
showing BBU
>>>>> properties. This is essentially the output of
>>>>>=20
>>>>> # MegaCli -AdpBbuInfo -GetBbuProperties -aLL
>>>>>=20
>>>>> and consists of info about battery learning: the learn period, the
>>>>> time at which the controller will start the next relearn, and the =
BBU
>>>>> mode (which indicates whether the battery supports transparent
>>>>> relearning).
>>>>>=20
>>>>> 2. adds a couple of subcommands under "mfiutil bbu" which lets =
users set
>>>>> the BBU properties which can be set by MegaCli.
>>>>>=20
>>>>> 3. adds a command "mfiutil start learn" which immediately kicks =
off a
>>>>> battery relearn.
>>>>>=20
>>>>> These changes grew out of concern about the fact that the =
controller
>>>>> write cache is set to write-through mode during a relearn period =
(which
>>>>> usually lasts for several hours). This ended up causing some =
mysterious
>>>>> and intermittent performance issues, so I needed a way of getting =
more
>>>>> info about what was going on (using MegaCli isn't really an option =
for
>>>>> several reasons). Some BBUs support transparent relearning, which
>>>>> basically means that the controller write cache doesn't get turned =
off
>>>>> during a relearn. However, LSI's default config doesn't enable it, =
and
>>>>> now mfiutil can be used to do that (through "mfiutil bbu =
bbu-mode").
>>>>>=20
>>>>> I was hoping someone would be able to review the patch. If =
anyone's able
>>>>> and willing to test it, I'd very much appreciate feedback from =
that.
>>>>>=20
>>>>> Thanks!
>>>>> -Mark
>>>>=20
>>>>=20
>>>> Just to document for the record. Finally got around to testing this
>>>> today with Mark providing updates. Looks good overall with a couple =
of
>>>> nits that he is handling at the moment (man page and variable name
>>>> collision).
>>>=20
>>>=20
>>> The updated patch is here:
>>> http://people.freebsd.org/~markj/patches/20130405-mfi-bbu.diff
>>>=20
>>> I'll commit it in a few days if there aren't any problems.
>>>=20
>>> Thanks,
>>> -Mark
>>> _______________________________________________
>>> freebsd-scsi@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>>>=20
>>>=20
>>>=20
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7351EE9D-4250-450F-9D1F-57E12102B6B2>