Date: Thu, 7 Nov 2013 12:02:07 +1000 From: David Gwynne <david@gwynne.id.au> To: Mark Johnston <markj@FreeBSD.org> Cc: Steve McCoy <smccoy@greatbaysoftware.com>, freebsd-scsi@freebsd.org Subject: Re: adding BBU relearn support to mfiutil Message-ID: <7351EE9D-4250-450F-9D1F-57E12102B6B2@gwynne.id.au> In-Reply-To: <20131106230356.GA86666@charmander.sandvine.com> References: <20130304033836.GA33631@oddish> <1365196956.17311.13.camel@localhost> <20130406000809.GA96223@raichu> <527A7603.7090303@greatbaysoftware.com> <20131106230356.GA86666@charmander.sandvine.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 7 Nov 2013, at 9:03 am, Mark Johnston <markj@FreeBSD.org> wrote: > On Wed, Nov 06, 2013 at 12:01:55PM -0500, Charles Owens wrote: >> Hi, we've been playing with this patch in the context of = 8.4-RELEASE-p4=20 >> (we extracted r250483 and r250497 from stable/8 and applied to=20 >> releng/8.4). I'm seeing some results that make me question whether = or=20 >> not caching is really working correctly after a BBU relearn operation=20= >> has completed -- or maybe whether or not the new BBU patch is talking = to=20 >> LSI controller properly. >>=20 >> Our test system had a BBU in the failed state (relearn needed). We = used=20 >> the "start learn command" and it seemed to go well, but strangely, = when=20 >> process is seems to have completed, and now several days later, = status=20 >> is still LEARN_CYCLE_REQUESTED (as seen with "mfiutil show battery"). = =20 >> This may be entirely normal -- maybe it says that because the = autolearn=20 >> feature is now enabled? >=20 > I suspect that the status is bogus and that the battery is in fact = dead. > There seem to be a few firmware bugs in the BBU status reporting, at > least with iBBU07. In your output below, I see: >=20 > Design Capacity: 1215 mAh > Full Charge Capacity: 65262 mAh > Current Capacity: 61543 mAh >=20 > which clearly isn't right. I've seen this problem before as well: over > time, the full charge capacity decreases, and eventually it seems to > wrap around to 65535. MegaCli (LSI's binary RAID management tool) = reports > exactly the same thing, so it's a problem with the controller = firmware. > If you look at MegaCli output you get things like "Absolute charge: = 6000%". > So I suspect that the status is incorrect as well; when I've run into > this problem, I still see "status: normal". >=20 ive been staring at bbus on dell perc5s and perc6s recently after we had = a bunch of bbus get too old. i havent seen the full charge or current capacity values wrap, but what = i did figure out is that the write cache wont be enabled if the SOH flag = is set in whats reported by the BBU STATE response. the SOH flag seems = to either be based on whether the firmware thinks the battery will last = a reasonable amount of time (like 72h or something), or whether the bbu = full capacity is above 30% of its design capacity. either way, the reality is that batteries degrade and need to be = replaced. the nearly four year old battery that has gone through 120 = learn cycles in your output below is what i call a good candidate for = replacement. later megaraid firmwares (well, firmwares on later megaraids) have more = status bits that clearly indicate whether the firmware wants you to = replace the battery. it takes an annoying amount of interpretation on = the older ones. dlg >>=20 >> The "cache" status command also suggests also is a bit strange. Here = is=20 >> the raw output of these status commands: >>=20 >> # mfiutil cache mfid0 >> mfi0 volume mfid0 cache settings: >> I/O caching: disabled >> write caching: write-back >> write cache with bad BBU: disabled >> read ahead: adaptive >> drive write cache: enabled >> Cache disabled due to dead battery or ongoing battery relearn >>=20 >>=20 >> # ./mfiutil show battery >> mfi0: Battery State: >> Manufacture Date: 3/18/2010 >> Serial Number: 77 >> Manufacturer: LS1111001A >> Model: 3598501 >> Chemistry: LION >> Design Capacity: 1215 mAh >> Full Charge Capacity: 65262 mAh >> Current Capacity: 61543 mAh >> Charge Cycles: 120 >> Current Charge: 94% >> Design Voltage: 3700 mV >> Current Voltage: 4081 mV >> Temperature: 23 C >> Autolearn period: 30 days >> Next learn time: Tue Nov 26 20:06:40 2013 >> Learn delay interval: 0 hours >> Autolearn mode: enabled >> Status: LEARN_CYCLE_REQUESTED >>=20 >>=20 >> /Why does cache status now say "Cache disabled due to dead battery = or=20 >> ongoing battery relearn"/? Shouldn't this no longer be the case = since=20 >> I've run the "learn" operation? Does this indicate that the I/O = caching=20 >> is really disabled? >=20 > I believe so. You can try changing the write caching policy to = write-back > with bad BBU and see if that re-enables the cache. If it does, that's > more evidence that the BBU is dead and needs to be replaced. >=20 >>=20 >> I'd appreciate any and all assistance. Here's a bit of other info = that=20 >> might be of interest: >>=20 >> # mfiutil show adapter >> mfi0 Adapter: >> Product Name: Integrated Intel(R) RAID Controller SROMBSASMP2 >> Serial Number: >> Firmware: 11.0.1-0036 >> RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50 >> Battery Backup: present >> NVRAM: 32K >> Onboard Memory: 512M >> Minimum Stripe: 8k >> Maximum Stripe: 1M >>=20 >> # mfiutil show drives >> mfi0 Physical Drives: >> 1 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=3D6TB005JE> = SAS E1:S0 >> 2 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=3D6TB005JV> = SAS E1:S1 >> 3 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=3D6TB005KD> = SAS E1:S4 >> 4 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=3D6TB005BQ> = SAS E1:S2 >> 5 ( 136G) HOT SPARE <SEAGATE ST9146852SS 0005 serial=3D6TB005FJ> = SAS E1:S3 >>=20 >> The storage volume is 4-drives, RAID10. System has 16GB RAM, dual = Xeon=20 >> E5530 CPUs, on an Intel S5520UR motherboard. >=20 > It might be useful to check the output of "mfiutil show events -c = info". >=20 >>=20 >> Thanks! >>=20 >> Charles Owens >> Great Bay Software >>=20 >>=20 >>=20 >> On Fri Apr 5 20:08:09 2013, Mark Johnston wrote: >>>=20 >>> On Fri, Apr 05, 2013 at 02:22:36PM -0700, Sean Bruno wrote: >>>>=20 >>>> On Sun, 2013-03-03 at 22:38 -0500, Mark Johnston wrote: >>>>>=20 >>>>> Hi Everyone, >>>>>=20 >>>>> I recently needed to add a couple of features to mfiutil related = to BBU >>>>> relearning. I've pasted a patch below which >>>>>=20 >>>>> 1. adds extra fields to the output of "mfiutil show battery" = showing BBU >>>>> properties. This is essentially the output of >>>>>=20 >>>>> # MegaCli -AdpBbuInfo -GetBbuProperties -aLL >>>>>=20 >>>>> and consists of info about battery learning: the learn period, the >>>>> time at which the controller will start the next relearn, and the = BBU >>>>> mode (which indicates whether the battery supports transparent >>>>> relearning). >>>>>=20 >>>>> 2. adds a couple of subcommands under "mfiutil bbu" which lets = users set >>>>> the BBU properties which can be set by MegaCli. >>>>>=20 >>>>> 3. adds a command "mfiutil start learn" which immediately kicks = off a >>>>> battery relearn. >>>>>=20 >>>>> These changes grew out of concern about the fact that the = controller >>>>> write cache is set to write-through mode during a relearn period = (which >>>>> usually lasts for several hours). This ended up causing some = mysterious >>>>> and intermittent performance issues, so I needed a way of getting = more >>>>> info about what was going on (using MegaCli isn't really an option = for >>>>> several reasons). Some BBUs support transparent relearning, which >>>>> basically means that the controller write cache doesn't get turned = off >>>>> during a relearn. However, LSI's default config doesn't enable it, = and >>>>> now mfiutil can be used to do that (through "mfiutil bbu = bbu-mode"). >>>>>=20 >>>>> I was hoping someone would be able to review the patch. If = anyone's able >>>>> and willing to test it, I'd very much appreciate feedback from = that. >>>>>=20 >>>>> Thanks! >>>>> -Mark >>>>=20 >>>>=20 >>>> Just to document for the record. Finally got around to testing this >>>> today with Mark providing updates. Looks good overall with a couple = of >>>> nits that he is handling at the moment (man page and variable name >>>> collision). >>>=20 >>>=20 >>> The updated patch is here: >>> http://people.freebsd.org/~markj/patches/20130405-mfi-bbu.diff >>>=20 >>> I'll commit it in a few days if there aren't any problems. >>>=20 >>> Thanks, >>> -Mark >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >>>=20 >>>=20 >>>=20 > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7351EE9D-4250-450F-9D1F-57E12102B6B2>