Date: Thu, 7 Nov 2013 14:44:03 -0500 From: Mark Johnston <markj@freebsd.org> To: Charles Owens <cowens@greatbaysoftware.com> Cc: Jason Damron <jdamron@greatbaysoftware.com>, freebsd-scsi@freebsd.org, Steve McCoy <smccoy@greatbaysoftware.com> Subject: Re: adding BBU relearn support to mfiutil Message-ID: <20131107194402.GA1695@charmander.sandvine.com> In-Reply-To: <527BD440.8010701@greatbaysoftware.com> References: <20130304033836.GA33631@oddish> <1365196956.17311.13.camel@localhost> <20130406000809.GA96223@raichu> <527A7603.7090303@greatbaysoftware.com> <20131106230356.GA86666@charmander.sandvine.com> <527BD440.8010701@greatbaysoftware.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Nov 07, 2013 at 12:56:16PM -0500, Charles Owens wrote: > On 11/6/13 6:03 PM, Mark Johnston wrote: > > On Wed, Nov 06, 2013 at 12:01:55PM -0500, Charles Owens wrote: > >> Hi, we've been playing with this patch in the context of 8.4-RELEASE-p4 > >> (we extracted r250483 and r250497 from stable/8 and applied to > >> releng/8.4). I'm seeing some results that make me question whether or > >> not caching is really working correctly after a BBU relearn operation > >> has completed -- or maybe whether or not the new BBU patch is talking to > >> LSI controller properly. > >> > >> Our test system had a BBU in the failed state (relearn needed). We used > >> the "start learn command" and it seemed to go well, but strangely, when > >> process is seems to have completed, and now several days later, status > >> is still LEARN_CYCLE_REQUESTED (as seen with "mfiutil show battery"). > >> This may be entirely normal -- maybe it says that because the autolearn > >> feature is now enabled? > > I suspect that the status is bogus and that the battery is in fact dead. > > There seem to be a few firmware bugs in the BBU status reporting, at > > least with iBBU07. In your output below, I see: > > > > Design Capacity: 1215 mAh > > Full Charge Capacity: 65262 mAh > > Current Capacity: 61543 mAh > > > > which clearly isn't right. I've seen this problem before as well: over > > time, the full charge capacity decreases, and eventually it seems to > > wrap around to 65535. MegaCli (LSI's binary RAID management tool) reports > > exactly the same thing, so it's a problem with the controller firmware. > > If you look at MegaCli output you get things like "Absolute charge: 6000%". > > So I suspect that the status is incorrect as well; when I've run into > > this problem, I still see "status: normal". > > > >> The "cache" status command also suggests also is a bit strange. Here is > >> the raw output of these status commands: > >> > >> # mfiutil cache mfid0 > >> mfi0 volume mfid0 cache settings: > >> I/O caching: disabled > >> write caching: write-back > >> write cache with bad BBU: disabled > >> read ahead: adaptive > >> drive write cache: enabled > >> Cache disabled due to dead battery or ongoing battery relearn > >> > >> > >> # ./mfiutil show battery > >> mfi0: Battery State: > >> Manufacture Date: 3/18/2010 > >> Serial Number: 77 > >> Manufacturer: LS1111001A > >> Model: 3598501 > >> Chemistry: LION > >> Design Capacity: 1215 mAh > >> Full Charge Capacity: 65262 mAh > >> Current Capacity: 61543 mAh > >> Charge Cycles: 120 > >> Current Charge: 94% > >> Design Voltage: 3700 mV > >> Current Voltage: 4081 mV > >> Temperature: 23 C > >> Autolearn period: 30 days > >> Next learn time: Tue Nov 26 20:06:40 2013 > >> Learn delay interval: 0 hours > >> Autolearn mode: enabled > >> Status: LEARN_CYCLE_REQUESTED > >> > >> > >> /Why does cache status now say "Cache disabled due to dead battery or > >> ongoing battery relearn"/? Shouldn't this no longer be the case since > >> I've run the "learn" operation? Does this indicate that the I/O caching > >> is really disabled? > > I believe so. You can try changing the write caching policy to write-back > > with bad BBU and see if that re-enables the cache. If it does, that's > > more evidence that the BBU is dead and needs to be replaced. > > > >> I'd appreciate any and all assistance. Here's a bit of other info that > >> might be of interest: > >> > >> # mfiutil show adapter > >> mfi0 Adapter: > >> Product Name: Integrated Intel(R) RAID Controller SROMBSASMP2 > >> Serial Number: > >> Firmware: 11.0.1-0036 > >> RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50 > >> Battery Backup: present > >> NVRAM: 32K > >> Onboard Memory: 512M > >> Minimum Stripe: 8k > >> Maximum Stripe: 1M > >> > >> # mfiutil show drives > >> mfi0 Physical Drives: > >> 1 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=6TB005JE> SAS E1:S0 > >> 2 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=6TB005JV> SAS E1:S1 > >> 3 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=6TB005KD> SAS E1:S4 > >> 4 ( 136G) ONLINE <SEAGATE ST9146852SS 0005 serial=6TB005BQ> SAS E1:S2 > >> 5 ( 136G) HOT SPARE <SEAGATE ST9146852SS 0005 serial=6TB005FJ> SAS E1:S3 > >> > >> The storage volume is 4-drives, RAID10. System has 16GB RAM, dual Xeon > >> E5530 CPUs, on an Intel S5520UR motherboard. > > It might be useful to check the output of "mfiutil show events -c info". > > > > > > This is good info, thank you. > > The "show events" command tells us when the battery first was detected > as "failed": > > 49336 (Sun Mar 3 21:53:40 UTC 2013/BATTERY/info) - Battery charge complete > 49340 (boot + 4s/BATTERY/info) - Battery Present > 49341 (boot + 4s/BATTERY/FATAL) - Battery has failed and cannot support data retention. Please replace the battery > 49365 (boot + 45s/BATTERY/WARN) - BBU disabled; changing WB virtual disks to WT > 49367 (Mon Mar 4 05:13:09 UTC 2013/BATTERY/info) - Battery temperature is normal > > > > So, given this strong indication that the BBU is really dead, and that > I'd still like to test the effects of write-caching, I used this > command: mfiutil cache mfid0 bad-bbu-write-cache enable > > Now the "cached disabled" messages is gone: > > # mfiutil cache mfid0 > mfi0 volume mfid0 cache settings: > I/O caching: writes > write caching: write-back > write cache with bad BBU: enabled > read ahead: adaptive > drive write cache: enabled > > > The obvious interpretation is that write-caching is now operational (in > the preferred write-back mode). Strangely, though, my performance tests > (with both pgbench and bonnie) still showed no meaningful effect from > having the cache operational! I toggled between caching / no-caching > with these commands: > > # mfiutil cache mfid0 writes > Setting write cache policy to write-back > > # mfiutil cache mfid0 disable > Disabling caching of I/O writes > > > Again, no difference in performance was seen. > > On a whim, I also tried write-through mode, and to my surprise, bonnie > showed significantly reduced performance! (consistent over multiple > samples) This is really confusing. To me it suggests that there's some > kind of disconnect between caching-status as seen with mfiutil and > caching-status in reality. Chief exhibits being that write-caching > appears to have still been happening even: > > * after the "cache mfid0 disable" command was issued, and > * earlier, before the "cache mfid0 bad-bbu-write-cache enable" command > was issued (when "mfiutil cache mfid0" still showed "Cache disabled > due to dead battery or ongoing battery relearn"). > > ** If this is the case then it suggests that the system before today was > in a dangerous state... actively doing write-back caching with a bad BBU > (despite what mfiutil claimed about the cache being disabled)! ** Yup. That's rather frightening. :( > > Your thoughts? Is there any other way to explain this? Nothing that comes to mind. The reason I did some work to improve LSI BBU reporting was because we were noticing intermittent performance problems that turned out to be caused by the controller flipping to write-through mode during BBU relearn cycles. However, I've never bothered verifying that the cache is actually in write-through mode when the battery is dead. I think there's a machine in my lab which shows similar problems, so I will try to take a look at it soon, do some write perf testing and see what MegaCli reports. It'll take me a few days at least to get to this though. I'm not sure how this might be fixed in the case that it turns out to be another firmware bug. -Mark > > > Here is the data from bonnie: > > ***** write-through caching (2 samples) > > # bonnie -s 2000 > File './Bonnie.1351', size: 2097152000 > ... > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 2000 61515 21.3 46388 4.3 57432 16.0 247823 99.9 1629696 100.0 55687.0 212.4 > > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 2000 60001 20.7 51828 4.9 51666 13.9 247501 100.0 1657454 100.0 53136.4 251.0 > > ***** write-back caching (2 samples) > > -------Sequential Output-------- ---Sequential Input-- --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 2000 128564 44.6 90065 8.7 245325 47.8 248492 100.0 1558747 99.7 61967.5 179.1 > > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU > 2000 184059 64.0 141360 13.8 129801 22.2 246222 99.2 1556723 100.0 51728.4 159.7 > > (and, again... same performance is seen after issuing "cache disable" > command) > > > Thanks much, > > Charles Owens > Great Bay Software >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131107194402.GA1695>