From owner-freebsd-scsi@FreeBSD.ORG Thu Nov 7 17:57:59 2013 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 7F5C7D26; Thu, 7 Nov 2013 17:57:59 +0000 (UTC) (envelope-from cowens@greatbaysoftware.com) Received: from p3plsmtpa11-07.prod.phx3.secureserver.net (p3plsmtpa11-07.prod.phx3.secureserver.net [68.178.252.108]) by mx1.freebsd.org (Postfix) with ESMTP id B6C7B21CA; Thu, 7 Nov 2013 17:57:58 +0000 (UTC) Received: from jack.bspruce.com ([174.62.183.95]) by p3plsmtpa11-07.prod.phx3.secureserver.net with id mhwG1m00S23uTxa01hwHq1; Thu, 07 Nov 2013 10:56:18 -0700 Message-ID: <527BD440.8010701@greatbaysoftware.com> Date: Thu, 07 Nov 2013 12:56:16 -0500 From: Charles Owens Organization: Great Bay Software User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Mark Johnston Subject: Re: adding BBU relearn support to mfiutil References: <20130304033836.GA33631@oddish> <1365196956.17311.13.camel@localhost> <20130406000809.GA96223@raichu> <527A7603.7090303@greatbaysoftware.com> <20131106230356.GA86666@charmander.sandvine.com> In-Reply-To: <20131106230356.GA86666@charmander.sandvine.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Jason Damron , freebsd-scsi@freebsd.org, Steve McCoy X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Nov 2013 17:57:59 -0000 On 11/6/13 6:03 PM, Mark Johnston wrote: > On Wed, Nov 06, 2013 at 12:01:55PM -0500, Charles Owens wrote: >> Hi, we've been playing with this patch in the context of 8.4-RELEASE-p4 >> (we extracted r250483 and r250497 from stable/8 and applied to >> releng/8.4). I'm seeing some results that make me question whether or >> not caching is really working correctly after a BBU relearn operation >> has completed -- or maybe whether or not the new BBU patch is talking to >> LSI controller properly. >> >> Our test system had a BBU in the failed state (relearn needed). We used >> the "start learn command" and it seemed to go well, but strangely, when >> process is seems to have completed, and now several days later, status >> is still LEARN_CYCLE_REQUESTED (as seen with "mfiutil show battery"). >> This may be entirely normal -- maybe it says that because the autolearn >> feature is now enabled? > I suspect that the status is bogus and that the battery is in fact dead. > There seem to be a few firmware bugs in the BBU status reporting, at > least with iBBU07. In your output below, I see: > > Design Capacity: 1215 mAh > Full Charge Capacity: 65262 mAh > Current Capacity: 61543 mAh > > which clearly isn't right. I've seen this problem before as well: over > time, the full charge capacity decreases, and eventually it seems to > wrap around to 65535. MegaCli (LSI's binary RAID management tool) reports > exactly the same thing, so it's a problem with the controller firmware. > If you look at MegaCli output you get things like "Absolute charge: 6000%". > So I suspect that the status is incorrect as well; when I've run into > this problem, I still see "status: normal". > >> The "cache" status command also suggests also is a bit strange. Here is >> the raw output of these status commands: >> >> # mfiutil cache mfid0 >> mfi0 volume mfid0 cache settings: >> I/O caching: disabled >> write caching: write-back >> write cache with bad BBU: disabled >> read ahead: adaptive >> drive write cache: enabled >> Cache disabled due to dead battery or ongoing battery relearn >> >> >> # ./mfiutil show battery >> mfi0: Battery State: >> Manufacture Date: 3/18/2010 >> Serial Number: 77 >> Manufacturer: LS1111001A >> Model: 3598501 >> Chemistry: LION >> Design Capacity: 1215 mAh >> Full Charge Capacity: 65262 mAh >> Current Capacity: 61543 mAh >> Charge Cycles: 120 >> Current Charge: 94% >> Design Voltage: 3700 mV >> Current Voltage: 4081 mV >> Temperature: 23 C >> Autolearn period: 30 days >> Next learn time: Tue Nov 26 20:06:40 2013 >> Learn delay interval: 0 hours >> Autolearn mode: enabled >> Status: LEARN_CYCLE_REQUESTED >> >> >> /Why does cache status now say "Cache disabled due to dead battery or >> ongoing battery relearn"/? Shouldn't this no longer be the case since >> I've run the "learn" operation? Does this indicate that the I/O caching >> is really disabled? > I believe so. You can try changing the write caching policy to write-back > with bad BBU and see if that re-enables the cache. If it does, that's > more evidence that the BBU is dead and needs to be replaced. > >> I'd appreciate any and all assistance. Here's a bit of other info that >> might be of interest: >> >> # mfiutil show adapter >> mfi0 Adapter: >> Product Name: Integrated Intel(R) RAID Controller SROMBSASMP2 >> Serial Number: >> Firmware: 11.0.1-0036 >> RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50 >> Battery Backup: present >> NVRAM: 32K >> Onboard Memory: 512M >> Minimum Stripe: 8k >> Maximum Stripe: 1M >> >> # mfiutil show drives >> mfi0 Physical Drives: >> 1 ( 136G) ONLINE SAS E1:S0 >> 2 ( 136G) ONLINE SAS E1:S1 >> 3 ( 136G) ONLINE SAS E1:S4 >> 4 ( 136G) ONLINE SAS E1:S2 >> 5 ( 136G) HOT SPARE SAS E1:S3 >> >> The storage volume is 4-drives, RAID10. System has 16GB RAM, dual Xeon >> E5530 CPUs, on an Intel S5520UR motherboard. > It might be useful to check the output of "mfiutil show events -c info". > > This is good info, thank you. The "show events" command tells us when the battery first was detected as "failed": 49336 (Sun Mar 3 21:53:40 UTC 2013/BATTERY/info) - Battery charge complete 49340 (boot + 4s/BATTERY/info) - Battery Present 49341 (boot + 4s/BATTERY/FATAL) - Battery has failed and cannot support data retention. Please replace the battery 49365 (boot + 45s/BATTERY/WARN) - BBU disabled; changing WB virtual disks to WT 49367 (Mon Mar 4 05:13:09 UTC 2013/BATTERY/info) - Battery temperature is normal So, given this strong indication that the BBU is really dead, and that I'd still like to test the effects of write-caching, I used this command: mfiutil cache mfid0 bad-bbu-write-cache enable Now the "cached disabled" messages is gone: # mfiutil cache mfid0 mfi0 volume mfid0 cache settings: I/O caching: writes write caching: write-back write cache with bad BBU: enabled read ahead: adaptive drive write cache: enabled The obvious interpretation is that write-caching is now operational (in the preferred write-back mode). Strangely, though, my performance tests (with both pgbench and bonnie) still showed no meaningful effect from having the cache operational! I toggled between caching / no-caching with these commands: # mfiutil cache mfid0 writes Setting write cache policy to write-back # mfiutil cache mfid0 disable Disabling caching of I/O writes Again, no difference in performance was seen. On a whim, I also tried write-through mode, and to my surprise, bonnie showed significantly reduced performance! (consistent over multiple samples) This is really confusing. To me it suggests that there's some kind of disconnect between caching-status as seen with mfiutil and caching-status in reality. Chief exhibits being that write-caching appears to have still been happening even: * after the "cache mfid0 disable" command was issued, and * earlier, before the "cache mfid0 bad-bbu-write-cache enable" command was issued (when "mfiutil cache mfid0" still showed "Cache disabled due to dead battery or ongoing battery relearn"). ** If this is the case then it suggests that the system before today was in a dangerous state... actively doing write-back caching with a bad BBU (despite what mfiutil claimed about the cache being disabled)! ** Your thoughts? Is there any other way to explain this? Here is the data from bonnie: ***** write-through caching (2 samples) # bonnie -s 2000 File './Bonnie.1351', size: 2097152000 ... -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 2000 61515 21.3 46388 4.3 57432 16.0 247823 99.9 1629696 100.0 55687.0 212.4 Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 2000 60001 20.7 51828 4.9 51666 13.9 247501 100.0 1657454 100.0 53136.4 251.0 ***** write-back caching (2 samples) -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 2000 128564 44.6 90065 8.7 245325 47.8 248492 100.0 1558747 99.7 61967.5 179.1 Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 2000 184059 64.0 141360 13.8 129801 22.2 246222 99.2 1556723 100.0 51728.4 159.7 (and, again... same performance is seen after issuing "cache disable" command) Thanks much, Charles Owens Great Bay Software