Date: Fri, 11 Feb 2011 19:24:27 -0800 From: Greg Bonett <greg@bonett.org> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: freebsd-stable@freebsd.org Subject: Re: 8.1 amd64 lockup (maybe zfs or disk related) Message-ID: <1297481067.16594.39.camel@ubuntu> In-Reply-To: <20110209092858.GA35033@icarus.home.lan> References: <1297026074.23922.8.camel@ubuntu> <20110207045501.GA15568@icarus.home.lan> <1297065041.754.12.camel@ubuntu> <20110207085537.GA20545@icarus.home.lan> <1297143276.9417.400.camel@ubuntu> <20110208055239.GA2557@icarus.home.lan> <1297145806.9417.413.camel@ubuntu> <20110208064633.GA3367@icarus.home.lan> <1297235241.4729.35.camel@ubuntu> <20110209092858.GA35033@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
Thanks for all the help. I've learned some new things, but haven't fixed the problem yet. > 1) Re-enable both CPU cores; I can't see this being responsible for the > problem. I do understand the concern over added power draw, but see > recommendation (4a) below. I re-enabled all cores but experienced a lockup while running zpool scrub. I was able to run scrub twice with 4 of 6 cores enabled without lockup. Also, when lockup occurs I'm not able to access the debugger with ctrl-alt-esc. Just to keep things straight, since I'm running geli, more cores means more io throughput during a scrub. If I'm not able to use the kernel debugger to diagnose this problem, should I disable it? Could it be a security risk? > 1) Disable the JMicron SATA controller entirely. > > 2) Disable the ATI IXP700/800 SATA controller entirely. > > 3a) Purchase a Silicon Image controller (one of the models I referenced > in my previous mail). Many places sell them, but lots of online vendors > hide or do not disclose what ASIC they're using for the controller. You > might have to look at their Driver Downloads section to find out what > actual chip is used. This is on my todo list, but as of now I'm still running the controllers on the motherboard. I should have the controller replaced by next week. > 3b) You've stated you're using one of your drives on an eSATA cable. If > you are using a SATA-to-eSATA adapter bracket[1][2], please stop > immediately and use a native eSATA port instead. > > Adapter brackets are known to cause all sorts of problems that appear as > bizarre/strange failures (xxx_DMAxx errors are quite common in this > situation), not to mention with all the internal cabling and external > cabling, a lot of the time people exceed the maximum SATA cable length > without even realising it -- it's the entire length from the SATA port > on your motherboard, to and through the adapter (good luck figuring out > how much wire is used there, to the end of the eSATA cable. Native > eSATA removes use of the shoddy adapters and also extends the maximum > cable length (from 1 metre to 2 metres), plus provides the proper amount > of power for eSATA devices (yes this matters!). Wikipedia has > details[3]. > > Silicon Image and others do make chips that offer both internal SATA and > an eSATA port on the same controller. Given your number of disks, you > might have to invest in multiple controllers. My motherboard has an eSATA port and that's what I'm using (not an extension bracket) Do you still recommend against it? I figured one fewer drive in the case would reduce the load on my PSU. > 4a) Purchase a Kill-a-Watt meter and measure exactly how much power your > entire PC draws, including on power-on (it will be a lot higher during > power-on than during idle/use, as drives spinning up draw lots of amps). > I strongly recommend the Kill-a-Watt P4600 model[4] over the P4400 model. > Based on the wattage and amperage results, you should be able to > determine if you're nearing the maximum draw of your PSU. Kill-a-Watt meter arrived today. It looks like during boot it's not exceeding 200 watts. During a zpool scrub it gets up to ~255 watts (with all cores enabled). So I don't think the problem is gross power consumption. > 4b) However, even if you're way under-draw (say, 400W), the draw may not > be the problem but instead the maximum amount of power/amperage/whatever > a single physical power cable can provide. I imagine to some degree it > depends on the gauge of wire being used; excessive use of Y-splitters to > provide more power connectors than the physical cable provides means > that you might be drawing too much across the existing gauge of cable > that runs to the PSU. I have seen setups where people have 6 hard disks > coming off of a single power cable (with Y-splitters and molex-to-SATA > power adapters) and have their drives randomly drop off the bus. Please > don't do this. Yes this seems like it could be a problem. I'll shutdown and figure out which drives are connected to which cables. Maybe with some rearranging I can even out the load. Even if I have a bunch of drives on a single cable, would a voltage drop on one cable filled with drives be enough to lockup the machine? It seems like the motherboard power would be unaffected. > A better solution might be to invest in a server-grade chassis, such as > one from Supermicro, that offers a hot-swap SATA backplane. The > backplane provides all the correct amounts of power to the maximum > number of disks that can be connected to it. Here are some cases you > can look at that[5][6][7]. Also be aware that if you're already using a > hot-swap backplane, most consumer-grade ones are complete junk and have > been known to cause strange anomalies; it's always best in those > situations to go straight from motherboard-to-drive or card-to-drive. This would be nice, but it's not in my budget right now. I'll keep it in mind for my next major upgrade. > After reviewing your SMART stats on the drive, I agree -- it looks > perfectly healthy (for a Seagate disk). Nothing wrong there. > > > > > calcru: runtime went backwards from 82 usec to 70 usec for pid 20 (flowcleaner) > > > > calcru: runtime went backwards from 363 usec to 317 usec for pid 8 (pagedaemon) > > > > calcru: runtime went backwards from 111 usec to 95 usec for pid 7 (xpt_thrd) > > > > calcru: runtime went backwards from 1892 usec to 1629 usec for pid 1 (init) > > > > calcru: runtime went backwards from 6786 usec to 6591 usec for pid 0 (kernel) > > > > > > This is a problem that has plagued FreeBSD for some time. It's usually > > > caused by EIST (est) being used, but that's on Intel platforms. AMD has > > > something similar called Cool'n'Quiet (see cpufreq(4) man page). Are > > > you running powerd(8) on this system? If so, try disabling that and see > > > if these go away. > > > > sadly, I don't know if I'm running powerd. > > ps aux | grep power gives nothing, so no I guess... > > as far as I can tell, this error is the least of my problems right now, > > but i would like to fix it. > > Yes that's an accurate ps/grep to use; powerd_enable="yes" in > /etc/rc.conf is how you make use of it. Is this recommended for desktop machines? > Could you provide output from "sysctl -a | grep freq"? That might help > shed some light on the above errors as well, but as I said, I'm not > familiar with AMD systems. > $ sysctl -a | grep freq kern.acct_chkfreq: 15 kern.timecounter.tc.i8254.frequency: 1193182 kern.timecounter.tc.ACPI-fast.frequency: 3579545 kern.timecounter.tc.HPET.frequency: 14318180 kern.timecounter.tc.TSC.frequency: 3491654411 net.inet.sctp.sack_freq: 2 debug.cpufreq.verbose: 0 debug.cpufreq.lowest: 0 machdep.acpi_timer_freq: 3579545 machdep.tsc_freq: 3491654411 machdep.i8254_freq: 1193182 dev.cpu.0.freq: 3000 dev.cpu.0.freq_levels: 3000/19507 2625/17068 2300/14500 2012/12687 1725/10875 1600/10535 1400/9218 1200/7901 1000/6584 800/6345 700/5551 600/4758 500/3965 400/3172 300/2379 200/1586 100/793 dev.acpi_throttle.0.freq_settings: 10000/-1 8750/-1 7500/-1 6250/-1 5000/-1 3750/-1 2500/-1 1250/-1 dev.cpufreq.0.%driver: cpufreq dev.cpufreq.0.%parent: cpu0 dev.hwpstate.0.freq_settings: 3000/19507 2300/14500 1600/10535 800/6345
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1297481067.16594.39.camel>