Date: Sun, 16 Apr 2017 09:49:00 +0100 From: Frank Leonhardt <freebsd-doc@fjl.co.uk> To: freebsd-hardware@freebsd.org Subject: Re: SSD errors Message-ID: <02898e76-9285-03e7-e76a-77a5290376b9@fjl.co.uk> In-Reply-To: <20170413205932.GJ2149@shrubbery.net> References: <20170413205932.GJ2149@shrubbery.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 13/04/2017 21:59, heasley wrote: > <snip> > When I push a lot of data to them, such as an rsync, I receive errors like > the below. If I move drives between slots, it seems to follow the chassis > slots, those closest to the power supply, but I'm not positive about this. > > I suppose the questions for list are: > - have I missed any fbsd ssd-specific configuration? > > - all 4 have non-zero UDMA_CRC_Error_Count counters; not many, about the > same number, which I believe implies electrical interference - most > likely in the cable or chassis backplane. Should I buy some specific > model cable? other recommendations? <snip> I'm not aware of any SSD-specific stuff you've missed. The SSD option on the initialisation code in the BIOS is probably just there because there's no need to wait for spin-up time (as you probably thought too). So I don't have an answer, but here are a few thoughts: I think it's the CRC error (out of that lot) that you should be worried about. It means that the drive wrote data, but when it read it back it didn't match. With ST506 this could (and often was) a cable fault but not with IDE. This doesn't mean dodgy cables can't cause you problems with IDE; only that they'd manifest differently. If the drive wrote the data to the flash with a CRC and then the CRC didn't match later, it doesn't make any difference if the data was corrupted on it's way to the drive, or even if it was corrupted on its way back (ZFS would pick that up). So it must have been corrupted on-drive. Right? (I could be wrong about where your CRC errors are being tested/detected, so not necessarily right). So with this in mind, why should the drive's location on the shelf matter (if it does make a difference). I can think of two reasons - electromagnetic interference from adjacent circuits or PSU problems. So if it were me, I'd check the interference theory by using longer cables and spreading the drives out. Serial transfer on long cables isn't really a problem like it was with parallel. That's the easy check. Then it's on to PSU issues. Does an SSD use more or less power than spinning rust? Really? Most people assume they'll use less but it's not as much less as you think, and it varies in different ways. If the PSU can't cope with the peak (e.g. while it's writing). IT people will know all about watts. Add up the number of watts on all your drives and if it's <= the number of watts written on your PSU, cushty. Wrong! An engineer will tell you you can't add watts together and get anything meaningful. And believing the label on a PSU is a mug's game. So, if you've got a decent oscilloscope take a look at the supply rails where they enter the drives. Try writing, and if you get so much as a blip on the voltage then do something about it. If you haven't got a 'scope to hand, I'd try running (some) the drives of a different PSU and see that makes a difference. Although I haven't hit this problem myself, I'd be surprised if the same PSU design intended to power spinning rust at a relatively constant current could cope well with an SSD going from nothing much to lots to nothing much again over a very short space of time. If I was connecting a different PSU to the SSD I'd load it with some real drives just to stabilise the current output a bit (i.e. plug an old drive or two on to some of the other spare outlets). Then there's always the chance it's over-cooking, but I think you'd have mentioned if they were getting very hot. Regards, Frank.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?02898e76-9285-03e7-e76a-77a5290376b9>