From owner-freebsd-scsi@freebsd.org Fri Mar 4 09:16:25 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 604859DA3E5 for ; Fri, 4 Mar 2016 09:16:25 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com [IPv6:2a00:1450:400c:c09::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E8EDF929 for ; Fri, 4 Mar 2016 09:16:24 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x22f.google.com with SMTP id p65so11617523wmp.1 for ; Fri, 04 Mar 2016 01:16:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=TOo+addsXbNzHm2TzNT/hhUF72OQ0W5ywugPGdV9eY4=; b=S2KWhlozUg/ruTyOQgOYeIJ9GCh1NGaY3G8VY0UndawTcDWIjn+kd95JYPo64ZdFW8 7Of6ls353kTGTxdyRZFcdSL/QBuSIPAsbmkbQU+u4bFHLLL8yS6+nisG/qEUCIjtwcUe ywXJRfirD7YPasvkKU4zQs6VUk/CRtTHD79kgsMIVDVBN8Ypz0SjZbEPF3PdZ/mPp4ge Ev2zR6xBws9YblgBU9W6hlkYel4pGjWIaoFbxO/ZszjlY2q1oOxtdTLSSEjZHQNq31R4 M/MfYxjPsZ/Dic6GaXG8RDiiR1z8iQQkck/6okpoQbMKqtPaL2bo6x2wHr418AhVy3y2 K8Vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=TOo+addsXbNzHm2TzNT/hhUF72OQ0W5ywugPGdV9eY4=; b=Yrd3Nv3f+H1VugI8zzBqOTfORXXZgW2/0Wz4sReJljST2/ebeDw+YPKR5vqg8XfYH8 ZrBI80KrYRZbX6Hd+D8Ue7vf+8XyXGrv2lyKPZze6963xt0QEnjlWnaycDg1r/iWJ+aT mRmY/386NAcmYcAXjC+SJBmICOA4RAYbZuI6znuVFNbUeM/Gnw4UqeZAmmcd97bxHsp9 s36ATFbVr/UjYjTvTC1TdvhfF9LU95LV0YNN4cAS+NY4dCD4ta6/ua6K1mSJM+ICp9Sz hDmhyIh8xM1W2nBqtBKeqae3Ah7mfwCxSI3fNK6eyCW+ycQjUnelaLLBMnpV31X3PfOK 7sNQ== X-Gm-Message-State: AD7BkJLJigcLhVQ8bTvt1u6p27hfzOZRiVA5WtEXJnAiwdwyiwF8C7+3A+4XZ7vs8/6M1Qvz X-Received: by 10.28.138.198 with SMTP id m189mr3831346wmd.19.1457082983459; Fri, 04 Mar 2016 01:16:23 -0800 (PST) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id e4sm2155211wma.10.2016.03.04.01.16.22 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 04 Mar 2016 01:16:22 -0800 (PST) Subject: Re: mpr(4) SAS3008 Repeated Crashing To: Borja Marcos , Scott Long References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <56D805FD.50500@multiplay.co.uk> Cc: FreeBSD-scsi From: Steven Hartland Message-ID: <56D95266.301@multiplay.co.uk> Date: Fri, 4 Mar 2016 09:16:22 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Mar 2016 09:16:25 -0000 On 04/03/2016 08:02, Borja Marcos wrote: >> On 03 Mar 2016, at 18:09, Scott Long wrote: >> >> >> SYNC CACHE seems to have been involved this time, and while it=E2=80=99= s sometimes a source of trouble with SATA disks, I=E2=80=99m very hesitan= t to blame it. Given the seemingly random nature of your problems, I=E2=80= =99m not as certain anymore to rule out a fault of the disk enclosure. T= his looks to be a different disk than your last report, and your statemen= t that a sibling system exhibits no problems is very interesting. Maybe = there=E2=80=99s an issue with the power supply, and the disks are getting= under-voltage conditions periodically. If you can run smartctl against = the disks, the output might be useful. Also, if you=E2=80=99re able, cou= ld you make sure that both this system and the one that is working well a= re being fed with sufficient and similar AC power? And if the power supp= ly modules in your enclosures are swappable, maybe swap them between syst= ems and see if the problem follows the module? If that doesn=E2=80=99t f= ix it then I=E2=80=99ll think of ways to provide more instrumentation. > The affected disks are completely random. I didn=E2=80=99t copy a lot o= f instances to avoid too much litter, but each time it=E2=80=99s a differ= ent disk. > > Both systems are in the same datacenter, and yes, the power infrastruct= ure is working. Swapping modules can be done if > the dealer sends us another one because I prefer not to mess with a wor= king system. > > The fact that it=E2=80=99s a different disk each time, and that the oth= er system works perfectly is what makes me quite certain that it=E2=80=99= s a hardware problem. Either some trouble > with the backplane or a power problem. > > I am tempted to go the oscilloscope route (monitoring the internal powe= r rails). But if the problem is in the power distribution of the backplan= e itself > I=E2=80=99ll need to destroy a broken disk to build a backplane power p= robe :) > Its very rare but we've also seen this type of behaviour from a failing=20 Intel CPU. There was no other indication the CPU had an issue, which one = might expect, so just wanted to make you aware of the possibility. That said the most common cause of this we've seen, when its not a=20 common disk or disks, is a bad backplane or cabling to the backplane. Regards Steve