Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 May 2016 20:03:07 +0100
From:      Steven Hartland <steven@multiplay.co.uk>
To:        Nikolaj Hansen <nikolaj.hansen@barnabas.dk>
Cc:        "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>
Subject:   Re: HP DL 585 / ACPI ID / ECC Memory / Panic
Message-ID:  <CAHEMsqbe-1B7T_x0YDfvmCaGRbMrcve58_YOf1bh-M-h%2BhcV1A@mail.gmail.com>
In-Reply-To: <57349ED3.7060606@barnabas.dk>
References:  <57349D5B.50202@barnabas.dk> <57349ED3.7060606@barnabas.dk>

next in thread | previous in thread | raw e-mail | index | archive | help
I wouldn't rule out a bad cpu as we had a very similar issue and that's
what it was.

Quick way to confirm is to move all the dram from the disabled CPU to one
of the other CPUs and see if the issue stays away with the current CPU
still disabled.

If that's the case it's likely the on chip memory controller has developed
a fault

On Thursday, 12 May 2016, Nikolaj Hansen <nikolaj.hansen@barnabas.dk> wrote:

> Hi,
>
> I recently added a zfs disk array to my old HP 585 G1 Server.
> Immediately there was kernel panics and I have spent quite a bit of time
> figuring out what was really wrong.
>
> The system has 4 cpu cards with opteron double core processors. Each
> card has 4x2 gigabyte memory 4x2x4 = 32 gigabyte of total system mem.
> The memory is DDR400 ECC mem.
>
> The panic was very easily reproducable. I just had to issue enough reads
> to the system up until the faulty mem was accessed.
>
> Strangely I can run memtest86+ with the DDR setting on and I find no
> error what so ever.
>
> Adding
>
> hint.lapic.2.disabled=1 > /boot/loader.conf
>
> Immediately mitigates the error for FreeBSD. So here is my conclusion:
>
> If you can make the system stable by disabling one core on one cpu card:
>
> 1) The other cards / mem must be ok.
> 2) The mainboard must be ok since one of the cores on the cpu is still
> running / not barfing panics.
> 3) the cpu core with acpi 2 is probably also ok. it is on the same chip
> as a non disabled core.
> 4) It is likely down to a rotten DIMM.
>
> In place of mindlessly trying to find the culprit by switching dimms I
> would really like to identify the CPU, card and mem module from the os.
>
> Info here:
>
> http://pastebin.com/jqufNKck
>
> Thank you for your time and help.
>
> --
>
>
> Med venlig hilsen / with regards
>
> Nikolaj Hansen
>
>
>
>
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAHEMsqbe-1B7T_x0YDfvmCaGRbMrcve58_YOf1bh-M-h%2BhcV1A>