Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 1 Jul 2014 18:56:03 +0200
From:      "O. Hartmann" <ohartman@zedat.fu-berlin.de>
To:        Willem Jan Withagen <wjw@digiware.nl>
Cc:        "Rang, Anton" <anton.rang@isilon.com>, Adrian Chadd <adrian@freebsd.org>, FreeBSD CURRENT <freebsd-current@freebsd.org>, Dimitry Andric <dim@FreeBSD.org>
Subject:   Re: [CURRENT]: weird memory/linker problem?
Message-ID:  <20140701185603.00be87ef.ohartman@zedat.fu-berlin.de>
In-Reply-To: <53B2DA66.9010506@digiware.nl>
References:  <20140622165639.17a1ba1e.ohartman@zedat.fu-berlin.de> <CAJ-Vmok0Oh6XGe62acXE-82pTmEaouibd1GqDT0pCo8P6x6Hog@mail.gmail.com> <20140623163115.03bdd675.ohartman@zedat.fu-berlin.de> <F427210C-D7A9-499F-AFF9-C0B29CC6D51B@FreeBSD.org> <20140701150755.548ed6b9.ohartman@zedat.fu-berlin.de> <F21EDC44C64DB34B90AF485AC3CEDD4B3539868C@MX104CL01.corp.emc.com> <53B2D262.2040502@digiware.nl> <20140701173335.394414c3.ohartman@zedat.fu-berlin.de> <53B2DA66.9010506@digiware.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/ZmBnMA4RRv0oGJlzTHlHs8K
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

Am Tue, 01 Jul 2014 17:57:26 +0200
Willem Jan Withagen <wjw@digiware.nl> schrieb:

> On 2014-07-01 17:33, O. Hartmann wrote:
> > Am Tue, 01 Jul 2014 17:23:14 +0200
> > Willem Jan Withagen <wjw@digiware.nl> schrieb:
> >
> >> On 2014-07-01 16:48, Rang, Anton wrote:
> >>> DOT =3D> DOD
> >>>
> >>> 444F54 =3D> 444F44
> >>>
> >>> That's a single-bit flip.  Bad memory, perhaps?
> >>
> >> Very likely, especially if the system does not have ECC....
> >> It just happens on rare occasions that a alpha particle, power cycle, =
or
> >> any things else disruptive damages a memory cell. And it could be that
> >> it requires a special pattern of accesses to actually exhibit the erro=
r.
> >>
> >> In the past (199x's) 'make buildworld' used to be a rather good memory
> >> tester. But nowadays look at
> >> 	http://www.memtest.org/
> >>
> >> This tool has found all of the bad memory in all the systems I used and
> >> or build for others...
> >> Note that it might take a few runs and some more heat to actually
> >> trigger the faulty cell, but memtest86 will usually find it.
> >>
> >> Note that on big systems with lots of memory it can take a loooooong
> >> time to run just one full testset to completion.
> >>
> >> --WjW
> >
> > I already testet via memtest86+ (had to download the linux image, the p=
ort on FreeBSD
> > is broken on CURRENT). It didn't find anything strange so far.
> >
> > I will do another test.
> >
> > I realised, that on that that specific box, the chipset temperature is =
81 Grad Celius.
> > The chipset is a Eaglelake P45 - in which the memory controller resides=
 on that old
> > platform. dmidecode gives:
> >
> >          Manufacturer: ASUSTeK Computer INC.
> >          Product Name: P5Q-WS
> >          Version: Rev 1.xx
>


Hello Willem,

=20
> Hi Oliver,
>=20
> I've build several (5+) systems with these boards (from memory they date=
=20
> around 2009??). And if I recall right, one of them is still functional.=20
> The first one broke down in a couple of weeks, and the other did not=20
> survive time either.
>=20
> The auxiliary chips on that board do run hot, but I never realized this=20
> hot. Is 81C is the CPU temp from sysctl, or did you measure the cooling=20
> body on the motherboard. In the later case it is just too hot, probably.
> But even if it is the temp on the chip itself, I've rrarely seen temps=20
> go up this high.

The temperature is seen in BIOS and by the usage of one of those health dae=
mon, found in
ports (forgot about the name).=20
There is no sysctl MIB showing the chipset temperature on that board, as fa=
r as I know.

>=20
> You can need to run the memtest86 for more than 6-10 complete runs with=20
> all the tests.

Last time I ran memtest86+ it took ~ 1 1/2 days to finish.

>=20
> If the memtests do not reveal anything broken, then you get into even=20
> more wizardry stuff, like bad power etc... Especially since it only=20
> occurs on occasion, it is going to be a nightmare to find the root cause=
=20
> of this. Other than replacing hardware piece by piece, which won't be=20
> easy given the age of the board and parts.
>=20
> You could go into the bios, and try to config ram access at a slower=20
> speed and see if the problem goes away. Then it could be that you are=20
> running an the edge of the spec with regards to ram timing.
>=20
> But like I said, it is all lots of funky details that can interact in=20
> strange and unexpected ways.
>=20
> --WjW

I will check memory these days again.

Regards,
Oliver


--Sig_/ZmBnMA4RRv0oGJlzTHlHs8K
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBAgAGBQJTsugnAAoJEOgBcD7A/5N8WXMH+QGPihwFglKqVaFZ0XiH5un/
9FkGh0vfhkbpJK1xtCUz3qPOseumUSIzfs8tGOaTpfqf4VNvpAdJ4k64wqd3m95E
bgXKgiXoyubWHO9KIJ9pME9LB1UEVyzWKBkT3r4doFRiwEKiZlpRK+mVW3Hbx46y
a6ffXL+o2PKyMw8HGvuUMF0C1YPixYu7nwBN/jYRvFaui4g0kfk6PFNt/XoiU6f2
1U77pPGXXyiNsEXFknMIqrjjX+vXjza7GTFeEJw/j8teUg0akitEMOVtBQWMEAvO
FHo+iQMcGGx7Qa17qpz6wE+36ikMZopRHJNe8ZXzoBzyXMmFF9/+YTO46vVkUQ4=
=0mnH
-----END PGP SIGNATURE-----

--Sig_/ZmBnMA4RRv0oGJlzTHlHs8K--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140701185603.00be87ef.ohartman>