Date: Tue, 1 Jul 2014 18:56:03 +0200 From: "O. Hartmann" <ohartman@zedat.fu-berlin.de> To: Willem Jan Withagen <wjw@digiware.nl> Cc: "Rang, Anton" <anton.rang@isilon.com>, Adrian Chadd <adrian@freebsd.org>, FreeBSD CURRENT <freebsd-current@freebsd.org>, Dimitry Andric <dim@FreeBSD.org> Subject: Re: [CURRENT]: weird memory/linker problem? Message-ID: <20140701185603.00be87ef.ohartman@zedat.fu-berlin.de> In-Reply-To: <53B2DA66.9010506@digiware.nl> References: <20140622165639.17a1ba1e.ohartman@zedat.fu-berlin.de> <CAJ-Vmok0Oh6XGe62acXE-82pTmEaouibd1GqDT0pCo8P6x6Hog@mail.gmail.com> <20140623163115.03bdd675.ohartman@zedat.fu-berlin.de> <F427210C-D7A9-499F-AFF9-C0B29CC6D51B@FreeBSD.org> <20140701150755.548ed6b9.ohartman@zedat.fu-berlin.de> <F21EDC44C64DB34B90AF485AC3CEDD4B3539868C@MX104CL01.corp.emc.com> <53B2D262.2040502@digiware.nl> <20140701173335.394414c3.ohartman@zedat.fu-berlin.de> <53B2DA66.9010506@digiware.nl>
next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/ZmBnMA4RRv0oGJlzTHlHs8K Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Am Tue, 01 Jul 2014 17:57:26 +0200 Willem Jan Withagen <wjw@digiware.nl> schrieb: > On 2014-07-01 17:33, O. Hartmann wrote: > > Am Tue, 01 Jul 2014 17:23:14 +0200 > > Willem Jan Withagen <wjw@digiware.nl> schrieb: > > > >> On 2014-07-01 16:48, Rang, Anton wrote: > >>> DOT =3D> DOD > >>> > >>> 444F54 =3D> 444F44 > >>> > >>> That's a single-bit flip. Bad memory, perhaps? > >> > >> Very likely, especially if the system does not have ECC.... > >> It just happens on rare occasions that a alpha particle, power cycle, = or > >> any things else disruptive damages a memory cell. And it could be that > >> it requires a special pattern of accesses to actually exhibit the erro= r. > >> > >> In the past (199x's) 'make buildworld' used to be a rather good memory > >> tester. But nowadays look at > >> http://www.memtest.org/ > >> > >> This tool has found all of the bad memory in all the systems I used and > >> or build for others... > >> Note that it might take a few runs and some more heat to actually > >> trigger the faulty cell, but memtest86 will usually find it. > >> > >> Note that on big systems with lots of memory it can take a loooooong > >> time to run just one full testset to completion. > >> > >> --WjW > > > > I already testet via memtest86+ (had to download the linux image, the p= ort on FreeBSD > > is broken on CURRENT). It didn't find anything strange so far. > > > > I will do another test. > > > > I realised, that on that that specific box, the chipset temperature is = 81 Grad Celius. > > The chipset is a Eaglelake P45 - in which the memory controller resides= on that old > > platform. dmidecode gives: > > > > Manufacturer: ASUSTeK Computer INC. > > Product Name: P5Q-WS > > Version: Rev 1.xx > Hello Willem, =20 > Hi Oliver, >=20 > I've build several (5+) systems with these boards (from memory they date= =20 > around 2009??). And if I recall right, one of them is still functional.=20 > The first one broke down in a couple of weeks, and the other did not=20 > survive time either. >=20 > The auxiliary chips on that board do run hot, but I never realized this=20 > hot. Is 81C is the CPU temp from sysctl, or did you measure the cooling=20 > body on the motherboard. In the later case it is just too hot, probably. > But even if it is the temp on the chip itself, I've rrarely seen temps=20 > go up this high. The temperature is seen in BIOS and by the usage of one of those health dae= mon, found in ports (forgot about the name).=20 There is no sysctl MIB showing the chipset temperature on that board, as fa= r as I know. >=20 > You can need to run the memtest86 for more than 6-10 complete runs with=20 > all the tests. Last time I ran memtest86+ it took ~ 1 1/2 days to finish. >=20 > If the memtests do not reveal anything broken, then you get into even=20 > more wizardry stuff, like bad power etc... Especially since it only=20 > occurs on occasion, it is going to be a nightmare to find the root cause= =20 > of this. Other than replacing hardware piece by piece, which won't be=20 > easy given the age of the board and parts. >=20 > You could go into the bios, and try to config ram access at a slower=20 > speed and see if the problem goes away. Then it could be that you are=20 > running an the edge of the spec with regards to ram timing. >=20 > But like I said, it is all lots of funky details that can interact in=20 > strange and unexpected ways. >=20 > --WjW I will check memory these days again. Regards, Oliver --Sig_/ZmBnMA4RRv0oGJlzTHlHs8K Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBAgAGBQJTsugnAAoJEOgBcD7A/5N8WXMH+QGPihwFglKqVaFZ0XiH5un/ 9FkGh0vfhkbpJK1xtCUz3qPOseumUSIzfs8tGOaTpfqf4VNvpAdJ4k64wqd3m95E bgXKgiXoyubWHO9KIJ9pME9LB1UEVyzWKBkT3r4doFRiwEKiZlpRK+mVW3Hbx46y a6ffXL+o2PKyMw8HGvuUMF0C1YPixYu7nwBN/jYRvFaui4g0kfk6PFNt/XoiU6f2 1U77pPGXXyiNsEXFknMIqrjjX+vXjza7GTFeEJw/j8teUg0akitEMOVtBQWMEAvO FHo+iQMcGGx7Qa17qpz6wE+36ikMZopRHJNe8ZXzoBzyXMmFF9/+YTO46vVkUQ4= =0mnH -----END PGP SIGNATURE----- --Sig_/ZmBnMA4RRv0oGJlzTHlHs8K--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140701185603.00be87ef.ohartman>