Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Jul 2014 00:22:52 +0200
From:      "O. Hartmann" <ohartman@zedat.fu-berlin.de>
To:        Willem Jan Withagen <wjw@digiware.nl>
Cc:        "Rang, Anton" <anton.rang@isilon.com>, Adrian Chadd <adrian@freebsd.org>, FreeBSD CURRENT <freebsd-current@freebsd.org>, Dimitry Andric <dim@FreeBSD.org>
Subject:   Re: [CURRENT]: weird memory/linker problem?
Message-ID:  <20140718002252.09f55fc1.ohartman@zedat.fu-berlin.de>
In-Reply-To: <53B2D262.2040502@digiware.nl>
References:  <20140622165639.17a1ba1e.ohartman@zedat.fu-berlin.de> <CAJ-Vmok0Oh6XGe62acXE-82pTmEaouibd1GqDT0pCo8P6x6Hog@mail.gmail.com> <20140623163115.03bdd675.ohartman@zedat.fu-berlin.de> <F427210C-D7A9-499F-AFF9-C0B29CC6D51B@FreeBSD.org> <20140701150755.548ed6b9.ohartman@zedat.fu-berlin.de> <F21EDC44C64DB34B90AF485AC3CEDD4B3539868C@MX104CL01.corp.emc.com> <53B2D262.2040502@digiware.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
--Sig_/j9Oi4X442YBhPfD4RtIhknH
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

Am Tue, 01 Jul 2014 17:23:14 +0200
Willem Jan Withagen <wjw@digiware.nl> schrieb:

> On 2014-07-01 16:48, Rang, Anton wrote:
> > DOT =3D> DOD
> >
> > 444F54 =3D> 444F44
> >
> > That's a single-bit flip.  Bad memory, perhaps?
>=20
> Very likely, especially if the system does not have ECC....
> It just happens on rare occasions that a alpha particle, power cycle, or=
=20
> any things else disruptive damages a memory cell. And it could be that=20
> it requires a special pattern of accesses to actually exhibit the error.
>=20
> In the past (199x's) 'make buildworld' used to be a rather good memory=20
> tester. But nowadays look at
> 	http://www.memtest.org/
>=20
> This tool has found all of the bad memory in all the systems I used and=20
> or build for others...
> Note that it might take a few runs and some more heat to actually=20
> trigger the faulty cell, but memtest86 will usually find it.
>=20
> Note that on big systems with lots of memory it can take a loooooong=20
> time to run just one full testset to completion.
>=20
> --WjW
>=20
>=20
> >
> > Anton
> >
> > -----Original Message-----
> > From: owner-freebsd-current@freebsd.org [mailto:owner-freebsd-current@f=
reebsd.org] On
> > Behalf Of O. Hartmann Sent: Tuesday, July 01, 2014 8:08 AM
> > To: Dimitry Andric
> > Cc: Adrian Chadd; FreeBSD CURRENT
> > Subject: Re: [CURRENT]: weird memory/linker problem?
> >
> > Am Mon, 23 Jun 2014 17:22:25 +0200
> > Dimitry Andric <dim@FreeBSD.org> schrieb:
> >
> >> On 23 Jun 2014, at 16:31, O. Hartmann <ohartman@zedat.fu-berlin.de> wr=
ote:
> >>> Am Sun, 22 Jun 2014 10:10:04 -0700
> >>> Adrian Chadd <adrian@freebsd.org> schrieb:
> >>>> When they segfault, where do they segfault?
> >> ...
> >>> GIMP, LaTeX work, nothing special, but a bit memory consuming
> >>> regrading GIMP) I tried updating the ports tree and surprisingly the
> >>> tree is left over in a unclean condition while /usr/bin/svn segfault
> >>> (on console: pid 18013 (svn), uid 0: exited on signal 11 (core dumped=
)).
> >>>
> >>> Using /usr/local/bin/svn, which is from the devel/subversion port,
> >>> performs well, while FreeBSD 11's svn contribution dies as described.=
 It did not
> >>> hours ago!
> >>
> >> I think what Adrian meant was: can you run svn (or another crashing
> >> program) in gdb, and post a backtrace?  Or maybe run ktrace, and see
> >> where it dies?
> >>
> >> Alternatively, put a core dump and the executable (with debug info) in
> >> a tarball, and upload it somewhere, so somebody else can analyze it.
> >>
> >> -Dimitry
> >>
> >
> > It's me again, with the same weird story.
> >
> > After a couple of days silence, the mysterious entity in my computer is=
 back. This
> > time it is again a weird compiler message of failure (trying to buildwo=
rld):
> >
> > [...]
> > c++  -O2 -pipe -O3 -O3
> > c++ -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/include
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/tools/clang/i=
nclude
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Support -=
I.
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/../../lib/cla=
ng/include
> > -DLLVM_ON_UNIX -DLLVM_ON_FREEBSD -D__STDC_LIMIT_MACROS -D__STDC_CONSTAN=
T_MACROS
> > -fno-strict-aliasing -DLLVM_DEFAULT_TARGET_TRIPLE=3D\"x86_64-unknown-fr=
eebsd11.0\"
> > -DLLVM_HOST_TRIPLE=3D\"x86_64-unknown-freebsd11.0\" -DDEFAULT_SYSROOT=
=3D\"\"
> > -Qunused-arguments -I/usr/obj/usr/src/tmp/legacy/usr/include -std=3Dc++=
11
> > -fno-exceptions -fno-rtti -Wno-c++11-extensions
> > -c /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Support/=
Host.cpp -o
> > Host.o --- GraphWriter.o --- In file included
> > from /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Suppor=
t/GraphWriter.cpp:14: /usr/src/lib/clang/libllvmsupport/../../../contrib/ll=
vm/include/llvm/Support/GraphWriter.h:269:10:
> > error: use of undeclared identifier 'DOD'; did you mean 'DOT'? O <<
> > DOD::EscapeString(Label); ^~~
> > DOT /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/include/llv=
m/Support/GraphWriter.h:35:11:
> > note: 'DOT' declared here namespace DOT {  // Private functions... ^ 1 =
error
> > generated. *** [GraphWriter.o] Error code 1
> >
> >
> > Well, in the past I saw many of those messages, especially not found la=
bels of
> > routines in shared objects/libraries or even those "funny" misspelled m=
essages shown
> > above.
> >
> > I can not reproduce them after a reboot, but as long as the system is r=
unning with
> > this error occured, it is sticky. So in order to compile the OS success=
fully, I
> > reboot.
> >
> > Does anyone have an idea what this could be? Since it affects at the mo=
ment only one
> > machine (the other CoreDuo has been retired in the meanwhile), it feels=
 a bit like a
> > miscompilation on a certain type of CPU.
> >
> > Thanks for your patience,
> >
> > Oliver


Hello all.

Well, I'd like to update some informations. It doesn't relief the special c=
oncern, but
might be a kind of replenishment of experience.

The box in question is now with only 4GB - and is oprable as expected. With=
 8 GB, I see
those reported weird bugs and they revealed themselfes as indeed bit flips.=
 I can not
reproduce them, the occur spontanously, but I can raise the frequency by pe=
rmutating the
RAM sticks. So far. As reported, the memtest86+ test doesn't show anything =
even after
three days(!) of testing!

The bos was built 2009 as a development system with 4GB RAM. That time, the=
 developer
ordered special and expensive overclocker RAM, Ballistix, from Crucial. Usu=
ally, I
purchase JEDEC conform RAM - I have some allergic reaction to this stupid o=
verclocking
and "planned destruction with fun" of silica by overdriving it. Especially =
when it
concerns equipment we have to rely on. The box has then been upgraded with =
further 4GB
RAM (two sticks) of the same type and brand, consuming 2+ volts (as far as =
I know).

Last summer, after 4 years of problem less operation, suddenly I had to fig=
ht with
spontanous crashes and blamed FBSD CURRENT, but very quickly the memory was=
 revealed as
to be the culprit. The funny thing was: the box "roasted" literally the upp=
er 4 GB bank
and they got that hot, you might have burned your fingers seriously when to=
uched (I
did!). The end of that game was, after a cascade of tests, swapping RAM sti=
cks, that
those sticks in the upper slots (B1 and B2) where destroyed! After I exchan=
ged the RAM
completely to JEDEC conform 8 GB, the system ran perfectly, until this summ=
er again. When
in end of May the temperatures went beyon 20 degree Celsius in my lab, the =
bos started
having the issues with this bit flips.

I guess that there is a temperature triggered problem with the voltage regu=
lation or
something killing slowly the RAM modules/sticks. This is only a guess. As I=
 reported, the
chipset itself reports 81 - 85 degree C (in BIOS and with healthd). This hi=
gh temperature
occured suddenly last year and I first thought that could be a mismeasureme=
nt.

After testing VBox and occupying all available memory without any obvious e=
rror or crash,
I tried compiling the OS and it seems that the notable load LLVM/CLANG rpod=
uces building
parallelised world/kernel triggers also this bit flip which results very fa=
st in strange
errors as reported earlier in this thread. The ultimate failure arose when =
I tried to
install a Windows 7 on a free harddrive with 8 GB: the install process died=
 with a file
corruption or not-copied file. I didn't dare to try the FreeBSD installatio=
n since I know
from the past that even FreeBSD's copying also triggers very fast hardware =
issues if any
available (overheating and sibblings). With 4 GB only everything works as e=
xpected, but 4
GB is a pain in the ass with ZFS and 11.0-CURRENT alone, not to mention the=
 pain when
doing some memory intensive calculations/simulations or even VBox.

At the end, there is a mixed conclusion. I realise that I can not trust the=
 expertise of
memtest86+. There is no suitable "burn-in" test for FreeBSD consuming, stre=
ssing,
tortouring memory and bus systems as well as all cores of the CPU starting =
with Core2Duo
CPUs, since cpuburn/burncpu of the ports do not utilise AVX/SIMD or other "=
hot" facilities
of modern Intel-like CPUs or stressing the integrated memory controller in =
a "brutal"
way. Prime95 is only available for i386 - and that is a pity on amd64 and >=
 4GB RAM.

At the end, there is no reason to purchase again a Workstation-grade mainbo=
ard, as
advertised by ASUS, for instance, with this overclocking crap. I leave behi=
nd a very
bitter taste - for my personal view. Since the memory problems I realised d=
o not reveal
themselfes as "steady-state" problems, permanently, I fear data corruption =
not indicated
by any protection - so for the future, ECC is some kind of a must. And this=
 means, even
for "low end" workstations, byebye cheap crappy Intel toy CPUs! At least a =
XEON type,
ECC capable processor is a prerequisite and I wish AMD had not followed the=
 cheap man's
path ripping the ECC facilities off their consumer CPUs. It is a matter of =
fact that even
in the academic environment "cheap" ECCless systems are purchased for "cost
effectiveness".=20

At the end, I personally wish for some massive tortouring tools like cpubur=
n or something
more sophisticated to stress the CPU to its limit - to test the reliability=
, the cooling
facilities and the energy support (power supply flaky under heavy load, etc=
.?). FreeBSD's
port do not have even the simplest Prime95 in a 64bit version as it is avai=
lable for
Linux or Windows. I'm sure, some professionals are capable of pulling toget=
her some
massive stresstest tools, but please could this be made available for the n=
ot so
professionals and more "common" users? Maybe a naive Christmas wish?

I need to replace the system since I can not rely on that flaky box anymore=
, even when
using encrypted devices. That is, after a painful time and hopes, the final=
 conclusion.

Regards and thanks for the patience reading this far,
Oliver

--Sig_/j9Oi4X442YBhPfD4RtIhknH
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBAgAGBQJTyEzDAAoJEOgBcD7A/5N8GcgH/1ULRP8IMJR+8fH8CJkYhArW
+CmCH9WFp7IMisKKcjqzWsOjPz1rE5ubg6AA+aFP7yvyTW3IrWxF0YzpMVFiV3+6
BhO77RIxYcuVye+F+Hf5W5QcRdBdGjiZe0nGdTdF1SvEvjh5F6KChMkhWJkHJDZP
zYYWmne/HAQxUIxRnc9PDOcdMANbqVCYOero9VhkexbzHuBsNIDELjsDuHUOZE7z
6opVrkznB5MVpawcaidxYVJeFO1odukA4UYxXHjfwtPgpL25dT8W04QsCPI+hShr
wPFzciWw3hDJos3XTKKTtH9dX0OOPQwJViHVM/S1duGXZzEE8ReHHuQLO3qowAc=
=nksm
-----END PGP SIGNATURE-----

--Sig_/j9Oi4X442YBhPfD4RtIhknH--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140718002252.09f55fc1.ohartman>