Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Feb 2015 12:22:59 -0800
From:      David Benfell <benfell@parts-unknown.org>
To:        galtsev@kicp.uchicago.edu
Cc:        cpet <cpet@sdf.org>, Polytropon <freebsd@edvax.de>, freebsd-questions@freebsd.org
Subject:   Re: why would I get a segmentation fault on one system but not the other?
Message-ID:  <590FB195-C4E9-4D22-8900-ABE784CE9896@parts-unknown.org>
In-Reply-To: <9134.76.193.19.10.1424620110.squirrel@cosmo.uchicago.edu>
References:  <20150221224006.GA5501@home.parts-unknown.org> <09da5ec0816e098badc49432c802dc18@sdf.org> <390c4c0547fc27e91d28872d29aa2e04@sdf.org> <20150222091956.fd1ec914.freebsd@edvax.de> <20150222104425.GA44573@home.parts-unknown.org> <9134.76.193.19.10.1424620110.squirrel@cosmo.uchicago.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On February 22, 2015 7:48:30 AM PST, Valeri Galtsev <galtsev@kicp.uchicago.edu> wrote:
>
>On Sun, February 22, 2015 4:44 am, David Benfell wrote:
>> On Sun, Feb 22, 2015 at 09:19:56AM +0100, Polytropon wrote:
>>> On Sat, 21 Feb 2015 17:03:50 -0600, cpet wrote:
>>> > As well as don't use stable on a production box as STABLE doesn't
>mean
>>> > what it means.
>>>
>>> STABLE means that the API/ABI is stable. Unlike HEAD (CURRENT),
>>> STABLE still is actually _stable_ in most cases, so it's a valid
>>> solution for production systems (given that you're prepared well,
>>> and you know what you're doing). I'm running STABLE on few
>>> production machines myself (where this is needed), but I usually
>>> prefer (and often recommend) using RELEASE and add the security
>>> patches when they are available.
>>>
>> Thinking about this more, I'm inclined to think my problem is not
>with
>> the base system. I haven't seen *any* crashes with stuff that can be
>> clearly identified as being in the base system, let alone the kernel.
>>
>> My memory test has just completed a 4th pass with zero errors. It's
>> now been running for 7.5 hours.
>>
>
>How long does the box run before segfault? Some memory errors may
>happen
>with smaller probability, then short memtest may be OK, not detecting
>memory errors happening less often.
>
>What is the load of machine when segfault happens? During memtest86 the
>load is "zero". During actual server run, you may be heating the
>interior
>of the box to higher temperatures, namely memory controller to higher
>temperatures, which increases chance of malfunction.
>
>Do you have ECC memory or non-ECC? If non-ECC can you replace it with
>ECC?
>(some memory controllers accept both). Is it possible that you have
>mixture of different types of RAM attached to the same memory
>controller
>(I've seen even different brands claiming the same specs did cause
>occasional malfunctions). Also, which slots do you use for RAM? If not
>all
>slots have RAM, start filling the slots that are farther away from
>memory
>controller (which is on CPU substrate these days, hence from CPU). If
>you
>leave fartherst slots open you will have open (not terminated) portion
>of
>transmission line causing reflections interfering with signal, leading
>to
>trouble. Some fancy system boards do have memory bus terminators so
>what I
>said about slots deasn't matter for them, but majority of boards do
>not.
>If the hardware is a suspect, I would begin with minimal amount of
>known
>good RAM.
>
>Swapping RAM between good and bad machines is another thing to try. I
>however, would try instead to swap hard drives, and see which of
>machines
>will start failing after that. This way you will know for sure if
>software
>(+ hard drive) is to blame (if different machine starts failing) or
>hardware (if the same machine with system from good machine keeps
>failing).
>
>Goog luck!
>
>Valeri
>
>++++++++++++++++++++++++++++++++++++++++
>Valeri Galtsev
>Sr System Administrator
>Department of Astronomy and Astrophysics
>Kavli Institute for Cosmological Physics
>University of Chicago
>Phone: 773-702-4247
>++++++++++++++++++++++++++++++++++++++++

Sorry for the top post; I'm on my phone now. A photo of the memtest from just before I shut it down is here: https://parts-unknown.org/wp/wp-content/uploads/2015/02/0222150941.jpg Hopefully it will answer some of the questions you pose.

The segfaults occur at start-up and consistently thereafter but only, so far as I know, with apache and php-fpm. I have not seen segfaults anywhere else on this system. It is plausible that apache is simply reporting segfaults from php. This is why I think something nefarious is happening within the ports.

Your other suggestions will have to wait until I get back on site.

Thanks
- --
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-----BEGIN PGP SIGNATURE-----
Version: APG v1.1.1

iQJfBAEBCgBJBQJU6jqjQhxEYXZpZCBCZW5mZWxsICgqc2lnaCouIFRyeWluZyBh
Z2Fpbi4pIDxiZW5mZWxsQHBhcnRzLXVua25vd24ub3JnPgAKCRAVeuMeEjZgK8uv
D/9sXo5akh5MCcjMKy7OIFKI6eM8pnh1VMZNe/qPgVNB7usCTS0/zGqwPEsXAsjH
Oh1Yrm/XQ7rMSpYMRzR7DrjIiUl2Qz1cH9Xj5xpB4dtD99pCMgo7XeyB7P0598fS
60MUcLqtDnzoN4QIKA0lSi/7mu50KYm+nhylIkl3C9S+ZtadeccM+Z+sXmW84JVB
DMtyEvcpV8fgA88uv5VPqGga2mOMlYRBKWFbElSdCJjS5L3mUIiNIxiN7iGZagyw
W89KbdbkJoEZ7ID/V0maMU9CzGA0QQWmfD3O4c0YQJAZeUiFuIf5VY9SgUi3rkvP
E7meqsJkHF4kpO/Iadr/C5zWetiAnhGU4FcdqdY1wqHGC2jywyQervyTDN6VPAnq
slwQQj1xJ99WPQ0Io7Ok4Td6vODtMESUDJ/NAcE4eWugp8FR6WQOldiCNeQnSsf6
LSWgaISijNRhDitM15ooWjpK/ehYECodmAItYPjWzwDwqnLrrM5R1bexGzmXuOoH
P6UdLHzWDYQvltdistw4seEOliufu8NQpHKnueMAtyCOyYsH5Fe3QJYaRajp+6FB
niyYDiN8CCugxcW8C1xoU+RNsRmhmDBqEvHUqckYs23rbuWAYxSc65zz5VD9WFro
hPYyUfHRQCtUjuV3AiyOb0QH78Y4EL1nrEargo07Md3gXg==
=bbfG
-----END PGP SIGNATURE-----




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?590FB195-C4E9-4D22-8900-ABE784CE9896>