Date: Sun, 22 Feb 2015 12:22:59 -0800 From: David Benfell <benfell@parts-unknown.org> To: galtsev@kicp.uchicago.edu Cc: cpet <cpet@sdf.org>, Polytropon <freebsd@edvax.de>, freebsd-questions@freebsd.org Subject: Re: why would I get a segmentation fault on one system but not the other? Message-ID: <590FB195-C4E9-4D22-8900-ABE784CE9896@parts-unknown.org> In-Reply-To: <9134.76.193.19.10.1424620110.squirrel@cosmo.uchicago.edu> References: <20150221224006.GA5501@home.parts-unknown.org> <09da5ec0816e098badc49432c802dc18@sdf.org> <390c4c0547fc27e91d28872d29aa2e04@sdf.org> <20150222091956.fd1ec914.freebsd@edvax.de> <20150222104425.GA44573@home.parts-unknown.org> <9134.76.193.19.10.1424620110.squirrel@cosmo.uchicago.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On February 22, 2015 7:48:30 AM PST, Valeri Galtsev <galtsev@kicp.uchicago.edu> wrote: > >On Sun, February 22, 2015 4:44 am, David Benfell wrote: >> On Sun, Feb 22, 2015 at 09:19:56AM +0100, Polytropon wrote: >>> On Sat, 21 Feb 2015 17:03:50 -0600, cpet wrote: >>> > As well as don't use stable on a production box as STABLE doesn't >mean >>> > what it means. >>> >>> STABLE means that the API/ABI is stable. Unlike HEAD (CURRENT), >>> STABLE still is actually _stable_ in most cases, so it's a valid >>> solution for production systems (given that you're prepared well, >>> and you know what you're doing). I'm running STABLE on few >>> production machines myself (where this is needed), but I usually >>> prefer (and often recommend) using RELEASE and add the security >>> patches when they are available. >>> >> Thinking about this more, I'm inclined to think my problem is not >with >> the base system. I haven't seen *any* crashes with stuff that can be >> clearly identified as being in the base system, let alone the kernel. >> >> My memory test has just completed a 4th pass with zero errors. It's >> now been running for 7.5 hours. >> > >How long does the box run before segfault? Some memory errors may >happen >with smaller probability, then short memtest may be OK, not detecting >memory errors happening less often. > >What is the load of machine when segfault happens? During memtest86 the >load is "zero". During actual server run, you may be heating the >interior >of the box to higher temperatures, namely memory controller to higher >temperatures, which increases chance of malfunction. > >Do you have ECC memory or non-ECC? If non-ECC can you replace it with >ECC? >(some memory controllers accept both). Is it possible that you have >mixture of different types of RAM attached to the same memory >controller >(I've seen even different brands claiming the same specs did cause >occasional malfunctions). Also, which slots do you use for RAM? If not >all >slots have RAM, start filling the slots that are farther away from >memory >controller (which is on CPU substrate these days, hence from CPU). If >you >leave fartherst slots open you will have open (not terminated) portion >of >transmission line causing reflections interfering with signal, leading >to >trouble. Some fancy system boards do have memory bus terminators so >what I >said about slots deasn't matter for them, but majority of boards do >not. >If the hardware is a suspect, I would begin with minimal amount of >known >good RAM. > >Swapping RAM between good and bad machines is another thing to try. I >however, would try instead to swap hard drives, and see which of >machines >will start failing after that. This way you will know for sure if >software >(+ hard drive) is to blame (if different machine starts failing) or >hardware (if the same machine with system from good machine keeps >failing). > >Goog luck! > >Valeri > >++++++++++++++++++++++++++++++++++++++++ >Valeri Galtsev >Sr System Administrator >Department of Astronomy and Astrophysics >Kavli Institute for Cosmological Physics >University of Chicago >Phone: 773-702-4247 >++++++++++++++++++++++++++++++++++++++++ Sorry for the top post; I'm on my phone now. A photo of the memtest from just before I shut it down is here: https://parts-unknown.org/wp/wp-content/uploads/2015/02/0222150941.jpg Hopefully it will answer some of the questions you pose. The segfaults occur at start-up and consistently thereafter but only, so far as I know, with apache and php-fpm. I have not seen segfaults anywhere else on this system. It is plausible that apache is simply reporting segfaults from php. This is why I think something nefarious is happening within the ports. Your other suggestions will have to wait until I get back on site. Thanks - -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -----BEGIN PGP SIGNATURE----- Version: APG v1.1.1 iQJfBAEBCgBJBQJU6jqjQhxEYXZpZCBCZW5mZWxsICgqc2lnaCouIFRyeWluZyBh Z2Fpbi4pIDxiZW5mZWxsQHBhcnRzLXVua25vd24ub3JnPgAKCRAVeuMeEjZgK8uv D/9sXo5akh5MCcjMKy7OIFKI6eM8pnh1VMZNe/qPgVNB7usCTS0/zGqwPEsXAsjH Oh1Yrm/XQ7rMSpYMRzR7DrjIiUl2Qz1cH9Xj5xpB4dtD99pCMgo7XeyB7P0598fS 60MUcLqtDnzoN4QIKA0lSi/7mu50KYm+nhylIkl3C9S+ZtadeccM+Z+sXmW84JVB DMtyEvcpV8fgA88uv5VPqGga2mOMlYRBKWFbElSdCJjS5L3mUIiNIxiN7iGZagyw W89KbdbkJoEZ7ID/V0maMU9CzGA0QQWmfD3O4c0YQJAZeUiFuIf5VY9SgUi3rkvP E7meqsJkHF4kpO/Iadr/C5zWetiAnhGU4FcdqdY1wqHGC2jywyQervyTDN6VPAnq slwQQj1xJ99WPQ0Io7Ok4Td6vODtMESUDJ/NAcE4eWugp8FR6WQOldiCNeQnSsf6 LSWgaISijNRhDitM15ooWjpK/ehYECodmAItYPjWzwDwqnLrrM5R1bexGzmXuOoH P6UdLHzWDYQvltdistw4seEOliufu8NQpHKnueMAtyCOyYsH5Fe3QJYaRajp+6FB niyYDiN8CCugxcW8C1xoU+RNsRmhmDBqEvHUqckYs23rbuWAYxSc65zz5VD9WFro hPYyUfHRQCtUjuV3AiyOb0QH78Y4EL1nrEargo07Md3gXg== =bbfG -----END PGP SIGNATURE-----
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?590FB195-C4E9-4D22-8900-ABE784CE9896>