From owner-freebsd-isp Sat May 3 14:36:23 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.5/8.8.5) id OAA19094 for isp-outgoing; Sat, 3 May 1997 14:36:23 -0700 (PDT) Received: from kremvax.demos.su (kremvax.demos.su [194.87.0.20]) by hub.freebsd.org (8.8.5/8.8.5) with SMTP id OAA19051; Sat, 3 May 1997 14:36:16 -0700 (PDT) Received: by kremvax.demos.su (8.6.13/D) from 0@skraldespand.demos.su [194.87.0.19] with ESMTP id BAA23759; Sun, 4 May 1997 01:35:58 +0400 Received: by skraldespand.demos.su id BAA12138; (8.8.5/D) Sun, 4 May 1997 01:37:01 +0400 (MSD) Message-ID: <19970504013700.25396@skraldespand.demos.su> Date: Sun, 4 May 1997 01:37:01 +0400 From: "Mikhail A. Sokolov" To: hackers@freebsd.org Cc: isp@freebsd.org Subject: strange 2.2.1 behaviour. Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.65_p2,4-7,10-11,15,18,21-22 Organization: Demos Company, Ltd., Moscow, Russian Federation. X-Point-of-View: Gravity is myth, - the earth sucks. X-Om-Livet-Suger: Ja. Ja, ja. Sender: owner-isp@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Hello, there's one problem I would dare to disturb you, people. Let's take 4 machines, as described below, 2 HP, 2 something (selfmade rack industrial PC). They all reboot themselves without warnings since became 2.2.1. Let me explain, they all are heavily loaded servers, with 100mbitx2 connection, and I assume it'd be better to explain each of them in particular: MH model HP 6/200 VA P6-200 chipset Intel "Natoma" 128MB EDO RAM adaptec 3949UW (TAG and SCB enabled) seagate ST32151W Intel EtherExpress Pro 10/100B (two, 100Mb full duplex) nfs client network activity is 200-400in/200-400out 1k packets/sec The *&^&^ crashes each 5-30 min with the following reason: Trap 12 : fault while in kernel mode ... virtual page adress 0x0 page not present , - that's rare ocasions this shy box escape's a yell like that, ussualy it'd just crash down. GK model HP 5/166 VL series 4 P5-166 chipset Intel 82437FX 128MB RAM adaptec 2949UW (aic7880, TAG and SCB enabled) seagate ST32550W Intel EtherExpress Pro 10/100B (one, 100Mb full duplex) nfs client network activity is also some kind of 200-400in/200-400out packets/sec crashes every 5-30 minutes. This one never let society know, why is it willing to crash. SB asus P/I-P6NP5 P6-200 chipset Intel "Natoma" 128MB RAM adaptec 3949UW (TAG and SCB enabled) seagate ST32151W and ST19171W Intel EtherExpress Pro 10/100B (two, 100Mb full duplex) nfs server network activity 500-1000in/500-1000out packets/sec crashed once 24-48 hr Here, it's silent also, but is definetely more loaded and is more stable for some unknown reasons. Of course I know HP sucks (pardon, but it does), but ASUS motherboarded machines definetely seems to be more stable than any HP made PC. Anyhow, There's another one, selfmade also, ASUS ppro200x2/Natoma/256 RAM and 3x3940 adaptecs, 10 disks (2x9gb and 8x4gb seagates) plus 2 fxp intel cards. It already reboots once per ~week, but without _any_ notice. This one is the most loaded, handling huge ftp server, proxy server etc. The most interesting part is that hardware is _not_ culprit in this situations, we changed memory in boxes, disks, ethernet's (tried de0's by SMC), even power supplies. They all are double UPS'd, all supplies have enough power to feed that iron pieces, but still, reboots happen. When we investigated what's wrong, we tried to correlate their reboots with a) high disk activities, b) network activities, c) network situation changes. We got: a) has nothing to do with situation, since both ppro200's handle use disk more than others, and the last one, unnamed, serves 10 disk easylly, still crahes a less than others. b) should be the culpit here, - MH and GK boxes were made to exec looped find's -exec ls -alRt (etc) over 100mbit full duplex NFS v 3.0 (tested both, TCP and UDP variants) on disk, mounted to SB, and here, - MH and GK crash in 10/20/30 minutes, still the server stands still, plus serving 40/60 clients simultaneously (that gives 200-300 processes, a la sh/slirp). That is odd, but when you unplugg boxes from network, they do ok for weeks (tested). c) we tried to correlate sb's crashes with arp info changes by arp proxy by nearby standing cisco (4500/IOS 10.3), - tough luck. Tried to correlate virtual inerfaces quantity increasing on SB (now it's ~130) with it's reboots, no luck here also. Now we totalaly misunderstand what is going on, what can it be and why, this boxes don't run anything than well known software, like squid, ircd, slirpd and alike things. Sorry for complicated explanation, Sincerely yours, Mikhail A. Sokolov. P.S. Please, all ideas are welcomed, maybe when they don't fit the list, mail it here, - don't let bosses desicion happen, so that ftp.ru.freebsd.org will live on some Sun box :-(