Date: Thu, 14 Feb 2008 15:27:29 -0700 (MST) From: Brett Bump <bbump@rsts.org> To: Kris Kennaway <kris@freebsd.org> Cc: freebsd-performance@freebsd.org Subject: Re: System perforamance 4.x vs. 5.x and 6.x Message-ID: <20080214131026.Y75492@mail.rsts.org> In-Reply-To: <47B49A16.1080103@FreeBSD.org> References: <20080214114759.R75215@mail.rsts.org> <47B49A16.1080103@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 14 Feb 2008, Kris Kennaway wrote: > We are going to need more information about your system. What do you > mean by "peak activity"? What is running on the system when it performs > badly (check top -S, ps, gstat, vmstat -w, vmstat -i). What is your > kernel configuration, dmesg and relevant aspects of the system > configuration? > > Kris > I would call 120 processes with a load average of 0.03 and 99.9 idle with 10-20 sendmail processes and 30 apache jobs nothing to write home about. But when that jumps to 250 processes, a load average of 30 with 50% idle (5-10 second waits on single character ssh echo) a bit busy. That usually means my heavy pop3 users are checking in at the same time someone (or 2 or 3) have sent email to the large volume listservs. Proc stat doesn't show as much as gstat and iostat. Gstat alwasy shows my drive with /var/mail being 97-100% busy and iostat will always show hi tps rates, but never anything above 8MB/s (4.10 gave me 30MB/s+). Kernel is generic with ipfirewall quota and smp (no ipfw rules yet). On Thu, 14 Feb 2008, Bill Moran wrote: > What _is_ the hardware? Dell PowerEdge 1750 1U, 146Gig U320s. The Broadcoms seem to be a change from the earlier 1550s with intel pro/100s (I prefer the intel's). On Thu, 14 Feb 2008, Kris Kennaway wrote: > All it takes is a single bug (e.g. in a driver) to affect performance on > a certain specific configuration. However, bugs tend to get fixed over > time. Maybe that is the case for you. It is well worth verifying > whether the problem persists on the most up-to-date sources, so that > everyone's time is not wasted in tracking down a problem that is already > fixed. You can just do a source upgrade from 6.2, which will be quite > straightforward. Agreed. I have a 2nd machine that is identical to this one I could put 6.3 on to test this. > It is pretty unusual for applications to be aborting, but usually they > do it because they fail an application-specific run-time check. What > diagnostics are logged by the applications? You may need to increase > their respective verbosity/debug levels. > > Kris > I was suspicious that maybe we needed more memory but swap has barely even been touched (232k used...with 1400meg inactive). On Thu, 14 Feb 2008, Mike Tancsa wrote: > No, but you havent given the list much to go on as to what the > problems are or what hardware you are using, or really quantified the > issue. By "slow" is the disk blocking on IO ? or are processes > blocking on network IO etc etc. 6.2 was not a "bad" release, but 6.3 > is better than 6.2. By starting with a more contemporary release, > less effort by developers and other users need to be exerted in > figuring out if the problem(s) you are running into have already been > fixed. It appears to me that disk access is extremely slow. I can transfer large files between the machines faster than making a duplicate copy on disk. > Because the drivers have changed since 4.10. "improvements" could > have introduced regressions... Change in the driver to support newer > versions of a chipset might break older chipsets. Any known issues with the Dell PERC RAID driver that anyone is aware of? I can start there. > bge is a good example of a driver that has had a lot of changes and > hasnt worked all that well at times.... hence the suggestion to try > 6.3 as there have been many bug fixes. Whether or not it fixes your > problem its hard to say, but start there to see if things are faster > and stable for you etc. > e.g. > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/bge/if_bge.c > > You should also post a full dmesg of the box as well as kernel config > etc... There kernel is generic with ipfirewall, quota and smp. Feb 14 02:53:37 mail sm-mta[33143]: m1E9qKLZ033143: SYSERR(root): collect: I/O error on connection from astro.pryor.com, from=<CUSTOMERSERVICE@EM.PRYOR.COM>pid 31611 (milter-greylist), uid 25: exited on signal 3 Feb 14 03:17:08 mail sshd[34844]: warning: /etc/hosts.allow, line 45: can't verify hostname: getaddrinfo(host-200-6-102-230.iia.cl, AF_INET) failed Feb 14 03:17:08 mail sshd[34844]: refused connect from 200.6.102.230 (200.6.102.230) Feb 14 03:36:30 mail sshd[35944]: refused connect from 202.129.44.218 (202.129.44.218) Feb 14 03:45:21 mail sshd[36667]: refused connect from 202.129.44.218 (202.129.44.218) Feb 14 03:52:01 mail sm-mta[33092]: m1E9peX3033092: SYSERR(root): collect: read timeout on connection from astro.pryor.com, from=<CUSTOMERSERVICE@EM.PRYOR.COM> Feb 14 07:24:01 mail sshd[52723]: warning: /etc/hosts.allow, line 45: can't verify hostname: getaddrinfo(42.215.6.200.intelnet.net.gt, AF_INET) failed Feb 14 07:24:01 mail sshd[52723]: refused connect from 200.6.215.42 (200.6.215.42) Feb 14 07:28:56 mail sm-mta[52866]: m1EEPPLC052866: SYSERR(root): collect: I/O error on connection from astro.pryor.com, from=<CUSTOMERSERVICE@EM.PRYOR.COM> Feb 14 07:29:15 mail sshd[53465]: warning: /etc/hosts.allow, line 45: can't verify hostname: getaddrinfo(42.215.6.200.intelnet.net.gt, AF_INET) failed Feb 14 07:29:15 mail sshd[53465]: refused connect from 200.6.215.42 (200.6.215.42) Feb 14 08:01:57 mail sshd[58183]: refused connect from mail.rsib.net (12.46.46.98) Feb 14 08:07:22 mail sshd[59017]: refused connect from mail.rsib.net (12.46.46.98) Feb 14 09:50:00 mail su: bbump to root on /dev/ttyp0 pid 43464 (httpd), uid 80: exited on signal 6 pid 86995 (imapd), uid 2151: exited on signal 6 pid 85706 (httpd), uid 80: exited on signal 6 pid 87600 (imapd), uid 1376: exited on signal 6 pid 45621 (httpd), uid 80: exited on signal 6 pid 45617 (httpd), uid 80: exited on signal 6 Feb 14 11:28:36 mail inetd[48076]: imap4 from 208.107.161.82 exceeded counts/min (limit 60/min) Feb 14 11:28:38 mail last message repeated 2 times Feb 14 11:52:34 mail sm-mta[99563]: m1EHqX9u099563: SYSERR(root): collect: read timeout on connection from fulltimeconsult.com, from=<AARPMembership@wlq.fulltimsgeconsult.com> Feb 14 13:06:27 mail su: bbump to root on /dev/ttyp0 pid 45995 (imapd), uid 3115: exited on signal 6 pid 46407 (imapd), uid 1873: exited on signal 6 pid 46418 (imapd), uid 2769: exited on signal 6 pid 46402 (imapd), uid 1873: exited on signal 6 pid 46651 (imapd), uid 2769: exited on signal 6 pid 46653 (imapd), uid 2769: exited on signal 6 pid 44499 (httpd), uid 80: exited on signal 6 pid 47035 (imapd), uid 1873: exited on signal 6 pid 46083 (httpd), uid 80: exited on signal 6 pid 46395 (httpd), uid 80: exited on signal 6 pid 46604 (httpd), uid 80: exited on signal 6 pid 46603 (httpd), uid 80: exited on signal 6 > what does > netstat -ni > give -bash-2.05b$ netstat -ni Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll bge0 1500 <Link#1> 00:0f:1f:66:0e:e6 12511748 902 12025487 0 0 bge0 1500 208.107.160/2 208.107.161.82 17011211 - 16533277 - - bge1 1500 <Link#2> 00:0f:1f:66:0e:e8 3523091 586 4089056 0 0 bge1 1500 10.1.1/24 10.1.1.1 3516790 - 4087415 - - lo0 16384 <Link#3> 4659734 0 4659733 0 0 lo0 16384 fe80:3::1/64 fe80:3::1 0 - 0 - - lo0 16384 ::1/128 ::1 2772 - 2772 - - lo0 16384 127 127.0.0.1 147255 - 147255 - - > and what options do you have on ifconfig ? Are the errors seen on > your switch port as well or just in netstat -ni ? ifconfig_bge0="inet 208.107.161.82 netmask 255.255.254.0 media 100baseTX mediaopt full-duplex" ifconfig_bge1="inet 10.1.1.1 netmask 255.255.255.0 media 100baseTX mediaopt full-duplex" No, the switch shows clear, they only show up as input errors on this box. The box sitting under this one has an uptime of 621 days with 1 Oerr. > Why are the processes sigabrting ? Is there anything in the > application logs to indicate why they are exiting ? > > ---Mike > [Thu Feb 14 09:59:23 2008] [notice] child pid 43464 exit signal Abort trap (6) httpd in malloc(): error: recursive call [Thu Feb 14 10:07:34 2008] [notice] child pid 85706 exit signal Abort trap (6) httpd in free(): error: recursive call [Thu Feb 14 10:48:39 2008] [notice] child pid 45621 exit signal Abort trap (6) httpd in free(): error: recursive call Memory. This is why I was willing to throw another 2gig of memory in it, but why am I only seeing 268K of swap used? Brett
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080214131026.Y75492>