Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Sep 2010 00:00:17 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Don Lewis <truckman@FreeBSD.org>
Cc:        stable@FreeBSD.org, sterling@camdensoftware.com
Subject:   Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime
Message-ID:  <20100929070017.GA82362@icarus.home.lan>
In-Reply-To: <201009290531.o8T5VRZJ061189@gw.catspoiler.org>
References:  <201009282344.o8SNiqSK060715@gw.catspoiler.org> <201009290531.o8T5VRZJ061189@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Sep 28, 2010 at 10:31:27PM -0700, Don Lewis wrote:
> On 28 Sep, Don Lewis wrote:
> 
> > Looking at the timestamps of things and comparing to my logs, I
> > discovered that the last instance of ntp instability happened when I was
> > running "make index" in /usr/ports.  I tried it again with entertaining
> > results.  After a while, the machine became unresponsive.  I was logged
> > in over ssh and it stopped echoing keystrokes.  In parallel I was
> > running a script that echoed the date, the results of "vmstat -i", and
> > the results of "ntpq -c pe".  The latter showed jitter and offset going
> > insane.  Eventually "make index" finished and the machine was responsive
> > again, but the time was way off and ntpd croaked because the necessary
> > time correction was too large.  Nothing else anomalous showed up in the
> > logs.  Hmn, about half an hour after ntpd died I started my CPU time
> > accounting test and two minutes into that test I got a spew of calcru
> > messages ...
> 
> I tried this experiment again using a kernel with WITNESS and
> DEBUG_VFS_LOCKS compiled in, and pinging this machine from another.
> Things look normal for a while, then the ping times get huge for a while
> and then recover.
> 
> 64 bytes from 192.168.101.3: icmp_seq=1169 ttl=64 time=0.135 ms
> 64 bytes from 192.168.101.3: icmp_seq=1170 ttl=64 time=0.141 ms
> 64 bytes from 192.168.101.3: icmp_seq=1171 ttl=64 time=0.130 ms
> 64 bytes from 192.168.101.3: icmp_seq=1172 ttl=64 time=0.131 ms
> 64 bytes from 192.168.101.3: icmp_seq=1173 ttl=64 time=0.128 ms
> 64 bytes from 192.168.101.3: icmp_seq=1174 ttl=64 time=38232.140 ms
> 64 bytes from 192.168.101.3: icmp_seq=1175 ttl=64 time=37231.309 ms
> 64 bytes from 192.168.101.3: icmp_seq=1176 ttl=64 time=36230.470 ms
> 64 bytes from 192.168.101.3: icmp_seq=1177 ttl=64 time=35229.632 ms
> 64 bytes from 192.168.101.3: icmp_seq=1178 ttl=64 time=34228.791 ms
> 64 bytes from 192.168.101.3: icmp_seq=1179 ttl=64 time=33227.953 ms
> 64 bytes from 192.168.101.3: icmp_seq=1180 ttl=64 time=32227.091 ms
> 64 bytes from 192.168.101.3: icmp_seq=1181 ttl=64 time=31226.262 ms
> 64 bytes from 192.168.101.3: icmp_seq=1182 ttl=64 time=30225.425 ms
> 64 bytes from 192.168.101.3: icmp_seq=1183 ttl=64 time=29224.597 ms
> 64 bytes from 192.168.101.3: icmp_seq=1184 ttl=64 time=28223.757 ms
> 64 bytes from 192.168.101.3: icmp_seq=1185 ttl=64 time=27222.918 ms
> 64 bytes from 192.168.101.3: icmp_seq=1186 ttl=64 time=26222.086 ms
> 64 bytes from 192.168.101.3: icmp_seq=1187 ttl=64 time=25221.164 ms
> 64 bytes from 192.168.101.3: icmp_seq=1188 ttl=64 time=24220.407 ms
> 64 bytes from 192.168.101.3: icmp_seq=1189 ttl=64 time=23219.575 ms
> 64 bytes from 192.168.101.3: icmp_seq=1190 ttl=64 time=22218.737 ms
> 64 bytes from 192.168.101.3: icmp_seq=1191 ttl=64 time=21217.905 ms
> 64 bytes from 192.168.101.3: icmp_seq=1192 ttl=64 time=20217.066 ms
> 64 bytes from 192.168.101.3: icmp_seq=1193 ttl=64 time=19216.228 ms
> 64 bytes from 192.168.101.3: icmp_seq=1194 ttl=64 time=18215.333 ms
> 64 bytes from 192.168.101.3: icmp_seq=1195 ttl=64 time=17214.503 ms
> 64 bytes from 192.168.101.3: icmp_seq=1196 ttl=64 time=16213.720 ms
> 64 bytes from 192.168.101.3: icmp_seq=1197 ttl=64 time=15210.912 ms
> 64 bytes from 192.168.101.3: icmp_seq=1198 ttl=64 time=14210.044 ms
> 64 bytes from 192.168.101.3: icmp_seq=1199 ttl=64 time=13209.194 ms
> 64 bytes from 192.168.101.3: icmp_seq=1200 ttl=64 time=12208.376 ms
> 64 bytes from 192.168.101.3: icmp_seq=1201 ttl=64 time=11207.536 ms
> 64 bytes from 192.168.101.3: icmp_seq=1202 ttl=64 time=10206.694 ms
> 64 bytes from 192.168.101.3: icmp_seq=1203 ttl=64 time=9205.816 ms
> 64 bytes from 192.168.101.3: icmp_seq=1204 ttl=64 time=8205.014 ms
> 64 bytes from 192.168.101.3: icmp_seq=1205 ttl=64 time=7204.186 ms
> 64 bytes from 192.168.101.3: icmp_seq=1206 ttl=64 time=6203.294 ms
> 64 bytes from 192.168.101.3: icmp_seq=1207 ttl=64 time=5202.510 ms
> 64 bytes from 192.168.101.3: icmp_seq=1208 ttl=64 time=4201.677 ms
> 64 bytes from 192.168.101.3: icmp_seq=1209 ttl=64 time=3200.851 ms
> 64 bytes from 192.168.101.3: icmp_seq=1210 ttl=64 time=2200.013 ms
> 64 bytes from 192.168.101.3: icmp_seq=1211 ttl=64 time=1199.100 ms
> 64 bytes from 192.168.101.3: icmp_seq=1212 ttl=64 time=198.331 ms
> 64 bytes from 192.168.101.3: icmp_seq=1213 ttl=64 time=0.129 ms
> 64 bytes from 192.168.101.3: icmp_seq=1214 ttl=64 time=58223.470 ms
> 64 bytes from 192.168.101.3: icmp_seq=1215 ttl=64 time=57222.637 ms
> 64 bytes from 192.168.101.3: icmp_seq=1216 ttl=64 time=56221.800 ms
> 64 bytes from 192.168.101.3: icmp_seq=1217 ttl=64 time=55220.960 ms
> 64 bytes from 192.168.101.3: icmp_seq=1218 ttl=64 time=54220.116 ms
> 64 bytes from 192.168.101.3: icmp_seq=1219 ttl=64 time=53219.282 ms
> 64 bytes from 192.168.101.3: icmp_seq=1220 ttl=64 time=52218.444 ms
> 64 bytes from 192.168.101.3: icmp_seq=1221 ttl=64 time=51217.618 ms
> 64 bytes from 192.168.101.3: icmp_seq=1222 ttl=64 time=50216.778 ms
> 64 bytes from 192.168.101.3: icmp_seq=1223 ttl=64 time=49215.932 ms
> 64 bytes from 192.168.101.3: icmp_seq=1224 ttl=64 time=48215.095 ms
> 64 bytes from 192.168.101.3: icmp_seq=1225 ttl=64 time=47214.262 ms
> 64 bytes from 192.168.101.3: icmp_seq=1226 ttl=64 time=46213.440 ms
> 64 bytes from 192.168.101.3: icmp_seq=1227 ttl=64 time=45212.623 ms
> 64 bytes from 192.168.101.3: icmp_seq=1228 ttl=64 time=44211.783 ms
> 64 bytes from 192.168.101.3: icmp_seq=1229 ttl=64 time=43210.903 ms
> 64 bytes from 192.168.101.3: icmp_seq=1230 ttl=64 time=42210.111 ms
> 64 bytes from 192.168.101.3: icmp_seq=1231 ttl=64 time=41209.274 ms
> 64 bytes from 192.168.101.3: icmp_seq=1232 ttl=64 time=40208.448 ms
> 64 bytes from 192.168.101.3: icmp_seq=1233 ttl=64 time=39207.608 ms
> 64 bytes from 192.168.101.3: icmp_seq=1234 ttl=64 time=38206.774 ms
> 64 bytes from 192.168.101.3: icmp_seq=1235 ttl=64 time=37205.842 ms
> 64 bytes from 192.168.101.3: icmp_seq=1236 ttl=64 time=36205.104 ms
> 64 bytes from 192.168.101.3: icmp_seq=1237 ttl=64 time=35204.270 ms
> 64 bytes from 192.168.101.3: icmp_seq=1238 ttl=64 time=34203.433 ms
> 64 bytes from 192.168.101.3: icmp_seq=1239 ttl=64 time=33202.603 ms
> 64 bytes from 192.168.101.3: icmp_seq=1240 ttl=64 time=32201.764 ms
> 64 bytes from 192.168.101.3: icmp_seq=1241 ttl=64 time=31200.924 ms
> 64 bytes from 192.168.101.3: icmp_seq=1242 ttl=64 time=30200.082 ms
> 64 bytes from 192.168.101.3: icmp_seq=1243 ttl=64 time=29198.883 ms
> 64 bytes from 192.168.101.3: icmp_seq=1244 ttl=64 time=28198.414 ms
> 64 bytes from 192.168.101.3: icmp_seq=1245 ttl=64 time=27197.434 ms
> 64 bytes from 192.168.101.3: icmp_seq=1246 ttl=64 time=26196.738 ms
> 64 bytes from 192.168.101.3: icmp_seq=1247 ttl=64 time=25195.912 ms
> 64 bytes from 192.168.101.3: icmp_seq=1248 ttl=64 time=24195.074 ms
> 64 bytes from 192.168.101.3: icmp_seq=1249 ttl=64 time=23194.231 ms
> 64 bytes from 192.168.101.3: icmp_seq=1250 ttl=64 time=22193.407 ms
> 64 bytes from 192.168.101.3: icmp_seq=1251 ttl=64 time=21192.565 ms
> 64 bytes from 192.168.101.3: icmp_seq=1252 ttl=64 time=20191.725 ms
> 64 bytes from 192.168.101.3: icmp_seq=1253 ttl=64 time=19190.852 ms
> 64 bytes from 192.168.101.3: icmp_seq=1254 ttl=64 time=18190.060 ms
> 64 bytes from 192.168.101.3: icmp_seq=1255 ttl=64 time=17189.220 ms
> 64 bytes from 192.168.101.3: icmp_seq=1256 ttl=64 time=16188.381 ms
> 64 bytes from 192.168.101.3: icmp_seq=1257 ttl=64 time=15183.118 ms
> 64 bytes from 192.168.101.3: icmp_seq=1258 ttl=64 time=14182.711 ms
> 64 bytes from 192.168.101.3: icmp_seq=1259 ttl=64 time=13181.876 ms
> 64 bytes from 192.168.101.3: icmp_seq=1260 ttl=64 time=12181.034 ms
> 64 bytes from 192.168.101.3: icmp_seq=1261 ttl=64 time=11180.192 ms
> 64 bytes from 192.168.101.3: icmp_seq=1262 ttl=64 time=10179.357 ms
> 64 bytes from 192.168.101.3: icmp_seq=1263 ttl=64 time=9178.522 ms
> 64 bytes from 192.168.101.3: icmp_seq=1264 ttl=64 time=8177.692 ms
> 64 bytes from 192.168.101.3: icmp_seq=1265 ttl=64 time=7176.850 ms
> 64 bytes from 192.168.101.3: icmp_seq=1266 ttl=64 time=6176.026 ms
> 64 bytes from 192.168.101.3: icmp_seq=1267 ttl=64 time=5175.185 ms
> 64 bytes from 192.168.101.3: icmp_seq=1268 ttl=64 time=4174.355 ms
> 64 bytes from 192.168.101.3: icmp_seq=1269 ttl=64 time=3173.479 ms
> 64 bytes from 192.168.101.3: icmp_seq=1270 ttl=64 time=2172.658 ms
> 64 bytes from 192.168.101.3: icmp_seq=1271 ttl=64 time=1171.835 ms
> 64 bytes from 192.168.101.3: icmp_seq=1272 ttl=64 time=170.971 ms
> 64 bytes from 192.168.101.3: icmp_seq=1273 ttl=64 time=0.138 ms
> 64 bytes from 192.168.101.3: icmp_seq=1274 ttl=64 time=0.162 ms
> 64 bytes from 192.168.101.3: icmp_seq=1275 ttl=64 time=0.133 ms
> 64 bytes from 192.168.101.3: icmp_seq=1276 ttl=64 time=0.140 ms
> 64 bytes from 192.168.101.3: icmp_seq=1277 ttl=64 time=0.138 ms
> 64 bytes from 192.168.101.3: icmp_seq=1278 ttl=64 time=0.132 ms
> 64 bytes from 192.168.101.3: icmp_seq=1279 ttl=64 time=0.132 ms
> 64 bytes from 192.168.101.3: icmp_seq=1280 ttl=64 time=0.132 ms
> 64 bytes from 192.168.101.3: icmp_seq=1281 ttl=64 time=0.129 ms
> 
> At that point the machine silently rebooted inspite of being compiled
> with KDB and DDB and not KDB_UNATTENDED.  This silent reboot is
> reproduceable.

Given all the information here, in addition to the other portion of the
thread (indicating ntpd reports extreme offset between the system clock
and its stratum 1 source), I would say the motherboard is faulty or
there is a system device which is behaving badly (possibly something
pertaining to interrupts, but I don't know how to debug this on a low
level).

Can you boot verbosely and provide all of the output here or somewhere
on the web?

If possible, I would start by replacing the mainboard.  The board looks
to be a consumer-level board (I see an nfe(4) controller, for example).

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100929070017.GA82362>