From owner-freebsd-arm@freebsd.org Thu Jan 7 19:48:27 2016 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D9FD4A66F5C for ; Thu, 7 Jan 2016 19:48:27 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from pmta2.delivery6.ore.mailhop.org (pmta2.delivery6.ore.mailhop.org [54.200.129.228]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B0FB819B1 for ; Thu, 7 Jan 2016 19:48:27 +0000 (UTC) (envelope-from ian@freebsd.org) Received: from ilsoft.org (unknown [73.34.117.227]) by outbound2.ore.mailhop.org (Halon Mail Gateway) with ESMTPSA; Thu, 7 Jan 2016 19:48:56 +0000 (UTC) Received: from rev (rev [172.22.42.240]) by ilsoft.org (8.14.9/8.14.9) with ESMTP id u07JmJZb004821; Thu, 7 Jan 2016 12:48:19 -0700 (MST) (envelope-from ian@freebsd.org) Message-ID: <1452196099.1215.12.camel@freebsd.org> Subject: Re: FYI: various 11.0-CURRENT -r293227 (and older) hangs on arm (rpi2): a description of sorts From: Ian Lepore To: Mark Millard Cc: freebsd-arm Date: Thu, 07 Jan 2016 12:48:19 -0700 In-Reply-To: References: <1452183170.1215.4.camel@freebsd.org> Content-Type: text/plain; charset="us-ascii" X-Mailer: Evolution 3.16.5 FreeBSD GNOME Team Port Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jan 2016 19:48:27 -0000 On Thu, 2016-01-07 at 11:24 -0800, Mark Millard wrote: > On 2016-Jan-7, at 8:12 AM, Ian Lepore wrote: > > > > On Thu, 2016-01-07 at 02:19 -0800, Mark Millard wrote: > > > I've had various hangs when the rpi2 was busy over longish > > > periods, > > > both debug buildkernel/buildworld builds of the arm and non-debug > > > variants. No log files or console messages produced. > > > > > > I've not had any analogous issues with powerpc64 (PowerMac G5) or > > > with amd64 (Virtual Box used on Mac OS X). > > > > > > I've finally discovered that if I have, say, top running on the > > > rpi2 > > > serial console that top continues to update its display so long > > > as I > > > leave it alone during the hang. (Otherwise it hangs too.) So I > > > finally have a little window for seeing some of what is > > > happening. > > > > > > An example top display showed after the hang: > > > > > > Mem: 764M Active 12M Inact 141M Wired 98M Buf 8k free > > > Swap: 2048M Total 29M Used 2019 Free 1% in use > > > > > > (Yep: Just 8K free Mem.) > > > > > > > That's not a problem. > > > > > The unusual STATEs for processes seemed to be (for the specific > > > hang): > > > > > > STATE COMMANDs > > > pfault [ld] [ld] /usr/sbin/syslogd > > > vmwait [ld] [md0] [kernel] > > > wswbuf [pagedaemon] > > > > > > Those same 3 states seem to always be involved. Some of the > > > processes > > > vary from one hang to the next: the prior hang had > > > build/genautoma , > > > /usr/sbin/moused , and /usr/sbin/ntpd instead of 3 [ld]'s. > > > > > > /usr/sbin/syslogd, [md0], [kernel], and [pagedaemon] and their > > > states > > > do not seem to vary (so far). > > > > > > > > > > Everything is backed up waiting for slow sdcard IO. You can get an > > amd64 system with many cores and gigabytes of ram into the same > > state > > with an sdcard (or any other storage device that takes literally > > seconds for any individual IO to complete). All the available > > buffers > > get queued up to the one slow device, then you can't do anything > > that > > requires IO (even launch tools to try to figure out what's going > > on). > > > > -- Ian > > This is not the (or a) sdcard for the root file system, it is a fast, > 400GB+ SSD, USB 3.0 capable (not that rpi2 uses it that way). Note > below the "da0" and the size and such (other than /boot/msdos): > > ugen0.5: at usbus0 > umass0: addr 5> on usbus0 > umass0: SCSI over Bulk-Only; quirks = 0x0100 > umass0:0:0: Attached to scbus0 > da0 at umass-sim0 bus 0 scbus0 target 0 lun 0 > da0: Fixed Direct Access SPC-4 SCSI device > da0: Serial Number XXXXXXXXXXXX > Release APs > da0: 40.000MB/s transfers > da0: 457862MB (937703088 512 byte sectors) > da0: quirks=0x2 > Trying to mount root from ufs:/dev/ufs/RPI2rootfs [rw,noatime]... > . . . > Starting file system checks: > /dev/ufs/RPI2rootfs: FILE SYSTEM CLEAN; SKIPPING CHECKS > /dev/ufs/RPI2rootfs: clean, 109711666 free (14002 frags, 13712208 > blocks, 0.0% fragmentation) > Mounting local file systems:. > . . . > > > Filesystem 1M-blocks Used Avail Capacity Mounted on > > /dev/ufs/RPI2rootfs 443473 16791 391203 4% / > > devfs 0 0 0 100% /dev > > /dev/mmcsd0s1 49 7 42 15% /boot/msdos > > > In USB 3.0 contexts I have never observed seconds for an IO for these > types of SSDs and I use them that way extensively. Nor for USB 2.0 > uses, though that is not as common of a context for me. Nor have I > had any problems with the type of USB 3.0 capable hub messing up IO. > > I use this type of SSD to hold my Virtual Box virtual machine(s) that > I run amd64 FreeBSD in on Mac OS X. No problems there. But it is true > that I've never directly booted amd64 FreeBSD from one of these SSDs > in a non-virtual amd64 context. > > Ignoring that for a moment, so this is an acceptable/expected FreeBSD > behavior when a "disk" device is slow? Interesting. I've let it sit > for hours and the hangup does not clear: it is effectively deadlocked > for overall usage. The rpi2 never will be able to buildworld, > buildkernel, ports, etc. reliably if this is the sort of behavior > that results. > > Back to this context: I there a way for me to confirm the queuing of > buffers to the SSD? Or at least some detail about its buffer usage? > Can I get some information from ddb that would confirm/deny/provide > insight? > > If the filesystems and swap space are on a usb drive, then maybe it's the usb subsystem that's hanging. The wait states you showed for those processes are consistant with what I've seen when all buffers get backed up in a queue on one non-responsive or slow device. It may be that there's a way to get the system deadlocked when it's low on buffers and there is memory pressure causing the swap to be used (I generally run arms systems without any swap configured). Running gstat in another window while this is going on may give you some insight into the situation. Beyond that I don't know what to look at, especially since you generally can't launch any new tools once the system gets into this kind of state. -- Ian