Date: Tue, 18 Oct 2016 17:27:15 +0200 From: Arrigo Marchiori <ardovm@yahoo.it> To: freebsd-fs@freebsd.org Subject: Random truncated files on USB hard disk with timeouts; how to debug? Message-ID: <20161018152715.GC89691@nuvolo>
next in thread | raw e-mail | index | archive | help
Hello List, I am encountering a strange problem, that happens seldom and randomly, and I don't know how to address it. Short description: some files sometimes become ``sort of truncated'': ls(1) tells me their size is not zero, but cat(1), less(1) and vi(1) show they are empty. The system is a 11-0 STABLE amd64, r307550, with GENERIC kernel. CPU: Intel Core 2 Duo. Ram: 2 GB. The root filesystem is mounted from a USB hard drive, with MBR partitioning scheme, formatted with ufs, SU+J enabled. The USB hard drive occasionally times out for ~10 seconds. But I do not see any warning or error messages in dmesg, that suggest that such timeouts could lead to broken files. In fact, dmesg(8) does not show anything at all about those timeouts, without tweaking the standard kernel verbosity options. If I set hw.usb.ehci.debug to 1, then I see ehci_timeout indications. If I set the sysctl to any bigger value, the console is flooded by messages. The problem appears while the computer is under heavy load: building world or ports. When this problem appears, the compilations stop with funny errror messages: the source files are empty!... Running truss(1) on cat(1) shows that the read(2) library function returns 0 bytes. I tried to disable journaling, but the problem still appears, apparently with the same frequency. Once the problem appears, I can reboot the system normally. I see no errors either during shutdown and the next startup. The filesystem is considered clean, and no fsck is run (BTW I disabled background fsck). The funny part is that after rebooting, the file contents are visible! I can resume the port compilation as if nothing ever happened. What can I do to get more information on this problem? Is there a well-known stress test I could run to exploit this problem more frequently? I am considering this a big problem, because I have no indications from the system logs that anything is going bad. If the HDD was broken, I would expect the kernel to yell it loud and often. Please add me in cc, as I am not subscribed to this list. Thank you in advance! -- rigo http://rigo.altervista.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161018152715.GC89691>