Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Oct 2016 17:27:15 +0200
From:      Arrigo Marchiori <ardovm@yahoo.it>
To:        freebsd-fs@freebsd.org
Subject:   Random truncated files on USB hard disk with timeouts; how to debug?
Message-ID:  <20161018152715.GC89691@nuvolo>

next in thread | raw e-mail | index | archive | help
Hello List,

I am encountering a strange problem, that happens seldom and randomly,
and I don't know how to address it.

Short description: some files sometimes become ``sort of truncated'':
ls(1) tells me their size is not zero, but cat(1), less(1) and vi(1)
show they are empty.

The system is a 11-0 STABLE amd64, r307550, with GENERIC kernel.  CPU:
Intel Core 2 Duo. Ram: 2 GB.  The root filesystem is mounted from a
USB hard drive, with MBR partitioning scheme, formatted with ufs, SU+J
enabled.

The USB hard drive occasionally times out for ~10 seconds. But I do
not see any warning or error messages in dmesg, that suggest that such
timeouts could lead to broken files. In fact, dmesg(8) does not show
anything at all about those timeouts, without tweaking the standard
kernel verbosity options.

If I set hw.usb.ehci.debug to 1, then I see ehci_timeout
indications. If I set the sysctl to any bigger value, the console is
flooded by messages.

The problem appears while the computer is under heavy load: building
world or ports. When this problem appears, the compilations stop with
funny errror messages: the source files are empty!...

Running truss(1) on cat(1) shows that the read(2) library function
returns 0 bytes.

I tried to disable journaling, but the problem still appears,
apparently with the same frequency.

Once the problem appears, I can reboot the system normally. I see no
errors either during shutdown and the next startup. The filesystem is
considered clean, and no fsck is run (BTW I disabled background fsck).

The funny part is that after rebooting, the file contents are visible!
I can resume the port compilation as if nothing ever happened.

What can I do to get more information on this problem? Is there a
well-known stress test I could run to exploit this problem more
frequently?

I am considering this a big problem, because I have no indications
from the system logs that anything is going bad. If the HDD was
broken, I would expect the kernel to yell it loud and often.

Please add me in cc, as I am not subscribed to this list.

Thank you in advance!
-- 
rigo

http://rigo.altervista.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161018152715.GC89691>