Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 7 Aug 2018 03:29:33 +0100
From:      Li-Wen Hsu <lwhsu@freebsd.org>
To:        Mark Millard <marklmi@yahoo.com>
Cc:        Ed Maste <emaste@freebsd.org>, Bryan Drewery <bdrewery@freebsd.org>,  FreeBSD Current <freebsd-current@freebsd.org>, Alexander Motin <mav@freebsd.org>
Subject:   Re: A head buildworld race visible in the ci.freebsd.org build history
Message-ID:  <CAKBkRUz06XcpjmhQ-KA2p8K40RHb=JYsQU1jS2E1BtyGLXb7Pg@mail.gmail.com>
In-Reply-To: <29F7FD25-147A-4B87-AC96-23CB3B1C38C7@yahoo.com>
References:  <74EAD684-0E0B-453A-B746-156777CF604A@yahoo.com> <1884103f-d1fb-aca6-2edd-062e11d05617@FreeBSD.org> <BCD47660-EE57-490C-90E8-83FC3B720B09@yahoo.com> <CAKBkRUxAfXi81yw93ejcJVpXQ0JetaACFtuS8tFprQvMeWx75A@mail.gmail.com> <33a43aac-231f-6158-1de4-f5dbfaf195df@FreeBSD.org> <CAPyFy2C47y-KDBg6MDjC_KxjcaqD2cs2CoyfDAMX8DkDsmH7EA@mail.gmail.com> <CAKBkRUzRL0NkhKs5-Aoee3upP1Qtfr7-ssAtmwDxtP74A2E3=w@mail.gmail.com> <EB806F90-7DC1-4617-93FE-078FF6FA7B72@yahoo.com> <CAKBkRUzxVxpQszhVstWa=7s16i7QzV30zcSFTVB1aJsQVZfG1w@mail.gmail.com> <29F7FD25-147A-4B87-AC96-23CB3B1C38C7@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 21, 2018 at 10:49 PM Mark Millard <marklmi@yahoo.com> wrote:
> Has the range r328278 < PROBLEM_START <= r330304 been narrowed down
> some more?
>
> (I'm just curious were the problem started.)

After several rounds of binary search, I found it might have something
todo with r329625.

The only thing I think this commit related to the situation we met is
it touched the code for doing unmount.  But I cannot confirm if it is
the cause.

It is a bit tricky to reproduce.  I will try to keep it concise.

We do builds for head in a jail (11.2-RELEASE) on a -CURRENT host.
The jail is on a
dedicated zfs.  And there is a daemon doing jail/zfs cleanup running
outside of the jail.

In some edge cases, that cleanup daemon wants to destroy the zfs of
the jail in which a build is still running.  If that happens, with an
earlier -CURRENT, it should just get "cannot unmount
'/jenkins/jails/test-ranlib': Device busy" and nothing serious will
happen.  Recently, although it still didn't destroy the
busy zfs, it started causing build error out with "ranlib: fatal:
Failed to open 'libXXX.a'"

To reproduce this, create a zfs and use that as the root of a jail,
run this build script under /usr/src inside the jail:

https://gist.github.com/lwhsu/ae3b8b1f0c856837f93984ab2493f629#file-build-sh

Run this cleanup script on the host:

https://gist.github.com/lwhsu/ae3b8b1f0c856837f93984ab2493f629#file-clean-test-ranlib-sh
(need to modify the zfs path)

I use powerpcspe as TARGET_ARCH here because it takes a shorter time
in one iteration.  There should be nothing related to the
architectures.

I am not very sure about what is the next step, maybe modifying ranlib
and log more what it gets "fatal: Failed to open 'libxxx.a'"  Any good
idea about debugging this?


Li-wen

--
Li-Wen Hsu <lwhsu@FreeBSD.org>
https://lwhsu.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAKBkRUz06XcpjmhQ-KA2p8K40RHb=JYsQU1jS2E1BtyGLXb7Pg>