Date: Fri, 30 Aug 2019 09:25:22 -0700 From: Enji Cooper <yaneurabeya@gmail.com> To: Li-Wen Hsu <lwhsu@freebsd.org> Cc: fcp@freebsd.org, FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: FCP 20190401-ci_policy: CI policy Message-ID: <339B7A20-F88D-4F60-B133-612189663272@gmail.com> In-Reply-To: <CAKBkRUwKKPKwRvUs00ja0%2BG9vCBB1pKhv6zBS-F-hb=pqMzSxQ@mail.gmail.com> References: <CAKBkRUwKKPKwRvUs00ja0%2BG9vCBB1pKhv6zBS-F-hb=pqMzSxQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Aug 27, 2019, at 21:29, Li-Wen Hsu <lwhsu@freebsd.org> wrote: >=20 > It seems I was doing wrong that just changed the content of this FCP > to "feedback", but did not send to the right mailing lists. >=20 > So I would like to make an announcement that the FCP > 20190401-ci_policy "CI policy": >=20 > https://github.com/freebsd/fcp/blob/master/fcp-20190401-ci_policy.md >=20 > is officially in "feedback" state to hopefully receive more comments > and suggestions, then we can move on for the next FCP state. First off, thank you Li-Wen and Kristof for spearheading this proposal; it=E2= =80=99s a very contentious topic with a lot of strong emotions associated wi= th it. As the person who has integrated a number of tests and helped manage them fo= r a few years (along with some of the care and feeding associated with them)= , this task is non-trivial. In particular when issues that I filed in bugzil= la are fixed quickly and linger in the tree for some time, impacting a lot o= f folks who might rely on build and test suite stability. The issue, as I see it, from a CI/release perspective that the new policy at= tempts to define a notion of =E2=80=9Cstable=E2=80=9D, in terms of both test= s and other code; right now, stability is sort of defined on a honor system b= asis with the FreeBSD test suite as a litmus test of sorts to convey a sense= of stability. =3D=3D=3D=3D=3D=3D One thing that I don=E2=80=99t see in the proposal is the health of the =E2=80= =9Cmake tinderbox=E2=80=9D target in a CI world (this is a gap in our curren= t CI process). Another thing that I don=E2=80=99t see in the proposal is about the health o= f head vs stable and how it relates to MFCs. I see a lot more issues occur o= n stable branches go unfixed for some time, in part because some fixes or en= hancements haven=E2=80=99t been MFCed. Part of the problem I see these days i= s a bit of a human/resource problem: if developers can=E2=80=99t test their c= hanges easily, they don=E2=80=99t MFC them. This issue has caused me to do a fair amount of triage in the past when back= porting changes, in order to discover potentially missing puzzle pieces in o= rder to make my tests and code work. =3D=3D=3D=3D=3D=3D The big issues, as I see it based on the discussions that has taken place in= the thread, is around revert timing and etiquette, and dealing with unrelia= ble tests. First off, revert timing and etiquette: while I see the FCP as an initial fr= amework, I am a bit concerned with the heavy handed ness of =E2=80=9Cwhat co= nstitutes needing reversion=E2=80=9D: should this be done after N consistent= failures in a certain period (be they build or test)? Furthermore, why is a= human involved in making this decision (apart from maybe a technical soluti= on via automation not being available yet)? Second off, unreliable tests: * Unreliable tests need to be qualified not based on a single run, but a pat= tern of runs. The way that this worked at Facebook is, if a test failed, it would attempt t= o rerun it multiple times (10 in total IIRC). If the test was consistently f= ailing on a build, the test would be automatically disabled, and all committ= ers in a revision range would be nagged as part of disabling those tests. Th= is generally works because of siloization of Facebook components, but is a m= uch harder problem to solve with FreeBSD because it is a complete OS distribution and sometimes small seemingly disconnected changes can cause= a lot of grief. So what to do? I suggest expanding the executors and running individuals suites instead of t= he whole batch of tests. While it wouldn=E2=80=99t fix everything and would b= e an expensive thing to do with our current test infrastructure, it would al= low folks to better pinpoint issues and be able to get some level of coverag= e, as opposed to throwing all of test execution out, like a baby with the ba= th water. How do we get there? - Expand the CI executor pool. - Provide a tool or process with which we can define test suites. - Make spinning up executors faster: with virtual machines this is typically= done by using Big Iron infrastructure clusters (e.g., ESXi clusters) and so= mething like thin provisioning where one could start from a common image/sna= pshot, instead of taking a hit copying around images. Linux can do this with= btrfs; we can do this with ZFS with per VM datasets, snapshotting, etc. While this only gets part of the way to a potential solution, it is a good w= ay to begin solving the isolation/execution problem. * A number of tests that existed in the tree have varying quality/reliabilit= y; I agree that system level tests (of which the pf tests are one of many) a= re less reliable than unit/API functional tests. This is the nature of the b= east of testing. The core issue I see with the test suite as it stands, is that it mixes inte= gration/system level tests (less deterministic) with functional/unit tests (= generally more deterministic). Using test mock frameworks would be a good technical solution to making syst= em tests into functional/unit tests (googlemock and unittest.mock are two of= many good tools I know of in this area), but we need a way to run both case= s. I can see now where some of the concern over labeling test types was a conce= rn when I first started this work (des@/phk@ aired this concern). Part of the technical/procedural solution to allowing commingling of tests i= s to go back and label the tests appropriately. I=E2=80=99ll send out an FCP= for this sometime in the next week or two. =3D=3D=3D=3D=3D=3D Taking a step back, as others have brought up, we=E2=80=99re currently hinde= red by tooling: we are applying a DVCS (git, hg) based technique (CI) to sub= version and testing changes after they=E2=80=99ve hit head, instead of befor= e they hit head. While phabricator can partially solve this by testing upfront (we don=E2=80=99= t enforce this; I=E2=80=99ve made my concerns with this not being a requirem= ent well-known in the past), the solution is limited by bandwidth for testin= g, i.e., testing is an all or nothing exercise right now and building multip= le toolchains/architectures takes a considerable amount of time. We could le= verage cloud/distributed solutions for this (Cirrus CI, Travis if the integr= ation existed), but this would require using github or teaching a tool how t= o make the appropriate REST api calls to run the tests and query the status (= in progress, pass, fail, etc). Applying labels and filtering on test suites will get us partway to a final s= olution from a test perspective, but a lot of work needs to be done with pha= bricator, etc. We also need to have build failures with tier 1 architectures with GENERIC b= e a commit blocking operation. Full stop. =3D=3D=3D=3D=3D=3D While some of the thoughts I put down aren=E2=80=99t complete solutions, I h= ave subproposals that should be done/things that could be worked on before i= mplementing the proposed CI policy. Some of the things I brought up above=20= While I can=E2=80=99t work on it now, December break is coming up, and with i= t I=E2=80=99ll have more time to work on projects like this. I=E2=80=99ll pu= t down some TODO items so I can look at tackling them during the break. Thank you, -Enji=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?339B7A20-F88D-4F60-B133-612189663272>