Date: Fri, 17 Mar 2023 07:23:59 +0100 From: Matthias Pfaller <leo@marco.de> To: stable@freebsd.org Subject: Re: Fwd: Kernel DHCP unpredictable/fails (PXE boot), userspace DHCP works just fine Message-ID: <0b95a502-eea0-46cc-5d0d-ec6e861ad51f@marco.de> In-Reply-To: <CAM2hQG-oDRsoccg3S1LykyUF=joWbdJz=GSPOnUroDRxjZ2_iQ@mail.gmail.com> References: <CAM2hQG-p=bfSh_nxuah9zcTBbz7HQ9pYyvOR2f6rC=CUGePKsg@mail.gmail.com> <CAM2hQG-oDRsoccg3S1LykyUF=joWbdJz=GSPOnUroDRxjZ2_iQ@mail.gmail.com>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --]
On 2023-03-16 21:44, Attila Nagy wrote:
> Hi,
>
> As this is super annoying, I'm willing to pay a $500 bounty for solving this issue
> (whomever is first, however I don't anticipate a big competition :) Having an invoice
> would be best, but I'm willing to accept individuals as well).
> I can't give remote access, but can run debug builds with serial console. stable/13
> branch.
>
> I have a bunch of netbooted machines, one set in a cluster is older (HP DL80 G9,
> 2x8C, Intel I350 -igb- NICs), the other set is newer (HP XL225n G10, AMD EPYC2x16C,
> BCM57412 -bnxt- NICs).
> All of these boot from the network, which is basically:
> - get IP and options with DHCP with the help of the NIC's PXE stack
> - get the loader and kernel, start it
> - do another round of DHCP from the kernel (bootp_subr.c)
> - mount the root via NFS and let everything work as usual
>
> The problem is that the newer machines take an indefinite time to boot. The older
> ones (with igb NIC) work reliably, they always boot fast.
> The process of getting an IP address via DHCP (bootpc_call from bootp_subr.c) either
> succeeds normally (in a few seconds), or takes a lot of time.
> Common (measured) times to boot range from 10s of minutes to anywhere between a few
> hours (1-6).
> Sometimes it just gets stuck and couldn't get past bootpc_call (getting the DHCP lease).
Do you have STP/RSTP enabled on the switch ports? When the link goes down when
switching from firmware mode to kernel mode, the port will go back to blocking. When
the dhcp requests don't make it to the dhcp server because of this and the link goes
down and up again while retrying (don't know if this happens) you will get the same
problem on the next try. As a simple test you could put a dumb unmanaged switch
between your core network and the server.
best regards, Matthias
[-- Attachment #2 --]
0 *H
010
`He 0 *H
G00
ZA]-%翞0
*H
010 UIT10UBergamo10UPonte San Pietro10U
Actalis S.p.A.1,0*U#Actalis Client Authentication CA G30
220915111330Z
230915111330Z010Uleo@marco.de0"0
*H
0
Xy.l"SQW2VF.n
F̍)LV{_jzï4F}mE)Q,f۾ B3 RۼvM4
v۠q"قz-F2t8#-x9
7 xí_t^yҲʉ_ˍM0w7Ֆg {`+aP;~Nõkz1 꿨AF?UuS8 00U0 0U#0S} 2.2w0~+r0p0;+0/http://cacert.actalis.it/certs/actalis-autclig301+0%http://ocsp09.actalis.it/VA/AUTHCL-G30U0leo@marco.de0GU @0>0<+0200+$