Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Jan 2002 17:54:30 -0800
From:      Peter Wemm <peter@wemm.org>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        Andrew Gallatin <gallatin@cs.duke.edu>, alpha@FreeBSD.ORG
Subject:   Re: Is anybody actually able to netboot at the moment? 
Message-ID:  <20020123015430.0DECD39F1@overcee.wemm.org>
In-Reply-To: <3C4DFC23.F5391D2D@mindspring.com> 

next in thread | previous in thread | raw e-mail | index | archive | help
Terry Lambert wrote:
> Peter Wemm wrote:
> > > Actually, there's a bug in the one's complement case on the
> > > FreeBSD checksum calculation, sometimes.  I was able to see
> > > incorrect checksums on a number of packets.  I think it's in
> > > the incremental update code, but since it doesn't seem to
> > > stop things from working, I never tracked down the source of
> > > the ethreal traces where I saw this.
> > 
> > Terry, what crack are you smoking this time?  We dont do incremental
> > checksums in the libstand code.  That stuff is as simple and as unoptimized
> > as it gets.
> 
> The bug is on transmit, not on receive, Peter.  8-).  Working
> validation on the receive with packets with bad checksums would
> stop the load.

You said "I think it's in the incremental update code", which is what
I was responding to.  There is no "incremental update code".  And we do
not calculate ethernet frame CRC's either, that is done by the chip
and/or SRM itself.

> To see if this is the problem, it would be wise to do a dump
> of a failed boot attempt with ethreal, which flags checksum
> errors on packets on the wire.

Sure, but there appear to be no packets on the wire (as I already said).
That is the problem. tcpdump in promiscious mode is not seeing a damn
thing.  So either there is a frame CRC error (outside our jurisdiction)
which the switch is killing or SRM is not transmitting it.  The prom write
call is reporting that it succeeded though.

Here is the actual packet contents being sent..  The first one did not
make it to the wire:

bootpsend: d=20036d28 called.^M
bootpsend: calling sendudp^M
sendudp: d=20036d28 called.^M
saddr: 0.0.0.0:68 daddr: 255.255.255.255:67^M
sendudp: dest ethernet addr = ff:ff:ff:ff:ff:ff^M
sendether: called, len 328^M
0000: ff ff ff ff ff ff 00 00 f8 75 67 16 08 00 45 00^M
0010: 01 48 00 00 00 00 04 11 b5 a6 00 00 00 00 ff ff^M
0020: ff ff 00 44 00 43 01 34 00 00 01 01 06 00 00 00^M
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0040: 00 00 00 00 00 00 00 00 f8 75 67 16 00 00 00 00^M
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0110: 00 00 00 00 00 00 63 82 53 63 ff 00 00 00 00 00^M
0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0150: 00 00 00 00 00 00^M
prom0: netif_put^M
prom_write: len=0x156, pkt=0x20036336, hate@0x20035a10^M
ret.bits = 0x0000000000000156^M
ret.u.retval = 0x156^M
ret.u.unit = 0x0^M
ret.u.mbz = 0x0^M
ret.u.error = 0x0^M
ret.u.status = 0x0^M
prom0: netif_put returning 342^M
sendudp: sendether returned 328^M
readudp: called^M
readether: called, len 328, tleft 1^M
prom0: netif_get^M
ret.bits = 0x8000000000000000^M
ret.u.retval = 0x0^M
ret.u.unit = 0x0^M
ret.u.mbz = 0x0^M
ret.u.error = 0x0^M
ret.u.status = 0x4^M
cc = 0^M
prom0: netif_get returning 0^M


ie: prom_write returned success, but we times out waiting for a reply.
tcpdump on two other boxes confirms that the broadcast never made it out.

This second write did work, and tcpdump on both other boxes confirms the
packet, and we happened to catch the reply.  Both sent packets are identical
except for the bootp bp_secs field (seconds counter).  Here is the second
one which worked:

bootpsend: d=20036d28 called.^M
bootpsend: calling sendudp^M
sendudp: d=20036d28 called.^M
saddr: 0.0.0.0:68 daddr: 255.255.255.255:67^M
sendudp: dest ethernet addr = ff:ff:ff:ff:ff:ff^M
sendether: called, len 328^M
0000: ff ff ff ff ff ff 00 00 f8 75 67 16 08 00 45 00^M
0010: 01 48 00 00 00 00 04 11 b5 a6 00 00 00 00 ff ff^M
0020: ff ff 00 44 00 43 01 34 00 00 01 01 06 00 00 00^M
0030: 00 00 00 4e 00 00 00 00 00 00 00 00 00 00 00 00^M
0040: 00 00 00 00 00 00 00 00 f8 75 67 16 00 00 00 00^M
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0110: 00 00 00 00 00 00 63 82 53 63 ff 00 00 00 00 00^M
0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0150: 00 00 00 00 00 00^M
prom0: netif_put^M
prom_write: len=0x156, pkt=0x20036336, hate@0x20035a10^M
ret.bits = 0x0000000000000156^M
ret.u.retval = 0x156^M
ret.u.unit = 0x0^M
ret.u.mbz = 0x0^M
ret.u.error = 0x0^M
ret.u.status = 0x0^M
prom0: netif_put returning 342^M
sendudp: sendether returned 328^M
readudp: called^M
readether: called, len 328, tleft 3^M
prom0: netif_get^M
ret.bits = 0x000000000000015a^M
ret.u.retval = 0x15a^M
ret.u.unit = 0x0^M
ret.u.mbz = 0x0^M
ret.u.error = 0x0^M
ret.u.status = 0x0^M
cc = 346^M
prom0: netif_get returning 346^M
readether: got len 346^M
0000: 00 00 f8 75 67 16 00 00 f8 75 92 0b 08 00 45 10^M
0010: 01 48 00 00 00 00 10 11 60 02 d8 88 cc 40 d8 88^M
0020: cc 41 00 43 00 44 01 34 9b f9 02 01 06 00 00 00^M
0030: 00 00 00 4e 00 00 00 00 00 00 d8 88 cc 41 d8 88^M
0040: cc 40 00 00 00 00 00 00 f8 75 67 16 00 00 00 00^M
0050: 00 00 00 00 00 00 68 30 68 30 20 6d 61 67 69 63^M
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0090: 00 00 00 00 00 00 6e 65 74 62 6f 6f 74 00 00 00^M
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00^M
0110: 00 00 00 00 00 00 63 82 53 63 01 04 ff ff ff 80^M
0120: 03 04 d8 88 cc 01 06 04 d8 88 cc 12 11 21 32 31^M
0130: 36 2e 31 33 36 2e 32 30 34 2e 36 34 3a 2f 61 2f^M
0140: 6e 66 73 72 6f 6f 74 73 2f 34 2e 64 69 72 34 ff^M
0150: 00 00 00 00 00 00^M
bootprecv: checked.  bp = 0x200364c0, n = 304^M
bootprecv: got one!^M
vend_rfc1048 bootp info. len=64^M
'native netmask' is 255.255.255.0^M
mask: 255.255.255.128^M
net_open: server addr: 216.136.204.64^M
net_open: server path: /a/nfsroots/4.dir4^M
sendrecv: called^M
SEND^M
sendudp: d=20036d28 called.^M
saddr: 216.136.204.65:1023 daddr: 216.136.204.64:111^M

....

> > I have experimented with alignment in the ethernet frame send code.. it
> > seems that we are trying to send with 2-byte alignment for the bootp case.
> > Fixing it doesn't seem to make much difference.  However, I wonder if SRM
> > is doing some length rounding or something because the lengths are not 4 or
> > 8 byte multiples for the bootp queries but are for the working rarp
> > queries.  However, even that doesn't make sense because it sometimes works.
> > I'm more suspicious of interactions between the tulip cards when being
> > driven by SRM and the switch at the moment.
> 
> OK, another shot in the dark.  The first 16 bit NE1000 cards
> an interesting problem, in that, unless you sent an even
> number of bus transfer units, it would always do an even
> transfer anyway, and the last two bytes would be byte-swapped
> when you went to checksum them, and you'd sum some garbage
> byte instead of the right byte.
> 
> The fix for this was to always send an even number of bytes,
> even if the payload wwas an odd length, to get around the
> problem.

Well, we're sending even byte counts in this case.  (as I already said)

> Maybe this is a byte-order problem?

Doesn't explain why it is not making it to the wire.  And it doesn't
explain why it works *sometimes*.  (about 1 in 50, as I already said).

> If it is, the place to fix it is on the server (again), by
> making it pad packets out to a 2 (or 4 or 8?) byte boundary
> so that the received packets are transferred as a unit, but
> only the payload portion is checked.
> 
> This "fix" would only apply if the packets sent on the wire
> were good in both directions (i.e. it's still time for the
> ethreal trace by an otherwise uninvolved third party machine).
> 
> Hope this helps... I'm waving my hands as fast as I can... ;^)

I'd love to see an explanation for why prom_write() doesn't seem to
work for bootp requests.

> -- Terry
> 

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020123015430.0DECD39F1>