Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 Sep 2000 14:25:44 -0700 (PDT)
From:      Guy Harris <guy@netapp.com>
To:        Danny Braniss <danny@cs.huji.ac.il>
Cc:        Guy Harris <gharris@flashcom.net>, guy@netapp.com, freebsd-hackers@freebsd.org
Subject:   Re: nfs v2
Message-ID:  <200009282125.OAA25500@tooting.eng.netapp.com>
In-Reply-To: <E13eikP-00016O-00@sexta.cs.huji.ac.il> from Danny Braniss at "Sep 28, 2000 09:49:33 pm"

next in thread | previous in thread | raw e-mail | index | archive | help
> }	1) NFS V2 having, as I remember, insufficient bits in the
> }	   major/minor device value used when creating special files to
> }	   support more than 8 bits of major and 8 bits of minor device;
> if i remember correctly,i copied the / over to the NetAPP via nfsv3 
> either tar or dump, and all is ok. it's when it gets mounted v2 (which the 
> diskcless boot does) it's when dev is wrong.

Originally:

	UNIX systems had 8-bit major and 8-bit minor devices;

	NFS V2 had no mechanism for creating special files.

Then Sun needed that V2 "mknod" support for NFS-only diskless operation,
so they added a hack to V2 wherein a V2 CREATE operation in which the
"mode" field of the "attributes" member of the arguments had the upper 4
bits set was treated as an attempt to create a file other than a plain
file, and those bits contained a standard UNIX file type, e.g. 0020000
for a character special file; the "size" field of the attributes was to
be interpreted as the major/minor device.

Later, for SV-style named pipe support, they added an additional hack
wherein a character special file create with a size of 0xffffffff meant
that it would be an attempt to create a FIFO special file.  (That was
sufficiently long ago that I forget why passing a file type of 010000,
i.e. IFIFO, wasn't the way it was done.)

Later, SVR4 extended the major/minor device to 32 bits, with 14 bits of
major device and 18 bits of minor device.  To handle this over NFS V2,
what SunOS 5.5.1's NFS server code, at least, appears to do is:

	1) store the major and minor device as 14-bit and 18-bit fields
	   in a 32-bit word;

	2) in a V2 CREATE request that attempts to create a character or
	   block special file, check whether any of the upper 16 bits of
	   the 32-bit size field are 1 and:

		if not, treat the size field as an 8-bit major device
		and an 8-bit minor device, and store the upper 8 bits as
		the upper 14 bits of the resulting file's "rdev" and
		store the lower 8 bits as the lower 18 bits of the
		resulting file's "rdev";

		if so, treat the size field as a 14-bit major device and
		an 18-bit minor device, and store the field as the
		resulting file's "rdev";

	3) when constructing V2 attributes of a file, if the major or
	   minor device will both fit in 8 bits, shift the major left by
	   8 and OR in the minor and stuff the result into the "rdev"
	   field, otherwise shift the major left by 14 and OR in the
	   minor and stuff the result into the "rdev" field.

This was, presumably, done to allow both SunOS 4.x (8-bit major, 8-bit
minor) and SunOS 5.x (14-bit major, 18-bit minor) systems to work
together.

Then NFS V3 came along; in V3, there's a MKNOD operation, and it
supplies "specdata1" and "specdata2" for character and block special
files, which are, on UNIX systems, interpreted as major and minor
devices, respectively.

For V3, what SunOS 5.5.1 appears to do is:

	1) in a V3 CREATE request that attempts to create a character or
	   block special file, combine the "specdata1" and "specdata2"
	   fields as if they were a 14-bit major and 18-bit minor;

	2) when constructing V3 attributes for the file, stuff the major
	   into "specdata1" and the minor into "specdata2".

What FreeBSD 3.4's client code, at least, does on a "mknod" is:

	for V2, do a CREATE, pass the appropriate mode bits, and pass
	the "rdev" value as the size;

	for V3, do a MNNOD, and pass the major and minor as "specdata1"
	and "specdata2".

The FreeBSD 32-bit major/minor value is 8 bits of major and 24 bits of
minor, which isn't the same as SVR4's 14/18.

On a "getattr", what FreeBSD 3.4's client code does is:

	for V2, treat the "rdev" value as a dev_t;

	for V3, treat "specdata1" as an 8-bit major and "specdata2" as a
	24-bit minor, and combine them with "makedev" into an dev_t.

NetApp filers originally just, as I remember, stuffed the size field
into the 32-bit "rdev" field of our inode on a CREATE operation, and
returned it in the "rdev" field of an "fattr" structure on a GETATTR
operation.

When we added V3 support, on a V2 CREATE we interpreted the "size" field
as containing an 8-bit major and an 8-bit minor, and passed those on to
the file system as the "specdata1" and "specdata2" values, and passed
"specdata1" and "specdata2" from a V3 MKNOD on in the same fashion; the
file system then treated them both as 8-bit values, and stuffed them
into the "rdev" field of the inode.  (At the time, Solaris didn't
*support* NFS V3.) On a GETATTR operation, the file system split the
"rdev" field into 8-bit "specdata1" and "specdata2" fields, and then:

	for V2, combined them into an 8-bit+8-bit rdev field in the NFS
	reply;

	for V3, returned them as "specdata1" and "specdata2" in the NFS
	reply.

(NOTE: the rdev field occupies the same space as the top-level file
block pointers; we don't waste 32 bits of the inode for files that
aren't character or block special files - we don't have 32 bits to
waste, as we have to stuff various DOS/Windows gunk in there as well,
for Windows CIFS clients.)

Later, when we had to support diskless Solaris clients using V3:

	for a V2 create, we just passed the size field on to the file
	system unchanged as what amounts to a "rdev" value;

	for a V3 create, we assumed that the client was a 14/18 system,
	stuffed "specdata1" into the upper 14 bits and "specdata2" into
	the lower 18 bits, and passed that on to the file as what
	amounts to an "rdev" value;

	we stuffed the "rdev" value into the inode's "rdev" field;

	for V2 GETATTR, we returned the "rdev" field as the "rdev"
	field;

	for V3 GETATTR, we checked whether the "rdev" field had the
	upper 16 bits set, and:

		if so, we split it into a 14-bit major and an 18-bit
		minor device, and returned those as "specdata1" and
		"specdata2";

		if not (meaning that either the inode was created by a
		version of our software that didn't have support for
		14/18 device values, or had a major device of 0), we
		split it into an 8-bit major and an 8-bit minor device,
		and returned those as "specdata1" and "specdata2".

A V2 CREATE from FreeBSD of device (2, 2) looks as if it'd pass

	(2 << 24) | 2

over the wire, i.e. 0x02000002.  We'd stuff 0x02000002 into the inode,
and should, in an NFS V2 reply, return it as 0x02000002.  It looks as if
FreeBSD would handle that.

A V3 MKNOD from FreeBSD of device (2, 2) would pass 2 over the wire as
"specdata1" and 2 over the wire as "specdata2"; what the server does
with that depends on the server:

	Solaris would, I suspect, turn that into 14 bits of 2 and 18
	bits of 2, i.e. 0x00080002;

	NetApp filers would do the same;

	an OS with 12-bit majors and 20-bit minors would turn it into
	0x00200002;

	an OS such as FreeBSD with 8-bit majors and 24-bit minors would
	turn it into 0x02000002.

A V2 GETATTR would get back whichever of those the server's OS did,
which would not be correctly interpreted unless the server had 8-bit
majors and 24-bit minors and thus sent 0x02000002.

> }	2) some OSes - Solaris was the one with which we were having
> }	   problems, as I remember - requiring those extra bits.
> i tried solaris 2.6 and it's ok.

If the major device is non-zero, the size field won't fit in 16 bits, so
the SunOS 5.5.1 server will probably misinterpret the size field of a V2
create as being 14/18 rather than 8/24.

(If the major device *is* zero, then, at least with a SunOS 5.5 client
and SunOS 5.5.1 server, the command

	mknod foobar c 0 8192

when using V2 created a file with a major device of 32 and a minor
device of 0; the upper 16 bits of the size were zero, so the 5.5.1
server assumed that the request was probably coming from a 4.x client.)

Perhaps later versions of SunOS 5.x don't do this; are you saying that
you tried a Solaris 2.6 server? If so, what happens if you do an "ls
-l", *on the Solaris server*, of the FreeBSD client's "/dev/null" file? 
Does it report "2, 2", or does it report something else?

> }NFS V3 is probably a better idea, if you can use it; we (NetApp) have
> }supported it for many years, and I suspect most if not all other vendors
> }of NFS servers do so as well.
> }
> and it's the prefered mount here too, the problem is the FreeBSD nfs_root/boot
> that is booting using V2. im trying to see how to get the boot to it's magic
> via V3, but that does not fix the problem :-)

To which problem are you referring?

I don't think there *is* a solution to the "create special files using
V3, get their attributes using V2" problem other than "only run servers
whose OSes use the same major/minor bitfield sizes as your client".

One solution to the "special files don't work" problem is "create all
the special files using the same version of NFS as will be used to get
their attributes", in which case, if you're going to be creating the
special files with V3, getting the OS to mount the root file system
using V3 *would* fix the problem.

> 
> }Also, could you get a network trace of:
> }
> }	the creation of the "/dev/null" entry, if it was done over NFS;
> }
> }	attempts by the FreeBSD box to get the attributes of "/dev/null"
> }	via NFS (e.g., an "ls -l /mnt/tmp/null", from your example);
> }
> }and send them to me?
> if you mena a tcpdump, that will have to wait till the morning (my morning :-)

Yes, I mean tcpdumps - if you use tcpdump, use "-s 65535", so that
tcpdump's annoying default teeny tiny snapshot length of 68 doesn't end
up cutting off a lot of the interesting parts of the NFS requests and
replies.  Also, send me the raw tcpdump captures (i.e., capture with
"-w" to a savefile), rather than tcpdump's printed interpretation
thereof - I may want to run them through Ethereal or convert them to
snoop format and run them through snoop.

Do captures for all the servers on which you've tried this, both of the
creation of the special files and the attempts to get the attributes.

> PS: out of curosity, what os is NetAPP base on?

The core kernel is our own; it's a message-passing, non-preemptive,
kernel-mode-only, single-address-space, no-demand-paging kernel.

The networking stack is 4.4-Lite-derived, with some bits of code from
various later BSDs added in, although it's drifted a fair bit from the
BSD base (i.e., it's not a no-brainer to move stuff into it from BSD
stacks or from it to BSD stacks at this point).  A number of the
commands are 4.4-Lite-derived as well, although we had to assault them
with a chainsaw to get them to run in our single-address-space
environment.

The NFS server code takes some stuff from 4.4-Lite, but we changed that
code a lot.

The file system is our own, as is the CIFS server code (no Samba
involved), the disk/SCSI/Fibre Channel subsystem, and RAID.  Some of the
platform support code on x86 originally came from BSD, and the Alpha
divide/remainder routine came from NetBSD, but, at this point, most of
the platform code is our own, even on x86.

I.e., it's based mostly on our code, with a bunch of BSD stuff, mainly
in the networking area, but even that stuff's often been changed a fair
bit.  It's not running a standard general-purpose OS with an appliance
wrapper.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200009282125.OAA25500>