Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 13 Feb 2011 15:14:11 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r218604 - head/sbin/fsck_ffs
Message-ID:  <20110213131411.GI78089@deviant.kiev.zoral.com.ua>
In-Reply-To: <20110213215436.L1544@besplex.bde.org>
References:  <201102121317.p1CDHE9v002742@svn.freebsd.org> <20110213215436.L1544@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--Ve0F8fDhlCYJZdTN
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Feb 13, 2011 at 11:30:56PM +1100, Bruce Evans wrote:
> On Sat, 12 Feb 2011, Konstantin Belousov wrote:
>=20
> >Log:
> > In checker, read journal by sectors.
> >
> > Due to UFS insistence to pretend that device sector size is 512 bytes,
>=20
> Do you mean that ffs_fsck does this?  UFS doesn't exist, and ffs in the
> kernel didn't use device sectors at all until recently (it used DEV_BSIZE
> units, but those are not sectors but are the units for bread() and friend=
s).
Yes, the journal writer started using DEV_BSIZE as _sector_ size to
write the journal blocks, and fixing this was the point of the series
of commits. Journal was unoperable on the providers with non-512 byte
sectors.

> newfs uses a hack to do this.  When handling kernel sizes in units of
> DEV_BSIZE (not a sector size!), ffs_fsck should use DEV_BSIZE (or maybe
> just fsbtodb()) instead of `dev_bsize' or `sectorsize' and not
> pretend that DEV_BSIZE (or the corresponding superblock value for the
> conversion macro) is the sector size.
>=20
> > sector size is obtained from ioctl(DIOCGSECTORSIZE) for real devices,
> > and from the label otherwise. The file images without label have to
> > be made with 512 sector size.
>=20
> >Modified: head/sbin/fsck_ffs/fsck.h
> >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
> >--- head/sbin/fsck_ffs/fsck.h	Sat Feb 12 13:12:45 2011	(r218603)
> >+++ head/sbin/fsck_ffs/fsck.h	Sat Feb 12 13:17:14 2011	(r218604)
> >@@ -268,6 +268,7 @@ char	snapname[BUFSIZ];	/* when doing sna
> >char	*cdevname;		/* name of device being checked */
> >long	dev_bsize;		/* computed value of DEV_BSIZE */
> >long	secsize;		/* actual disk sector size */
> >+long	real_dev_bsize;
>=20
> If `secsize' is the actual disk sector size, then we don't need another
> variable giving the real sector size.
>=20
> The new variable has no comment.  What is the difference between a sector
> size and a dev_bsize?  There is enough confusion between DEV_BSIZE and
> sector sizes already.  DEV_BSIZE has come to means nothing to do with
> sectors and little to do with devices or block sizes.  It is just a minim=
al
> i/o sizes for use in old interfaces (ones that didn't want to use more th=
an
> 32 bits for disk addresses, but now use 64 bits anyway, so that they can
> address 72 bits after multiplication by DEV_BSIZE).
There is no "actual" difference (I tired of the word actual when debugging
the patches), they are all DEV_BSIZE.

>=20
> These variables are named and documented slightly better in mkfs:
>=20
> % extern int	sectorsize;	/* bytes/sector */
> % extern int	realsectorsize;	/* bytes/sector in hardware*/
>=20
> At lease the fake sector size variable doesn't claim to be actual, and
> the real sector size variable doesn't have extra underscores and claims
> to be a sector size variable.
>=20
> >Modified: head/sbin/fsck_ffs/setup.c
> >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
> >--- head/sbin/fsck_ffs/setup.c	Sat Feb 12 13:12:45 2011 (r218603)
> >+++ head/sbin/fsck_ffs/setup.c	Sat Feb 12 13:17:14 2011 (r218604)
> >@@ -446,7 +446,7 @@ sblock_init(void)
> >	if (sblk.b_un.b_buf =3D=3D NULL || asblk.b_un.b_buf =3D=3D NULL)
> >		errx(EEXIT, "cannot allocate space for superblock");
> >	if ((lp =3D getdisklabel(NULL, fsreadfd)))
> >-		dev_bsize =3D secsize =3D lp->d_secsize;
> >+		real_dev_bsize =3D dev_bsize =3D secsize =3D lp->d_secsize;
> >	else
> >		dev_bsize =3D secsize =3D DEV_BSIZE;
> >}
>=20
> Both the variables are real enough here, provided getdisklabel() works.
> Otherwise, you are in trouble and should fail instead of defaulting the
> size unless the size is set by another means.  newfs seems to fail, and
> I'm sure newfs_msdos does fail if the sector size is unknown.
>=20
> >Modified: head/sbin/fsck_ffs/suj.c
> >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
> >--- head/sbin/fsck_ffs/suj.c	Sat Feb 12 13:12:45 2011	(r218603)
> >+++ head/sbin/fsck_ffs/suj.c	Sat Feb 12 13:17:14 2011	(r218604)
> >@@ -28,6 +28,7 @@
> >__FBSDID("$FreeBSD$");
> >
> >#include <sys/param.h>
> >+#include <sys/disk.h>
> >#include <sys/disklabel.h>
> >#include <sys/mount.h>
> >#include <sys/stat.h>
> >@@ -201,6 +202,11 @@ opendisk(const char *devnam)
> >		    disk->d_error);
> >	}
> >	fs =3D &disk->d_fs;
> >+	if (real_dev_bsize =3D=3D 0 && ioctl(disk->d_fd, DIOCGSECTORSIZE,
> >+	    &real_dev_bsize) =3D=3D -1)
> >+		real_dev_bsize =3D secsize;
> >+	if (debug)
> >+		printf("dev_bsize %ld\n", real_dev_bsize);
> >}
> >
> >/*
>=20
> If this gives a different "real_dev_bsize", and this is different from the
> sector size in the label or defaulted-to, then the latter is presumably
> wrong, and we're left with a `secsize' that is not the "actual sector siz=
e".
Yes, secsize is not the "actual sector size", it is equal to DEV_BSIZE,
see setup.c:sblock_init().

>=20
> It is a bug to replace the sector size in the label with the one obtained
> here.  The former should be correct, and it is the only one under user
> control.  It might need to be changed if one given by the ioctl is wrong
> or just if it is fake.  (Labels only have 32-bit disk addresses, so fake
> sector sizes are needed to address more than 41 bits if the physical sect=
or
> size has is 512.)
>=20
> newfs doesn't have the  precedence that I want:
> - sectorsize set on the command line (-S option) has precedence.  [Bug;
>   -S 0 is needed to cancel any previous setting of sectorsize on the
>   command line by -S or maybe by -T, but it is treated as an error.]
>   fsck_ffs unfortunately doesn't seem to have any -S or -T option.  These
>   used to be unneeded since sector size weren't used, and labels worked
>   and contain enough info to locate the superblock where there is more
>   info.
> - then try setting sectorsize using the ioctl if sectorsize is not already
>   set.  [Bug: no error checking for this ioctl.  If it fails, then it
>   may clobber sectorsize, which breaks the next step.]
> - then set sectorsize from the label if there is a label and sectorsize
>   is not already set
> - fail if sectorsize is not set
newfs.c does not do what you describe, in fact. After going to all
troubles finding the sector size, it performs the following:

	if (sectorsize !=3D DEV_BSIZE) {		/* XXX */
		int secperblk =3D sectorsize / DEV_BSIZE;

		sectorsize =3D DEV_BSIZE;
		fssize *=3D secperblk;
		if (pp !=3D NULL)
			pp->p_size *=3D secperblk;
	}

And now layout calculations happen as if sector size is 512 bytes. I was
puzzled for long time why libufs givess me 512 in d_bsize for a volume over
4096-bytes per sector md:
	disk->d_bsize =3D fs->fs_fsize / fsbtodb(fs, 1);

>=20
> Then the hacks involving realsectorsize:
> - set it to sectorsize
> - if sectorsize !=3D DEV_BSIZE, change it to DEV_BSIZE and fix up related
>   quantities (I hope the change to the partition table is never written).
>   Now `sectorsize' is no longer "actual", though it was actual before the
>   hacks.  We continue with the actual sector size recorded in=20
>   realsectorsize.
realsectorsize is not used. You are citing the piece of code I pasted above.

>=20
> >@@ -2262,7 +2268,7 @@ suj_build(void)
> >		rec =3D (union jrec *)seg->ss_blk;
> >		for (i =3D 0; i < seg->ss_rec.jsr_cnt; off +=3D JREC_SIZE,=20
> >		rec++) {
> >			/* skip the segrec. */
> >-			if ((off % DEV_BSIZE) =3D=3D 0)
> >+			if ((off % real_dev_bsize) =3D=3D 0)
> >				continue;
> >			switch (rec->rec_jrefrec.jr_op) {
> >			case JOP_ADDREF:
> >...
>=20
> This is all sort of backwards.  DEV_BSIZE is the real DEV_BSIZE.  Old code
> that uses units of DEV_BSIZE spells this as dev_bsize although it is
> constant.  New code that uses units of sectors spelled this as DEV_BSIZE
> although the correct size is variable; now the new code spelled this as
> real_dev_bsize, which is variable but unrelated to dev_bsize =3D DEV_BSIZ=
E=20
> :-).
We need to write the journal block atomically, and device/geom level only
provides that:
- we must write at least sector;
- any write bigger then sector is not atomic.

>=20
> I think using fragments instead of sectors would work better.  For a star=
t,
> you can get the fragment size from the superblock and not have to worry
> about labels of ioctls.
No, it would not, since fragment write can be non-atomic.

>=20
> There is a problem locating the superblock.  The kernel and various
> utilities only try reading the superblock with size SBSIZE (8K).  This
> fails if the sector size is not a divisor of 8K.  They also only try
> reading at various offsets which might not be a multiple of the sector
> size, but which are larger than 8K so they are more likely to be a
> multiple.  I fixed this problem in fsck_msdosfs where it is larger
> (fsck_msdosfs assumes that the sector size is a divisor of 512, but
> sector sizes like 2K and 4K are now common relative to sizes of 16K).
>=20
> Hmm, the initialization of dev_bsize is even more obfuscated than I
> remembered:
>=20
> % 	/*
> % 	 * Compute block size that the file system is based on,
> % 	 * according to fsbtodb, and adjust superblock block number
> % 	 * so we can tell if this is an alternate later.
> % 	 */
> % 	super *=3D dev_bsize;
>=20
> We may have needed a fake sector size to read the superblock.  I forget if
> we got a non-fake one from the label or ioctl before here.  We should have
> just tried all reasonable power of 2 sizes between 1 and 1M (I use
> 64K down to DOSBOOTBLOCKSIZE =3D 512 in fsck_msdosfs).
>=20
> % 	dev_bsize =3D sblock.fs_fsize / fsbtodb(&sblock, 1);
>=20
> This is an obfuscated way of spelling DEV_BSIZE.  Although fsbtodb()
> converts to blocks with fixed size DEV_BSIZE, it needs a variable shift
> count to handle the variable fragment size.  This shift count is
> fs_fsbtodb, which was initialized in newfs to
> ilog2(fs_fsize / sectorsize).  But sectorsize was fake there
> (always DEV_BSIZE).  Suppose for example than fs_fsize is the default
> (2K).  Then fs_fsbtodb =3D ilog2(4) =3D 2, so fsbtodb(&sbblock, 1) is 4
> and dev_bsize =3D 2K / 4 =3D DEV_BSIZE again.  It is always DEV_BSIZE aga=
in,
> since the fake sectorsize in newfs is always DEV_BSIZE.  The fakery is
> extended to fsck_ffs by the above assignment.  The above assignment is
> necessary if the kernel is changed to not fake things so much, or to
> fake them differently (newfs can't can change it's i/o methods but can't
> change the generated fs_fsbtodb etc. unless the kernel changes.  All it
> could do much better is to make the conversions between different block
> sizes (real and virtual either way) more explicit).  However, things
> are already confusing enough with a fixed DEV_BSIZE.
Again, you are describing what I pasted above, and what I referred  to
as "UFS (ok, FFS) pretending that sector size is 512 bytes".

I agree that it would be cleaner if FFS does not make such obscuring
arithmetic, but it did not matter while i/o was done in units of
whole fragments. Journal requirements are different.

--Ve0F8fDhlCYJZdTN
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (FreeBSD)

iEYEARECAAYFAk1X2SIACgkQC3+MBN1Mb4hGJACfUvdEVje1OLZamE9cVtDKFqcY
7dgAoLnMYTVl21xIyzO8lQ9tUnJKTDpV
=yLzZ
-----END PGP SIGNATURE-----

--Ve0F8fDhlCYJZdTN--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110213131411.GI78089>