Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 21 Aug 2018 02:45:35 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 230792] sort -R, --random-source issues
Message-ID:  <bug-230792-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D230792

            Bug ID: 230792
           Summary: sort -R, --random-source issues
           Product: Base System
           Version: CURRENT
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: bin
          Assignee: bugs@FreeBSD.org
          Reporter: cem@freebsd.org

sort(1)'s --random-source has major problems.

It attempts to MD5 the entire[1] provided file to seed its RNG.  As a speci=
al
case (and the default), if the path matches "/dev/random" exactly, it "only"
fetches 1024 bytes to MD5 for its seed.

[1]: It just loops until read(2) returns EOF.  This may never happen for un=
ix
sockets or devices like /dev/urandom.

So the first and most obvious bug fix is that --random-source cannot be rea=
ding
megabytes out of /dev/urandom and it is unlikely that reading megabytes out=
 of
random regular files is beneficial either.

My suggestions from least controversial to more controversial:

1. Check for random via st_ino/st_rdev rather than path.  This will match
urandom or non-absolute path spellings of the random device.
   a.  In this case, only read 16-32 bytes =E2=80=94 not 1024.  32 bytes fr=
om
/dev/[u]random is more than the context size of MD5 (16 bytes).

2. Reject non-regular files other than /dev/random entirely.

3. Reject regular files larger than 1024 bytes.

4. Don't MD5 the output of the random device.  It doesn't give any benefit.


------------------------------------------------


-R's implementation leaves a lot to be desired too.

The implementation is:

1. An initial MD5 context, "md5_ctx", is seeded with the ASCII digest from =
the
random source file.

2. A comparator, "randomcoll", copies that digest twice and concatenates to
each the two lines to be compared.

3. The two MD5s are finalized and the digests are converted to (!)malloced
ASCII strings via MD5End().

4. If MD5End() hit ENOMEM, *that is incorporated into the sort order*.

5. Otherwise, the digests are compared *with strcmp()*. and the result
returned.

The goal is to provide "random" ordering, but stable results for repeated
comparisons of the same pair of lines.

Major problems:

0. ENOMEM should cause immediate exit and abort =E2=80=94 not affect sort o=
rder!

1. There is no need to malloc out an ASCII digest instead of just memcmping=
 the
binary digest.

2. Even so, this (primitive keyed hash) has got to be one of the most expen=
sive
possible ways of producing a repeatable ordering between keys.  A better me=
thod
might be to pre-fill an NxN lookup table of line number pairs with e.g.
"(int)arc4random_uniform(3) - 1" (or some keyed random function if
--random-source and a regular file was provided) and simply reference that =
LUT
from the comparator.  (That's a literal 3, not the manual page section.)

That's what I've got off the top of my head.  There are probably other issu=
es.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-230792-227>