Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Oct 2024 11:46:56 -0400
From:      Mark Johnston <markj@freebsd.org>
To:        "David E. Cross" <david@crossfamilyweb.com>
Cc:        Cy Schubert <Cy.Schubert@cschubert.com>, Enji Cooper <yaneurabeya@gmail.com>, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Review D38047 ... and then there was one....
Message-ID:  <ZwqZ8C5Zwizh1rLF@nuc>
In-Reply-To: <2b399f81-ccca-0eb5-3175-5b59e4253b7b@crossfamilyweb.com>
References:  <20241007150830.939DC334@slippy.cwsent.com> <AE37187C-79D0-4B5E-87F0-6BB52822F03B@gmail.com> <20241011153526.431EA110@slippy.cwsent.com> <2b399f81-ccca-0eb5-3175-5b59e4253b7b@crossfamilyweb.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Oct 11, 2024 at 03:42:54PM -0400, David E. Cross wrote:
> TLDR: This IS almost exclusively a single feature; and the requested
> piecemeal change is actually DANGEROUS because the current brokenness of the
> code is preventing dangerous code from being executed in the first place.
> 
> This is in most ways a single change, 'get nscd to support
> getgrouplist(3)'.  That's it.  The reason I want that is that I have an LDAP
> directory backing my users, but I don't want unauthenticated access to my
> LDAP hierarchy, I also don't want a user readable plaintext password just
> sitting on the machine for anyone to grab; so I use kerberos with machine
> credentials.  But what happens when a non-root local service needs to look
> up a user?  it cannot.  NSCD running in 'perform lookup' mode addresses this
> concern beautifully.  But getgrouplist(3) was never done, and because it was
> never done the fallback code isn't intercepted and it just outright fails.
> 
> This is a entirely new (and previously overlooked) access method. Almost
> certainly the reason it was overlooked is because it is fundamentally
> different than every other entity database operation in the system.  
> Because it is fundamentally different it shares almost no access patterns
> and code with them.
> 
> Because nscd so tightly cooperates with libc, there's a dependency there as
> well.  libc cannot make calls  nscd doesn't understand and nscd cannot
> return data that libc cannot process..
> 
> Yes, there were numerous other bugs encountered during the journey, but at
> the point they were encountered it was already deep into the implementation;
> could they be fixed independently? Maybe, but extracting them increases the
> risk to both those bugs (this code was developed and tested together), and
> to the larger and more complicated feature set).  And the implication here
> is that even if I go through the time and complexity to pull those out the
> actual feature that I want and did the majority of the effort for is just
> going to sit on a  shelf because the complexity is too high.
> 
> All of the substantive fixes made were in the agent code, which I am going
> to guess that nobody (but me) is actually using.  Since the current code
> just does not even work at all (loop/cycle detection is broken); like
> seriously attempt to turn it on and see what happens.
> (perform-actual-lookups passwd yes , perform-actual-lookups group yes; set
> nsswitch.com to " group: files cache [notfound=return] compat" "passwd:
> files cache [notfound=return] compat" and then run "id nosuchuser" ).  In
> which case doing fixes like the loop detection code fix without the rest is
> MORE dangerous, since it then exposes the underlying broken code.
> 
> 
> Why is getgrouplist(3) so different, and so problematic requiring a large
> effectively atomic change?
> 
> The way that nscd is structured is around the typical 'entity' database in
> classic unix.  Give me a name and I will give you the record, give me the ID
> and I will give you the record.  These operations are standard for ethers,
> password, group, hosts, services...
> 
> The way nscd is structured (and it makes sense), is that calls are mapped to
> a key of database, a lookup type, and the lookup value. This 'key' also
> becomes the effective parameter for an RPC call (a loading cache).  The
> previously defined lookup types are 'id', 'name', and 'all' (all being used
> to step through an entire database).
> 
> getgouplist(3) is exactly none of these, additionally the return value is 
> NOT a group structure, or anything like it, it is an array of gid_ts.  So we
> need either a new lookup type, a new database type, or a 4th parameter to
> "some" calls (and thus changing the ENTIRE processing of key lookups in nscd
> and libc). Regardless we still need new marshaling and demarshaling routines
> because this data is unlike anything else in the system.
> 
> An alternative would have been to make a new entity type ('grouplist'?) but
> that would also necessitate creating a new 'agent' on the nscd side, and
> duplicating a tremendous amount of the boilerplate code in many places to
> create an entire new database type that only supported a single operation.
> 
> There really isn't any middle of the road.  To support getgrouplist(3) you
> need ALL of the code.  Could the code be in and be entirely dead and
> uncalled?  sure, but then you're also not supporting getgrouplist(3), and
> then at some point all of that code goes live all at once again, and the
> only thing validated in the interim is that it compiles; none of the code is
> called by anything that insn't getgrouplist(3), and getgrouplist(3) requires
> all of it.

And if all of the commits are pushed together, which they would be, then
users will only see the end result anyway, so there is nothing
inherently dangerous about splitting the change into easier-to-review
commits.

Even if the whole patch is applied atomically, it affects multiple
binaries in the system (at least libc.so and nscd), and there's nothing
guaranteeing that the update is somehow applied atomically across the
whole system.  What if some applications using an older libc end up
still running with an updated nscd?  That's an untested configuration,
but it might still arise if the whole patch is committed as one unit.

I would split it up like this:
1. Rename the cycle prevent symbol.  This is a cosmetic change which is
   irrelevant to the correctness of the rest of the patch.
2. In nscd, fix the problem with negative lookups.  I can't easily see
   if anything other than the query.c change is required there, which is
   part of the reason I asked for the patch to be split up.
3. Use thread-safe lookups in nscd's passwd.c and group.c.  For the
   latter, a bit of work is needed to tease out the bits that add
   getgrouplist() handling.  I believe that consists of
   wrap_getgrouplist() and the handling of nss_lt_pivot in
   group_lookup_func().
4. Add getgrouplist() support to nscd.  At this point, libc doesn't
   dispatch getgrouplist() to nscd, so this change has no effect.
5. Teach libc to marshall and unmarshall getgrouplist() requests.  This
   could reasonably be combined with 4 since the logic is coupled.
6. Fix nscd to export the cycle prevention symbol.

Just the exercise of reading through the patch again with an eye to
splitting it up helped me understand it better and find a couple of
buglets.  It can be split up mechanically using git add -p for the most
part (aside from the fact that parts 3 and 4 touch nscd's group.c).

>From my point of view, your patch hasn't gotten a lot of attention
because
1. It's big and affects multiple binaries.
2. There are few committers familiar with our NSS implementation.

The point of splitting it is to help with 1.  It requires more work from
developers, but then reviewers are much more likely to read and actually
understand it, which helps with 2 in the long run.

> On 10/11/24 11:35, Cy Schubert wrote:
> > In message<AE37187C-79D0-4B5E-87F0-6BB52822F03B@gmail.com>, Enji Cooper
> > writes
> > :
> > > > On Oct 7, 2024, at 08:08, Cy Schubert<Cy.Schubert@cschubert.com> wrote:
> > > > Yes. I was about to suggest this. Plus, any proposed commit log message
> > > > must answer the questions why, what and how. With special
> > > > attention to why=
> > > .
> > > 
> > > I have the same feelings as Cy.
> > > 
> > > FWIW, part of the reason why large/complex changes like this
> > > languish in my r
> > > =
> > > eview queues is in part due to reasons like this.
> > Totally!
> > 
> > > Unless I am a SME in the area who is driven to understand what the
> > > change ai=
> > > ms to achieve, I will not take the time to review large/complex
> > > chances. I h=
> > > ave a lot of other things in my life which take priority over large
> > > code rev=
> > > iews.
> > Agreed. In the case of nscd, I applied one fix to it years ago. Like
> > you, I
> > do not feel qualified to review a large and complex jumbo patch (group of
> > patches).
> > 
> > > Please break the large change down into a smaller set of
> > > changes/reviews to m
> > > =
> > > ake it easier to review the overall change effectively.
> > Yes. I have the same issue at $JOB. We use github enterprise. I refuse to
> > review such patches and yet others, less experienced, summarily approve
> > such patches without even looking at them, breaking our infrastructure.
> > 
> > The choice is clear.
> > 
> > > Cheers,
> > > -Enji=
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZwqZ8C5Zwizh1rLF>