Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 7 Apr 2015 14:24:00 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Simon Gerraty <sjg@juniper.net>, "arch@FreeBSD.org" <arch@freebsd.org>, Anuranjan Shukla <anshukla@juniper.net>, freebsd-arch@freebsd.org
Subject:   Re: RFC: chgsbsize doesn't handle -ve sbsize correctly
Message-ID:  <20150407114358.B1007@besplex.bde.org>
In-Reply-To: <476184366.nOLsX7PO1g@ralph.baldwin.cx>
References:  <D14391A1.294AE%anshukla@juniper.net> <476184366.nOLsX7PO1g@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 6 Apr 2015, John Baldwin wrote:

> On Friday, April 03, 2015 07:38:59 AM Anuranjan Shukla wrote:
>> Hi,
>> In struct uidinfo{}, ui_sbsize member is an int. In a system with large
>> number of sockets owned by a user it's possible for ui_sbsize to roll over
>> to a negative value. When that happens, further calls to _increase_ sbsize
>> don't detect this condition as they should per design. But when the
>> sockets start shutting down this condition would be detected and you'll
>> see the console prints ("negative sbsize for uid =") happen. As a worst
>> case the console being flooded with these prints can create a DoS scenario
>> (specially on single CPU systems).
>>
>> Regardless of the end result, the code itself needs to be fixed for
>> correctness. There are two things to note:
>> - Int doesn't seem to be the correct type for ui_sbsize. Should be a u_int
>> atleast.
>> - Since there's no real prevention of the overflow as it happens, the
>> warning print isn't helpful and should be a DEBUG level log at best. If we
>> change ui_sbsize to be a u_int, the negative check itself can go.

It could be an unconditional panic.

>> ...
>
> Note that ui_sbsize is long, not int.  That doesn't help you on 32-bit
> platforms, but you are also stuck with 32 bits since you typically don't
> have 64-bit atomics on 32-bit platforms.
>
> Making it unsigned just means you can't detect overflow anymore.  Instead

You can detect it as wrap.  I don't like letting the long actually overflow,
bit this seems simplest.

> I think it should be fixed to detect the overflow when increasing the size
> and fail then.  Several nearby functions would need the same fix btw:
>
> Index: kern_resource.c
> ===================================================================
> --- kern_resource.c	(revision 281146)
> +++ kern_resource.c	(working copy)
> @@ -1380,17 +1380,18 @@ chgproccnt(struct uidinfo *uip, int diff, rlim_t m
> int
> chgsbsize(struct uidinfo *uip, u_int *hiwat, u_int to, rlim_t max)
> {
> +	long new;
> 	int diff;
>
> 	diff = to - *hiwat;

Note that the apparently-simple subtraction here is hard to understand
due to the misused of unsigned types.  For negative adjustments, the
RHS is a huge unsigned value.  This is converted to a non-huge negative
value by assignment to diff.  Callers must ensure that the unsigned
values are non-huge so that they would have fitted in ints for the
final step to work.

> +	new = atomic_fetchadd_long(&uip->ui_sbsize, (long)diff) + diff;
> 	if (diff > 0) {
> -		if (atomic_fetchadd_long(&uip->ui_sbsize, (long)diff) + diff > max) {
> +		if (new < 0 || new > max) {
> 			atomic_subtract_long(&uip->ui_sbsize, (long)diff);
> 			return (0);
> 		}
> 	} else {
> -		atomic_add_long(&uip->ui_sbsize, (long)diff);
> -		if (uip->ui_sbsize < 0)
> +		if (new < 0)
> 			printf("negative sbsize for uid = %d\n", uip->ui_uid);
> 	}
> 	*hiwat = to;

It is simpler to fix up any mess in both cases.  The above could panic
when the size goes negative, but it uses extra logic to continue with
a corrupted size.  Also remove the redundant casts.  This code can be
assumed to not be compiled with a K&R compiler or otherwise not have
a prototype for the (inline) atomic functions in scope.  Also partially
fix the printf format error for the uid.  uid_t is unsigned and can have
any size, but its promotion was assumed to be signed int.  Keep assuming
that its promition has the same size as u_int.

 	new = atomic_fetchadd_long(&uip->ui_sbsize, diff) + diff;
 	if (new < 0 || new > max) {
 		atomic_subtract_long(&uip->ui_sbsize, diff);
 		if (diff < 0 && new < 0)
 			printf("negative sbsize for uid = %u\n", uip->ui_uid);
 		return (0);
 	}
 	*hiwat = to;

Note: it is dangerous to let ui_sbsize actually overflow, but this can't
happen unless there is a bug in the accounting.

'max' is usually RLIM_INFINITY when the size is being reduced, but really
shouldn't be used in that case.  The old code doesn't use it, but my new
code depends on it being the max value of an rlim_t which it normally is.

When the size is being increased, 'max' is the user's sbsize rlimit.
That defaults to RLIM_INFINITY, so it is no limit at all, and is 2**31
times larger that the virtual address space on 32-bit arches.  The size
is also limited by the 32-bit args.  On 64-bit arches, you just can't
add up enough u_ints for ui_sbsize to get near overflowing (physical
resources are probably at most a few petabytes, so they run out first
by a factor of 2**20 or so).  On 32-bit systems, with the default
maxsockbuf limit of 2M, you need just 1024 maximally sized sockets
to reach the limit.

There are still races in the above on 32-bit arches.  Physical limits
would run out long before overflow occurs.  However:
- 1024 threads running in parallel might cause the overflow before any
   resources are allocated
- just 1 more thread running in parallel and trying to reduce the count
   might see it remain negative.  This happens for the order:
   - 1 thread allocates uncontended
   - 1024 threads increase the count to 2 step past overflow
   - first thread decreases the count to 1 step past overflow
- another 1024 threads running in parallel might wrap the count completely
   (1024 steps to overflow to LONG_MAX + 1 = negative; another 1024 steps
   to overflow to -1 + 1 = 0).  This breaks the overflow detection.

Using signed types gives more chance for the overflow detection to
work, unless someone shoots themself in the foot by increasing sb_max
to LONG_MAX.  (BTW, sb_max has a bogus type (u_long), the sysctl for
changing it has a bad name (not matching the variable name), and the
sysctl is badly implemented (it uses lots of code to just prevent the
foot-shooting of setting the limit to a small value; large values are
allowed and sb_max is changed non-atomically.  I prefer to allow any
foot shooting.  This is most useful for testing limits.)

I first thought that the races could be fixed by limiting ui_sbsize to
a value much smaller than the max of its type (say LONG_MAX / 2).  This
doesn't quite work due to the possibility of conconcurrent accesses.
The atomic accesses would have to be slightly more careful to prevent
the count ever exceeding a limit.  E.g., read the value non-atomically,
check that it is not too large, add in a local variable, and use
atomic_cmpset to store the result if the value read didn't change.
This may be simpler anyway since it doesn't need the fixup step.
Probably-buggy implementation:

 	if (diff < -LONG_MAX / 2 || diff > LONG_MAX / 2)
 		return (1);		/* XXX callers should limit it. */
 	for (;;) {
 		old = uip->ui_sbsize;	/* May be stale even with atom. read. */
 		new = old + diff;	/* XXX bogus overlow detection next. */
 		KASSERT(new >= 0, ("negative total sbsize for uid = %u\n",
 		    uip->ui_uid));
 		if (new > max || new > LONG_MAX / 2)
 			return (0);	/* Give up a bit too easily. */
 		if (atomic_cmpset_long(&uip->ui_sbsize, old, new)) {
 			*hiwat = to;
 			return (1);
 		}
 	}

Remove the hackish error checking to make this easier to understand.
It should be removed in the final version.  I don't like KASSERT()s
taking more space than the main code to check for things that can't
happen.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150407114358.B1007>