From owner-freebsd-net@freebsd.org  Thu Jan  7 16:12:20 2016
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B54E0A67747
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Thu,  7 Jan 2016 16:12:20 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 44CA61360
 for <freebsd-net@freebsd.org>; Thu,  7 Jan 2016 16:12:20 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u07GCF82070949
 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Thu, 7 Jan 2016 18:12:15 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u07GCF82070949
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id u07GCDRC070948;
 Thu, 7 Jan 2016 18:12:13 +0200 (EET)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Thu, 7 Jan 2016 18:12:13 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Boris Astardzhiev <boris.astardzhiev@gmail.com>
Cc: Luigi Rizzo <rizzo@iet.unipi.it>, Mark Delany <c2h@romeo.emu.st>,
 "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject: Re: Does FreeBSD have sendmmsg or recvmmsg system calls?
Message-ID: <20160107161213.GZ3625@kib.kiev.ua>
References: <20160103214720.72014.qmail@f5-external.bushwire.net>
 <20160104085415.GS3625@kib.kiev.ua>
 <20160104091108.50654.qmail@f5-external.bushwire.net>
 <20160104093859.GV3625@kib.kiev.ua>
 <20160104101747.58347.qmail@f5-external.bushwire.net>
 <20160104194044.GD3625@kib.kiev.ua>
 <20160104210741.32812.qmail@f5-external.bushwire.net>
 <CAP=KkTwfpjec2Tgnm4PRR3u8t4GEqN9Febm5HRcqapifBG-B6g@mail.gmail.com>
 <CA+hQ2+h4NNz9tgSpjJdv7fXteq5tAR7o3LvjV=u08NHjRLPwmA@mail.gmail.com>
 <CAP=KkTzFUDsZwDDLD3n97xJW0qLVZMPduZGSX+eXC3UuLpVjMg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAP=KkTzFUDsZwDDLD3n97xJW0qLVZMPduZGSX+eXC3UuLpVjMg@mail.gmail.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Jan 2016 16:12:20 -0000

On Thu, Jan 07, 2016 at 03:29:23PM +0200, Boris Astardzhiev wrote:
> First of all thanks for the insightful code review.
> 
> >> +     struct file *fp;
> >> +     cap_rights_t rights;
> >> +
> >> +     if (fget(td, uap->s, cap_rights_init(&rights, CAP_RECV), &fp) != 0)
> >> +             return (EBADF);
> >> +
> >> +     so = fp->f_data;
> >What if uap->s filedescriptor does not reference a socket ?
> 
> kern_sendit/recvit() eventually call getsock_cap().
> To me this means that send/recv(m)msg have to be limited only to sockets.
> I was in a need to grab access to so_error due to the error handling later.
> Probably I need to use something similar like kern_sendit/recvit's
> getsock_cap approach.
I missed this check, so my comment is not applicable.

Hm, why do you need to fget() the uap->s filedescriptor at all ?
Rather, I see why would you need it, but what is done is not completely
correct.  Note that userspace may close uap->s between two calls to
kern_sendit(), or close and reopen for different kind of file,
so your check at the start of the function is mostly useless.

You should change the kern_sendit() interface to accept socket for this
race to be eliminated, or move the guts of kern_sendit() into new function
which takes socket as an argument.  Then getsock_cap() would be done
by your new syscall code and by kern_sendit() etc.

After looking closer at it, additional argument to not use kern_sendit()
is because you miss conditions like nullification of EINTR/ERESTART, which
should stop the iteration.

> Comments or a better approach to so_error?
> 
> >>> +
> >>> +     vlen = uap->vlen;
> >>> +     if (vlen > UIO_MAXIOV)
> >>> +             vlen = UIO_MAXIOV;
> >>> +
> >>> +     len = vlen * sizeof(*uap->msgvec);
> >>> +     vec = malloc(len, M_TEMP, M_WAITOK);
> >>This is a user-controlled malloc(9), i.e. an easy kernel panic due to
> >>impossible to satisfy request, or KVA and memory exhaustion.
> 
> Having a glance in the manpage there it is stated that:
>      M_WAITOK
>              Indicates that it is OK to wait for resources.  If the request
>              cannot be immediately fulfilled, the current process is put to
>              sleep to wait for resources to be released by other processes.
>              The malloc(), realloc(), and reallocf() functions cannot return
>              NULL if M_WAITOK is specified.
> I.e we'll be put to sleep and no NULL will be returned. I don't see how I'll
> get a panic here. Or you're stressing the malloc type M_TEMP here?
> More clarifications?
Suppose that malicious user code passed a value of 10 000 000 for uap->vlen.
How many bytes for kernel malloc would be consumed ?  This is local DoS.

What would happen if the requested size is bigger than the kmem size, so
that the allocation cannot succeed even theoretically ?

> >What are the reasons to go with the trivial kernel implementation
> >right now, instead of trivial and simpler libc wrapper ?  I think the
> >speed would be same, but the code _much_ simpler.
> 
> My initial idea was such that with the kernel approach we get closer to the
> driver itself and if it has mm optimizations we may have a benefit in
> future.
> And thus one day we won't need to redo the work scrubbing the libc-only
> implementation.

I suspect that the actual syscall interface would be more rich than the
direct translation of the function signature into the syscall.  IMO it
would be easier for you to get the pure libc implementation correct
for the first step.

With such changes, I alway much prefer to have the hard part prototyped
before making the bound changes in the API/ABI, which in this case means
that some protocol should be enchanced to the multi-message interface.