From owner-freebsd-fs@freebsd.org Fri Jan 8 08:11:46 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 437E8A67697 for ; Fri, 8 Jan 2016 08:11:46 +0000 (UTC) (envelope-from xhernandez@datalab.es) Received: from dlbex1.datalab.es (dlbex1.datalab.es [192.146.172.55]) by mx1.freebsd.org (Postfix) with ESMTP id C2B0B12E0 for ; Fri, 8 Jan 2016 08:11:44 +0000 (UTC) (envelope-from xhernandez@datalab.es) Received: from xavih.datalab.es (unknown [192.168.200.206]) by dlbex1.datalab.es (Postfix) with ESMTP id 4E65C404AB; Fri, 8 Jan 2016 09:02:16 +0100 (CET) Subject: Re: [Gluster-devel] FreeBSD port of GlusterFS racks up a lot of CPU usage To: Raghavendra G , Rick Macklem References: <571237035.145690509.1451437960464.JavaMail.zimbra@uoguelph.ca> <20151230103152.GS13942@ndevos-x240.usersys.redhat.com> <2D8C2729-D556-479B-B4E2-66E1BB222F41@ixsystems.com> <1083933309.146084334.1451517977647.JavaMail.zimbra@uoguelph.ca> Cc: freebsd-fs , Gluster Devel , Hubbard Jordan From: Xavier Hernandez Message-ID: <568F6D07.6070500@datalab.es> Date: Fri, 8 Jan 2016 09:02:15 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jan 2016 08:11:46 -0000 On 08/01/16 05:42, Raghavendra G wrote: > Sorry for the delayed reply. Had missed out this mail. Please find my > comments inlined. > > On Thu, Dec 31, 2015 at 4:56 AM, Rick Macklem > wrote: > > Jordan Hubbard wrote: > > > > > On Dec 30, 2015, at 2:31 AM, Niels de Vos > wrote: > > > > > >> I'm guessing that Linux uses the event-epoll stuff instead of event-poll, > > >> so it wouldn't exhibit this. Is that correct? > > > > > > Well, both. most (if not all) Linux builds will use event-poll. But, > > > that calls epoll_wait() with a timeout of 1 millisecond as well. > > > > > >> Thanks for any information on this, rick > > >> ps: I am tempted to just crank the timeout of 1msec up to 10 or 20msec. > > > > > > Yes, that is probably what I would do too. And have both poll functions > > > use the same timeout, have it defined in libglusterfs/src/event.h. We > > > could make it a configurable option too, but I do not think it is very > > > useful to have. > > > > I guess this begs the question - what’s the actual purpose of polling for an > > event with a 1 millisecond timeout? If it was some sort of heartbeat check, > > one might imagine that would be better served by a timer with nothing close > > to 1 millisecond as an interval (that would be one seriously aggressive > > heartbeat) and if filesystem events are incoming that glusterfs needs to > > respond to, why timeout at all? > > > If I understand the code (I probably don't) the timeout allows the loop > to call a function that may add new fd's to be polled. (If I'm right, > the new ones might not get serviced.) > > > Yes, that's correct. Since in poll we pass the fds to be polled in an > array as an argument, the only place where we can add/remove fds to be > polled is at the time we call poll sycall. To make adding/removing fds > from polling to be more responsive, poll timeouts "frequently enough". > The trade-off we are considering here is between: > > 1. Number of calls to poll > vs > 2. Responsiveness of adding/removing a new fd from polling. > > For clients, there is not much change of the list of fds that are > polled. However, for bricks/server this list can vary frequently as new > clients are connected/disconnected. > > Since epoll provides a way to add new fds for polling while an > epoll_wait is in progress (unlike poll), the timeout of epoll_wait is > infinite. Also note that on systems where both epoll and poll are > available, epoll is preferred over poll. I don't know anything about gluster's poll implementation so I may be totally wrong, but would it be possible to use an eventfd (or a pipe if eventfd is not supported) to signal the need to add more file descriptors to the poll call ? The poll call should listen on this new fd. When we need to change the fd list, we should simply write to the eventfd or pipe from another thread. This will cause the poll call to return and we will be able to change the fd list without having a short timeout nor having to decide on any trade-off. Just an idea... Xavi > > > I'll post once I've tried a longer timeout and if it seems ok, I will > put it in the Redhat bugs database (as mentioned in the last post). > In its current form, it's fine for testing. > > > I also have a broader question to go with the specific one: We (at > > iXsystems) were attempting to engage with some of the Red Hat folks back > > when the FreeBSD port was first done, in the hope of getting it more > > “officially supported” for FreeBSD and perhaps even donating some more > > serious stress-testing and integration work for it, but when those Red Hat > > folks moved on we lost continuity and the effort stalled. Who at Red Hat > > would / could we work with in getting this back on track? We’d like to > > integrate glusterfs with FreeNAS 10, and in fact have already done so but > > it’s still early days and we’re not even really sure what we have yet. > > > Just fyi..sofar, working with FreeBSD11/head and the port of 3.7.6 > (the port tarball > is in FreeBSD PR#194409), the only GlusterFS problem I've encountered is > the above one. I'm not sure why this isn't in /usr/ports, but that > would be > nice as it might get more people trying it. (I'm a src comitter, but > not a > ports one.) > > However, I have several patches for the FreeBSD fuse interface and for > a mount_glusterfs mount to work ok you need a couple of them. > 1 - When an open decides to do DIRECT_IO after the file has done buffer > cache I/O the buffer cache needs to be invalidated so you don't get > stale cached data. > 2 - For a WRONLY write, you need to force DIRECT_IO (or do a > read/write open). > If you don't do this, the buffer cache code will get stuck when > trying > to read a block in before writing a partial block. (I think this is > what FreeBSD PR#194293 is caused by.) > > Because I won't be able to do svn until April, these patches won't > make it > into head for a while, but they will both be in PR#194293 within hours. > > The others add features like extended attributes, advisory byte > range locking > and the changes needed to export the fuse/glusterfs mount via the > FreeBSD > kernel nfsd. If anyone wants/needs these patches, email and I can > send you > them. > > A bit off your topic, but until you have the fixes for FreeBSD fuse, you > probably can't do a lot of serious testing. > (I don't know, but I'd guess that FreeNAS has about the same fuse module > code as FreeBSD's head, since it hasn't been changed much in head > recently.) > > Thanks everyone for your help with this, rick > > > Thanks, > > > > - Jordan > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > > > > > -- > Raghavendra G > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel >