From owner-freebsd-hackers@freebsd.org  Tue Oct  2 02:24:37 2018
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 04D8710B1782
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Tue,  2 Oct 2018 02:24:37 +0000 (UTC)
 (envelope-from munro@penski.net)
Received: from mail-ed1-x542.google.com (mail-ed1-x542.google.com
 [IPv6:2a00:1450:4864:20::542])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 6E9FC76AB4
 for <freebsd-hackers@freebsd.org>; Tue,  2 Oct 2018 02:24:36 +0000 (UTC)
 (envelope-from munro@penski.net)
Received: by mail-ed1-x542.google.com with SMTP id f38-v6so663273edd.8
 for <freebsd-hackers@freebsd.org>; Mon, 01 Oct 2018 19:24:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=ip9-org.20150623.gappssmtp.com; s=20150623;
 h=mime-version:from:date:message-id:subject:to:cc;
 bh=YjWh9ntZTj8QCVuNOjFlC1gZMu/2FGYMZl4NxzM9YMI=;
 b=Xe1dmZz19hrSl0hoUavs7QauZCfg+dvNUJZOpSp5j6WKWSSBhKVV0ZAALUzZnp4a0G
 XGayJB0hE9wd1HbJsiiDzOYWLTqu9mrmdPTbEuDtgfsbooUl14fTLCOnx+LFiUDlmf+Y
 nj4X9uWrzG5L6YBxi0ondLZWNyrn/KbzF1hZQ9epeyVUW8ki8GFTP4XzEYG36tUNViuN
 nvQGHZxxLsEDqORKQYICUzIVoQgxmo+66+/5CIoGZwilVVdg2+VknjsAm1vJk/Kz/ScR
 eSBKTyFXf2Mv7AOmWVlbeZ179JV8vUPjBVKUJhxm7VRjnyy1XoC2b6VnwQ4xoRGwHjfj
 oqhA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc;
 bh=YjWh9ntZTj8QCVuNOjFlC1gZMu/2FGYMZl4NxzM9YMI=;
 b=G9o4jAnzy13mcul05dhMGAJ6sP8sZ6lB5/1FvEYkN77Hf8XzJWLNAP4JbugFJpVy3Z
 X+pnzGLTKRiuFjtWOVWNm+g1linx2XDEqdSIPxGwkgLSfN5Sap9/aGDOiExobJ4wFyAK
 GKpiuERaz+kfsNpfO7ByjLGhX+UbOqHjGi6ebS+b3YHaCOhnFAZr0P3ghc24WeIA8VGd
 gmds3ZLVDQe1//F9cW8T1BLi87Yuy/qSVIVV5ih/nbIsh2PEEURrGBwWxbIXcPwSHOBY
 t+HxU0psnRedjjHfSGShAE06BlgH4BphoSZFrvWZvWuEFLkvII4pNDiDh0qouSXmQ+xm
 Fm4Q==
X-Gm-Message-State: ABuFfoiFNjat/ETNTt1k9Qy8joc6zHyVTvw6h3JcM1eWEtEvngATmpjY
 p9Wd/AlKApceAa8WVDyh75wWVzNW+1rOijb0KyfjF2YCUb98fw==
X-Google-Smtp-Source: ACcGV60BgJ3LuALJjOlUAO+bT3CwhUuRYLyNvkJkLcGpJb41aewSifdy83Kjxw+zfll5z+AUlSpoWezW4qmJ6CXrqBw=
X-Received: by 2002:a50:86ba:: with SMTP id
 r55-v6mr21086599eda.87.1538447075072; 
 Mon, 01 Oct 2018 19:24:35 -0700 (PDT)
MIME-Version: 1.0
From: Thomas Munro <munro@ip9.org>
Date: Tue, 2 Oct 2018 15:24:23 +1300
Message-ID: <CADLWmXXcdbL6wyLUktGzp=41zmbRjxw30FU=Ait-jfd8NcQSyQ@mail.gmail.com>
Subject: Regression when trying to replace poll() with kqueue()
To: freebsd-hackers@freebsd.org
Cc: mjg@freebsd.org, alc@freebsd.org, markj@freebsd.org, 
 Konstantin Belousov <kib@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Oct 2018 02:24:37 -0000

Hello FreeBSD hackers,

(CCing mjg and a list of others FreeBSD hackers he suggested)

In a fit of enthusiasm for FreeBSD, a couple of years ago I wrote a
patch to teach PostgreSQL to use kqueue(2).  That was after we
switched over to epoll(2) on Linux for performance reasons.  Our
default is to use poll(2) unless we have something better.  The most
common usage pattern is simply waiting for read/write readiness on the
socket that is connected to the client + a pipe connected to the
parent supervisor process ("postmaster"), but we have plans for more
interesting kinds of multiplexing involving many more descriptors, and
in general this sits behind our very thin abstraction called
WaitEventSet (see latch.c in the PostgreSQL source tree) that can be
used for many things.

We did some testing using "pgbench" (instructions below) on various
platforms that have kqueue(2), and we got some conflicting results
from FreeBSD.  When the system is heavily overloaded (a scenario we
want to work well, or at least not get worse under kqueue, even if
it's not the ideal way to run your database server), mjg reported that
with the kqueue patch performance was way better than unpatched when
the pgbench test client was running on a different host.  Huzzah!

Unfortunately, another tester reported the performance was worse when
running pgbench from the same host (originally he complained about
NetBSD performance and then we realised FreeBSD was the same under
those conditions), and I confirmed that was the case for both Unix
sockets and TCP sockets.  In one 96 (!) thread test, the TPS reported
by pgbench dropped from 70k to 50k queries per second on an 8 CPU
system.  As crazy as those test conditions may seem, that is not a
good result.

Curiously, when truss'd, in the overloaded scenario that performs
worse, we very rarely seem to actually reach kevent(2).  It seems like
there is some kind of scheduling difference producing the change.
Each PostgreSQL server process looks like this over ~10 seconds:

syscall                     seconds   calls  errors
sendto                  0.396840146    3452       0
recvfrom                0.415802029    3443       6
kevent                  0.000626393       6       0
gettimeofday            2.723923249   24053       0
                      ------------- ------- -------
                        3.537191817   30954       6

(That was captured on a virtualised system which had gettimeofday as a
syscall, but the effect has been reported on bare metal too and there
no gettimeofday calls show up; I don't believe that is a factor).

The pgbench client looks like this:

syscall                     seconds   calls  errors
ppoll                   0.002773195       1       0
sendto                 16.597880468    7217       0
recvfrom               25.646406008    7238       0
                      ------------- ------- -------
                       42.247059671   14456       0

(For whatever reason pgbench uses ppoll() instead, but I assume that's
irrelevant here; it's also multi-threaded, unlike the server.)  The
truss -c results for the server are not much different when using
poll(2) instead of kevent(2), although recvfrom in the pgbench client
seems to show a few seconds less total time, which is curious.  You
can see that we're mostly able to do sendto() and recvfrom() without
seeing EWOULDBLOCK.  So it's not direct access to the kqueue that is
affecting performance.  It's something else, something caused by the
mere existence of the kqueue object holding the descriptor.

That led several people to speculate that there may be a difference in
the wakeup logic, when one end of a descriptor is in a kqueue (mjg
speculated wake-up-one vs broadcast could be a factor), and that may
be leading to worse scheduling behaviour.

To be clear, nobody thinks that 96 client threads talking to 96
processes on a single 8 CPU box is a great way to run a system in real
life!  But it's still surprising that we lose performance whe using
kqueue, and it'd be great to understand why, and hopefully improve it.

The complete discussion on pgsql-hackers is here:

https://www.postgresql.org/message-id/flat/CAEepm%3D37oF84-iXDTQ9MrGjENwVGds%2B5zTr38ca73kWR7ez_tA%40mail.gmail.com

Any ideas would be most welcome.

Thanks for reading!

====

Reproduction steps (assuming you have git, gmake, flex, bison,
readline, curl, ccache):

# grab postgres
git clone https://github.com/postgres/postgres.git
cd postgres

# grab kqueue patch
curl -O https://www.postgresql.org/message-id/attachment/65098/0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch
git checkout -b kqueue
git am 0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch

# build
./configure --prefix=$HOME/install --with-includes=/usr/local/include
--with-libs=/usr/local/lib CC="ccache cc"
gmake -s -j8
gmake -s install
gmake -C contrib/pg_prewarm install

# create a db cluster and set it to use 2GB of shmem so we can hold
whole dataset
~/install/bin/initdb -D ~/pgdata
echo "shared_buffers = '2GB'" >> ~/pgdata/postgresql.conf

# you can either start (and later stop) postgres in the background with pg_ctl:
~/install/bin/pg_ctl start -D ~/pgdata
# ... or just run it in the foreground and hit ^C to stop it:
# ~/install/bin/postgres -D ~/pgdata

# this should produce about 1.1GB of data under ~/pgdata
~/install/bin/pgbench -s 10 -i postgres

# install the prewarm extension, so we can run the test without doing
any file IO
~/install/bin/psql postgres -c "create extension pg_prewarm"

# after that, after any server restart, prewarm like so:
~/install/bin/psql postgres -c "select pg_prewarm(c.oid::regclass)
from pg_class c where relkind in ('r', 'i')" | cat

# then 60 second pgbench runs are simply:
~/install/bin/pgbench -c 96 -j 96 -M prepared -S -T 60 postgres

# to make pgbench use TCP instead of Unix sockets, add -h localhost;
# to allow connection from another host, update ~/pgdata/postgresql.conf's
# listen_addresses