Date: Tue, 2 Oct 2018 15:24:23 +1300 From: Thomas Munro <munro@ip9.org> To: freebsd-hackers@freebsd.org Cc: mjg@freebsd.org, alc@freebsd.org, markj@freebsd.org, Konstantin Belousov <kib@freebsd.org> Subject: Regression when trying to replace poll() with kqueue() Message-ID: <CADLWmXXcdbL6wyLUktGzp=41zmbRjxw30FU=Ait-jfd8NcQSyQ@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Hello FreeBSD hackers, (CCing mjg and a list of others FreeBSD hackers he suggested) In a fit of enthusiasm for FreeBSD, a couple of years ago I wrote a patch to teach PostgreSQL to use kqueue(2). That was after we switched over to epoll(2) on Linux for performance reasons. Our default is to use poll(2) unless we have something better. The most common usage pattern is simply waiting for read/write readiness on the socket that is connected to the client + a pipe connected to the parent supervisor process ("postmaster"), but we have plans for more interesting kinds of multiplexing involving many more descriptors, and in general this sits behind our very thin abstraction called WaitEventSet (see latch.c in the PostgreSQL source tree) that can be used for many things. We did some testing using "pgbench" (instructions below) on various platforms that have kqueue(2), and we got some conflicting results from FreeBSD. When the system is heavily overloaded (a scenario we want to work well, or at least not get worse under kqueue, even if it's not the ideal way to run your database server), mjg reported that with the kqueue patch performance was way better than unpatched when the pgbench test client was running on a different host. Huzzah! Unfortunately, another tester reported the performance was worse when running pgbench from the same host (originally he complained about NetBSD performance and then we realised FreeBSD was the same under those conditions), and I confirmed that was the case for both Unix sockets and TCP sockets. In one 96 (!) thread test, the TPS reported by pgbench dropped from 70k to 50k queries per second on an 8 CPU system. As crazy as those test conditions may seem, that is not a good result. Curiously, when truss'd, in the overloaded scenario that performs worse, we very rarely seem to actually reach kevent(2). It seems like there is some kind of scheduling difference producing the change. Each PostgreSQL server process looks like this over ~10 seconds: syscall seconds calls errors sendto 0.396840146 3452 0 recvfrom 0.415802029 3443 6 kevent 0.000626393 6 0 gettimeofday 2.723923249 24053 0 ------------- ------- ------- 3.537191817 30954 6 (That was captured on a virtualised system which had gettimeofday as a syscall, but the effect has been reported on bare metal too and there no gettimeofday calls show up; I don't believe that is a factor). The pgbench client looks like this: syscall seconds calls errors ppoll 0.002773195 1 0 sendto 16.597880468 7217 0 recvfrom 25.646406008 7238 0 ------------- ------- ------- 42.247059671 14456 0 (For whatever reason pgbench uses ppoll() instead, but I assume that's irrelevant here; it's also multi-threaded, unlike the server.) The truss -c results for the server are not much different when using poll(2) instead of kevent(2), although recvfrom in the pgbench client seems to show a few seconds less total time, which is curious. You can see that we're mostly able to do sendto() and recvfrom() without seeing EWOULDBLOCK. So it's not direct access to the kqueue that is affecting performance. It's something else, something caused by the mere existence of the kqueue object holding the descriptor. That led several people to speculate that there may be a difference in the wakeup logic, when one end of a descriptor is in a kqueue (mjg speculated wake-up-one vs broadcast could be a factor), and that may be leading to worse scheduling behaviour. To be clear, nobody thinks that 96 client threads talking to 96 processes on a single 8 CPU box is a great way to run a system in real life! But it's still surprising that we lose performance whe using kqueue, and it'd be great to understand why, and hopefully improve it. The complete discussion on pgsql-hackers is here: https://www.postgresql.org/message-id/flat/CAEepm%3D37oF84-iXDTQ9MrGjENwVGds%2B5zTr38ca73kWR7ez_tA%40mail.gmail.com Any ideas would be most welcome. Thanks for reading! ==== Reproduction steps (assuming you have git, gmake, flex, bison, readline, curl, ccache): # grab postgres git clone https://github.com/postgres/postgres.git cd postgres # grab kqueue patch curl -O https://www.postgresql.org/message-id/attachment/65098/0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch git checkout -b kqueue git am 0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch # build ./configure --prefix=$HOME/install --with-includes=/usr/local/include --with-libs=/usr/local/lib CC="ccache cc" gmake -s -j8 gmake -s install gmake -C contrib/pg_prewarm install # create a db cluster and set it to use 2GB of shmem so we can hold whole dataset ~/install/bin/initdb -D ~/pgdata echo "shared_buffers = '2GB'" >> ~/pgdata/postgresql.conf # you can either start (and later stop) postgres in the background with pg_ctl: ~/install/bin/pg_ctl start -D ~/pgdata # ... or just run it in the foreground and hit ^C to stop it: # ~/install/bin/postgres -D ~/pgdata # this should produce about 1.1GB of data under ~/pgdata ~/install/bin/pgbench -s 10 -i postgres # install the prewarm extension, so we can run the test without doing any file IO ~/install/bin/psql postgres -c "create extension pg_prewarm" # after that, after any server restart, prewarm like so: ~/install/bin/psql postgres -c "select pg_prewarm(c.oid::regclass) from pg_class c where relkind in ('r', 'i')" | cat # then 60 second pgbench runs are simply: ~/install/bin/pgbench -c 96 -j 96 -M prepared -S -T 60 postgres # to make pgbench use TCP instead of Unix sockets, add -h localhost; # to allow connection from another host, update ~/pgdata/postgresql.conf's # listen_addresses
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CADLWmXXcdbL6wyLUktGzp=41zmbRjxw30FU=Ait-jfd8NcQSyQ>